What happens when you need to analyze a ton of data using Python, but it’s taking too long? You implement multiprocessing pools in Python of course!
So recently at work, I was tasked with programmatically looking through around 30 GB of log files to determine test status. I figured it would be an interesting side project to learn some additional Python, so I dug in. After some trial and error, I got everything working correctly with one minor issue. It was taking me over an hour to complete.
The good part was that I was essentially doing the same operations eight different times for eight different sets of scripts. So I decided the best course of action would be to implement multiprocessing pools in Python. A few Google searches later and I came up with the basics.
from multiprocessing import Pool
import time
def subtask(count):
# Simple task to sleep for 2 seconds
# and return count squared
time.sleep(2)
return count*count
result_list = list()
def add_result(result):
# This is the callback function that is
# called whenever subtask completes
result_list.append(result)
def run_async_subtasks_with_callback():
# Define a pool
my_pool = Pool()
# Add asynchroneous tasks to pool
for i in range(8):
my_pool.apply_async(subtask,
args = (i, ), callback = add_result)
# Close pool - no more tasks can be submitted
my_pool.close()
# Wait for all tasks in the pool to complete
my_pool.join()
print(result_list)
if __name__ == '__main__':
run_async_subtasks_with_callback()
Running this, we get the following:
[4, 0, 1, 9, 16, 25, 36, 49]
It’s important to realize that the order each sub-task finishes is not guaranteed, which is why the results shown above are not in numerical order. It’s also important to realize that any errors in a sub-task will not show up in the output window. Instead, it appears as though nothing happens. I found a good solution, but that will have to wait for another blog post.
So, how did the performance improve using multiprocessing pools in Python? About what you’d expect. I was running these scripts on an Intel Core i7, with 8 virtual cores, so I initialized a pool of size 8. My overall run-time decreased to about 12 minutes. Not eight times better, but that’s not what I expected. All sub-tasks were accessing the disk, and it therefore became the bottleneck. Overall, I was very pleased.
I hope to do a small series on Python now that I’ve been using it more, so stay tuned if you’re interested.