What happens when you need to analyze a ton of data using Python, but it’s taking too long? You implement multiprocessing pools in Python of course!
So recently at work, I was tasked with programmatically looking through around 30 GB of log files to determine test status. I figured it would be an interesting side project to learn some additional Python, so I dug in. After some trial and error, I got everything working correctly with one minor issue. It was taking me over an hour to complete.
The good part was that I was essentially doing the same operations eight different times for eight different sets of scripts. So I decided the best course of action would be to implement multiprocessing pools in Python. A few Google searches later and I came up with the basics.
from multiprocessing import Pool import time def subtask(count): # Simple task to sleep for 2 seconds # and return count squared time.sleep(2) return count*count result_list = list() def add_result(result): # This is the callback function that is # called whenever subtask completes result_list.append(result) def run_async_subtasks_with_callback(): # Define a pool my_pool = Pool() # Add asynchroneous tasks to pool for i in range(8): my_pool.apply_async(subtask, args = (i, ), callback = add_result) # Close pool - no more tasks can be submitted my_pool.close() # Wait for all tasks in the pool to complete my_pool.join() print(result_list) if __name__ == '__main__': run_async_subtasks_with_callback()
Running this, we get the following:
[4, 0, 1, 9, 16, 25, 36, 49]
It’s important to realize that the order each sub-task finishes is not guaranteed, which is why the results shown above are not in numerical order. It’s also important to realize that any errors in a sub-task will not show up in the output window. Instead, it appears as though nothing happens. I found a good solution, but that will have to wait for another blog post.
So, how did the performance improve using multiprocessing pools in Python? About what you’d expect. I was running these scripts on an Intel Core i7, with 8 virtual cores, so I initialized a pool of size 8. My overall run-time decreased to about 12 minutes. Not eight times better, but that’s not what I expected. All sub-tasks were accessing the disk, and it therefore became the bottleneck. Overall, I was very pleased.
I hope to do a small series on Python now that I’ve been using it more, so stay tuned if you’re interested.