Multiprocessing Pools in Python

March 24, 2019 2-minute read

What happens when you need to analyze a ton of data using Python, but it’s taking too long? You implement multiprocessing pools in Python of course!

So recently at work, I was tasked with programmatically looking through around 30 GB of log files to determine test status. I figured it would be an interesting side project to learn some additional Python, so I dug in. After some trial and error, I got everything working correctly with one minor issue. It was taking me over an hour to complete.

The good part was that I was essentially doing the same operations eight different times for eight different sets of scripts. So I decided the best course of action would be to implement multiprocessing pools in Python. A few Google searches later and I came up with the basics.

from multiprocessing import Pool
import time

def subtask(count):
    # Simple task to sleep for 2 seconds 
    # and return count squared
    time.sleep(2)
    return count*count

result_list = list()

def add_result(result):
    # This is the callback function that is
    # called whenever subtask completes
    result_list.append(result)


def run_async_subtasks_with_callback():
    # Define a pool
    my_pool = Pool()
    
    # Add asynchroneous tasks to pool
    for i in range(8): 
        my_pool.apply_async(subtask, 
            args = (i, ), callback = add_result)
    
    # Close pool - no more tasks can be submitted
    my_pool.close()
    
    # Wait for all tasks in the pool to complete
    my_pool.join()
    
    print(result_list)
    
if __name__ == '__main__':
    run_async_subtasks_with_callback()

Running this, we get the following:

[4, 0, 1, 9, 16, 25, 36, 49]

It’s important to realize that the order each sub-task finishes is not guaranteed, which is why the results shown above are not in numerical order. It’s also important to realize that any errors in a sub-task will not show up in the output window. Instead, it appears as though nothing happens. I found a good solution, but that will have to wait for another blog post.

So, how did the performance improve using multiprocessing pools in Python? About what you’d expect. I was running these scripts on an Intel Core i7, with 8 virtual cores, so I initialized a pool of size 8. My overall run-time decreased to about 12 minutes. Not eight times better, but that’s not what I expected. All sub-tasks were accessing the disk, and it therefore became the bottleneck. Overall, I was very pleased.

I hope to do a small series on Python now that I’ve been using it more, so stay tuned if you’re interested.