Category Archives: Software Engineering

Multiprocessing Pools in Python

What happens when you need to analyze a ton of data using Python, but it’s taking too long? You implement multiprocessing pools in Python of course!

So recently at work, I was tasked with programmatically looking through around 30 GB of log files to determine test status. I figured it would be an interesting side project to learn some additional Python, so I dug in. After some trial and error, I got everything working correctly with one minor issue. It was taking me over an hour to complete.

The good part was that I was essentially doing the same operations eight different times for eight different sets of scripts. So I decided the best course of action would be to implement multiprocessing pools in Python. A few Google searches later and I came up with the basics.

from multiprocessing import Pool
import time

def subtask(count):
    # Simple task to sleep for 2 seconds 
    # and return count squared
    time.sleep(2)
    return count*count

result_list = list()

def add_result(result):
    # This is the callback function that is
    # called whenever subtask completes
    result_list.append(result)


def run_async_subtasks_with_callback():
    # Define a pool
    my_pool = Pool()
    
    # Add asynchroneous tasks to pool
    for i in range(8): 
        my_pool.apply_async(subtask, 
            args = (i, ), callback = add_result)
    
    # Close pool - no more tasks can be submitted
    my_pool.close()
    
    # Wait for all tasks in the pool to complete
    my_pool.join()
    
    print(result_list)
    
if __name__ == '__main__':
    run_async_subtasks_with_callback()
    

Running this, we get the following:

[4, 0, 1, 9, 16, 25, 36, 49]

It’s important to realize that the order each sub-task finishes is not guaranteed, which is why the results shown above are not in numerical order. It’s also important to realize that any errors in a sub-task will not show up in the output window. Instead, it appears as though nothing happens. I found a good solution, but that will have to wait for another blog post.

So, how did the performance improve using multiprocessing pools in Python? About what you’d expect. I was running these scripts on an Intel Core i7, with 8 virtual cores, so I initialized a pool of size 8. My overall run-time decreased to about 12 minutes. Not eight times better, but that’s not what I expected. All sub-tasks were accessing the disk, and it therefore became the bottleneck. Overall, I was very pleased.

I hope to do a small series on Python now that I’ve been using it more, so stay tuned if you’re interested.


Microsoft Excel SYLK error opening a CSV file

Today I ran into an interesting problem where I was trying to open a Python generated CSV file. Microsoft Excel gives a SYLK file error when trying to open the file. The confusing thing was that I’ve been using Python to generate CSV files for graphing and other Excel type work for a while now, and never run into this. What was more, it was repeatable.

Here is the actual error:

Excel has detected that ‘filename.csv’ is an SYLK file, but cannot load it. Either the file has error or it is not a SYLK file format. Click OK to try to open the file in a different format

Googling, I found the issue. Apparently Microsoft decided that, regardless of the file extension, they were going to look at the first bit of data in your file. If it started with “ID”, they were going to assume it was a SYLK file. What’s more, it seems that this has been an issue for a long time! I found one helpful site here that shows Microsoft Excel has been returning SYLK errors like this since January of 2012. That’s over seven years! Even more annoying was Microsoft’s insistence at ignoring the actual file extension.

I updated my Python script to call the first column of data something else and all was well again with the world. I did find out that it was literally anything that starts with capital “ID”. Lowercase (“id”) works just fine.