Categories
Django Python Uncategorized

Getting around memory limitations with Django and multi-processing

I’ve spent the last few weeks writing a data migration for a large high traffic website and have had a lot of fun trying to squeeze every bit of processing power out of my machine. While playing around locally I can cluster the migration so it executes on fractions of the queryset. For instance.

./manage.py run_my_migration --cluster=1/10
./manage.py run_my_migration --cluster=2/10
./manage.py run_my_migration --cluster=3/10
./manage.py run_my_migration --cluster=4/10

All this does is take the queryset that is generated in the migration and chop it up into tenths. No big deal. The part that is a big deal is that the queryset contains 30,000 rows. In itself that isn’t a bad thing, but there are a lot of memory and cpu heavy operations that happen on each row. I was finding that when I tried to run the migration on our Rackspace Cloud servers the machine would exhaust its memory and terminate my processes. This was a bit frustrating because presumably the operating system should be able to make use of the swap and just deal with it. I tried to make the clusters smaller, but was still running into issues. Even more frustrating was that this happened at irregular intervals. Sometimes it took 20 minutes and sometimes it took 4 hours.

Threading & Multi-processing

My solution to the problem utilized the clustering ability I already had built into the program. If I could break the migration down into 10,000 small migrations, then I should be able to get around any memory limitations. My plan was as follows:

  1. Break down the migration into 10,000 clusters of roughly 3 rows a piece.
  2. Execute 3 clustered migrations concurrently.
  3. Start the next migration after one has finished.
  4. Log the state of the migration so we know where to start if things go poorly.

One of the issues with doing concurrency work with Python is the global interpreter lock (GIL). It makes writing code a lot easier, but doesn’t allow Python to spawn proper threads. However, its easy to skirt around if you just spawn new processes like I did.

Borrowing some thread pooling code here, I was able to get pretty sweet script running in no time at all.

import sys
import os.path
 
from util import ThreadPool
 
def launch_import(cluster_start, cluster_size, python_path, command_path):
    import subprocess
 
    command = python_path
    command += " " + command_path
    command += "{0}/{1}".format(cluster_start, cluster_size)
 
    # Open completed list.
    completed = []
    with open("clusterlog.txt") as f:
        completed = f.readlines()
 
    # Check to see if we should be running this command.
    if command+"\n" in completed:
        print "lowmem.py ==> Skipping {0}".format(command)
    else:
        print "lowmem.py ==> Executing {0}".format(command)
        proc = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        output = proc.stdout.read() # Capture the output, don't print it.
 
        # Log completed cluster
        logfile = open('clusterlog.txt', 'a+')
        logfile.write("{0}\n".format(command))
        logfile.close()
 
 
if __name__ == '__main__':
 
    # Simple command line args checking
    try:
        lowmem, clusters, pool_size, python_path, command_path = sys.argv
    except:
        print "Usage: python lowmem.py <clusters> <pool_size> <path/to/python> <path/to/manage.py>"
        sys.exit(1)
 
    # Initiate log file.
    if not os.path.isfile("clusterlog.txt"):
        logfile = open('clusterlog.txt', 'w+')
        logfile.close()
 
    # Build in some extra space.
    print "\n\n"
 
    # Initiate the thread pool
    pool = ThreadPool(int(pool_size))
 
    # Start adding tasks
    for i in range(1, int(clusters)):
        pool.add_task(launch_import, i, clusters, python_path, command_path)
 
    pool.wait_completion()

Utilizing the code above, I can now run a command like:

python lowmem.py 10000 3 /srv/www/project/bin/python "/srv/www/project/src/manage.py import --cluster=" &

Which breaks the queryset up into 10,000 parts and runs the import 3 sets at a time. This has done a great job of keeping the memory footprint of the import low, while still getting some concurrency so it doesn’t take forever.