Categories
Django Python

Using the Django Per-Site Cache with the Nginx HTTP Memcached Module

For a long time I thought that the most interesting problems in my field were in scalability. Some people may be more interested in scaling, and others might be more into slick interfaces and fast animations. But for me, scalability has continued to be my passion. For awhile though, it was a unicorn. That unattainable thing that I wanted to work on but couldn’t find anywhere to do it at. That is, until I started work at Future US.

Future is a media company. Originally they started in old media focusing heavily on gaming and tech magazines. Eventually the internet became prominent in everyday life, so more of their old media properties made the transition to the web. The one that really matters to me though is PC Gamer. I’ve been a huge fan of PC Gamer since I was about 7 years old. I still have fond memories getting demo disks in the mail with my subscription.

When I was hired at Future it was to help facilitate the move of PC Gamer from its existing platform (WordPress) to Django. Future had experienced success moving other properties to Django, so it made sense to do it with PC Gamer. When it eventually came time to implement our caching layer, we thought about a lot of different ways that it could be done. Varnish came up as an option, but we decided against it since nobody on the team had experience configuring it (and people elsewhere in the organization had experienced issues with it). Eventually we settled on having Nginx serve pages directly from Memcache. For us, this method works great because PC Gamer doesn’t have a lot of interaction (its almost completely consumption from the user end). Anything that does require back-and-forth between the server is handled via javascript, which makes full page caching super easy to do.

The high level architecture for pc gamer.
The high level architecture for pc gamer.

So how does it all work? The image above describes PC Gamer’s server architecture from a high level. Its pretty basic and works quite well for us. We end up having two types of requests: cache hits & cache misses. The flow for a cache hit is: request -> load balancer -> nginx -> memcache -> your browser. The flow for a cache miss is: request -> load balancer -> nginx -> application server (django) -> (store page in cache) -> your browser.

Since we’re basically running a static site, deciding what content to cache is easy: EVERYTHING!

Cache all the things!
Cache all the things!

Luckily for us Django already has a nice way of doing this: The per-site cache. But it is not without its issues. First of all, the cache keys it creates are insane. We needed something a little simpler for our setup so Nginx could build the cache key of the current request on the fly.

How It Works

The meat and potatoes of overriding Django’s per-site cache key comes in the `_generate_cache_key` function.

def _generate_cache_key(request, method, headerlist, key_prefix):
    if key_prefix is None:
        key_prefix = settings.CACHE_MIDDLEWARE_KEY_PREFIX
    cache_key = key_prefix + get_absolute_uri(request)
    return hashlib.md5(cache_key).hexdigest()

To make things easier for Nginx to understand we just take the url and md5 it. Simple!

On the Nginx side of things, the setup is equally simple.

        set            $combined_string "$host$request_uri";
        set_by_lua     $memcached_key "return ngx.md5(ngx.arg[1])" $combined_string;
 
        # 404 for cache miss
        # 502 for memcached down
        error_page     404 502 504 = @fallback;
 
        memcached_pass {{ cache.private_ip }}:11211;

All this setup does is take the MD5 of the host + request URI and then check to see if that cache key exists in memcache. If it does then we serve the content at that cache key, if it doesn’t we fall back to our Django application servers and they generate the page.

Thats it. Seriously. It’s simple, extremely fast, and works for us. Your mileage may vary, but if you have relatively simple caching requirements I highly suggest looking into this method before looking at something like Varnish. It could help you remove quite a bit of complexity from your setup.

Categories
Django Python Uncategorized

Getting around memory limitations with Django and multi-processing

I’ve spent the last few weeks writing a data migration for a large high traffic website and have had a lot of fun trying to squeeze every bit of processing power out of my machine. While playing around locally I can cluster the migration so it executes on fractions of the queryset. For instance.

./manage.py run_my_migration --cluster=1/10
./manage.py run_my_migration --cluster=2/10
./manage.py run_my_migration --cluster=3/10
./manage.py run_my_migration --cluster=4/10

All this does is take the queryset that is generated in the migration and chop it up into tenths. No big deal. The part that is a big deal is that the queryset contains 30,000 rows. In itself that isn’t a bad thing, but there are a lot of memory and cpu heavy operations that happen on each row. I was finding that when I tried to run the migration on our Rackspace Cloud servers the machine would exhaust its memory and terminate my processes. This was a bit frustrating because presumably the operating system should be able to make use of the swap and just deal with it. I tried to make the clusters smaller, but was still running into issues. Even more frustrating was that this happened at irregular intervals. Sometimes it took 20 minutes and sometimes it took 4 hours.

Threading & Multi-processing

My solution to the problem utilized the clustering ability I already had built into the program. If I could break the migration down into 10,000 small migrations, then I should be able to get around any memory limitations. My plan was as follows:

  1. Break down the migration into 10,000 clusters of roughly 3 rows a piece.
  2. Execute 3 clustered migrations concurrently.
  3. Start the next migration after one has finished.
  4. Log the state of the migration so we know where to start if things go poorly.

One of the issues with doing concurrency work with Python is the global interpreter lock (GIL). It makes writing code a lot easier, but doesn’t allow Python to spawn proper threads. However, its easy to skirt around if you just spawn new processes like I did.

Borrowing some thread pooling code here, I was able to get pretty sweet script running in no time at all.

import sys
import os.path
 
from util import ThreadPool
 
def launch_import(cluster_start, cluster_size, python_path, command_path):
    import subprocess
 
    command = python_path
    command += " " + command_path
    command += "{0}/{1}".format(cluster_start, cluster_size)
 
    # Open completed list.
    completed = []
    with open("clusterlog.txt") as f:
        completed = f.readlines()
 
    # Check to see if we should be running this command.
    if command+"\n" in completed:
        print "lowmem.py ==> Skipping {0}".format(command)
    else:
        print "lowmem.py ==> Executing {0}".format(command)
        proc = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        output = proc.stdout.read() # Capture the output, don't print it.
 
        # Log completed cluster
        logfile = open('clusterlog.txt', 'a+')
        logfile.write("{0}\n".format(command))
        logfile.close()
 
 
if __name__ == '__main__':
 
    # Simple command line args checking
    try:
        lowmem, clusters, pool_size, python_path, command_path = sys.argv
    except:
        print "Usage: python lowmem.py <clusters> <pool_size> <path/to/python> <path/to/manage.py>"
        sys.exit(1)
 
    # Initiate log file.
    if not os.path.isfile("clusterlog.txt"):
        logfile = open('clusterlog.txt', 'w+')
        logfile.close()
 
    # Build in some extra space.
    print "\n\n"
 
    # Initiate the thread pool
    pool = ThreadPool(int(pool_size))
 
    # Start adding tasks
    for i in range(1, int(clusters)):
        pool.add_task(launch_import, i, clusters, python_path, command_path)
 
    pool.wait_completion()

Utilizing the code above, I can now run a command like:

python lowmem.py 10000 3 /srv/www/project/bin/python "/srv/www/project/src/manage.py import --cluster=" &

Which breaks the queryset up into 10,000 parts and runs the import 3 sets at a time. This has done a great job of keeping the memory footprint of the import low, while still getting some concurrency so it doesn’t take forever.

Categories
Python

Virtualenv Error: Could not find the …site.py element of the Setuptools distribution

For a few hours I was running into a problem whenever I would try to install my PIP requirements file. The install would go alright until it got to Distribute, at which point I would get an that ended up in a stack trace with:

Could not find the /lib/python2.7/site-packages/site.py element of the Setuptools distribution

After much searching and digging, the issue was that my virtual environment needed to be instantiated with the –distribute flag.

virtualenv venv --distribute

Problem solved.

Categories
Behavior Driven Development (BDD) Django Lettuce Python Testing

Integration Testing With Django and Lettuce: Getting Started

The Ruby on Rails community has long been a proponent of Behavior Driven Development(BDD) and has a great ecosystem around it supporting that testing methodology. From Cucumber to Capybara, RoR developers have it made when it comes to BDD. But what about Django? What about Python? Django and Python don’t have access to Cucumber or Capybara, but what we do have is a fantastic port of Cucumber called Lettuce.

What is Behavior Driven Development

Before we can get started talking about Lettuce and all the cool things you can do with it, we first need to talk about BDD.

Behavior-driven development combines the general techniques and principles of TDD with ideas from domain-driven design and object-oriented analysis and design to provide software developers and business analysts with shared tools and a shared process to collaborate on software development. (Wikipedia)

BDD arose out of the need for the business side of software and the engineering side of software to communicate more effectively. Prior to BDD, it was a lot more difficult to communicate the business requirements of a project to developers. Sure there were spec documents, but those still needed to be translated into a language the computer can understand. With BDD, tests and acceptance criteria are more accessible to everybody involved. Dan North suggested a few guidelines for BDD, and then the development community took it from there.

  • Tests should be grouped into user stories. Essentially narratives about the expected functionality.
  • Stories should have a title. The title should be clear and explicit.
  • There should be a short narrative at the beginning of the story, that explains who the primary stakeholder of the story is, what effect the story should have, and what business value the stakeholder derives from this from this effect.
  • Scenarios(tests) should follow the format of first describing the initial conditions for the scenario, then which event(s) triggers the start of the scenario, and finally what the expected outcome of the scenario should be.
  • All of these steps should be written out in natural language, preferably using the Gherkin syntax.

An example feature using Gherkin.

Feature: Authentication
    In order to protect private information
    As a registered user
    I want to log in to the admin portal
 
    Scenario: I enter my password correctly
        Given the user "Jack" exists with password "password"
        And I am at "/login/"
        When I fill in "Login" with "Jack"
        And I fill in "Password" with "password"
        And I press "Login"
        Then I should be at "/portal/"
        And I should see "Welcome to the admin portal"

So now that we know the gist of BDD, why would you want to use it? There are probably more reasons than the 3 I’m going to list, but I found these to justify my use of BDD in most cases.

  1. It’s easy for business minded people to understand what you’re trying to test.
  2. It’s easier to translate complicated business requirements into tests.
  3. Some things are easier to explain in natural language.

Alright, now we’re done with the background information. Let’s get rolling on some testing.

Getting Started

To follow the rest of this article, you’re going to need the following:

  • A little Python experience
  • A little Django experience
  • Extremely basic knowledge of regular expressions
  • Knowledge of how to set up a virtual environment using virtualenv (I also use virtualenvwrapper to make my life a bit easier)
  • Firefox – Yes, I know you don’t need Firefox to do this, but its probably the easiest to use with Selenium.

On the bright side, no previous testing experience required!

The best place to start with all this getting the virtual environment set up.

jacks$ mkvirtualenv learning_lettuce

After that, lets get Django installed.

(learning_lettuce)jack:repos jacks$ pip install django

And now we’ll need to create a new Django project.

(learning_lettuce)jack:repos jacks$ django-admin.py startproject learning_lettuce
(learning_lettuce)jack:repos jacks$ cd learning_lettuce/
(learning_lettuce)jack:learning_lettuce jacks$ ls -la
total 8
drwxr-xr-x   4 jacks  staff  136 Jun 19 07:50 .
drwxrwxrwx  28 jacks  staff  952 Jun 19 07:50 ..
drwxr-xr-x   6 jacks  staff  204 Jun 19 07:50 learning_lettuce
-rw-r--r--   1 jacks  staff  259 Jun 19 07:50 manage.py

At this point, I also like to CHMOD manage.py so I can execute it without calling Python directly.

(learning_lettuce)jack:learning_lettuce jacks$ chmod +x manage.py
(learning_lettuce)jack:learning_lettuce jacks$ ./manage.py runserver
Validating models...
 
0 errors found
June 19, 2013 - 06:53:29
Django version 1.5.1, using settings 'learning_lettuce.settings'
Development server is running at http://127.0.0.1:8000/
Quit the server with CONTROL-C.

If you can run the server and see the image below, then we can proceed.
Django Powered!

Now that we have Django set up, lets go ahead and create the app we’ll be testing in.

(learning_lettuce)jack:learning_lettuce jacks$ ./manage.py startapp blog

And then go ahead and add the blog app to INSTALLED_APPS in your settings.py file.

INSTALLED_APPS = (
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.sites',
    'django.contrib.messages',
    'django.contrib.staticfiles',
 
    # Authored
    'blog',
)

Also, lets configure our project to use SQLite3.

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.sqlite3',
        'NAME': 'learning_lettuce.db',
    }
}

Now that we have a Django project and one app set up, its time to take a break and talk about Lettuce.

Lettuce

No, not the vegetable you add to salads. I’m talking about the the BDD testing framework for Python (http://www.lettuce.it). Lettuce is basically a port of a BDD testing framework from the RoR community called Cucumber. The Lettuce website contains extensive documentation and is a great source for learning best practices with it. It’s worth noting however, that at the time of this writing the Lettuce website is undergoing some design changes. They’re incomplete and have made it pretty hard to extract information from the site. Hopefully by the time you need it for reference it’s back to being usable again.

Alright, back to work. Lets install Lettuce, Selenium, and Nose and then freeze a requirements file so we can replicate this environment if we ever need to.

(learning_lettuce)jack:learning_lettuce jacks$ pip install lettuce selenium django-nose
(learning_lettuce)jack:learning_lettuce jacks$ pip freeze > requirements.txt
(learning_lettuce)jack:learning_lettuce jacks$ cat requirements.txt 
Django==1.5.1
django-nose==1.1
fuzzywuzzy==0.2
ipdb==0.7
ipython==0.13.2
lettuce==0.2.18
nose==1.3.0
selenium==2.33.0
sure==1.2.2
wsgiref==0.1.2

You’ll also need to add Lettuce to INSTALLED_APPS in your settings.py file.

INSTALLED_APPS = (
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.sites',
    'django.contrib.messages',
    'django.contrib.staticfiles',
 
    # 3rd party
    'lettuce.django',
 
    # Authored
    'blog',
)

So now that you have Lettuce installed, lets see that it actually works.

(learning_lettuce)jack:learning_lettuce jacks$ ./manage.py harvest
Django's builtin server is running at 0.0.0.0:8000
Oops!
could not find features at ./blog/features

Great, Lettuce worked! It didn’t find any tests to run, but thats ok. At least we’ve verified that we installed everything correctly.

Your First Test

Before you can test anything, you should probably have some content to test on. So let’s quickly wire up a simple view in the blog app.

# blog/views.py
from django.http import HttpResponse
 
def quick_test(request):
	return HttpResponse("Hello testing world!");
# learning_lettuce/urls.py
from django.conf.urls import patterns, include, url
 
from blog.views import quick_test
 
urlpatterns = patterns('',
    url(r'^quick-test/$', quick_test),
)

Great! Now when you go to http://127.0.0.1:8000/quick-test/ you should see “Hello testing world!”.

The next step is to create a folder inside of the blog app called “features”. And inside of that create a file called “test.feature”. It’s worth noting that Lettuce doesn’t actually care what your file is named, so long as the extension is “.feature”. In “test.feature”, add the following:

Feature: Test
	As someone new to testing
	So I can learn behavior driven development
	I want to write some scenarios
 
	Scenario: I can view the test page
		Given I am at "/quick-test/"
		Then I should see "Hello testing world!"

Look at all that plain english! Even without me telling you anything, you can probably figure out what we’re trying to test. But let me break it down for you.

  • Line 1: This loosely describes what all of the scenarios below are testing. Think of it as a way to logically group tests together.
  • Lines 2-4: This is the narrative. It explains why you’re testing in the first place.
  • Line 6: The title of your scenario. This describes what you are specifically testing in this instance.
  • Lines 7-8: These are called “steps”. Steps are how you test your scenario. Each step maps to a method in your code.

Alright, so now that you have your first test written, run it using “./manage.py harvest”. You should see the following:
Learning Lettuce - Unimplemented Steps
Look at all that beautiful output! But what does it all mean?! It’s telling you that Lettuce attempted to run one scenario, and that the two steps within that scenario aren’t implemented yet (remember, each step maps to a method in your code). And because Lettuce is great, it gives you some code to help you implement those two steps.

The Terrain File

Lettuce keeps all of it’s settings and configuration is a file called terrain.py in the root of your Django project. It’s here that we’re going to configure the test database, Firefox, and Selenium. Go ahead and create a terrain.py file in the root of your Django project, and drop the following in it.

from django.core.management import call_command
from django.test.simple import DjangoTestSuiteRunner
 
from lettuce import before, after, world
from logging import getLogger
from selenium import webdriver
 
try:
	from south.management.commands import patch_for_test_db_setup
except:
	pass
 
logger = getLogger(__name__)
logger.info("Loading the terrain file...")
 
@before.runserver
def setup_database(actual_server):
	'''
	This will setup your database, sync it, and run migrations if you are using South.
	It does this before the Test Django server is set up.
	'''
	logger.info("Setting up a test database...")
 
	# Uncomment if you are using South
	# patch_for_test_db_setup()
 
	world.test_runner = DjangoTestSuiteRunner(interactive=False)
	DjangoTestSuiteRunner.setup_test_environment(world.test_runner)
	world.created_db = DjangoTestSuiteRunner.setup_databases(world.test_runner)
 
	call_command('syncdb', interactive=False, verbosity=0)
 
	# Uncomment if you are using South
	# call_command('migrate', interactive=False, verbosity=0)
 
@after.runserver
def teardown_database(actual_server):
	'''
	This will destroy your test database after all of your tests have executed.
	'''
	logger.info("Destroying the test database ...")
 
	DjangoTestSuiteRunner.teardown_databases(world.test_runner, world.created_db)
 
@before.all
def setup_browser():
	world.browser = webdriver.Firefox()
 
@after.all
def teardown_browser(total):
	world.browser.quit()

In your settings.py file, you’re going to need some additions too.

# Nose
TEST_RUNNER = 'django_nose.NoseTestSuiteRunner'
 
# Lettuce
LETTUCE_SERVER_PORT = 9000

We use the Nose test runner because it’s faster than Django’s default test runner, and we change the server port for running tests so it doesn’t collide with our development server. At this point if you run `./manage.py harvest` again, you’ll still get notices for unimplemented steps, but you’ll also see Firefox open and close real quick. That means we’ve done our job correctly.

Your First Step Definition

Alright, lets make something happen. If you look at the output from the harvest command, you’ll see that it gave you some code to help you implement the new steps that you wrote. Go ahead and copy that code into the bottom of the terrain.py file (and make sure to import ‘step’ at the top). Now, re-run ./manage.py harvest. You should get the following output.
Lettuce - Failing ouput
So why did our steps fail? If you look that the code that was generated for you, there is a line that essentially says “False is equal to some string”. This is obviously not true, so our step fails. So why don’t we make the test pass? We’re going to change a few things:

  • Change the decorator – We want this step to match even if we use other Gherkin keywords like “when”, “and”, and “then”.
  • Change the function name and args – “group1” isn’t very descriptive
  • Write the code – We need this to do something, and right now it doesn’t!
@step(u'I am at "([^"]*)"')
def i_am_at_url(step, url):
    world.browser.get(url)

Now if you run ./manage.py harvest command again your tests will still fail, but this time for a different reason. The reason is the url that we’re passing into the step definition isn’t well formed. We were hoping to be able to pass relative urls in, but we can’t. So go ahead and modify the step in your scenario to look like this.

Given I am at "http://127.0.0.1:9000/quick-test/"

Run ./manage.py harvest again. You’ll see one passing test and one failing test!
Lettuce - Passing and failing

To make the next step pass, we need to make our web page a bit more formal. Go ahead and create a folder called “templates” inside of the “blog” app. Inside that folder, add a file called “base.html” and populate it with:

<html>
	<head>
		<title>Learning Lettuce!</title>
	</head>
	<body id='content'>
		{% block content %}{% endblock %}
	</body>
</html>

Now create a file called “blog.html” inside the same folder. Give it the following content:

{% extends "base.html" %}
 
{% block content %}
Hello testing world!
{% endblock %}

You’ll also need to update the view:

from django.shortcuts import render_to_response
 
def quick_test(request):
	return render_to_response("blog.html", {})

And you’ll need to update your settings file.

## Add this at the top of settings.py
import os.path
root = os.path.dirname(__file__).replace('\\','/')
 
## Make your TEMPLATE_DIRS variable look like this
TEMPLATE_DIRS = (
    root + "/../blog/templates/",
)

Now that our template is more formalized, lets update the step definition in “terrain.py”.

@step(u'I should see "([^"]*)"')
def i_should_see_content(step, content):
	if content not in world.browser.find_element_by_id("content").text:
		raise Exception("Content not found.")

This code explains itself pretty easily. We check to see if the content that is passed in via the step exists inside the body of the page. This has a few drawbacks:

  • What if we don’t want to check the body? What if we want to check a different element?
  • What if the content isn’t visible? (CSS hidden)

Since this is a simple example, we’re going to ignore these issues for now and just run our tests.
Lettuce - Passing tests!
Passing tests!

Next Steps

Now that you have passing steps, you’re well on your way to writing serious integration tests for your code. But there is still a lot more to learn. The next article in this series will cover using Lettuce Webdriver to handle common step definitions, tables, scenario outlines, and much much more.

All code related to this post can be found at https://github.com/vital101/Learning-Lettuce

Categories
Django Python

Lettuce Tags

Lettuce is a BDD (Behavior Driven Development) testing tool for Python based on the excellent Cucumber project. It has most of the same features that Cucumber has, and has proven invaluable in my projects. I discovered an undocumented feature the other day called “Tags”. Cucumber has them, so I also assumed that Lettuce had them. Tags allow you to selectively skip or run scenarios. For instance:

Feature: Some Feature
 
	Scenario: This is scenario 1
		Given I do stuff
		And I see stuff
		Then I am stuff
 
	@mytag
	Scenario: This is scenario 2
		Given I do more stuff
		And I see more stuff
		Then I am more stuff

You can use tags in many ways.

lettuce --tag=mytag # Run only scenarios with this tag
lettuce --tag=-mytag # Don't run scenarios with this tag
./manage.py harvest --tag=mytag # Django/Lettuce way of using tags.
Categories
Django Python

Saving within a post_save signal in Django

One of the more useful features of the Django framework is it’s extensive signaling capabilities. The ORM throws off a handful of signals every time a model is initialized, modified, saved, or deleted. They include:

  • pre_init
  • post_init
  • pre_save
  • post_save
  • pre_delete
  • post_delete
  • m2m_changed
  • class_prepared

I tend to use the post_save signal fairly often as a good way to get around overriding the default save method on models. Recently though I ran into an issue where I was hitting the “maximum recursion depth exceeded” error when I was saving the current model from within the post_save signal. If you think about it, that makes a lot of sense. You save once, then save again in the signal and then it triggers the signal again. BOOM, infinite loop.

To get around the saving within a post_save signal problem, you just need to disconnect the post_save signal before you call save. After save, you can re-connect it.

from django.db.models import signals
signals.post_save.disconnect(some_method, sender=SomeModel)
some_instance.save()
signals.post_save.connect(some_method, sender=SomeModel)
Categories
Django Python

Programmatically Adding Users to Groups in Django

I recently needed to add a large number of users to a permission group in Django. I had a hard time finding a way to do this in the documentation, so I thought I’d share my solution.

from django.contrib.auth.models import Group, User
 
g = Group.objects.get(name='My Group Name')
users = User.objects.all()
for u in users:
    g.user_set.add(u)

Easy as pie. I originally attempted to do this from the perspective of a user, but as it turns out, doing it from the group perspective is much easier.

Categories
Django Python

Copying Django Model Objects

Django’s ORM is top notch. It provides facilities to do almost anything you can think of with a database, and if it doesn’t, it still lets you execute arbitrary SQL to your hearts content. I’ve been developing Django for close to 2 years now, and still discover facets of it that I never knew existed. For instance, I had a need to duplicate a row in a table, but give it a different primary key. After a quick Google search, I discovered that Django allows you to do the following to copy instantiated model objects.

my_model = MyModel.objects.get(pk=4)
my_model.id = None
my_model.save()

There are a few caveats with doing things this way.

  • Unique Constraints – If you have any unique constraints on the model, the save will not pass validation and fail.
  • ManyToMany Fields – If you need new copies of ManyToMany field values, you’ll need to handle this yourself.

That being said, in many cases duplicating a model instance is as easy as changing it’s ID and saving.