The Dramatic Benefits of Django Subqueries and Annotations
If a web request is stretching into seconds, 99% of the time the database isn’t being used correctly. When using the ORM, its natural to want to think in Python. This mindset can be disastrous for your applications performance. Its possible to see 20–100x speedups for requests by switching to subqueries and annotations.
The Django 1.11 release turned the framework into a serious way to express queries. More robust annotations and subqueries make the vast majority of SQL possible in stock Django.
The Problem
One day you might be wondering why you django code is slow. You open up your monitoring and see a graph like this:
The red lines in the graph are trips to the database. 664 trips to be exact. Datadog saying that the database is only taking 20% of the time is deceptive. Datadog is monitoring the time it takes the database to fulfill a request. Each request is fast, but every piece of data that comes from DB has overhead attached to it. It cost resources to allocate memory, serialize and deserialize, and send data across the network.
Lowering the load on a properly indexed application database means taking less data out. The more work kept within the DB, the more efficient it is.
Doing this will turn 664 queries into 1 query and turn an 8 second request into a 250ms request.
Performance Checklist
Efficiency in Django is largely dependent on how you are using the ORM.
- Is the number of queries in a request constant regardless of data size?
- Are you only getting the data you absolutely need?
- Is the only loop over models in a template or serializer?
Fulfilling this checklist requires a move from thinking in Python to thinking in SQL.
Breaking out of the Python Mindset
Take an example of a City and imagine you want to add a computation to it. Its pretty straightforward to put it on the model.
This is the Python way way of thinking about things. The only time doing this makes sense is if you are presenting the data in a view. The pattern breaks down when doing anything else with the data. Using the computation across collections will lead to significant performance degradation. For example, filtering must be in the same place as the logic.
If only 5% of cities are dense, 19x the data retrieved is discarded. Filtering in the database is always faster than loading all the data and using Python. The difference is tiny with small data, but quickly grows as the amount of columns selected and the rows retrieved increase. For every unneeded row or column, Django has to do extra work to transform it into a model instance. Besides performance everything gets messy by confusing Python loops and SQL. What happens if you want to filter by another column or sort? Luckily, Django has answer to put things that deal with a City
collection: the QuerySet
!
Think in QuerySets instead of Models
Fixing the query for dense cities requires the use of an annotation. A convenient place for an annotation is on a custom queryset.
Annotations are surprisingly expressive. It can do pretty much everything Python can do (as long as it isn’t turing complete). It’s rarely more would efficient to pull the data into python and do the computation .
1 Query is Always Better than N Queries.
The fastest way to kill a Django app is to put a query inside of a loop.
[
state for state in State.objects.all()
if state.cities.dense_cities().exists()
]
Checking each exists()
will cause an extra query to the database. It might not be noticeable on a few, but it will quickly make your site crawl. Hundreds of 1–10ms queries add up.
A subquery can fix this. The most basic subquery can be passed directly into a filter statement. Remember QuerySets are lazy. Django will properly compose this into a subquery. Only 1 query will be made.
This is limited in its functionality. It is much more powerful to use subqueries as annotations.
OuterRef('id')
refers to the id
column of the publisher set. exists()
returns a boolean, so the subquery annotationExists
is used. Everything else is a normal queryset. The custom queryset method with annotations dense_cities
still works.
A more powerful example is counting the cities per state. Like exists()
, count()
returns a value and not a queryset. More complicated logic is needed.
Its a bit more messy than using .count()
because we have to group the rows on state_id
and then aggregate them. The subquery count logic should be in its own method.
Note that filtering on dense_cities=True
will always be more efficient than filtering on dense_city_count__gt=0.
Subqueries Instead of Joins
Django makes it so easy to use forward and reverse joins. The related name for city -> state
is cities
and we can use that in join within an aggregate to calculate the biggest city population by state.
State.objects.annotate(biggest_city_size=Max('cities__population'))
This works for simple querysets. It doesn’t work if you have default filtering on the CityQueryset. For instance if the QuerySet by default excludes certain items, these items will be included in the join. A subquery is more fitting here.
More verbose, but also more accurate. The join will also make Django tell the DB to group on all columns, including annotations, causing a performance hit. Also doing a reverse join will create a cartesian product and you will have more rows. This will mess with aggregates if you start to put subqueries in subqueries.
The biggest benefit of Subqueries is that they more closely follow they Django paradigm. Django’s support for arbitrary joins is very limited. SQL query planner should be smart enough to interchange joins for subqueries where they make sense.
Composition
One of the biggest bennefits of subqueries and annotations is that they are nestable.
Drawbacks
You have to know what calculations are needed so that annotations can be added before you make the query. This is difficult and the reason doing things the Python way is so appealing. Sometimes they are messy, but its easy to write helper methods for subquery_count
and subquery_aggregate
.
If you are having difficulty debugging performance of the query, take the subquery out, use EXPLAIN
, fix the problem, and put it back in.
When using subqueries and annotations make sure you aren’t annotating what you don’t need. Django will include everything you add with annotate. Do not add annotations by default. They need to be explicit.
Conclusions
When looking for performance issues, optimizing database access should be the first step. Subqueries and annotations are powerful tools to do this. They are made to be nested and composed into more complicated computations. They are compound interest; the more they are used in an application, the more powerful they become.