Thinking Functionally with Python and Django at eShares Inc.
Be ready to revise any system, scrap any method, abandon any theory, if the success of the job requires it. — Henry Ford
When it comes to problem solving in Python, often times we feel locked into the OOP way. Python has more than just an imperative style. Functional programming in Python is well known but not well used. Sometimes the benefits of a functional approach are unclear and it becomes difficult to break out of the imperative world.
Switching to functional programming won’t magically make programs faster, but it can help to provide structure and expose problem areas.
At eShares, we use FP to make our data exports 20x faster by breaking report generation into discrete stages that allow for easier consumption of resources from the Django ORM.
Understanding the Problem at eShares
Our reports took several minutes to run. One central problem that our team constantly faces with Django is that it makes accessing the database too easy. When we first approached report generation, the number of queries increased exponentially with the number of entries in the excel document. Most of these were foreign key lookups in loops. Django made it very easy to query the data when populating the cell’s value. This was bad.
The biggest problem was that the complicated Excel documents consisted of hundreds of nested functions. It was nearly impossible to isolate where the queries were coming from. Not only did this cause performance issues, but it increased our risk that we wouldn’t be able to understand the calculations being performed.
Creative Destruction
When dealing with optimizing a large project, it is hard to separate activity from progress. Patches to the existing code might seem great by shaving off 5% here or there, but sometimes the solution needs to be orders of magnitude better.
When it comes to revising code, there is a fear of letting go. We become worried that all the time spent building the current version will be wasted. But this is often illogical. It is hard to separate the code from the knowledge behind it. The model of the problem transcends code and programming languages and some code styles fit the model better than others.
So instead of iterating on the old solution, we took a leap, pushed delete, and tried something new. Sometimes starting over will be the most productive thing you ever do.
It was clear that whatever was going to replace the old report not only needed to be faster, but it also needed to be easy to understand. When dealing with complicated domains such as the math behind a cap table, it is important to code for humans first and speed second. Don’t compromise clarity for performance. Clear code has a tendency to be performant.
Our new model of the problem, breaks up excel generation into three steps: data collection, data transformation, and data views (traditionally views in Django are for the web, but an Excel view is just as valid a view as an HTML one). Each of these steps are discrete and provide very clear boundaries to debug performance problems. There is a natural data flow so pipelines map nicely as a solution.
A Pipeline Primer
A very useful concept in functional programming is a pipeline. Pipelines compose individual steps together and form one big function. The output of one function gets fed into the next function.
Start to think of data as being on an assembly line of specialized machines instead of on a workbench with a list of instructions. Henry Ford saw the power of this idea and became the father of mass production. The same 100 year-old insight can apply to your data. (Software design can be influenced by industry in many ways)
Chaining Calls
Probably the most pythonic pipeline comes in the form of a class whose methods return instances of the same class. Django has querysets which exibit this ability to build up an object.
Surprisingly, Python does not have anything that is similar for building iterators. Lets take an example where a generator of CSV rows gets convert ed to Book objects, filtered by year, sorted and finally formatted. It quickly gets unweildy.
The major problem with this syntax is that the first action is the deepest in the nest. It’s not how a person thinks.
Expressing iterator transformation by chaining together functions would be better.
The conduit class is dead simple. If you want more robust features, there is PyFunctional and even an underscore.js port which does the same thing and much more.
The conduit isn’t a performance upgrade, but it expresses data transformation in a clear, natural way.
Function Composition
Function composition is a pipeline technique that pieces together small functions into a larger function. It is very similar to chaining, except the end result is a function and not a result.
Chaining method calls is actually doing something very similar to function composition. Remember that obj.method()
it is the same as ObjClass.method(obj)
.
Using composition, each method is explicit. Using dot notation and chaining, if any method in the chain returns a subclass class of Obj that subclassses’ method is used. In Python we frequently want to turn everything into a Class, but sometimes it is simpler and better to use small, freestanding functions. Function composition helps to give some of the power of classes with the simplicity of compact functions.
The Flat Data Pipeline
In the case of our report needs to be explicit. Our report consists of pulling Django objects which represents financial securities and doing computations on them. Django objects are dangereous. Query counts can explode unexpectedly. Since we wanted to pull all data upfront, passing Django objects to Excel view is not an option. We need to flatten the data before it gets to the view. The view should only deal with strings, decimals, namedtuples, and other basic python objects.
Our first attempt was a pipeline of functions which took two element tuple: the “flat” object and the Django model instance. It copied over everything that the Excel writer needed from the model to the flat object. This ensured that it was impossible to make a query while writing the CSV file.
Unfortunately, Python 3 no longer accepts pattern matched tuples in the function arguments. The functions being composed seemed to be overly verbose. We decided to embrace Python’s mutability to redesign our pipeline to better express what it was supposed to do; copy attributes over.
While it isn’t a purely functional approach, recognizing the strengths and weaknesses of the language can lead to cleaner code. Don’t swim upstream trying to achieve theoretical, immutable perfection in a mutable world.
Efficient Pipes
If the report is simple, the data can be directly piped into a CSV.
Using iterator()
to pull data from the database makes the steps simple: read a row from the database, transform that row through the pipeline, and then write the data to a CSV. Then repeat for the next row. It’s an incredibly efficient way to write a report. Only keep one row is kept in memory and the roles of each piece of code are clearly defined.
Pure, Higher-Order Functions and Resource Management
Unfortunately, not all reports are that simple. Usually the transform step requires information not found directly on the element it is transforming. The problem areas we identitied were clarity and resource management.
Using classes invites complexity. It also makes it difficult to tell if a function is pure. A pure function is a function that when run with the same input will always give the same output. This means no database access, no cache access and even no clock access. The result is higher-order functions which conveniently document their requirements and everything is explicit.
For example, a transformer that relies on the current export time to tell if a security is canceled or not, needs the date.
Now the transformation provides its own documentation and informs us that in order to know if a security is canceled or not, an external resource not contained on the Django object is needed.
This becomes especially important for database access. Some of the transformations for securities depend on other securities. We wanted to make it a goal to only have one copy of each object in memory. Using select_related
and prefetch_related
won’t help us here.
Instead of using iterator()
to load the data, we load every piece of data into lists. Then create indexes to the data using basic python dictionaries. It’s fast, efficient, and requires no dependencies.
Because all of the transformation functions are isolated, an assert statement to the flatten
function can be added to be absolutely certain there are no queries during the transformations.
Final Thoughts
Everything that we built required no external dependencies. The glue, classes and support functions were only a few lines each. We were able to build a powerful data pipeline by taking advantage of the native features of Python.
We learned some important lessons:
- Break your process into discrete steps.
- Have one copy of data in memory.
- Embrace Python’s OOP and functional styles. Don’t go for purity.
- Make it as simple as possible. Rely on the standard lib.
- Focus on a solid foundation. Save optimizations till the end.
- Don’t be afraid to throw code away or comporomise.
The results are dramatic. From exponentially growing query counts reaching up to 7,000 for a single report down to about 30 and generation times for some customers dropped from 600 seconds to 15 seconds. Now, it is easier to detect were the performance problems are. Separating the different stages allows us to see OpenPyxl’s styles cause a 2x performance drop and which transformations are expensive. Those have not been optimized yet.
Our code is faster, cleaner and even more resuable. Since the code is broken into distinct parts, there is nothing stopping us from piping the Data to a JSON API instead of to an Excel document.
Software design is only one component of the problem. The reports run on worker servers and are processed through a queue system. Improving them is a cross-domain effort. By optimizing other parts of the system, we believe we can get all reports down to 5 seconds or less. These are things that can be optimized now that we have a solid foundation.
Functional programming didn’t do anything magical. All it did was bring clarity and structure to a complex process.
This project was a combined effort of the eShares team. Special shout-out to Eli Wilson who pulled the project together.