Scalable Backend Secret Sauce

There’s an implementation principle of a scalable backend system which I find among the most important ones. It’s about minimization of compute resources footprint of a request handler for every possible request. This principle isn’t only applicable to handlers in request/response scenarios, e.g. REST or GraphQL APIs. For a session-based handler like a websocket handler pretty much same logic applies. In such scenario it’s still a key if one aims at squeezing the most scalability out of the backend system they deal with.

A simple example to illustrate

Let’s dive in right away with a simple Pythonic example. Consider following two simple Django view functions:

from django.http import HttpRequest, Http404
from myapp.models import MyModel


def view_with_len(request: HttpRequest, model_pk: int):
    if len(MyModel.objects.filter(pk=model_pk)) == 0:
        raise Http404("No model with such PK")
    # Rest of the code isn't relevant for this example


def view_with_orm_count(request: HttpRequest, model_pk: int):
    if MyModel.objects.filter(pk=model_pk).count() == 0:
        raise Http404("No model with such PK")
    # Rest of the code isn't relevant for this example

Let’s put aside a bit controversial way of checking the existence of the specified row in MyModel‘s database table. In practice we’re likely to use EAFP-ish try/except with the combination of Django object manager .get method in this particular case. Nevertheless such implementation allows pretty good demo of footprint minimization approach.

The problem for a scalable backend

At first sight both pieces of code do pretty much the same thing. Moreover while system load is low both would work mostly identical. The situation will change drastically with the increase of the system’s load. There are 2 main factors: row count of MyModel‘s database table and parallel execution of request handlers.

Caveat: by parallel execution in this case I mean any way of having multiple in-flight requests. The difference in compute footprint for these 2 snippets is related to memory rather than CPU. In such case all we need is multiple in-memory request handling structures. That can be achieved with either synthetic or physical threads of execution in the same process space. Similarly coroutines/greenlets or multiple OS processes will have similar effect.

With that clarified let’s continue. The raising number of both parallel executions and rows in the table will result in the higher memory consumption of view_with_len view in comparison to with_with_orm_count. That will lead to much less scalability of the former in comparison to the latter. The culprit of that is simple: Django object manager .count method results in a single SQL query retrieving number of rows. Usage of built-in len() function requires ORM to fetch all the rows and populate Python MyModel objects before calculating the length of the resulting collection. The amount of insignificant work for this particular case done by view_with_len will be enormous with large datasets. It can even lead to stability issues, e.g. timeouts, rather than just hampering the scalability of the system.

Application of the principle

The principle of minimal compute resources footprint sounds easy in theory. It could be really hard to consistently apply it in practice when it comes to such subtle differences as in the example we just covered. Basically for every request handler there could be an optimal logic flow which will minimize the footprint. It’s usually all about just 2 things to get to a scalable backend. The first is short circuiting execution as soon as possible for lower CPU consumption. The second is loading in memory only the data which is absolutely required for the current state of the request handler execution for lower memory usage. Here are some examples of applying these practices:

  • Validate input present in request handler right away & short circuit execution for any failed validation.
  • Validate system state while materializing minimal required dataset & short circuit execution if operation shouldn’t be permitted.
  • Ensure the dataset used by the request handler is always as small as possible for the task at hand.

Development team/organization awareness is a key for long-term maintainability of the well optimized request handlers. Unaware developer can unintentionally break the optimal flow by introducing new behavior or even during refactoring. The best practice is to make this an inherent part of the overall development culture if scalability is really important for the systems people work on.