I’ve been tinkering with performance engineering (PE) as one of the areas of interest for a while lately. From experience I built up a few mental models / metaphors which help with reasoning about systems performance. Let’s get through these mental models for performance engineering one by one.
The most significant achievement thus far was getting through the astonishing book “Systems Performance” by Brendan Gregg =) That’s 800+ pages of hardcore deep technical content, so it’s totally worth the effort. And it’s full of practical advice which helps a lot in the field.

System as Requests Funnel
Most of IT systems out there process requests / transactions in one way or another. Hence as a black box it accepts requests on the one side and spits out responses on another. Setting aside the effect of load shape in simplistic terms one of the properties of such a black box is it’s performance in requests / transactions per second (RPS / TPS).
Here comes the first useful metaphor: system as a black box contains a funnel inside. Requests pour in it and responses flush out of it. Whenever the input stream (in RPS / TPS) is above funnel capacity we end up with water clogging on top. That results in increased latency of requests which were unfortunate to get caught on top of the funnel.
If we zoom into the funnel itself we can then find out that it’s not just a single straight pipe. It’s actually composed of a lot of shorter pipes of different radius. And in full accordance with Goldratt’s theory of constraints the resulting RPS/TPS of the whole funnel is constrained by the narrowest pipe.
When you want to increase overall RPS/TPS you should look for the constraint and make that pipe wider. At some point it’ll become as wide as the next narrowest pipe hence it won’t make any difference widening it even more. To increase overall RPS/TPS even further you should turn your attention to the next constraint.
Typical approaches to “wide the pipe” are increasing resources available for the corresponding component or making it work more efficiently by tuning the implementation.
Resource Utilization as Leaky Bucket from Mental Models for Performance Engineering
There’s a couple of fundamental concepts in PE — resource utilization & saturation. One way to express utilization is the time fraction resource was busy, what is suitable for CPU. Another is how much of the resource capacity is used, that’s suits RAM. Anyways utilization is always a fraction from 0 to 1 which can be expressed in percents. Typical tools report both CPU & RAM utilization that way.
Saturation is defined by excessive work accrued “before” the resource / queued for processing. For a CPU core it would be the corresponding run queue length or scheduler latency.
With that in mind you can think of resource utilization / saturation like a leaky bucket. When you pour in just enough water, so it gets through right away, then your resource is still underutilized. Whenever water stays in the bucket for a while that’s saturation. Saturation always results in additional latency for work items while they stay in the bucket.
Resource performance also may degrade under load. For instance, rotational hard disks are well known to have a tipping point around 60% of utilization. Before that point latency grows insignificantly and in a linear fashion. After the 60% utilization point latency grows exponentially. So we effectively get saturation-like situation well before the resource is reported as utilized at 100% in such case.
The basic rule of thumb is simple: resource should be always underutilized or you get additional latency. In reality it’s much more convoluted, of course. Sporadical spikes won’t hurt much while still yielding additional latency to affected work items. The system as a whole is a complex network of resources to manage.
Individual Request Latency as Conveyor
It’s typically possible to reason about some parts of the system as simple queueing systems with one service center (what actually processes work items) and a queue before it which can hold items to process next. These are individual PHP-FPM worker processes, for instance, or threads of processing in a thread pool.
You can think of such component as a conveyor. In a general case boxes on the conveyor can be vastly different in size resulting in very different processing time for each. Moreover a particularly big box can “clog” the conveyor, so while it’s being processed even most tiny boxes would be stuck behind it. That would result in a huge difference in individual request latency.
Let’s consider an example — we have a steady arrival rate of 1 RPS but varying request processing times. Typically requests are processed in under 100ms, so all runs super smooth. Then we get this 5 seconds to process request. When it’ll be finally processed there will be 4 requests in the queue. If these are all typical then in just a second they will be done. However, instead of 100ms latency there will be a range of 4100-4400 millisecond for these.
From a practical standpoint this means that for a predictable latency it’s unproductive to mix massively different work items. For instance, allowing users to fetch arbitrary data sets is usually a bad idea. It’s better to have a defined maximum allowed page size which will correspond to an upper bound of latency of such requests. I touched the topic of making backend logic scalable in one of the earlier posts. Another example is arbitrary SQL queries — it’s a whole world to explore, so I just mention it here!
Wrap Up on Mental Models for Performance Engineering
That’s all folks! Hope you find these mental models useful for day-to-day tasks.