503 Backend Fetch Failed – WTF?

At high level “503 backend fetch failed” boils down to an HTTP protocol status code from the 5xx group. 5xx group consists of so called “server errors” which means that client’s request overall seem to be valid but server failed to fulfill it. Some common reasons for such behavior include:

Some bug in the application code. That’s typically the 500 status since it’s generic and the bug is unexpected (Internal Server Error).
Client talks to a proxy. The proxy forwards requests to an upstream service. The upstream service failed to respond correctly to the forwarded request. That’s the 502 status (Bad Gateway).
The server ran out of the disk space. Actual status code depends on server/app implementation but usually that’s still generic 500 status.

The 502 status itself means some temporary failure. In other words client should be able to send the exact same request later and get a valid response. There’s an HTTP header which tells the client when there will be a good time to retry the request. That’s the Retry-After header which tells either full date/time to try after or a delay in seconds to wait.

How the 5xx status code to return is determined?

That’s mostly up to app developers. That’s also the reason why 5xx status code could be misleading. That happens when app developers aren’t careful with what status codes the app returns.

For instance, 502 status appears when there are too many requests from a particular client. In other words there’s a particular throttling / requests quota set up for the client. That’s pretty common these days for APIs usage. However, there’s a specific error code for exactly this situation – 429 status. 502 status should be returned only in case no requests quota is exceeded but still server processing queue is full already. While we’re at it: I touched the topic of backend scalability in this article.

This also explains that “503 backend fetch failed” error per se doesn’t mean anything particular. To understand why this error appears it’s essential to know what apps/servers one uses.

Why do I encounter “503 backend fetch failed” error?

Here are some common examples when one can bump into “503 backend fetch failed” error in the wild:

Pokemon lovers who surf Bulbapedia encounter that from time to time:
- That means Bulbapedia’s ops team got an incident to resolve.
- As a user you can only wait for them to do their job.
- Since they don’t have a dedicated status page it’s up to checking either the site itself or via a down detector service.
Developers using Varnish caching server can get complains about this error from users:
- This error is displayed when Varnish fails to fetch the requested data from the upstream server.
- There could be numerous reasons why it was unable to retrieve the data: the upstream could be down or unavailable; the connection could’ve closed before the cache server was done reading the response; the upstream code or plugins could be at fault, and so on.
- A few common pitfalls to check:
  - Varnish by default can process HTTP responses from the upstream no longer than 8 KB. If upstream returns longer responses Varnish will return 502 status. It should be reconfigured in such case by adjusting http_resp_size and probably http_resp_hdr_len as well.
  - If Varnish health check fail it’ll also return 502 status. Learn more about health checks and how to configure these from the docs.
Magento / Adobe Commerce developers can encounter that too:
- That may be related to Varnish cache which is used by Magento. Check out the above section for remedies.
- There’s a Magento-specific reason for this – var/.maintenance file. It’s generated by Magento during maintenance. In case it wasn’t properly removed after the maintenance the 503 status will pop up. Just remove the file via file manager and refresh the website.

Mental Models for Performance Engineering

I’ve been tinkering with performance engineering (PE) as one of the areas of interest for a while lately. From experience I built up a few mental models / metaphors which help with reasoning about systems performance. Let’s get through these mental models for performance engineering one by one. The most significant achievement thus far was…

DOOM уже не тот (и не торт) – часть вторая

Совсем недавно вышла очередная игра в серии ребута классического Doom – Doom: The Dark Ages aka D:DA. Я уже отметился как не шибко фанат творчества господина Мартина и честно хотел остаться сторонним наблюдателем, ведь DOOM уже не тот. Но какой-то очередной ролик на ЮТубоне таки меня склонил поставить НЕДОСТУПНЫЙ в РФ через Стим D:DA =)…

Unsettling story of proxy_next_upstream in Nginx

It turns out that there’s quite a bunch of people who bumped into a particular Nginx upstream handling behavior. Not that long ago I also joined the club. It usually manifests as unavailability of some service behind Nginx as a reverse proxy. Clients start to get “502 Bad Gateway” after some change in the service…

Tribute to the Blog of Uwe Friedrichsen

At the end of November last year I bumped into a blog which immediately captured my attention. I eagerly read during a few weeks after as the blog turned out to be a treasury of value =) Articles are nicely interconnected with each other, so it’s easy to fall into a rabbithole of great content…

Технический аудит при закупках ИТ-систем

Корпоративные информационные системы (ИС) играют ключевую роль в управлении и оптимизации бизнес-процессов современных организаций. Как и любые другие технические активы – а как я уже утверждал ИС это вполне себе технический актив – эти системы проходят определённые этапы жизненного цикла, начиная от планирования и заканчивая утилизацией. Пока не очень понятно при чем тут технический аудит…