It turns out that there’s quite a bunch of people who bumped into a particular Nginx upstream handling behavior. Not that long ago I also joined the club. It usually manifests as unavailability of some service behind Nginx as a reverse proxy. Clients start to get “502 Bad Gateway” after some change in the service behind Nginx. There were clearly some 500 on that service but not enough to justify “all instances are down” situation. Another strong clue is “no live upstreams” in Nginx logs and $upstream_response_time being zero — if you’re lucky to log that.

Load balancing in ngx_http_proxy_module
Obviously Nginx is a solid software with a ton of settings. It may take a lifetime to experience all permutations in production under different workloads. No surprise I didn’t even knew about existence of the setting “proxy_next_upstream” in Nginx before that incident. Usually you may have something as simple as the following:
upstream backend {
server backend1:80;
server backend2:80;
}
server {
listen 80;
server_name proxy;
location / {
proxy_pass http://backend;
}
}
This basically tells Nginx to split incoming traffic in round-robin fashion between backend1 & backend2 hostnames. The proxy_next_upstream
setting is included implicitly in the following form:
proxy_next_upstream error timeout;
That means for an incoming request intended for backend1 try next upstream (backend2) in case there’s an error trying to connect to backend1 or a timeout. It’s a sensible default which increases overall availability in case of temporary glitches with upstream servers (or even planned downtime).
When proxy_next_upstream in Nginx can hurt you badly
The proxy_next_upstream
setting is not the whole story, though. When defining upstream
block itself there’s a bunch of settings for failover (like weights for instance) including these guys:
max_fails
— how many times to retry one server, by default just oncefail_timeout
— two-fold: sliding window for failures counting AND how long to disable failed server
So let’s grab it all together for a case of temporary glitch of backend1 in the above config:
- backend1 is online but the service on port 80 doesn’t listen — is reloading for instance
- nginx tries out backend1 to get RST what means “error” in terms of
proxy_next_upstream
setting - since
max_fails
is equal to 1 that immediately excludes backend1 from the pool of healthy instances — per pretty standard load balancing logic - nginx won’t even try to send requests to backend1 for duration defined by
fail_timeout
(hence zero for $upstream_response_time in logs)
So far so good, all look like a robust way of handling stuff — especially if there’s a whole bunch of upstream servers.
This nice logic fails miserably, though, when proxy_next_upstream defines more failure modes it should react to. In particular
is a tough one.http_500
Broken route + improper proxy_next_upstream = a match made in heaven
Let’s imagine you have just one broken route which yields 500 in your latest release. And you have this in your reverse proxy config:
proxy_next_upstream error timeout http_500;
Congrats, that’s a recipe for a disaster! Let’s replay it again:
- backend1 is all good
- nginx tries out backend1 with that particular route which yields 500
- since
max_fails
is equal to 1 that immediately excludes backend1 from the pool of healthy instances per pretty standard load balancing logic - nginx won’t even try to send requests to backend1 for duration defined by
fail_timeout
- as soon as nginx tries out backend2 (which runs exact same code with exact same broken route) it also got removed from the pool of healthy instances
What is especially lovely: now a request to a totally healthy route won’t be able to make it through. Clients will get 502 from nginx for fail_timeout
. You may get really confused what causes these intermittent availability issues. Here are a few tips on how to get out:
- Remove
altogether if there’s no strong argument why it’s even therehttp_500
- If you really have transient instance-specific 500s possible I’d be really interested to look into that case =) Anyways it still possible to define
max_fails
above 1 or shorterfail_timeout
if that helps. Beware that broken route will still hit you for whateverfail_timeout
is. - It’s also possible to set
max_fails
to zero, so nginx won’t disable upstream servers. The downside is generally a bit more chatty reverse proxy.
Nginx Plus has full-fledged health checks for upstream instances instead of “poor man” version we just discussed. That may be an option too if you can afford it.