Unsettling story of proxy_next_upstream in Nginx

It turns out that there’s quite a bunch of people who bumped into a particular Nginx upstream handling behavior. Not that long ago I also joined the club. It usually manifests as unavailability of some service behind Nginx as a reverse proxy. Clients start to get “502 Bad Gateway” after some change in the service behind Nginx. There were clearly some 500 on that service but not enough to justify “all instances are down” situation. Another strong clue is “no live upstreams” in Nginx logs and $upstream_response_time being zero — if you’re lucky to log that.

Nginx as reverse proxy by Midjourney

Load balancing in ngx_http_proxy_module

Obviously Nginx is a solid software with a ton of settings. It may take a lifetime to experience all permutations in production under different workloads. No surprise I didn’t even knew about existence of the setting “proxy_next_upstream” in Nginx before that incident. Usually you may have something as simple as the following:

upstream backend {
  server backend1:80;
  server backend2:80;
}

server {
  listen 80;
  server_name proxy;

  location / {
    proxy_pass http://backend;
  }
}

This basically tells Nginx to split incoming traffic in round-robin fashion between backend1 & backend2 hostnames. The proxy_next_upstream setting is included implicitly in the following form:

proxy_next_upstream error timeout;

That means for an incoming request intended for backend1 try next upstream (backend2) in case there’s an error trying to connect to backend1 or a timeout. It’s a sensible default which increases overall availability in case of temporary glitches with upstream servers (or even planned downtime).

When proxy_next_upstream in Nginx can hurt you badly

The proxy_next_upstream setting is not the whole story, though. When defining upstream block itself there’s a bunch of settings for failover (like weights for instance) including these guys:

  • max_fails — how many times to retry one server, by default just once
  • fail_timeout — two-fold: sliding window for failures counting AND how long to disable failed server

So let’s grab it all together for a case of temporary glitch of backend1 in the above config:

  • backend1 is online but the service on port 80 doesn’t listen — is reloading for instance
  • nginx tries out backend1 to get RST what means “error” in terms of proxy_next_upstream setting
  • since max_fails is equal to 1 that immediately excludes backend1 from the pool of healthy instances — per pretty standard load balancing logic
  • nginx won’t even try to send requests to backend1 for duration defined by fail_timeout (hence zero for $upstream_response_time in logs)

So far so good, all look like a robust way of handling stuff — especially if there’s a whole bunch of upstream servers.

This nice logic fails miserably, though, when proxy_next_upstream defines more failure modes it should react to. In particular http_500 is a tough one.

Broken route + improper proxy_next_upstream = a match made in heaven

Let’s imagine you have just one broken route which yields 500 in your latest release. And you have this in your reverse proxy config:

proxy_next_upstream error timeout http_500;

Congrats, that’s a recipe for a disaster! Let’s replay it again:

  • backend1 is all good
  • nginx tries out backend1 with that particular route which yields 500
  • since max_fails is equal to 1 that immediately excludes backend1 from the pool of healthy instances per pretty standard load balancing logic
  • nginx won’t even try to send requests to backend1 for duration defined by fail_timeout
  • as soon as nginx tries out backend2 (which runs exact same code with exact same broken route) it also got removed from the pool of healthy instances

What is especially lovely: now a request to a totally healthy route won’t be able to make it through. Clients will get 502 from nginx for fail_timeout. You may get really confused what causes these intermittent availability issues. Here are a few tips on how to get out:

  • Remove http_500 altogether if there’s no strong argument why it’s even there
  • If you really have transient instance-specific 500s possible I’d be really interested to look into that case =) Anyways it still possible to define max_fails above 1 or shorter fail_timeout if that helps. Beware that broken route will still hit you for whatever fail_timeout is.
  • It’s also possible to set max_fails to zero, so nginx won’t disable upstream servers. The downside is generally a bit more chatty reverse proxy.

Nginx Plus has full-fledged health checks for upstream instances instead of “poor man” version we just discussed. That may be an option too if you can afford it.