While running a load test that injects 10k TPS across 3 Nginx instances, we are seeing spikes of errors where Nginx returns HTTP 502 and logs the message 'no live upstreams while connecting to upstream'. There are no other errors logged e.g. connection errors.
Also, we have a single upstream virtual IP (we use iptables to balance load across the backend) and according to the docs the upstream should never be marked as down in this case:
'If there is only a single server in a group, max_fails, fail_timeout and slow_start parameters are ignored, and such a server will never be considered unavailable'
Testing locally with our config confirms this and I cannot reproduce the 'no live upstreams while connecting to upstream' message when simulating connection and read errors with a single upstream.
To debug I tried enabling debug logs but under load that degraded performance too much. I also traced the worker process with strace and didn't find any socket or other other errors during the 502 spike.
I was able to create this issue on Nginx 1.12.2 and 1.15.3.
So given that we don't see any source error and we have a single upstream, I'm interested to know what other scenarios could result in a 502 with the log message 'no live upstreams while connecting to upstream'?