When running load tests against an Nginx server, I seen a failure mode that results in Nginx returning 5xx errors, and the error log is filled with messages like:
> [crit] 140#0: *1572276 bind(0.0.0.0) failed (98: Address already in use) while connecting to upstream
My theories for what might be happening here were:
1) File handles exhausted
2) Ephemeral ports or sockets exhausted
3) Nginx crashed, came back up and tried to re-bind to the same port
For 1), I think we should see `24: Too many open files` according to https://blog.serverdensity.com/troubleshoot-nginx/
For 2), I think we should see `99: Cannot assign requested address` according to https://www.nginx.com/blog/overcoming-ephemeral-port-exhaustion-nginx-plus/
If 3) happened, I think we would have seen health check failures from our external load balancer sitting in front of nginx (which we did not).
Note that our health check is implemented in Nginx as:
> location = /health {
> return 204;
> }
So I’m guessing the reason the health check did not fail is that the health check is not trying to open any connections to the upstream server. I think this makes 3) seem less likely, since if the nginx process crashed I think the health checks would have failed as well.
Does anyone have any insight into what's happening here, or how to diagnose further?
Thanks!
Edited 1 time(s). Last edit at 06/27/2019 07:03PM by jarstewa.