Hi,
I've seen a pattern of errors on Nginx under load, where a single host returns a large burst of 500 errors along with messages in the error log like:
> 2021/02/03 00:52:32 [crit] 217#0: *289501 bind(1.2.3.4) failed (98: Address already in use) while connecting to upstream, client: 127.0.0.2, server: local_upstream, request: "HEAD /library/movie.mp4 HTTP/1.1", upstream: "http://10.0.177.216:80/library/movie.mp4", host: "local_upstream_server"
In the past I saw this error under failure modes which had lead to ephemeral port exhaustion. However, based on the tcp connection metrics on the host, I don't think that's happening here. So, I am wondering if anyone has suggestions about what might be happening or what data/metrics it would be useful to collect to investigate this further.
A brief note about my setup, I'm running two layers of Nginx servers on the same host. The IP in the bind exceptions here were a remote IP for the backend fleet.
Front door load balancer -> Nginx layer 1 -> Nginx layer 2 -> Backend fleet