Every other 60 second timeouts

bdmesh

Every other 60 second timeouts
December 17, 2015 03:38PM

Registered: 8 years ago
Posts: 6

Hi,

We're using NGINX as a reverse proxy and are having difficulty getting consistent throughput. The service we have stood up behind the proxy is a NodeJS service running on CoreOS on an AWS ec2 instance, sitting behind an ELB. For the proxy, we're using NGINX as part of KONG running on a CoreOS Docker container. The version is openresty/1.9.3.1. We're using a simply proxy_pass to route the traffic though.

We have a few of these NGINX instances all configured the same. They'll work great for a while (1-4 days) and then the traffic passing through will enter a pattern where every other call proxies through instantaneously, the other calls will delay in NGINX for 60 seconds before making it out to the upstream service and getting a response back to proxy back to the caller.

I've configured other test upstream services to see if there's anything tied to the service being fronted, however as soon as I apply the config changes to NGINX, the report goes away for a while, so this path is a slow process and also hard to validate once we think we have a fix as we basically need to just wait 2-3X times the standard failure time to see if it fails again.

Has anyone seen this before or know how to fix it?

Thanks!
-Brian

Reply Quote

itpp2012

Re: Every other 60 second timeouts
December 17, 2015 05:11PM

Admin
Registered: 11 years ago
Posts: 1,297

Sounds like a DNS issue.

---
nginx for Windows http://nginx-win.ecsds.eu/

Reply Quote

bdmesh

Re: Every other 60 second timeouts
December 17, 2015 06:13PM

Registered: 8 years ago
Posts: 6

Thanks ittp2012,

How do you mean exactly? A DNS issue would cause it to consistently oscillate back and forth between working and not working with each call? The speed/timing of calls doesn't seem to have any bearing on the back and forth nature of the response. So are you suggesting that somehow nginx just doing the DNS lookup somehow effects the results coming back? What would I look for to track this down any further?

Thanks!

Reply Quote

itpp2012

Re: Every other 60 second timeouts
December 17, 2015 06:17PM

Admin
Registered: 11 years ago
Posts: 1,297

If you are using DNS names and the lookup of those names take longer then they should for whatever reason this can seriously impact nginx performance. If possible use IP's to see if the delays are gone or use a debug build version to see where in the code it is waiting and for what.

If it is really a dns issue you might have to do some caching of dns lookups.

---
nginx for Windows http://nginx-win.ecsds.eu/

Reply Quote

bdmesh

Re: Every other 60 second timeouts
December 17, 2015 07:51PM

Registered: 8 years ago
Posts: 6

Thanks -

I'll try to get a debug version up to check things out... the unfortunate part is if I touch one of the existing repros, it immediately temporarily resolves. I made a change to my config to add another mapping to a test service and just doing the "nginx -s reload" made the repro just go away. I've now reset all repos but one and want to leave it alone until the others start reproing again.

Wanting to continue tracking this down while I wait for a repro on a debug instance...

For the DNS caching, I see in a search result that it appears NGINX is already doing some sort of caching by default? Either way, my pattern is happening multiple times per second if I cancel hanging calls. Ie, curl it - success, curl it - hang (cancel under half second), curl it - success, curl it -hang.... seems awfully quick to just be a cache timing issue. Additionally, if I leave one of the hanging calls going and start curl'ing it from somewhere else, all those other calls will succeed for the duration of that other hang. Does that still fit the DNS theory? I'm just not knowledgable enough there to know.

I will try ip-based routing as a debug step, but the service that we currently front (as well as many to come) use AWS's Elastic Load Balanacer as their entry point and have rolling dynamic IPs as they scale/swap out instances to meet demand. For production, we'll need to stay with domain name based proxy_pass definitions.

I did find the following logs (dns name and ip addresses modified):
2015/12/15 03:40:13 [warn] 10#0: *32460 upstream server temporarily disabled while connecting to upstream, client: 10.0.1.127, server: , request: "HEAD /receipts HTTP/1.1", upstream: "http://52.52.52.52:80/", host: "www.mydomain.com"
2015/12/15 03:55:16 [error] 10#0: *32520 connect() failed (111: Connection refused) while connecting to upstream, client: 10.0.1.127, server: , request: "GET /v4/fca509b424b04cbf8f58cca76faba5b0 HTTP/1.1", upstream: "http://52.52.52.52:80/v4/fca509b424b04cbf8f58cca76faba5b0", host: "www.mydomain.com"

There was nothing of interest in error.log or access.log. The only thing of note was that in access.log, the entries for a given request only showed up when the call completed (so in the 60 sec delay case it only shows in the log after 60 seconds when the response comes through.),

Thanks again for the help!

Reply Quote

itpp2012

Re: Every other 60 second timeouts
December 18, 2015 02:57AM

Admin
Registered: 11 years ago
Posts: 1,297

bdmesh Wrote:
-------------------------------------------------------
> I did find the following logs (dns name and ip addresses modified):
> 2015/12/15 03:40:13 [warn] 10#0: *32460 upstream server temporarily
> disabled while connecting to upstream, client: 10.0.1.127, server: ,

This indicates a backend is causing this, not nginx, it may be going down (and marked as down) for some reason. There are a lot of reasons why this is happening still including a dns issue somewhere. Maybe curl-ing the backend rapidly may show something.

---
nginx for Windows http://nginx-win.ecsds.eu/

Reply Quote

bdmesh

Re: Every other 60 second timeouts
December 18, 2015 08:10PM

Registered: 8 years ago
Posts: 6

Yeah, we've thought of that - logs on the backend aren't showing the calls come in until the delay is finished. We have other mirror instances of nginx (with same configs) that succeed the whole time. We also have test scripts that are curling the same backend service via the "working" nginx proxy, the "flaky" nginx proxy", and the backend service directly. Those loop multiple times per second and the "working" nginx proxy (until it turns flaky) and the direct calls to the backend always succeed even while the "flaky proxy" reliably shows the error.

Still digging... thanks for the continued brainstorm!

Reply Quote

itpp2012

Re: Every other 60 second timeouts
December 19, 2015 04:10AM

Admin
Registered: 11 years ago
Posts: 1,297

See also https://forum.nginx.org/read.php?15,263481,263482#msg-263482
Have a look how you are dealing with: error timeout http_500 http_502 http_503 http_504 http_404, etc.

---
nginx for Windows http://nginx-win.ecsds.eu/

Reply Quote

hheiko

Re: Every other 60 second timeouts
May 22, 2016 03:31PM

Registered: 8 years ago
Posts: 13

Did you've find a solution for this problem?
We're facing probably a similiar problem, where the apache backend is running fine for some hours and then nginx get 504 error from time to time, with an exact time difference of 1/3/5/9 minutes, the the backends run fine again for 1-2-3 hours.

The upstream is an Centos7/Apache2.4/PHP 5.6 cluster and there in absolutely no load on nginx and the upstreams, maxium 1 client from time to time. Both backend servers shows these problems, with an Apache running on windows this problem has never been occured.

I'm pretty sure it's not directly an nignx issue but after days of researching I have no idea what to try next.....

2016/05/22 11:33:55 [notice] 4904#6056: using sockapi from "4;"
2016/05/22 11:33:55 [notice] 4268#5176: sockinit() attempting to access sockapi
2016/05/22 11:33:55 [notice] 4268#5176: Access to sockapi succeded!
2016/05/22 11:33:55 [notice] 4268#5176: using sockapi from "4;"
2016/05/22 11:33:55 [notice] 2396#1404: sockinit() attempting to access sockapi
2016/05/22 11:33:55 [notice] 2396#1404: Access to sockapi succeded!
2016/05/22 11:33:55 [notice] 2396#1404: using sockapi from "4;"

2016/05/22 13:03:11 [warn] 5884#1340: *1770011 upstream server temporarily disabled while reading response header from upstream, client: 69.162.124.236, server: wahl2.hannover-stadt.de, request: "HEAD / HTTP/1.1", upstream: "http://192.168.57.88:80/wrs/", host: "wahl2.hannover-stadt.de"
2016/05/22 13:03:11 [error] 5884#1340: *1770011 upstream timed out (10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond) while reading response header from upstream, client: 69.162.124.236, server: wahl2.hannover-stadt.de, request: "HEAD / HTTP/1.1", upstream: "http://192.168.57.88:80/wrs/", host: "wahl2.hannover-stadt.de"
2016/05/22 13:06:11 [warn] 2292#5648: *1770018 upstream server temporarily disabled while reading response header from upstream, client: 69.162.124.236, server: wahl2.hannover-stadt.de, request: "HEAD / HTTP/1.1", upstream: "http://192.168.57.88:80/wrs/", host: "wahl2.hannover-stadt.de"
2016/05/22 13:06:11 [error] 2292#5648: *1770018 upstream timed out (10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond) while reading response header from upstream, client: 69.162.124.236, server: wahl2.hannover-stadt.de, request: "HEAD / HTTP/1.1", upstream: "http://192.168.57.88:80/wrs/", host: "wahl2.hannover-stadt.de"
2016/05/22 13:17:11 [warn] 4268#6072: *1770041 upstream server temporarily disabled while reading response header from upstream, client: 69.162.124.236, server: wahl2.hannover-stadt.de, request: "HEAD / HTTP/1.1", upstream: "http://192.168.57.14:80/wrs/", host: "wahl2.hannover-stadt.de"
2016/05/22 13:17:11 [error] 4268#6072: *1770041 upstream timed out (10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond) while reading response header from upstream, client: 69.162.124.236, server: wahl2.hannover-stadt.de, request: "HEAD / HTTP/1.1", upstream: "http://192.168.57.14:80/wrs/", host: "wahl2.hannover-stadt.de"
2016/05/22 13:18:11 [warn] 4268#6072: *1770044 upstream server temporarily disabled while reading response header from upstream, client: 69.162.124.236, server: wahl2.hannover-stadt.de, request: "HEAD / HTTP/1.1", upstream: "http://192.168.57.88:80/wrs/", host: "wahl2.hannover-stadt.de"
2016/05/22 13:18:11 [error] 4268#6072: *1770044 upstream timed out (10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond) while reading response header from upstream, client: 69.162.124.236, server: wahl2.hannover-stadt.de, request: "HEAD / HTTP/1.1", upstream: "http://192.168.57.88:80/wrs/", host: "wahl2.hannover-stadt.de"
2016/05/22 13:19:11 [warn] 5884#1340: *1770048 upstream server temporarily disabled while reading response header from upstream, client: 69.162.124.236, server: wahl2.hannover-stadt.de, request: "HEAD / HTTP/1.1", upstream: "http://192.168.57.14:80/wrs/", host: "wahl2.hannover-stadt.de"
2016/05/22 13:19:11 [error] 5884#1340: *1770048 upstream timed out (10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond) while reading response header from upstream, client: 69.162.124.236, server: wahl2.hannover-stadt.de, request: "HEAD / HTTP/1.1", upstream: "http://192.168.57.14:80/wrs/", host: "wahl2.hannover-stadt.de"
2016/05/22 13:21:11 [warn] 5884#1340: *1770053 upstream server temporarily disabled while reading response header from upstream, client: 69.162.124.236, server: wahl2.hannover-stadt.de, request: "HEAD / HTTP/1.1", upstream: "http://192.168.57.14:80/wrs/", host: "wahl2.hannover-stadt.de"
2016/05/22 13:21:11 [error] 5884#1340: *1770053 upstream timed out (10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond) while reading response header from upstream, client: 69.162.124.236, server: wahl2.hannover-stadt.de, request: "HEAD / HTTP/1.1", upstream: "http://192.168.57.14:80/wrs/", host: "wahl2.hannover-stadt.de"
2016/05/22 13:22:11 [warn] 6088#540: *1770056 upstream server temporarily disabled while reading response header from upstream, client: 69.162.124.236, server: wahl2.hannover-stadt.de, request: "HEAD / HTTP/1.1", upstream: "http://192.168.57.88:80/wrs/", host: "wahl2.hannover-stadt.de"
2016/05/22 13:22:11 [error] 6088#540: *1770056 upstream timed out (10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond) while reading response header from upstream, client: 69.162.124.236, server: wahl2.hannover-stadt.de, request: "HEAD / HTTP/1.1", upstream: "http://192.168.57.88:80/wrs/", host: "wahl2.hannover-stadt.de"
2016/05/22 13:43:11 [warn] 592#5524: *1770100 upstream server temporarily disabled while reading response header from upstream, client: 69.162.124.236, server: wahl2.hannover-stadt.de, request: "HEAD / HTTP/1.1", upstream: "http://192.168.57.88:80/wrs/", host: "wahl2.hannover-stadt.de"
2016/05/22 13:43:11 [error] 592#5524: *1770100 upstream timed out (10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond) while reading response header from upstream, client: 69.162.124.236, server: wahl2.hannover-stadt.de, request: "HEAD / HTTP/1.1", upstream: "http://192.168.57.88:80/wrs/", host: "wahl2.hannover-stadt.de"
2016/05/22 13:45:11 [warn] 4996#2736: *1770105 upstream server temporarily disabled while reading response header from upstream, client: 69.162.124.236, server: wahl2.hannover-stadt.de, request: "HEAD / HTTP/1.1", upstream: "http://192.168.57.88:80/wrs/", host: "wahl2.hannover-stadt.de"
2016/05/22 13:45:11 [error] 4996#2736: *1770105 upstream timed out (10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond) while reading response header from upstream, client: 69.162.124.236, server: wahl2.hannover-stadt.de, request: "HEAD / HTTP/1.1", upstream: "http://192.168.57.88:80/wrs/", host: "wahl2.hannover-stadt.de"
2016/05/22 13:51:11 [warn] 1768#5656: *1770145 upstream server temporarily disabled while reading response header from upstream, client: 69.162.124.236, server: wahl2.hannover-stadt.de, request: "HEAD / HTTP/1.1", upstream: "http://192.168.57.88:80/wrs/", host: "wahl2.hannover-stadt.de"
2016/05/22 13:51:11 [error] 1768#5656: *1770145 upstream timed out (10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond) while reading response header from upstream, client: 69.162.124.236, server: wahl2.hannover-stadt.de, request: "HEAD / HTTP/1.1", upstream: "http://192.168.57.88:80/wrs/", host: "wahl2.hannover-stadt.de"
2016/05/22 13:56:11 [warn] 1812#4888: *1770156 upstream server temporarily disabled while reading response header from upstream, client: 69.162.124.236, server: wahl2.hannover-stadt.de, request: "HEAD / HTTP/1.1", upstream: "http://192.168.57.88:80/wrs/", host: "wahl2.hannover-stadt.de"
2016/05/22 13:56:11 [error] 1812#4888: *1770156 upstream timed out (10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond) while reading response header from upstream, client: 69.162.124.236, server: wahl2.hannover-stadt.de, request: "HEAD / HTTP/1.1", upstream: "http://192.168.57.88:80/wrs/", host: "wahl2.hannover-stadt.de"
2016/05/22 14:16:11 [warn] 4268#6072: *1770198 upstream server temporarily disabled while reading response header from upstream, client: 69.162.124.236, server: wahl2.hannover-stadt.de, request: "HEAD / HTTP/1.1", upstream: "http://192.168.57.88:80/wrs/", host: "wahl2.hannover-stadt.de"
2016/05/22 14:16:11 [error] 4268#6072: *1770198 upstream timed out (10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond) while reading response header from upstream, client: 69.162.124.236, server: wahl2.hannover-stadt.de, request: "HEAD / HTTP/1.1", upstream: "http://192.168.57.88:80/wrs/", host: "wahl2.hannover-stadt.de"

Reply Quote

itpp2012

Re: Every other 60 second timeouts
May 22, 2016 04:42PM

Admin
Registered: 11 years ago
Posts: 1,297

Your best bet is to use wireshark to see exactly what is going on on the wire and who's reacting to who and mostly when it logs a failure what wireshark logs.

It feels like some sort of handshake error.

---
nginx for Windows http://nginx-win.ecsds.eu/

Reply Quote

itpp2012

Re: Every other 60 second timeouts
May 23, 2016 04:15AM

Admin
Registered: 11 years ago
Posts: 1,297

Some other 'solutions' you can try;

https://www.scalescale.com/tips/nginx/upstream-timed-out/
http://serverfault.com/questions/647411/nginx-lots-of-connection-timed-out-errors
https://kb.plesk.com/en/120210
http://stackoverflow.com/questions/28347184/upstream-timed-out-110-connection-timed-out-for-static-content

---
nginx for Windows http://nginx-win.ecsds.eu/

Reply Quote

itpp2012

Re: Every other 60 second timeouts
May 25, 2016 03:06AM

Admin
Registered: 11 years ago
Posts: 1,297

The only way to really dig into this issue is wireshark/tcpdump on both ends to determine if calls arrive, what happens with them and what happens when they return to nginx (port closures, handshake error, timeout, etc.)

We serve loads of tomcat backends which are really slow sometimes and rarely see errors like these, yes they do occur but like once a week when too many users hit the backend (filling out their time-sheets).

Also have a good look at the settings of the network cards, energy saving mode, tcp offloading, qos, top/dis-responders, dns settings (like auto register this machine at....), unused netbios, calls to active directory which may interrupt, etc.

---
nginx for Windows http://nginx-win.ecsds.eu/

Reply Quote

itpp2012

Re: Every other 60 second timeouts
June 13, 2016 04:45AM

Admin
Registered: 11 years ago
Posts: 1,297

Might be unrelated but this could also be a reason:
https://social.technet.microsoft.com/Forums/office/en-US/c277b434-fb99-4694-9933-15fbe3761e70/no-network-activity-for-60-sec-after-a-destination-host-unreachable-error?forum=w7itpronetworking

Multiple LAN cards and using more than one default gateway (ARP/MAC ‘blocking’).

---
nginx for Windows http://nginx-win.ecsds.eu/

Reply Quote

Every other 60 second timeouts

Online Users