Welcome! Log In Create A New Profile

Advanced

Transient, Load Related Slow response_time / upstream_response_time vs App Server Reported Times

Jordan von Kluck
October 29, 2020 02:04PM
Hello -

I am hoping someone on the community list can help steer me in the right
direction for troubleshooting the following scenario:

I am running a cluster of 4 virtualized nginx open source 1.16.0 servers
with 4 vCPU cores and 4 GB of RAM each. They serve HTTP (REST API) requests
to a pool of about 40 different upstream clusters, which range from 2 to 8
servers within each upstream definition. The upstream application servers
themselves have multiple workers per server.

I've recently started seeing an issue where the reported response_time and
typically the reported upstream_response_time the nginx access log are
drastically different from the reported response on the application servers
themselves. For example, on some requests the typical average response_time
would be around 5ms with an upstream_response_time of 4ms. During these
transient periods of high load (approximately 1200 -1400 rps), the reported
nginx response_time and upstream_response_time spike up to somewhere around
1 second, while the application logs on the upstream servers are still
reporting the same 4ms response time.

The upstream definitions are very simple and look like:
upstream rest-api-xyz {
least_conn;
server 10.1.1.33:8080 max_fails=3 fail_timeout=30; #
production-rest-api-xyz01
server 10.1.1.34:8080 max_fails=3 fail_timeout=30; #
production-rest-api-xyz02
}

One avenue that I've considered but does not seem to be the case from the
instrumentation on the app servers is that they're accepting the requests
and queueing them in a TCP socket locally. However, running a packet
capture on both the nginx server and the app server actually shows the http
request leaving nginx at the end of the time window. I have not looked at
this down to the TCP handshake to see if the actual negotiation is taking
an excessive amount of time. I can produce this queueing scenario
artificially, but it does not appear to be what's happening in my
production environment in the scenario described above.

Does anyone here have any experience sorting out something like this? The
upstream_connect_time is not part of the log currently, but if that number
was reporting high, I'm not entirely sure what would cause that. Similarly,
if the upstream_connect_time does not account for most of the delay, is
there anything else I should be looking at?

Thanks
Jordan
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Subject Author Posted

Transient, Load Related Slow response_time / upstream_response_time vs App Server Reported Times

Jordan von Kluck October 29, 2020 02:04PM

Re: Transient, Load Related Slow response_time / upstream_response_time vs App Server Reported Times

Maxim Dounin October 30, 2020 05:02PM

Re: Transient, Load Related Slow response_time / upstream_response_time vs App Server Reported Times

Jordan von Kluck November 05, 2020 09:08PM



Sorry, only registered users may post in this forum.

Click here to login

Online Users

Guests: 67
Record Number of Users: 6 on February 13, 2018
Record Number of Guests: 421 on December 02, 2018
Powered by nginx      Powered by FreeBSD      PHP Powered      Powered by MariaDB      ipv6 ready