Hi,
I finally reproduced this, with debug logging enabled --- I found the problematic request in the error log preceding the kill signal, saying it was being buffered to a temporary file:
2014/07/21 11:39:39 [warn] 21182#0: *32332838 an upstream response is buffered to a temporary file /var/cache/nginx/uwsgi_temp/9/90/0000186909 while reading upstream, client: x.x.x.x, server: foo.com, request: "GET /api/nasty/troublemaker.csv?el=xyzzy!a:b&dates_as_dates=1 HTTP/1.1", upstream: "uwsgi://123.45.67.89:3003", host: "foo.com"
2014/07/21 11:41:18 [alert] 16885#0: worker process 21182 exited on signal 9
and retrying that request reproduces the problem, nginx growing in size without bound. (The request never made it to the access log because of the OOM kill, which is why my previous testing didn't reproduce it)
So here are debug log segments for single epoll events, during a healthy streaming request (one that doesn't trigger this problem), and during the problematic request. These requests *ought* to behave just the same (I'm not aware of any behavior difference in our upstream app, invoked via uwsgi_pass, except that the CSV lines are a little longer in the problematic response).
Healthy request:
epoll: fd:43 ev:0005 d:00000000023FF960
*1 http upstream request: "/api/well-behaved/request.csv?el=fnord!a=1.2:b=3.4.5:c&dates_as_dates=1"
*1 http upstream process upstream
*1 pipe read upstream: 1
*1 readv: 3:4096
*1 pipe recv chain: 44
*1 readv: 3:4096
*1 readv() not ready (11: Resource temporarily unavailable)
*1 pipe recv chain: -2
*1 pipe buf free s:0 t:1 f:0 00000000020EED30, pos 00000000020EED30, size: 2267 file: 0, size: 0
*1 pipe buf free s:0 t:1 f:0 000000000220B1C0, pos 000000000220B1C0, size: 0 file: 0, size: 0
*1 pipe buf free s:0 t:1 f:0 0000000001FE5900, pos 0000000001FE5900, size: 0 file: 0, size: 0
*1 pipe length: -1
*1 pipe write downstream: 1
*1 pipe write busy: 0
*1 pipe write: out:0000000000000000, f:0
*1 pipe read upstream: 0
*1 pipe buf free s:0 t:1 f:0 00000000020EED30, pos 00000000020EED30, size: 2267 file: 0, size: 0
*1 pipe buf free s:0 t:1 f:0 000000000220B1C0, pos 000000000220B1C0, size: 0 file: 0, size: 0
*1 pipe buf free s:0 t:1 f:0 0000000001FE5900, pos 0000000001FE5900, size: 0 file: 0, size: 0
*1 pipe length: -1
*1 event timer: 43, old: 1405973335524, new: 1405973335560
*1 http upstream request: "/api/well-behaved/request.csv?el=fnord!a=1.2:b=3.4.5:c&dates_as_dates=1"
*1 http upstream dummy handler
timer delta: 0
posted events 0000000000000000
worker cycle
accept mutex lock failed: 0
epoll timer: 500
For the problematic request, the epoll events *often* look identical (all the same kinds of lines, plus sometimes an extra pipe recv chain and readv line pair, presumably just because of data being streamed in slightly bigger chunks) ... but sometimes they have some extra lines, which I've highlighted with an XXX prefix here:
epoll: fd:42 ev:0005 d:000000000150DE70
*1 http upstream request: "/api/nasty/troublemaker.csv?el=xyzzy!a:b&dates_as_dates=1"
*1 http upstream process upstream
*1 pipe read upstream: 1
*1 readv: 3:4096
*1 pipe recv chain: 519
XXX *1 input buf #135
*1 readv: 2:4096
*1 readv() not ready (11: Resource temporarily unavailable)
*1 pipe recv chain: -2
XXX *1 pipe buf in s:1 t:1 f:0 0000000001417550, pos 0000000001417550, size: 8192 file: 0, size: 0
*1 pipe buf free s:0 t:1 f:0 0000000001447610, pos 0000000001447610, size: 326 file: 0, size: 0
*1 pipe buf free s:0 t:1 f:0 00000000014305C0, pos 00000000014305C0, size: 0 file: 0, size: 0
*1 pipe length: -1
*1 pipe write downstream: 1
*1 pipe write busy: 0
XXX *1 pipe write buf ls:1 0000000001417550 8192
XXX *1 pipe write: out:0000000001431878, f:0
XXX *1 http output filter "/api/nasty/troublemaker.csv?el=xyzzy!a:b&dates_as_dates=1"
XXX *1 http copy filter: "/api/nasty/troublemaker.csv?el=xyzzy!a:b&dates_as_dates=1"
XXX *1 http postpone filter "/api/nasty/troublemaker.csv?el=xyzzy!a:b&dates_as_dates=1" 0000000001431878
XXX *1 http chunk: 8192
XXX *1 write new buf t:1 f:0 0000000001431888, pos 0000000001431888, size: 6 file: 0, size: 0
XXX *1 write new buf t:1 f:0 0000000001417550, pos 0000000001417550, size: 8192 file: 0, size: 0
XXX *1 write new buf t:0 f:0 00000000014317E0, pos 00000000004ADFFD, size: 2 file: 0, size: 0
XXX *1 http write filter: l:0 f:1 s:8200
XXX *1 http write filter limit 0
XXX *1 writev: 8200
XXX *1 http write filter 0000000000000000
XXX *1 http copy filter: 0 "/api/nasty/troublemaker.csv?el=xyzzy!a:b&dates_as_dates=1"
XXX *1 pipe write busy: 0
*1 pipe write: out:0000000000000000, f:0
*1 pipe read upstream: 0
*1 pipe buf free s:0 t:1 f:0 0000000001447610, pos 0000000001447610, size: 326 file: 0, size: 0
*1 pipe buf free s:0 t:1 f:0 0000000001417550, pos 0000000001417550, size: 0 file: 0, size: 0
*1 pipe buf free s:0 t:1 f:0 00000000014305C0, pos 00000000014305C0, size: 0 file: 0, size: 0
*1 pipe length: -1
*1 event timer: 42, old: 1405971008168, new: 1405971008452
*1 http upstream request: "/api/nasty/troublemaker.csv?el=xyzzy!a:b&dates_as_dates=1"
*1 http upstream dummy handler
timer delta: 1
posted events 0000000000000000
worker cycle
accept mutex lock failed: 0
epoll timer: 500
These extra lines *never* appear in the healthy requests, so I imagine they point to the problem (but I am not at all familiar with Nginx debug output); in particular those "write new buf" lines look relevant; they are output right after ngx_alloc_chain_link is called.
All the possibly relevant Nginx config:
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '...';
access_log /var/log/nginx/access.log main buffer=32k;
sendfile on;
keepalive_timeout 65;
ssl_session_cache shared:SSL:10m;
ssl_session_timeout 10m;
gzip on;
gzip_min_length 1280;
gzip_types text/css application/json application/javascript text/csv text/xml text/turtle;
log_format combined_upstream '...';
log_format internal_proxy '...';
uwsgi_hide_header X-...;
uwsgi_buffers 64 8k;
proxy_buffers 64 8k;
include /etc/nginx/sites-enabled/*;
}
upstream fooapp.foo.com {
server 123.45.67.89:3003 down;
server 123.45.67.88:3003 down;
server 123.45.67.87:3003;
server 123.45.67.86:3003;
least_conn;
}
server {
listen 80;
server_name foo.com;
listen 443 ssl;
ssl_certificate wildcard.foo.com.crt;
ssl_certificate_key wildcard.foo.com.crt.key;
ssl_protocols SSLv3 TLSv1 TLSv1.1 TLSv1.2;
ssl_ciphers HIGH:!aNULL:!MD5;
ssl_prefer_server_ciphers on;
set_real_ip_from 123.45.67.89/27;
access_log /var/log/nginx/foo.com.access.log combined_upstream buffer=8k flush=2s;
client_max_body_size 32m;
...
location /api/ {
include uwsgi_params;
uwsgi_pass fooapp-foo.com;
uwsgi_read_timeout 180;
}
}
There are no other occurrences of _buffer anywhere in the config:
$ sudo grep -r _buffer /etc/nginx/
/etc/nginx/nginx.conf: uwsgi_buffers 64 8k;
/etc/nginx/nginx.conf: proxy_buffers 64 8k;
Beginning of output of well-behaved request:
HTTP/1.1 200 OK
Server: nginx/1.6.0
Date: Mon, 21 Jul 2014 20:48:50 GMT
Content-Type: text/csv; charset=iso-8859-1
Transfer-Encoding: chunked
Connection: keep-alive
Content-Disposition: attachment; filename=Foo-export.csv
Content-Language: en
Sex,Age group,ZIP code,Year-begin-date,Value
Female,Under 5 years,00601,2012-01-01,572
Female,Under 5 years,00602,2012-01-01,1132
Female,Under 5 years,00603,2012-01-01,1589
Female,Under 5 years,00606,2012-01-01,189
Female,Under 5 years,00610,2012-01-01,784
...
Beginning of output of problematic request:
HTTP/1.1 200 OK
Server: nginx/1.6.0
Date: Mon, 21 Jul 2014 20:49:07 GMT
Content-Type: text/csv; charset=iso-8859-1
Transfer-Encoding: chunked
Connection: keep-alive
Content-Disposition: attachment; filename=Foo-export.csv
Content-Language: en
Location,Measurement,Month-begin-date,Value
SHARJAH INTER. AIRP (AE000041196),Temperature (max),2010-01-01,30.6
SHARJAH INTER. AIRP (AE000041196),Temperature (max),2010-02-01,35.5
SHARJAH INTER. AIRP (AE000041196),Temperature (max),2010-03-01,41.4
SHARJAH INTER. AIRP (AE000041196),Temperature (max),2010-04-01,41.6
SHARJAH INTER. AIRP (AE000041196),Temperature (max),2010-05-01,44.9
...
Does this narrow down the problem? Can I provide anything further?
Cheers,
Gulli