Hello All.
I was hoping someone would be able to provide some insight into a difficult problem we are seeing.
We use nginx as a reverse proxy. Every day at the same time the nginx server stops responding.
We have tried a variety of steps to solve this problem with no resolve.
Before I get into the symptoms, here is some relevant version information :
Here is the output of a nginx -V on our two servers
nginx version: nginx/0.8.32
built by gcc 4.3.3 (Ubuntu 4.3.3-5ubuntu4)
TLS SNI support enabled
configure arguments: --prefix=/etc/nginx --with-http_realip_module --with-http_addition_module --with-http_stub_status_module --with-http_ssl_module --with-http_stub_status_module --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/log/nginx/nginx.pid --with-syslog
nginx version: nginx/0.7.65
built by gcc 4.3.3 (Ubuntu 4.3.3-5ubuntu4)
TLS SNI support enabled
configure arguments: --prefix=/etc/nginx --with-http_realip_module --with-http_addition_module --with-http_stub_status_module --with-http_ssl_module --with-http_stub_status_module --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/log/nginx/nginx.pid
Here is the linux kernel information : 2.6.28-11-server #42-Ubuntu SMP
There are two nginx servers connected directly to our router.
Each of these has its own real server IP address.
As an example lets say they are 100.100.10.20 and 100.100.10.21
We also have VIP addresses for websites running through the nginx proxies.
As example lets say we have 5 sites (3 VIP's), 100.100.20.1 , .2, .3, .4 ,.5 .
The problem seems to follow a particular website, but causes whatever other sites running through
that nginx server at the time to fail, as nginx is no longer responding.
tcptraceroutes to both the real server IP and the VIP's show that the port is open.
If we ssh into the server and attempt to connect locally via real IP or VIP, we get the same
result, port open but no repsonse to an http get or anything else. Normally nginx would serve a
static page with the line "<hostname> is up>" when we attempt to connect to the real server IP,
and again this is on the lcoal nginx server that failed, so it is not using the network to connect.
This is not a cronjob or other timed event on our end. we have changed the cron job time, and also changed the date/time on the server with no change.
In trying to solve this issue we have contact numerous people and we have collected data via
running "strace" with nginx. Checking /proc/net/ values at the time of failure, and running tshark on
that server to collect a traffic dump, which shows alot of tcp retransmissions.
Additionally we have tried other versions of nginx, and we have tried checking sysctl values as well
as making changes to those values as per below ,
(first line we check the current value, middle line we change the value, last line we verify the value has changed.) :
-----
cat /proc/sys/net/ipv4/tcp_rfc1337
0
echo 1 > /proc/sys/net/ipv4/tcp_rfc1337
cat /proc/sys/net/ipv4/tcp_rfc1337
1
----
cat /proc/sys/net/ipv4/tcp_max_syn_backlog
1024
echo 5120 > /proc/sys/net/ipv4/tcp_max_syn_backlog
cat /proc/sys/net/ipv4/tcp_max_syn_backlog
5120
----
cat /proc/sys/net/ipv4/tcp_max_tw_buckets
180000
echo 280000 > /proc/sys/net/ipv4/tcp_max_tw_buckets
cat /proc/sys/net/ipv4/tcp_max_tw_buckets
280000
----
cat /proc/sys/net/ipv4/tcp_frto
2
echo 0 > /proc/sys/net/ipv4/tcp_frto
cat /proc/sys/net/ipv4/tcp_frto
0
-----------------------------------------------
As my colleague suggested, here is a printout of sysctl -a , the sysctl current values.
I have excluded ipv6 , kernel , dev, and fs from this list to keep it smaller :
net.ipv4.route.gc_thresh = 32768
net.ipv4.route.max_size = 524288
net.ipv4.route.gc_min_interval = 0
net.ipv4.route.gc_min_interval_ms = 500
net.ipv4.route.gc_timeout = 300
net.ipv4.route.gc_interval = 60
net.ipv4.route.redirect_load = 2
net.ipv4.route.redirect_number = 9
net.ipv4.route.redirect_silence = 2048
net.ipv4.route.error_cost = 100
net.ipv4.route.error_burst = 500
net.ipv4.route.gc_elasticity = 8
net.ipv4.route.mtu_expires = 600
net.ipv4.route.min_pmtu = 552
net.ipv4.route.min_adv_mss = 256
net.ipv4.route.secret_interval = 600
error: permission denied on key 'net.ipv4.route.flush'
net.ipv4.neigh.default.mcast_solicit = 3
net.ipv4.neigh.default.ucast_solicit = 3
net.ipv4.neigh.default.app_solicit = 0
net.ipv4.neigh.default.retrans_time = 100
net.ipv4.neigh.default.base_reachable_time = 30
net.ipv4.neigh.default.delay_first_probe_time = 5
net.ipv4.neigh.default.gc_stale_time = 60
net.ipv4.neigh.default.unres_qlen = 3
net.ipv4.neigh.default.proxy_qlen = 64
net.ipv4.neigh.default.anycast_delay = 100
net.ipv4.neigh.default.proxy_delay = 80
net.ipv4.neigh.default.locktime = 100
net.ipv4.neigh.default.retrans_time_ms = 1000
net.ipv4.neigh.default.base_reachable_time_ms = 30000
net.ipv4.neigh.default.gc_interval = 30
net.ipv4.neigh.default.gc_thresh1 = 128
net.ipv4.neigh.default.gc_thresh2 = 512
net.ipv4.neigh.default.gc_thresh3 = 1024
net.ipv4.neigh.lo.mcast_solicit = 3
net.ipv4.neigh.lo.ucast_solicit = 3
net.ipv4.neigh.lo.app_solicit = 0
net.ipv4.neigh.lo.retrans_time = 100
net.ipv4.neigh.lo.base_reachable_time = 30
net.ipv4.neigh.lo.delay_first_probe_time = 5
net.ipv4.neigh.lo.gc_stale_time = 60
net.ipv4.neigh.lo.unres_qlen = 3
net.ipv4.neigh.lo.proxy_qlen = 64
net.ipv4.neigh.lo.anycast_delay = 100
net.ipv4.neigh.lo.proxy_delay = 80
net.ipv4.neigh.lo.locktime = 100
net.ipv4.neigh.lo.retrans_time_ms = 1000
net.ipv4.neigh.lo.base_reachable_time_ms = 30000
net.ipv4.neigh.eth0.mcast_solicit = 3
net.ipv4.neigh.eth0.ucast_solicit = 3
net.ipv4.neigh.eth0.app_solicit = 0
net.ipv4.neigh.eth0.retrans_time = 100
net.ipv4.neigh.eth0.base_reachable_time = 30
net.ipv4.neigh.eth0.delay_first_probe_time = 5
net.ipv4.neigh.eth0.gc_stale_time = 60
net.ipv4.neigh.eth0.unres_qlen = 3
net.ipv4.neigh.eth0.proxy_qlen = 64
net.ipv4.neigh.eth0.anycast_delay = 100
net.ipv4.neigh.eth0.proxy_delay = 80
net.ipv4.neigh.eth0.locktime = 100
net.ipv4.neigh.eth0.retrans_time_ms = 1000
net.ipv4.neigh.eth0.base_reachable_time_ms = 30000
net.ipv4.neigh.eth1.mcast_solicit = 3
net.ipv4.neigh.eth1.ucast_solicit = 3
net.ipv4.neigh.eth1.app_solicit = 0
net.ipv4.neigh.eth1.retrans_time = 100
net.ipv4.neigh.eth1.base_reachable_time = 30
net.ipv4.neigh.eth1.delay_first_probe_time = 5
net.ipv4.neigh.eth1.gc_stale_time = 60
net.ipv4.neigh.eth1.unres_qlen = 3
net.ipv4.neigh.eth1.proxy_qlen = 64
net.ipv4.neigh.eth1.anycast_delay = 100
net.ipv4.neigh.eth1.proxy_delay = 80
net.ipv4.neigh.eth1.locktime = 100
net.ipv4.neigh.eth1.retrans_time_ms = 1000
net.ipv4.neigh.eth1.base_reachable_time_ms = 30000
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_retrans_collapse = 1
net.ipv4.ip_default_ttl = 64
net.ipv4.ip_no_pmtu_disc = 0
net.ipv4.ip_nonlocal_bind = 0
net.ipv4.tcp_syn_retries = 5
net.ipv4.tcp_synack_retries = 5
net.ipv4.tcp_max_orphans = 32768
net.ipv4.tcp_max_tw_buckets = 280000
net.ipv4.ip_dynaddr = 0
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_retries2 = 15
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_abort_on_overflow = 0
net.ipv4.tcp_stdurg = 0
net.ipv4.tcp_rfc1337 = 1
net.ipv4.tcp_max_syn_backlog = 5120
net.ipv4.ip_local_port_range = 5000 65000
net.ipv4.igmp_max_memberships = 20
net.ipv4.igmp_max_msf = 10
net.ipv4.inet_peer_threshold = 65664
net.ipv4.inet_peer_minttl = 120
net.ipv4.inet_peer_maxttl = 600
net.ipv4.inet_peer_gc_mintime = 10
net.ipv4.inet_peer_gc_maxtime = 120
net.ipv4.tcp_orphan_retries = 0
net.ipv4.tcp_fack = 1
net.ipv4.tcp_reordering = 3
net.ipv4.tcp_ecn = 0
net.ipv4.tcp_dsack = 1
net.ipv4.tcp_mem = 77376 103168 154752
net.ipv4.tcp_wmem = 4096 16384 3301376
net.ipv4.tcp_rmem = 4096 87380 3301376
net.ipv4.tcp_app_win = 31
net.ipv4.tcp_adv_win_scale = 2
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_frto = 0
net.ipv4.tcp_frto_response = 0
net.ipv4.tcp_low_latency = 0
net.ipv4.tcp_no_metrics_save = 0
net.ipv4.tcp_moderate_rcvbuf = 1
net.ipv4.tcp_tso_win_divisor = 3
net.ipv4.tcp_congestion_control = reno
net.ipv4.tcp_abc = 0
net.ipv4.tcp_mtu_probing = 0
net.ipv4.tcp_base_mss = 512
net.ipv4.tcp_workaround_signed_windows = 0
net.ipv4.tcp_slow_start_after_idle = 1
net.ipv4.cipso_cache_enable = 1
net.ipv4.cipso_cache_bucket_size = 10
net.ipv4.cipso_rbm_optfmt = 0
net.ipv4.cipso_rbm_strictvalid = 1
net.ipv4.tcp_available_congestion_control = reno cubic
net.ipv4.tcp_allowed_congestion_control = reno cubic
net.ipv4.tcp_max_ssthresh = 0
net.ipv4.udp_mem = 388416 517888 776832
net.ipv4.udp_rmem_min = 4096
net.ipv4.udp_wmem_min = 4096
net.ipv4.conf.all.forwarding = 0
net.ipv4.conf.all.mc_forwarding = 0
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.secure_redirects = 1
net.ipv4.conf.all.shared_media = 1
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.conf.all.proxy_arp = 0
net.ipv4.conf.all.medium_id = 0
net.ipv4.conf.all.bootp_relay = 0
net.ipv4.conf.all.log_martians = 0
net.ipv4.conf.all.tag = 0
net.ipv4.conf.all.arp_filter = 0
net.ipv4.conf.all.arp_announce = 0
net.ipv4.conf.all.arp_ignore = 0
net.ipv4.conf.all.arp_accept = 0
net.ipv4.conf.all.disable_xfrm = 0
net.ipv4.conf.all.disable_policy = 0
net.ipv4.conf.all.force_igmp_version = 0
net.ipv4.conf.all.promote_secondaries = 0
net.ipv4.conf.default.forwarding = 0
net.ipv4.conf.default.mc_forwarding = 0
net.ipv4.conf.default.accept_redirects = 1
net.ipv4.conf.default.secure_redirects = 1
net.ipv4.conf.default.shared_media = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.send_redirects = 1
net.ipv4.conf.default.accept_source_route = 1
net.ipv4.conf.default.proxy_arp = 0
net.ipv4.conf.default.medium_id = 0
net.ipv4.conf.default.bootp_relay = 0
net.ipv4.conf.default.log_martians = 0
net.ipv4.conf.default.tag = 0
net.ipv4.conf.default.arp_filter = 0
net.ipv4.conf.default.arp_announce = 0
net.ipv4.conf.default.arp_ignore = 0
net.ipv4.conf.default.arp_accept = 0
net.ipv4.conf.default.disable_xfrm = 0
net.ipv4.conf.default.disable_policy = 0
net.ipv4.conf.default.force_igmp_version = 0
net.ipv4.conf.default.promote_secondaries = 0
net.ipv4.conf.lo.forwarding = 0
net.ipv4.conf.lo.mc_forwarding = 0
net.ipv4.conf.lo.accept_redirects = 1
net.ipv4.conf.lo.secure_redirects = 1
net.ipv4.conf.lo.shared_media = 1
net.ipv4.conf.lo.rp_filter = 0
net.ipv4.conf.lo.send_redirects = 1
net.ipv4.conf.lo.accept_source_route = 1
net.ipv4.conf.lo.proxy_arp = 0
net.ipv4.conf.lo.medium_id = 0
net.ipv4.conf.lo.bootp_relay = 0
net.ipv4.conf.lo.log_martians = 0
net.ipv4.conf.lo.tag = 0
net.ipv4.conf.lo.arp_filter = 0
net.ipv4.conf.lo.arp_announce = 0
net.ipv4.conf.lo.arp_ignore = 0
net.ipv4.conf.lo.arp_accept = 0
net.ipv4.conf.lo.disable_xfrm = 1
net.ipv4.conf.lo.disable_policy = 1
net.ipv4.conf.lo.force_igmp_version = 0
net.ipv4.conf.lo.promote_secondaries = 0
net.ipv4.conf.eth0.forwarding = 0
net.ipv4.conf.eth0.mc_forwarding = 0
net.ipv4.conf.eth0.accept_redirects = 1
net.ipv4.conf.eth0.secure_redirects = 1
net.ipv4.conf.eth0.shared_media = 1
net.ipv4.conf.eth0.rp_filter = 1
net.ipv4.conf.eth0.send_redirects = 1
net.ipv4.conf.eth0.accept_source_route = 1
net.ipv4.conf.eth0.proxy_arp = 0
net.ipv4.conf.eth0.medium_id = 0
net.ipv4.conf.eth0.bootp_relay = 0
net.ipv4.conf.eth0.log_martians = 0
net.ipv4.conf.eth0.tag = 0
net.ipv4.conf.eth0.arp_filter = 0
net.ipv4.conf.eth0.arp_announce = 0
net.ipv4.conf.eth0.arp_ignore = 0
net.ipv4.conf.eth0.arp_accept = 0
net.ipv4.conf.eth0.disable_xfrm = 0
net.ipv4.conf.eth0.disable_policy = 0
net.ipv4.conf.eth0.force_igmp_version = 0
net.ipv4.conf.eth0.promote_secondaries = 0
net.ipv4.conf.eth1.forwarding = 0
net.ipv4.conf.eth1.mc_forwarding = 0
net.ipv4.conf.eth1.accept_redirects = 1
net.ipv4.conf.eth1.secure_redirects = 1
net.ipv4.conf.eth1.shared_media = 1
net.ipv4.conf.eth1.rp_filter = 1
net.ipv4.conf.eth1.send_redirects = 1
net.ipv4.conf.eth1.accept_source_route = 1
net.ipv4.conf.eth1.proxy_arp = 0
net.ipv4.conf.eth1.medium_id = 0
net.ipv4.conf.eth1.bootp_relay = 0
net.ipv4.conf.eth1.log_martians = 0
net.ipv4.conf.eth1.tag = 0
net.ipv4.conf.eth1.arp_filter = 0
net.ipv4.conf.eth1.arp_announce = 0
net.ipv4.conf.eth1.arp_ignore = 0
net.ipv4.conf.eth1.arp_accept = 0
net.ipv4.conf.eth1.disable_xfrm = 0
net.ipv4.conf.eth1.disable_policy = 0
net.ipv4.conf.eth1.force_igmp_version = 0
net.ipv4.conf.eth1.promote_secondaries = 0
net.ipv4.ip_forward = 0
net.ipv4.ipfrag_high_thresh = 262144
net.ipv4.ipfrag_low_thresh = 196608
net.ipv4.ipfrag_time = 30
net.ipv4.icmp_echo_ignore_all = 0
net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv4.icmp_ignore_bogus_error_responses = 1
net.ipv4.icmp_errors_use_inbound_ifaddr = 0
net.ipv4.icmp_ratelimit = 1000
net.ipv4.icmp_ratemask = 6168
net.ipv4.ipfrag_secret_interval = 600
net.ipv4.ipfrag_max_dist = 64
net.token-ring.rif_timeout = 60000
net.unix.max_dgram_qlen = 10
net.core.wmem_max = 131071
net.core.rmem_max = 131071
net.core.wmem_default = 111616
net.core.rmem_default = 111616
net.core.dev_weight = 64
net.core.netdev_max_backlog = 1000
net.core.message_cost = 5
net.core.message_burst = 10
net.core.optmem_max = 10240
net.core.xfrm_aevent_etime = 10
net.core.xfrm_aevent_rseqth = 2
net.core.xfrm_larval_drop = 1
net.core.xfrm_acq_expires = 30
net.core.netdev_budget = 300
net.core.warnings = 1
net.core.somaxconn = 128
crypto.fips_enabled = 0
-----------------------------------------
We are at a loss as to what might be causing the issue.
If anyone know what might be going on here or steps we have not already taken to try to unwrap
this mystery, please help.