Hi
I am trying to set up a UDP load balancer using Nginx. Initially, I configured 4 usptream servers with two server processes running on each of them.
It gave a throughput of around 24000 query per second when tested with dnsperf. When I try to add two more upstreams servers, the throughput is not increasing as expected. In fact, it deteriorates to the range of 5000 query per second with the following error:
[warn] 5943#0: *10433175 upstream server temporarily disabled while proxying connection, udp client: xxx.xxx.xxx.29, server: 0.0.0.0:53, upstream: "xxx.xxx.xxx.224:53", bytes from/to client:80/0, bytes from/to upstream:0/80
[error] 5943#0: *10085077 no live upstreams while connecting to upstream, udp client: xxx.xxx.xxx.224, server: 0.0.0.0:53, upstream: "dns_upstreams", bytes from/to client:80/0, bytes from/to upstream:0/0
I understood that the above error appears when Nginx doesn't receive responses from upstream on time, and it is marked as unavailable temporarily. I used to get this error before even with 4 upstream servers, but after adding the following additional configuration, it had got resolved:
user nginx;
worker_processes 4;
worker_rlimit_nofile 65535;
load_module "/usr/lib64/nginx/modules/ngx_stream_module.so";
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;
events {
worker_connections 10240;
}
stream {
upstream dns_upstreams {
server xxx.xxx.xxx.0:53 max_fails=2000 fail_timeout=30s;
server xxx.xxx.xxx.0:6363 max_fails=2000 fail_timeout=0s;
server xxx.xxx.xxx.187:53 max_fails=2000 fail_timeout=30s;
server xxx.xxx.xxx.187:6363 max_fails=2000 fail_timeout=30s;
server xxx.xxx.xxx.183:53 max_fails=2000 fail_timeout=30s;
server xxx.xxx.xxx.183:6363 max_fails=2000 fail_timeout=30s;
server xxx.xxx.xxx.212:53 max_fails=2000 fail_timeout=30s;
server xxx.xxx.xxx.212:6363 max_fails=2000 fail_timeout=30s;
}
server {
listen 53 udp;
proxy_pass dns_upstreams;
proxy_timeout 1s;
proxy_responses 1;
}
}
Even though this configuration works fine with 4 upstream servers, it doesn't help when I increase the number of servers.
The Nginx server has enough memory and CPU capacity remaining when running with 4 upstream servers as well as 6 upstream servers. And the dnsperf client is not a bottleneck here because it can send much more load in a different setup. Also, the individual upstream server can serve a bit more than 5000 request per second.
I am trying to get some hints about why I am observing more upstream failures and eventual unavailability when I add more servers. If anybody has faced a similar issue in the past and can give me some pointers to solve it, that would of great help.
Thanks,
Ajmal