On Thu, Apr 30, 2009 at 06:28:16AM -0400, JacobSingh wrote:
> Hello,
>
> I've been using nginx pretty successfully for awhile, and this problem started recently. I've got CentOS 5 and the nginx from yum (nginx/0.6.32).
>
> Here is my config:
>
> upstream search1.us_seach {
>
> server slave1.search1.xxxxxx.us:8080 weight=3 max_fails=40 fail_timeout=20s;
>
> server master.search1.xxxxxx.us:8080 weight=1 max_fails=0;
> }
>
> When the master server is down, instead of failing over to the slave server (as expected). I get this:
>
> 2009/04/30 02:23:34 15116#0: *99 connect() failed (111: Connection refused) while connecting to upstream, client: 127.0.0.1, server: _, request: "GET /mypath/to/something HTTP/1.1", upstream: "http://11.111.111.11:8080//mypath/to/something", host: "localhost:81"
> 2009/04/30 02:23:34 13227#0: signal 17 (SIGCHLD) received
> 2009/04/30 02:23:34 13227#0: worker process 15116 exited on signal 8
> 2009/04/30 02:23:34 13227#0: start worker process 16779
>
> Where (11.111.111.11 == master.search1.xxxxxx.us).
>
> It never even tries the slave server because it just bails right there.
>
> I made an strace of the problem, it is available here:
> http://pastebin.ca/1408224
>
> Here is the very end of it:
>
> #
> gettimeofday({1241072060, 198274}, NULL) = 0
> #
> getsockopt(20, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
> #
> write(9, "2009/04/30 02:14:20 1322"..., 335) = 335
> #
> --- SIGFPE (Floating point exception) @ 0 (0) ---
>
> I tried compiling 0.6.7. The problem remains, but is slightly better. Now, it will actually try the slave server first sometimes, but it will still die every time it tries the master, not falling over to the slave.
The problme due to
max_fails=0;
It was fixed in 0.6.33:
Changes with nginx 0.6.33 20 Nov 2008
*) Bugfix: if the "max_fails=0" parameter was used in upstream with
several servers, then a worker process exited on a SIGFPE signal.
Thanks to Maxim Dounin.
--
Igor Sysoev
http://sysoev.ru/en/