Pools randomly hang

Michael Tabolsky

Pools randomly hang
May 23, 2013 01:18AM

Hi List,

I really hope someone can help to debug this problem since I am trying to
run out of options here ...

I have a setup of two nodes, one running nginx, the other php-fpm. about 30
hosts/pools. All was good for a few months, but since some upgrade and I
just can't track it back which one, I've started to get into a problem with
random pools at random intervals, children that stuck like this:
[pid 7939] write(3, "122\"><a href=\"http://www.b"..., 456 <unfinished ...>
[pid 7672] write(3, "122\"><a href=\"http://www.b"..., 456^C <unfinished
....>

The fd is the connection to the nginx (tcp, naturally), which is already
dropped by nginx because of timeout. There are no errors or warnings in the
debug log. As soon as the pool hits the children limit, master starts to
refuse connections from nginx. Just before the stuck writing of response
starts, php processes don't do anything suspicious, just normally mmapping
the files without any errors. If I kill these children, the master spawns
the new ones as it should and they get stuck immediately in the same way.
This doesn't affect other pools running under different or the same UIDs,
they are still going. The only way to "recover" the "broken" pool is to
restart the master.

the php (5.3.23) is running on centos 6.4 x86_64 with memcache for sessions
and no accelerators.

I also cannot correlate the problem to any external factor, like high loads
or network outages.

Any guess please?

Thanks a lot in advance!

--

---
You received this message because you are subscribed to the Google Groups "highload-php-en" group.
To unsubscribe from this group and stop receiving emails from it, send an email to highload-php-en+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply Quote

Jérôme Loyet

Re: Pools randomly hang
May 23, 2013 02:34AM

Hi,

is it possible for you to updated to the last php 5.4 version to see if the
problem still occurs ?

++ Jerome

2013/5/21 Michael Tabolsky <mtabolsky@gmail.com>

> Hi List,
>
> I really hope someone can help to debug this problem since I am trying to
> run out of options here ...
>
> I have a setup of two nodes, one running nginx, the other php-fpm. about
> 30 hosts/pools. All was good for a few months, but since some upgrade and I
> just can't track it back which one, I've started to get into a problem with
> random pools at random intervals, children that stuck like this:
> [pid 7939] write(3, "122\"><a href=\"http://www.b"..., 456 <unfinished
> ...>
> [pid 7672] write(3, "122\"><a href=\"http://www.b"..., 456^C <unfinished
> ...>
>
> The fd is the connection to the nginx (tcp, naturally), which is already
> dropped by nginx because of timeout. There are no errors or warnings in the
> debug log. As soon as the pool hits the children limit, master starts to
> refuse connections from nginx. Just before the stuck writing of response
> starts, php processes don't do anything suspicious, just normally mmapping
> the files without any errors. If I kill these children, the master spawns
> the new ones as it should and they get stuck immediately in the same way.
> This doesn't affect other pools running under different or the same UIDs,
> they are still going. The only way to "recover" the "broken" pool is to
> restart the master.
>
> the php (5.3.23) is running on centos 6.4 x86_64 with memcache for
> sessions and no accelerators.
>
> I also cannot correlate the problem to any external factor, like high
> loads or network outages.
>
> Any guess please?
>
> Thanks a lot in advance!
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "highload-php-en" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to highload-php-en+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

--

---
You received this message because you are subscribed to the Google Groups "highload-php-en" group.
To unsubscribe from this group and stop receiving emails from it, send an email to highload-php-en+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply Quote

Michael Tabolsky

Re: Pools randomly hang
May 27, 2013 01:46PM

Hi,

thanks for the reply.
After reading all the php-fpm related bugs ... out of desperation tried to
upgrade just now.
5.4 15 changes the symptoms somewhat but generally doesn't solve the
problem.
I have a pool, only one single pool (and there are other 5 pools under the
same UID) that doesn't stop stalling.
with 5.4.15, it serves some requests and then stops in :
[pid 12104] recvfrom(3, <unfinished ...>
[pid 12097] recvfrom(3,

where fds 3 are this:
tcp 0 0 10.x:9003 10.x:52258 FIN_WAIT2
12104/php-fpm
tcp 0 0 10.x:9003 10.x:52059 FIN_WAIT2
12097/php-fpm

the new connections from nginx are getting backloged:
tcp 1704 0 10.x:9003 10.x:52311 ESTABLISHED -

and nothing works :(

i don't know what to do with it, really ...

On Thursday, May 23, 2013 8:33:26 AM UTC+2, Jérôme Loyet wrote:
>
> Hi,
>
> is it possible for you to updated to the last php 5.4 version to see if
> the problem still occurs ?
>
> ++ Jerome
>
>
> 2013/5/21 Michael Tabolsky <mtab...@gmail.com <javascript:>>
>
>> Hi List,
>>
>> I really hope someone can help to debug this problem since I am trying to
>> run out of options here ...
>>
>> I have a setup of two nodes, one running nginx, the other php-fpm. about
>> 30 hosts/pools. All was good for a few months, but since some upgrade and I
>> just can't track it back which one, I've started to get into a problem with
>> random pools at random intervals, children that stuck like this:
>> [pid 7939] write(3, "122\"><a href=\"http://www.b"..., 456 <unfinished
>> ...>
>> [pid 7672] write(3, "122\"><a href=\"http://www.b"..., 456^C
>> <unfinished ...>
>>
>> The fd is the connection to the nginx (tcp, naturally), which is already
>> dropped by nginx because of timeout. There are no errors or warnings in the
>> debug log. As soon as the pool hits the children limit, master starts to
>> refuse connections from nginx. Just before the stuck writing of response
>> starts, php processes don't do anything suspicious, just normally mmapping
>> the files without any errors. If I kill these children, the master spawns
>> the new ones as it should and they get stuck immediately in the same way..
>> This doesn't affect other pools running under different or the same UIDs,
>> they are still going. The only way to "recover" the "broken" pool is to
>> restart the master.
>>
>> the php (5.3.23) is running on centos 6.4 x86_64 with memcache for
>> sessions and no accelerators.
>>
>> I also cannot correlate the problem to any external factor, like high
>> loads or network outages.
>>
>> Any guess please?
>>
>> Thanks a lot in advance!
>>
>> --
>>
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "highload-php-en" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to highload-php-...@googlegroups.com <javascript:>.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>>
>
>

--

---
You received this message because you are subscribed to the Google Groups "highload-php-en" group.
To unsubscribe from this group and stop receiving emails from it, send an email to highload-php-en+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply Quote

Michael Tabolsky

Re: Pools randomly hang
May 27, 2013 02:08PM

oh, there is also another difference with 5.4.15
if I kill the stalled processes, the newly spawned serve the requests for
some time. with 5.3 they where stuck on the first request.

On Monday, May 27, 2013 7:42:18 PM UTC+2, Michael Tabolsky wrote:
>
> Hi,
>
> thanks for the reply.
> After reading all the php-fpm related bugs ... out of desperation tried to
> upgrade just now.
> 5.4 15 changes the symptoms somewhat but generally doesn't solve the
> problem.
> I have a pool, only one single pool (and there are other 5 pools under the
> same UID) that doesn't stop stalling.
> with 5.4.15, it serves some requests and then stops in :
> [pid 12104] recvfrom(3, <unfinished ...>
> [pid 12097] recvfrom(3,
>
> where fds 3 are this:
> tcp 0 0 10.x:9003 10.x:52258 FIN_WAIT2
> 12104/php-fpm
> tcp 0 0 10.x:9003 10.x:52059 FIN_WAIT2
> 12097/php-fpm
>
> the new connections from nginx are getting backloged:
> tcp 1704 0 10.x:9003 10.x:52311 ESTABLISHED -
>
>
> and nothing works :(
>
> i don't know what to do with it, really ...
>
>
> On Thursday, May 23, 2013 8:33:26 AM UTC+2, Jérôme Loyet wrote:
>>
>> Hi,
>>
>> is it possible for you to updated to the last php 5.4 version to see if
>> the problem still occurs ?
>>
>> ++ Jerome
>>
>>
>> 2013/5/21 Michael Tabolsky <mtab...@gmail.com>
>>
>>> Hi List,
>>>
>>> I really hope someone can help to debug this problem since I am trying
>>> to run out of options here ...
>>>
>>> I have a setup of two nodes, one running nginx, the other php-fpm. about
>>> 30 hosts/pools. All was good for a few months, but since some upgrade and I
>>> just can't track it back which one, I've started to get into a problem with
>>> random pools at random intervals, children that stuck like this:
>>> [pid 7939] write(3, "122\"><a href=\"http://www.b"..., 456 <unfinished
>>> ...>
>>> [pid 7672] write(3, "122\"><a href=\"http://www.b"..., 456^C
>>> <unfinished ...>
>>>
>>> The fd is the connection to the nginx (tcp, naturally), which is already
>>> dropped by nginx because of timeout. There are no errors or warnings in the
>>> debug log. As soon as the pool hits the children limit, master starts to
>>> refuse connections from nginx. Just before the stuck writing of response
>>> starts, php processes don't do anything suspicious, just normally mmapping
>>> the files without any errors. If I kill these children, the master spawns
>>> the new ones as it should and they get stuck immediately in the same way.
>>> This doesn't affect other pools running under different or the same UIDs,
>>> they are still going. The only way to "recover" the "broken" pool is to
>>> restart the master.
>>>
>>> the php (5.3.23) is running on centos 6.4 x86_64 with memcache for
>>> sessions and no accelerators.
>>>
>>> I also cannot correlate the problem to any external factor, like high
>>> loads or network outages.
>>>
>>> Any guess please?
>>>
>>> Thanks a lot in advance!
>>>
>>> --
>>>
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "highload-php-en" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to highload-php-...@googlegroups.com.
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>>
>>>
>>
>>

--

---
You received this message because you are subscribed to the Google Groups "highload-php-en" group.
To unsubscribe from this group and stop receiving emails from it, send an email to highload-php-en+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply Quote

Michael Tabolsky

Re: Pools randomly hang
June 08, 2013 03:34AM

Update,

It seems that I've succeeded to narrowing the circumstances of this strange
behavior. The chain of events is as follows:
1. a php worker receives a request from nginx and takes some time to work
on it. During this time, nginx reaches the fastcgi timeout and tries to
close the connection.
2. the php worker doesn't receive or ignores the RST packet from nginx and
continues
3. when done, it starts writing to the socket, fills the buffer and the
socket becomes CLOSE_WAIT
4. at this point, the worker is stuck in
# cat /proc/16983/stack
[<ffffffff8140b756>] sk_stream_wait_memory+0x186/0x270
[<ffffffff8144f585>] tcp_sendmsg+0x705/0xa30
[<ffffffff81400ef1>] sock_aio_write+0x151/0x160
[<ffffffff8116d05a>] do_sync_write+0xfa/0x140
[<ffffffff8116d424>] vfs_write+0x184/0x1a0
[<ffffffff8116dd91>] sys_write+0x51/0x90
[<ffffffff81013172>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
# strace -f -p 16983
Process 16983 attached - interrupt to quit
write(3, "ef=\"http://www.\" >"..., 51048

tcp 9 14608 php:9019 nginx:36970 CLOSE_WAIT
16983/php-fpm

and the kernel is backloging the connections.
5. now, if I kill the process, the situation recovers immediately, because
the freshly spawned process picks up a backloged connection that is already
abandoned by nginx, and again, the socket is in CLOSE_WAIT.
6. The only way to make it work again is to restart the master, so the
backloged connections are dropped.

There are two problems that I don't know who to blame. Firstly, I don't
know why backloged connections remain in established state and are not
dropped. Secondly, I don't know why the RST packets from nginx never arrive
to the php.

The latter problem is obviously seem to be the network problem. I think I
could mitigate it at this point by setting up a point-to-point tunnel
between the hosts, so no switch or firewall interferes with the traffic.
Most probably I've hit some "intelligent" configuration in the Amazon's EC2
network. It's been 3 days since I've started tunneling the traffic and the
problem doesn't represent itself anymore. I'll give it a week and will try
to ask clarifications from Amazon. I don't expect much to come out of this
conversation.

As for the first problem, I am not sure what's wrong here. I have a gut
feeling that something is wrong with it. Any comments anyone?

Thanks!

On Tuesday, May 21, 2013 12:33:36 PM UTC+2, Michael Tabolsky wrote:
>
> Hi List,
>
> I really hope someone can help to debug this problem since I am trying to
> run out of options here ...
>
> I have a setup of two nodes, one running nginx, the other php-fpm. about
> 30 hosts/pools. All was good for a few months, but since some upgrade and I
> just can't track it back which one, I've started to get into a problem with
> random pools at random intervals, children that stuck like this:
> [pid 7939] write(3, "122\"><a href=\"http://www.b"..., 456 <unfinished
> ...>
> [pid 7672] write(3, "122\"><a href=\"http://www.b"..., 456^C <unfinished
> ...>
>
> The fd is the connection to the nginx (tcp, naturally), which is already
> dropped by nginx because of timeout. There are no errors or warnings in the
> debug log. As soon as the pool hits the children limit, master starts to
> refuse connections from nginx. Just before the stuck writing of response
> starts, php processes don't do anything suspicious, just normally mmapping
> the files without any errors. If I kill these children, the master spawns
> the new ones as it should and they get stuck immediately in the same way.
> This doesn't affect other pools running under different or the same UIDs,
> they are still going. The only way to "recover" the "broken" pool is to
> restart the master.
>
> the php (5.3.23) is running on centos 6.4 x86_64 with memcache for
> sessions and no accelerators.
>
> I also cannot correlate the problem to any external factor, like high
> loads or network outages.
>
> Any guess please?
>
> Thanks a lot in advance!
>

--

---
You received this message because you are subscribed to the Google Groups "highload-php-en" group.
To unsubscribe from this group and stop receiving emails from it, send an email to highload-php-en+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply Quote

Pools randomly hang

Online Users