Hi,
An update on this - we found the problem happens when the number of aio contexts (defaults to 32) is exceeded. When that happens nginx falls back to using regular (synchronous) io, and for some reason this makes the kernel not send completion notification for some pending aio requests. Increasing worker_aio_requests to a larger value (we use 1024) solved the problem for us.
IMHO, it would have been better if nginx would have failed the request in this case instead of falling back to regular io. Or, at least, output some message to the error log.
Thanks
Eran