vedranf Wrote:
-------------------------------------------------------
> Hello,
>
> I'm having an issue where nginx (1.8) cache manager suddenly just
> stops deleting content thus the disk soon ends up being full until I
> restart it by hand. After it is restarted, it works normally for a
> couple of days, but then it happens again. Cache has some 30-40k
> files, nothing huge. Relevant config lines are:
>
> proxy_cache_path /home/cache/ levels=2:2 keys_zone=cache:25m
> inactive=7d max_size=2705g use_temp_path=on;
> proxy_temp_path /dev/shm/temp; # reduces parallel writes on
> the disk
> proxy_cache_lock on;
> proxy_cache_lock_age 10s;
> proxy_cache_lock_timeout 30s;
> proxy_ignore_client_abort on;
>
> Server gets roughly 100 rps and normally cache manager deletes a
> couple of files every few seconds, however when it gets stuck this is
> all it does for 20-30 minutes or more, i.e. there are 0 unlinks (until
> I restart it and it rereads the on-disk cache):
>
> ...
> epoll_wait(14, {}, 512, 1000) = 0
> epoll_wait(14, {}, 512, 1000) = 0
> epoll_wait(14, {}, 512, 1000) = 0
> epoll_wait(14, {}, 512, 1000) = 0
> gettid() = 11303
> write(24, "2016/02/18 08:22:02 [alert] 11303#11303: ignore long locked
> inactive cache entry 380d3f178017bcd92877ee322b006bbb, count:1\n",
> 123) = 123
> gettid() = 11303
> write(24, "2016/02/18 08:22:02 [alert] 11303#11303: ignore long locked
> inactive cache entry 7b9239693906e791375a214c7e36af8e, count:24\n",
> 124) = 124
> epoll_wait(14, {}, 512, 1000) = 0
> ...
>
> I assume the mentioned error is due to relatively often nginx restarts
> and is benign. There's nothing else in the error log (except for
> occasional upstream timeouts). I'm aware this likely isn't enough info
> to debug the issue, but do you at least have some ideas on what might
> be causing this issue, where to look? I'm wild guessing cache manager
> waits for some lock to be released, but it never gets released so it
> just waits indefinitely.
>
> Thanks,
> Vedran
We have the same problem, but i'm not sure, that this is caused by often nginx restarts.
As far as i know problem exist since version 1.6 (maybe even earlier, 1.4.6 from ubuntu repo is not affected) till now (1.9.9)
I've collected related forum posts (should help analyze the problem):
https://forum.nginx.org/read.php?21,258292,258292#msg-258292
https://forum.nginx.org/read.php?21,260990,260990#msg-260990
https://forum.nginx.org/read.php?2,263625,263625#msg-263625
Also, i think it's somehow related to write connection leak. (see image link)
https://s3.eu-central-1.amazonaws.com/drive-public-eu/nginx/betelgeuse_nginx_connections.PNG
Here we have our standard nginx configuration (before january, 28) with 7 days inactive time:
proxy_cache_path /mnt/cache1/nginx levels=2:2 keys_zone=a.d-1_cache:2143M inactive=7d max_size=643G loader_sleep=1ms;
Every ~8 days (when writing connections reaches ~10k mark) cache starts growing and fills the disk. Write connections falls on graph are nginx restarts.
On january, 28 i changed inactive time to 8h. After write connections hits ~10k mark, nginx starts filling logs with "ignore long locked inactive cache entry" message (1-2 messages per minute on average).
As you see write connections continuously grows. (When we had to power off the machine it's reached ~60k).
For counting nginx connections we use standard http_stub_status_module.
I think that nginx "reference counter" could be broken, because total established TCP connection remains the same all the time.