Welcome! Log In Create A New Profile

Advanced

limit_rate based on User-Agent; how to exempt /robots.txt ?

Cameron Kerr
August 06, 2018 10:46PM
Hi all, I’ve recently deployed a rate-limiting configuration aimed at protecting myself from spiders.

nginx version: nginx/1.15.1 (RPM from nginx.org)

I did this based on the excellent Nginx blog post at https://www.nginx.com/blog/rate-limiting-nginx/ and have consulted the documentation for limit_req and limit_req_zone.

I understand that you can have multiple zones in play, and that the most-restrictive of all matches will apply for any matching request. I want to go the other way though. I want to exempt /robots.txt from being rate limited by spiders.

To put this in context, here is the gist of the relevant config, which aims to implement a caching (and rate-limiting) layer in front of a much more complex request routing layer (httpd).

http {
map $http_user_agent $user_agent_rate_key {
default "";
"~our-crawler" "wanted-robot";
"~*(bot/|crawler|robot|spider)" "robot";
"~ScienceBrowser/Nutch" "robot";
"~Arachni/" "robot";
}

limit_req_zone $user_agent_rate_key zone=per_spider_class:1m rate=100r/m;
limit_req_status 429;

server {
limit_req zone=per_spider_class;

location / {
proxy_pass http://routing_layer_http/;
}
}
}



Option 1: (working, but has issues)

Should I instead put the limit_req inside the "location / {}" stanza, and have a separate "location /robots.txt {}" (or some generalised form using a map) and not have limit_req inside that stanza

That would mean that any other configuration inside the location stanzas would get duplicated, which would be a manageability concern. I just want to override the limit_req.

server {
location /robots.txt {
proxy_pass http://routing_layer_http/;
}

location / {
limit_req zone=per_spider_class;
proxy_pass http://routing_layer_http/;
}
}

I've tested this, and it works.


Option 2: (working, but has issues)

Should I create a "location /robots.txt {}" stanza that has a limit_req with a high burst, say burst=500? It's not a whitelist, but perhaps something still useful?

But I still end up with replicated location stanzas... I don't think I like this approach.

server {
limit_req zone=per_spider_class;

location /robots.txt {
limit_req zone=per_spider_class burst=500;
proxy_pass https://routing_layer_https/;
}

location / {
proxy_pass https://routing_layer_https/;
}
}


Option 3: (does not work)

Some other way... perhaps I need to create some map that takes the path and produces a $path_exempt variable, and then somehow use that with the $user_agent_rate_key, returning "" when $path_exempt, or $user_agent_rate_key otherwise.

map $http_user_agent $user_agent_rate_key {
default "";
"~otago-crawler" "wanted-robot";
"~*(bot/|crawler|robot|spider)" "robot";
"~ScienceBrowser/Nutch" "robot";
"~Arachni/" "robot";
}

map $uri $rate_for_spider_exempting {
default $user_agent_rate_key;
"/robots.txt" "";
}

#limit_req_zone $user_agent_rate_key zone=per_spider_class:1m rate=100r/m;
limit_req_zone $rate_for_spider_exempting zone=per_spider_class:1m rate=100r/m;


However, this does not work because the second map is not returning $user_agent_rate_key; the effect is that non-robots are affected (and the load-balancer health-probes start getting rate-limited).

I'm guessing my reasoning of how this works is incorrect, or there is a limitation or some sort of implicit ordering issue.


Option 4: (does not work)

http://nginx.org/en/docs/http/ngx_http_core_module.html#limit_rate

I see that there is a variable $limit_rate that can be used, and this would seem to be the cleanest, except in testing it doesn't seem to work (still gets 429 responses as a User-Agent that is a bot)

server {
limit_req zone=per_spider_class;

location /robots.txt {
set $limit_rate 0;
}

location / {
proxy_pass http://routing_layer_http/;
}
}


I'm still fairly new with Nginx, so wanting something that decomposes cleanly into an Nginx configuration. I would quite like to be able just have one place where I specify the map of URLs I wish to exempt (I imagine there could be others, such as ~/.well-known/something that could pop up).

Thank you very much for your time.

--
Cameron Kerr
Systems Engineer, Information Technology Services
University of Otago

_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Subject Author Posted

limit_rate based on User-Agent; how to exempt /robots.txt ?

Cameron Kerr August 06, 2018 10:46PM

Re: limit_rate based on User-Agent; how to exempt /robots.txt ?

Peter Booth via nginx August 07, 2018 01:58AM

Re: limit_rate based on User-Agent; how to exempt /robots.txt ?

Maxim Dounin August 07, 2018 08:00AM

RE: limit_rate based on User-Agent; how to exempt /robots.txt ?

Cameron Kerr August 07, 2018 06:28PM



Sorry, only registered users may post in this forum.

Click here to login

Online Users

Guests: 289
Record Number of Users: 8 on April 13, 2023
Record Number of Guests: 421 on December 02, 2018
Powered by nginx      Powered by FreeBSD      PHP Powered      Powered by MariaDB      ipv6 ready