Welcome! Log In Create A New Profile

Advanced

Blocking 'good' bots with multiple conditions for certain off-limits URL's where humans can go

Posted by gplv2 
Hi Everyone,

After 2 days of searching/trying/failing I decided to post this here, I haven't found any example of someone doing the same nor what I tried seems to be working OK. I'm trying to send a 403 to bots not respecting the robots.tx file (even after downloading it several times). Specifically Googlebot. It will support the following robots.txt definition.

User-agent: *
Disallow: /*/*/page/

The intent is to allow Google to browse whatever they can find on the site but return a 403 for the following type of request. Googlebot seems to keep on nesting these links eternally adding paging block after block:

my_domain.com:80 - 66.x.67.x - - [25/Apr/2012:11:13:54 +0200] "GET /2011/06/page/3/?/page/2//page/3//page/2//page/3//page/2//page/2//page/4//page/4//page/1/&wpmp_switcher=desktop HTTP/1.1" 403 135 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

It's a wordpress site btw. I don't want those pages to show up, even though after the robots.txt info got through, they stopped for a while only to begin crawling again later. It just never stops .... I do want real people to see this. As you can see, google get a 403 but when I try this myself in a browser I get a 404 back. I want browsers to pass.

root@my_domain:# nginx -V
nginx version: nginx/1.2.0

I tried different approaches, using a map and plain old nono if's and they both act the same:
(under http section)

map $http_user_agent $is_bot {
default 0;
~crawl|Googlebot|Slurp|spider|bingbot|tracker|click|parser|spider 1;
}

(under the server section)


location ~ /(\d+)/(\d+)/page/ {
if ($is_bot) {
return 403; # Please respect the robots.txt file !
}
}

I recently had to polish up my Apache skills for a client where I did about the same thing like this :

# Block real Engines , not respecting robots.txt but allowing correct calls to pass
# Google
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ Googlebot/2\.[01];\ \+http://www\.google\.com/bot\.html\)$ [NC,OR]
# Bing
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ bingbot/2\.[01];\ \+http://www\.bing\.com/bingbot\.htm\)$ [NC,OR]
# msnbot
RewriteCond %{HTTP_USER_AGENT} ^msnbot-media/1\.[01]\ \(\+http://search\.msn\.com/msnbot\.htm\)$ [NC,OR]
# Slurp
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ Yahoo!\ Slurp;\ http://help\.yahoo\.com/help/us/ysearch/slurp\)$ [NC]

# block all page searches, the rest may pass
RewriteCond %{REQUEST_URI} ^(/[0-9]{4}/[0-9]{2}/page/) [OR]

# or with the wpmp_switcher=mobile parameter set
RewriteCond %{QUERY_STRING} wpmp_switcher=mobile

# ISSUE 403 / SERVE ERRORDOCUMENT
RewriteRule .* - [F,L]
# End if match

This does a bit more than I asked nginx to do but it's about the same principle, I'm having a hard time figuring this out for nginx.

So my question would be, why would nginx serve my browser a 404 ? Why isn't it passing, The regex isn't matching for my UA:

"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.30 Safari/536.5"

There are tons of example to block based on UA alone, and that's easy. It also looks like the matchin location is final, e.g. it's not 'falling' through for regular user, I'm pretty certain that this has some correlation with the 404 I get in the browser.

As a cherry on top of things, I also want google to disregard the parameter wpmp_switcher=mobile , wpmp_switcher=desktop is fine but I just don't want the same content being crawled multiple times.

Even though I ended up adding wpmp_switcher=mobile via the google webmaster tools pages (requiring me to sign up ....). that also stopped for a while but today they are back spidering the mobile sections.

So in short, I need to find a way for nginx to enforce the robots.txt definitions. Can someone shell out a few minutes of their lives and push me in the right direction please ?

I really appreciate __ANY__ response that makes me think harder ;-)


Glenn



Edited 1 time(s). Last edit at 04/25/2012 06:04AM by gplv2.
Sorry, only registered users may post in this forum.

Click here to login

Online Users

Guests: 126
Record Number of Users: 8 on April 13, 2023
Record Number of Guests: 421 on December 02, 2018
Powered by nginx      Powered by FreeBSD      PHP Powered      Powered by MariaDB      ipv6 ready