Welcome! Log In Create A New Profile


Help: How to deal with content scrapers?

April 22, 2009 08:17PM

I need some help. In the past few months, the site that I administer (it's a large medical non-profit/charity) has been attacked by content scraping bots. Basically, these content thieves scrape our sites and then repost the information on their own domains and also intersperse it with malware, ads. They quite often rank fairly high on Google because of it and when a user gets infected, they blame us. I've been asking google to delist these sites but that takes days/weeks.

These scrapers obviously don't care about robots.txt and they just indiscriminately scrape the content and ignore all the rules. I've been blocking these scrapers manually but by the time I'm aware of the problem, it's already too late. They really inflict a lot of damage to our database performance and many users complain that the site is too slow at times. When we correlate the data, we see that the slowdown occurs while these thieves are scraping the site.

What's the best way to limit the number of requests an IP can make in a, say 15 min, time period, for example? Is there a way to block them on a webserver (nginx) layer and move it away from an application layer since app layer blocking incurs too much of a performance hit? I'm looking for something that would simply count for the number of requests over a particular time period and just add the IP to iptables if it ever crosses the limit.

Any advice is much appreciated!!

Thank you,

Subject Author Posted

Help: How to deal with content scrapers?

davidr April 22, 2009 08:17PM

Re: Help: How to deal with content scrapers?

Kon Wilms April 22, 2009 08:44PM

Re: Help: How to deal with content scrapers?

Jonathan Vanasco April 22, 2009 09:41PM

Sorry, only registered users may post in this forum.

Click here to login

Online Users

Guests: 109
Record Number of Users: 8 on April 13, 2023
Record Number of Guests: 421 on December 02, 2018
Powered by nginx      Powered by FreeBSD      PHP Powered      Powered by MariaDB      ipv6 ready