Welcome! Log In Create A New Profile


Re: Help: How to deal with content scrapers?

Jonathan Vanasco
April 22, 2009 09:41PM
Some tips I learned from anti-email spamming:

Sometimes the best thing you can do in this situation isn't to block,
but to identify and throttle + change content.
if you block, they'll try and try again , before swapping ips or
going to the next victim
if you throttle... to something crazy like 1byte/second , most bot
operators won't notice. you'll also end up tying up their connections
you can also send them alternate content... like a mixture of
gibberish and text that identifies them as a spammer or would drop
down the search relevance

On Apr 22, 2009, at 8:17 PM, davidr wrote:

> Guys,
> I need some help. In the past few months, the site that I administer
> (it's a large medical non-profit/charity) has been attacked by
> content scraping bots. Basically, these content thieves scrape our
> sites and then repost the information on their own domains and also
> intersperse it with malware, ads. They quite often rank fairly high
> on Google because of it and when a user gets infected, they blame
> us. I've been asking google to delist these sites but that takes
> days/weeks.
> These scrapers obviously don't care about robots.txt and they just
> indiscriminately scrape the content and ignore all the rules. I've
> been blocking these scrapers manually but by the time I'm aware of
> the problem, it's already too late. They really inflict a lot of
> damage to our database performance and many users complain that the
> site is too slow at times. When we correlate the data, we see that
> the slowdown occurs while these thieves are scraping the site.
> What's the best way to limit the number of requests an IP can make
> in a, say 15 min, time period, for example? Is there a way to block
> them on a webserver (nginx) layer and move it away from an
> application layer since app layer blocking incurs too much of a
> performance hit? I'm looking for something that would simply count
> for the number of requests over a particular time period and just
> add the IP to iptables if it ever crosses the limit.
> Any advice is much appreciated!!
> Thank you,
> Dave
> Posted at Nginx Forum: http://forum.nginx.org/read.php?2,1361,1361#msg-1361

// Jonathan Vanasco

e. jonathan@2xlp.com
w. http://findmeon.com/user/jvanasco
blog. http://destructuring.net

| - - - - - - - - - -
| Founder/CEO - FindMeOn, Inc.
| FindMeOn.com - The cure for Multiple Web Personality Disorder
| - - - - - - - - - -
| CTO - ArtWeLove, LLC
| ArtWeLove.com - Explore Art On Your Own Terms
| - - - - - - - - - -
| Founder - SyndiClick
| RoadSound.com - Tools for Bands, Stuff for Fans
| - - - - - - - - - -
Subject Author Posted

Help: How to deal with content scrapers?

davidr April 22, 2009 08:17PM

Re: Help: How to deal with content scrapers?

Kon Wilms April 22, 2009 08:44PM

Re: Help: How to deal with content scrapers?

Jonathan Vanasco April 22, 2009 09:41PM

Sorry, only registered users may post in this forum.

Click here to login

Online Users

Guests: 142
Record Number of Users: 8 on April 13, 2023
Record Number of Guests: 421 on December 02, 2018
Powered by nginx      Powered by FreeBSD      PHP Powered      Powered by MariaDB      ipv6 ready