How To Fight Web Scrapers Outranking You With Your Content

Author: Gab Goldenberg

This is a guest post from Everett Sizemore, who is an eCommerce SEO consultant operating off his 38 acre farm in the Blue Ridge mountains of Virginia. He enjoys gardening, collecting eggs and tackling tough SEO problems.

Want to fight the scrapers outranking you with your own content post-Panda?

Google is losing the war against content scrapers and we’re the ones paying the price. It has gotten even worse since Panda, despite efforts to fix the problem. I know of several dozen websites that are being outranked by scrapers.

While Google might hope we’re going to do their job for them with tags like rel canonical, and rel author, this problem isn’g going away any time soon. You know things are bad when powerhouses like TIME.com are outranked by sites like this.

Fortunately, there are many ways to keep sites from scraping your content, including easy ones like simply asking them to remove it, or bureaucratic messes like filing a DMCA complaint and reporting it to Google. But neither of these options is practical when your’e dealing with more than a few sites or a few dozen pages of content.

In looking for an effective strategy to combat scraper bots you should consider a multi-faceted approach that is scaleable, and then pick up any that get through the cracks using cease and desist letters and reports to Google and/or their host.

The Multi-Faceted Approach To Fighting Scrapers:

#1 – Due Dilligence
First, make sure your message is heard loud and clear. Have your copyright visible in the footer of every page, and include something like the following on your Terms page, as well as in an area commented out within the body copy code:

<!– COPYRIGHT NOTICE: The content of this page is owned by MyWebsite.com and is protected under copyright law. All rights are reserved. Using this content in any way is forbidden, except when permission has been given explicitly in writing from MyWebsite.com. Any use outside the given permissions constitutes copyright violation and will be prosecuted within the full extent of the law. All access to this website is logged. –>

While that probably won’t stop a lot of scrapers, it provides you with the first line of defense and shows that you have done your due diligence from a legal perspective should you have to file a DMCA report.

#2 – Robots.txt
Disallow any bots that you don’t want accessing your site from your robots.txt file. While a lot of scrapers will ignore your robots.txt file (more on that later), many of the popular tools used to scrape websites do respect this file. As an example, you might consider these two lines:
User-agent: Yahoo Pipes 1.0 Disallow: /
User-agent: Yahoo Pipes 2.0 Disallow: /

Disallow a specific file and then record any traffic that accesses the file, which should also have a robots noindex, nofollow tag on it. For example…
Disallow: /do-not-open.html

#2.5 – Bait the Trap
Now link to the disallowed file from within the body area of every page on your site. It can just be an href tag, or you can put the link around a 1 pixel image if you like. All of the “real” search engines will respect your robots.txt file and not visit that URL. If they don’t respect that file then you can bet it’s a rogue bot, even if they are “spoofing” Googlebot’s user-agent.

#3 – Htaccess (Or your Windows server equivalent.)
You should be recording which IP addresses, referrers and user agents access that do-not-open.html page, whether with a script or just by viewing the log files. The next step is to block all of those from accessing your site. You have some options here. For instance, you could deny them access, send them to a custom page (like ScrewYou.html), or get a little tricky and redirect them to their own IP address, which has the benefit of crashing some bots by sending them through an infinite loop. Eventually they’ll get the point and just stop coming to your site.

You can try and get a head-start on the process by blocking the scrapers that are already on blacklists like…
Referrer blacklist by Perishable Press
User Agent blacklist by Perishable Press

For more technical instructions, including the code to use in your htaccess file, see this great post on PerishablePress.com: Eight Ways to Blacklist with Apache’s mod_rewrite

There are probably as many ways to deal with content scrapers as there are ways to scrape content, including several WordPress plugins that mostly just ensure the scrapers give you a link by embedding them into your RSS feeds. What are your favorite methods and why?

[Editor’s footnote: Conversely, if you have legitimate uses for scrapers, here’s an article on three web screen scrapers for SEO. And if you liked this post, add my rss feed to your reader! ]

Sidebar Story

Comments

  1. Very interesting. One minor issue - in this case it would be referred to as "providing adequate notice" rather than due diligence. But hey, you are a SEO, not a lawyer so thanks for the great guest post.

    Comment by Dan - July 16, 2011 @ 4:23pm
  2. Great point there Dan - thanks for sharing!

    Comment by Gabriel Goldenberg - July 19, 2011 @ 5:28am

Leave a Reply