3 Cheap Web Robots & Screen Scrapers for SEO Data Collection

This is a guest post by Jeffrey Russo, a Boston-area SEO at a boutique search agency. Jeff is particularly interested in the intersection of search and social, and regularly covers SEO related topics on his personal blog at jeffreyrusso.com.

This post exemplifies the creative and critical thinking I discuss in depth in my book

One of my favorite things about SEO is that I regularly get to sit down and dig into a massive data set, searching for the non-intuitive insights that have the power to truly move the needle.

But as much as I enjoy uncovering an obscure keyword space or a fantastic link opportunity from deep within an Excel file, the slow and painful process of actually collecting the data to work with can get in the way of doing this kind of detailed analysis.

Thankfully, there are some fantastic scraping tools that can help deliver raw data from the web in a quick and scalable manner. Want to collect metrics on your competitors’ social media profiles? Pull in large amounts of text to analyze for keyword ideas? Tasks that once required either hours of manual collection or serious programming skills are now easy for anyone to carry out.

Here is a rundown of three excellent resources that can make collecting data from the web a lot easier.

Outwit

Outwit is a Firefox extension that is perfect for smaller scraping tasks. With very little prep work, it is easy to collect structured data or build automated scrapers to pull larger data sets from the web. At $38 for the full version, it is also a steal.

For quick, one-off scrapes, Outwit is tough to beat. Point it to a URL and Outwit “guesses” how the data is structured and what you are trying to collect. If you want close control over how data is scraped, that’s easy to set up too – just define the tags or other elements around the text you want, and Outwit does the rest. When you are finished scraping, exporting to Excel or other popular formats is quick and easy.

Outwit also has some robust macro tools to deal with larger amounts of data. Define a starting page, and outwit can usually detect the next pages in sequence to collect data from (for instance, when scraping product titles off pages 1-100 on a large e-commerce site.)

Outwit can also accept a text file of URLs to scrape when working with multiple sites or different sets of pages. The macro tools do have limitations, though – Outwit sometimes slows down when working with more than a few hundred pages at a time.

Outwit Screen Captures

Outwit Screen Captures

Outwit consists of several different interfaces, each for scraping different types of data and organizing it in different ways. From top to bottom: A sample of scraped URLs and product names, the scrapers control panel, and the macro control panel.

Mozenda

Mozenda is a workhorse tool that better handles larger and more complex jobs than other web scrapers. While its capabilities are similar to those of Outwit, there are a number of features that really make it shine.

For starters, it is easy to use. Even with no HTML/CSS knowledge (which is helpful when using Outwit), anyone can quickly build [ed: give instructions to] a scraper following Mozenda’s guided process. Mozenda is also capable of filling in known user inputs to do things like generate search results pages.

Mozenda’s scrapers are built locally but run from the server, an approach that is preferable for a number of reasons. Data stored on the server can be regularly updated from new scrapes, scheduling tasks is painless, and it eliminates a lot of the worries associated with using your single home IP address for scraping tasks.

While Mozenda is a strong tool that I use frequently, there are caveats. You essentially pay for blocks of pages scraped, so it can get expensive pretty quickly. There is also no Mac version of the desktop software.

Needlebase

Needlebase is a start to finish data acquisition, validation and presentation platform. While it too has a strong set of scraping features that are similar to those of Outwit and Mozenda, Needlebase’s tools for working with your data after it has been scraped make it stand out.

Needlebase makes cleaning, structuring and presenting your data easy. Rather than dragging everything into Excel, Needlebase easily accepts imports from other sources regardless of origin.

Once imported, Needlebases’ “semantic deduplication” features can understand and detect similar entries and other trouble spots that Excel could never catch automatically. Needlebase also easily normalizes commonly known entries like URLs or street addresses.

After your data is cleaned up and ready for prime time, Needlebase’s “Data Publishing Environment” makes it easy to build a clean and well-organized HTML database (either public or private) with pretty much any chart, graph, table or view easily accessible. This is fantastic for presenting your findings to clients, or keeping a structured database if you intend to use it over time.

Needlebase Example DB

Needlebase Example DB


An example of a Needlebase published data set (credit needlebase.com.) This linked HTML document allows users to sort and visualize data in a number of different formats.

While Needlebase has a strong free offering, if you want to scrape more than 5,000 pages per month or the ability to make your data private, you’ll have to pay (monthly packages start at $399.)

A few ideas to start with

With the ability to quickly harvest data from the web comes an endless number of new ways to optimize. Here are a few things I’ve either tried or have in the pipeline…

  • Quickly audit on-page optimization of a large site by pulling title tags, metadata, text content, and just about anything else into Excel
  • Scrape a clients’ site to generate product lists, keyword ideas, or to analyze site content en masse
  • Use Needlebase to build a directory or a structured dataset by scraping or importing internal data (This is a quick way to build a great piece of linkbait)
  • Scrape social networks to pull in competitors’ tweets or shared content and engagement metrics (PostRank, Klout score, follower/retweet trends) to see what is working for them on different social sites

These ideas are only the tip of the iceberg, and I’d love to hear how you are leveraging scrapers.

What SEO tasks are you using scraping tools for? Have scraping tools played a role in an important project or task that you otherwise wouldn’t have been able to do?

Tags: ,

Comments

  1. Well what about Scrapebox? I have heard alot about this tool but don't know much about it. Does this fall into the web robot scraper category? But Outwit sounds good I might give it a try. Thanks

    Comment by Thomas - January 6, 2011 @ 1:26pm
  2. Hey Gab, will have a look. Scraping and botting are definitely becoming important as the web evolves towards semantic sites. Drop me an email whenever you have some free time :)

    Comment by Jeremy - January 10, 2011 @ 2:32am
  3. For the python crowd I can heartily recommend beautiful soup: http://bit.ly/hAlPbz

    Comment by Ryan Underdown - January 10, 2011 @ 1:50pm
  4. Hey Gabe -- still waiting on the book I really like Scrapebox a lot. I use it to check a large list of URLs to see if they are indexed -- for websites that I'm working on and backlink analysis. Same deal with Google PageRank scraping. I'm hoping that SB adds a feature to scrape Cache dates. There's a LOT more that you can do with it -- especially Black Hat stuff Needlebase sounds very intriguing

    Comment by Devin - January 24, 2011 @ 10:58pm
  5. It's an ok list, but beautiful soup wins hands down. Take 30 minutes to dive into python - it may SEEM intimidating, but it'll click and you'll never go back.

    Comment by Pavlicko - February 10, 2011 @ 6:54pm
  6. Outwit released a standalone version, so you don't need Firefox anymore, and it supports command line parameters so you can put macros in cron jobs. Working with huge volumes of data (pages, results...) seems to be quicker too.

    Comment by Grubshka - April 13, 2012 @ 2:27pm

Leave a Reply