Blog Search Ranking Factors – Google’s BlogSearch Algorithm

Author: Gab Goldenberg

Google’s blog search patent application came out in March 2007 (though it was filed two years earlier) and it, like Google’s near-legendary Anatomy of a Large-Scale Search Engine paper, explains how the search engine will rank documents. In other words, it gives an idea of what makes the first blog show up first, and what makes the second show up second. So understanding the patent is essential to rank a blog in Google’s blog search.

Luckily for you, you don’t need to go through the drudgery of reading through the whole, long technical paper. Various SEOs (Bill Slawski) and another good folk (Alister Cameron) have analyzed the patent and published this analysis for the public benefit. What follows is my comprehensive summary that aggregates and simplifies the patent and its analysis.

Three umbrella factors emerge from reading the patent’s abstract.

1) Document Relevance
2) Blog Post Quality (read assessing authority in the blog world…); possibly also Blog Quality generally, as per Bill Slawski
3) “The blog search engine may also provide information regarding the group of blog documents based on the determined scores.” – Patent abstract. This suggests that the display will not only list results but potentially also offer ratings and reviews or other “information” about the blog.

Bill Slawski’s analysis jumps right to the blog and blog post quality assessment part of the patent since the relevance factors are essentially the algorithm running Google’s main search engine. He writes, roughly quoting the patent:

The next step [after deciding what info to use for the sake of analysis] is to identify positive indicators of quality:

* Popularity of the blog,
* Implied popularity of the blog,
* Inclusion of the blog in blogrolls,
* Existence of the blog in high quality blogrolls,
* Tagging of the blog,
* References to the blog by other sources,
* A pagerank of the blog, and;
* Other indicators could also be used.

Popularity, both our analysts concur, will be based on RSS feed readership. In the world of blogging, instead of subscribing to an email newsletter, readers subscribe to an “RSS feed.” The distinction between email technology and RSS feed technology is unimportant for our purposes. The point is that the more people subscribing, the more popular a blog is considered.

The data on subscriptions would be obtained through Google Reader, Google’s RSS feed reader, as well as through Feedburner, one of the web’s most popular RSS services. Google recently bought Feedburner. Which somewhat reinforces what I’ve said before about Google knowing too much about the average person (and so have others – scroll down a bit). But I digress.

Implied popularity is a fancy name for Click-Through Ratio (CTR). In fact, the idea isn’t very original, and was proposed by Claria a while ago. In layman’s terms, the more people click through to your site from the search results, the more your popularity is implied.

Inclusion in blogrolls and high quality blogrolls will be treated the way PageRank originally intended to treat links, before the system became known and abused. That is, each [blogroll] link to another blog will count as a vote for that blog. If it comes from a high quality blogroll, additional “trust” points will be earned (the points analogy is my own).

Tagging of the document might help give an indication of secondary keywords the blog is relevant for. Essentially, this would function the way anchor text does in the main Google algorithm, describing the content. Additionally, this might be an indication of a clustering-style algorithm in Teoma style. A clustering algorithm puts websites together by subject “neighbourhood”. Links from within the neighbourhood are worth more, the same way links from high quality blogrolls are worth more. If Google’s blog search also integrates Teoma style “refine search” and “related resources” options, there could be additional targets and real estate for SEOs to rank their clients for.

Alister Cameron at Problogger, one of the analysts whose work I’m summarizing, suggests that this emphasis on tagging will make Social Media Optimization a necessary skill for SEO. I’m inclined to agree.

Outside references to a blog could also be important. Alister suggests that Google might be reading your mail (or its robots, anyway). As far as SEO goes, Alister draws yet another intelligent conclusion: aim to get your post and blog mentioned in email. Which can mean word of mouth marketing (those jokes that get email forwarded for years) and also “email this post” buttons on your blog, as well as other initiatives.

PageRank is the name for the original Google algorithm, laid out in Google founders Sergey Brin and Larry Page’s landmark paper, Anatomy of a Large Scale Hypertextual Web Search Engine. (The same one I mentioned above.) To make a long story short, all websites (including blogs) have their pages rated on a scale of 0-10 (10 being the best) in terms of relevancy to a particular query. This is dependent on the text found on the page, and the links to that page from other webpages, be they on the same website or on others.

Bill Slawski continues his analysis a little further than Alister Cameron and also considers the factors that can negatively impact a ranking. These factors are mostly aimed at fighting spam. For those of you who want the most comprehensive understanding of blog search you can get, read on. Otherwise, feel free to ignore what’s below, as there’s nothing that will help you rank higher – you’ll just figure out how to avoid being labelled as spam. Skip ahead to the last paragraph for more tips.

* Frequency of new posts,
* Content of the posts,
* Size of the posts,
* Link distribution of the blog,
* The presence of ads in the blog, and;
* Other indicators may also be used.

Frequency of posts will be used in combination with Google’s knowledge of spam blogs, or splogs, in order to fight spam. The blog search patent refers to patterns in the frequency of updates that are highly correlated with spam blogs; when these patterns are noticed, the blog will see its ranking negatively affected.

Where content is concerned, the blog search patent suggests that a disconnect between RSS feeds and what is actually on the blog tends to occur on spammy sites. The feed in such a scenario shows one form of content, in order to get ranked well, while the site shows a different content to visitors (e.g. ads).

For those technically inclined, the explanation for how this works is that Google’s blog search indexes blogs through their feeds. This indexing-by-proxy means that spammers can disguise the proxy (the RSS feed) as on-topic material while showing irrelevant, yet profitable ads on the real site. It’s cloaking spam brought to the blog world.

Other points relating to content: a) Duplicate content will also serve as a flag to Google that a site is spammy. b) Keyword stuffing, and specifically stuffing of keywords related to the generic pharmaceutical/gambling/other spammy industries will serve to flag spam.

Post size is really straightforward. Splogs auto-generate content that tends to be about the same length. If all posts are around 500 words, the blog will set off spam alerts at Google HQ.

Link distribution refers to the percentage of outgoing links in a post that go to any particular site. If there are many links, all of which send visitors to one site – a PageRank manipulation technique – the blog will be marked as spam.

The advertising factor refers to the location of ads, in Bill’s opinion, and specifically to the attempt to hide them in content. This measure would likely be aimed at text link ad sellers in other words. But that isn’t necessarily so, based on the text of the patent alone, which simply reads:

The presence of ads in the blog document may be a negative indication of quality of the blog document. If a blog document contains a large number of ads, this may be a negative indication of the quality of the blog document.

In other words, the judgment may simply be made based on how much advertising the blog has on it. Having made a blog through Google’s blogger service that had many ads on it, and having been subsequently required to prove I wasn’t an automated spambot, I know that this is already being done and probably having some success.

The flip side is that if you have a popular blog and then add lots of advertising, you probably won’t get flagged as spam. But starting out, you’re best off waiting before you put up ads (the trickle of traffic and handful of clicks probably won’t get you rich off them anyway, so you may as well wait). You may want to see more on Google’s TrustRank and indexing.

Finally, I’d like to discuss the whole other” indicators may be used, which was mentioned in both the positive and negative factors sections. What Bill and Alister seem to have overlooked is a line in the abstract that refers to providing more information to Blog search users.

While the patent doesn’t provide any detail on what the other factors/information provided might be, I’d guess that we can expect something similar to the OneBox results in Google’s main search results, whereby there’s rating information (0-5 stars) and the like. Thus if/when the search results are presented with this additional information, search marketers will need to figure out how to modify it to get their message out and bring visitors in. This could easily be the “other” factor being used in Google’s Blog search ranking algorithm.

Before I wrap this up, I want to share one more tip: Use your keywords in the blog’s name. Blogs that have their keywords in the name will get listed at the top of Blog search in a sort of Onebox area, ahead of all the blog posts that are returned. I say this from experience. A blog of mine had its keywords in the name – not even the URL, I just rebranded it to include the keywords – and it routinely showed up in Google’s Blog search Onebox.

