While emailing with Aaron Wheeler about SEOmoz‘s Link Intersect, Aagon gave me the following insight into how they crawl the web and prioritize their choices of links. It’s pretty fascinating from a tech perspective…
Thanks for writing in and sorry that you weren’t able to find that directory in the Competitive Link Finder tool. Competitive Link Finder is still in the Labs section of the site, which mean it’s beta, prone to breaking, and we’re either working on tools to use the technology demonstrated in the Labs tool or we don’t plan on working on it any further (but it’s still cool enough to show off!). It really depends. Regardless, we can’t provide too much support for these Labs tools because they’re not on our main road map. Sorry about that!
If you have specific feedback about why you think http://montrealloft.ca/fr/liste_promoteurs.php should have been listed in those results [my initial email asked why it wasn't], let me know and I’ll pass on the feedback. We really appreciate it! It does look like http://montrealloft.ca/fr/liste_promoteurs.php isn’t in our Linkscape Index of the web, so it’s probably not going to show up throughout our site Index-wise.
Most new sites and links will be indexed by our spiders and available in Linkscape and Open Site Explorer within 60 days, but some take even longer for a plethora of reasons, including crawl-ability of sites, the amount of inbound links to them, and the depth of pages in subdirectories.
Just so you know, here’s how we do our index: we take the last index, take the 10 billion URLs with the highest mozrank (with a fixed limit on some of the larger domains), and start crawling from the top-down until we’ve crawled 40,000,000,000 pages (which is about 1/4 of the amount in Google’s index). Therefore, if the site is not linked to by one of these seed URLs (or one of the URLs linked to by them in the next update) then it won’t show up in our index
We update our Linkscape Index every 3 to 5 weeks. Crawling the whole internet to look for links takes 2-3 weeks. And then we’ve got 1-2 weeks of processing to do on those links to determine which are the most important links etc. You can see a schedule of how often we update, and planned updates here.
Linkscape focuses on a breadth-first approach, and thus we nearly always have content from the homepage of websites, externally linked-to pages and pages higher up in a site’s information hierarchy. However, deep pages that are buried beneath many layers of navigation are sometimes missed and it may be several index updates before we catch all of these.
If our crawlers or data sources are blocked from reaching those URLs, they may not be included in our index (though links that points to those pages will still be available). Finally, the URLs seen by Linkscape must be linked-to by other documents on the web or our index will not include them.
I hope this information helps! While the site may not be indexed yet, give it some time – maybe we’ll see it in OSE and Linkscape next month.
Best of luck!
I wrote Aaron back:
Thanks a lot for the indepth answer – that was really helpful! Can I
actually publish that on my site? I’ll bet other folks would be interested
in the answer.
Re: Link intersect and the presence/absence of particular pages in it, Link
Intersect is one of the key reasons I subscribe, along with Q+A and maybe
keyword difficulty. And I know a lot of other advanced link builders value
it highly, because it answers the “why” question – why are competitors in
the niche being given links – in a way most likely to be imitable. As to
that particular directory, being that it’s on a high quality website, I’d
encourage you guys to look at how to include more like it in your index.
It’ll make Link Intersect and Linkscape/OSE in general a lot more valuable.
And his response:
My pleasure! Glad it was helpful. You can most certainly publish it on your site. =)
It’s really good to hear your feedback about your tool usage; it helps us figure out what to prioritize! It would be pretty awesome to include the LI tool in a future iteration of Open Site Explorer, as we’re already making changes to OSE to roll out in a couple months.
Did you know we have a feature request forum? It’s a great place to share your ideas. Other people can vote on them and it will help us determine priorities. You should check it out: http://seomoz.zendesk.com/forums
Regardless, I’ll pass the message along to our team and we’ll see what we can do to add more functionality to Link Intersect down the line. Thanks again for letting us know! Best of luck with your upcoming book; hope we help you keep your hats in order. =)
Have a great day,