A: Robots.txt is a text file that tells search engine spiders (what are search engine spiders?), also known as search engine robots, which parts of your website they can enter and which parts they can’t. You can think of it as a nightclub’s bouncer or doorman, whose job it is to keep certain people (in our case the search engine robots) out of a certain VIP section of the club.
Now you may be thinking, I thought SEO was supposed to get my website fully crawled and indexed by the search engines! Why in the world would I want to keep their robots away from some of my pages? It’s a good question, and the simple answer is that most websites don’t want any of their pages being hidden from the search engine spiders.
The main reason why robots.txt would be used is to keep sensitive information private.
For example, if an ecommerce site uses a database with clients’ personal information stored on it, and this information is accessible through the website, robots.txt would be used to tell search engine spiders not to crawl (and therefore not to index) the access pages. If Yahoo’s Slurp robot collected the site’s clients’ credit card information, then anyone searching Yahoo could access those numbers! To avoid this PR and legal nightmare, robots.txt would be used to hide the database from Yahoo.
Another case for using robots.txt would be the situation where expensive research was being posted online. For example, imagine that pharmaceutical giant Pfizer had its research department using a wiki (a sort of collaborative website to which anyone can easily contribute and post updates) through Pfizer.com to share ideas and discoveries. It could use Robots.txt to make the wiki off-limits to search engine spiders.
Important Note: Password protection, firewalls, encryption and other security devices should be used in conjunction with robots.txt. Search engines are only one of many ways people come to visit a site, and so the security of sensitive data needs to be considered in light of human surfing behaviour, not just robots. Besides which, malicious spammers also use spiders and robots, and the first part of a website they’ll go to is often what’s been marked “Do not enter” in the robots.txt file. They know that’s where the “valuables” are kept.
At the time of this writing, Mcdonalds.com’s robots.txt excluded robots from crawling a directory called tools. If you were to try to visit Mcdonalds.com/tools, you would get an Error: file not found page (both by trying http://mcdonalds.com/tools and http://www.mcdonalds.com/tools ; see discussion below about canonicalization problems). Other companies simply redirect such queries to the homepage. Both approaches are valid, and worth implementing.
Canonicalization problems occur where multiple pages on a website contain the same information. For example, a product page might also have a “Print” version to make it easier to print out the specs and details on the product. This can pose a problem to search engines who then have to figure out which version of the page (i.e. site.com/product-a.html or site.com/product-a-print.html) is the canonical one. Robots.txt would be used to keep the secondary version(s) from being indexed.
However, this solution isn’t ideal, because some people will link to a secondary page. If robots have been told to keep out, then the SEO benefit conferred to the from the inbound link may be lost. A better alternative may be to redirect search engines from all the secondary pages to the primary page, that way all links will certainly be counted. In addition the links will be counted towards the benefit of one specific page (as opposed to two or more), making it easier to achieve high rankings and pull in traffic.
(In case you’re wondering, “canonicalization” and its related words come from the world of religious writing. There, books or theories and other ideas may develop for a certain time until they are “canonized” – declared the best – and may no longer be edited. It is then part of the “Canon” or official doctrine.
As to the problem being referred to as a “duplicate content” issue, this is incorrect because duplicate content refers to a situation where more than two or more websites share the same content. Obviously, robots.txt can’t disallow pages on someone else’s website and is therefore useless in a duplicate content situation. Here the duplicate versions of the content are all on the same website. The difference is like that of a big company’s IT department and marketing department both publishing separate but identical press releases (the canonicalization issue) as opposed to two rival companies both claiming a new discovery as their own.)
How to write robots.txt and implement it on your website
The first thing you need to do is write a robots.txt file (i.e. create one).
Open Notepad (Mac users use a text editor that can output “.txt” format files) by clicking Start -> Accessories -> Notepad.
Now, type (or copy-paste) the following two lines into Notepad.
Next, save your file as “robots.txt”. The .txt extension should be on by default if you’re using Notepad. You just need to save the file as “robots” (without the quotation marks). Be sure to use lower-case, because some technologies may not work well with a “Robots.txt”, for example.
The fourth step is to customize the robots.txt file by writing out which robots you want to exclude (by altering the User-Agent line) and from what files or folders you want to exclude them. Below are some examples from the official robots.txt site. Once you’re done customizing your robots.txt file, you just upload it to the root directory of your website (it won’t work if you upload it anywhere else, such as a subdirectory).
What to put into the robots.txt file
The “/robots.txt” file usually contains a record looking like this:
In this example, three directories are excluded. [Namely the cgi-bin, tmp and~joe directories. A directory in server jargon is a file folder: it's a place where other files are kept and from which they may be accessed. By disallowing an entire directory, you block all the files it contains from search engine robots. These directories should not be confused with yellow-page style directories that list websites, such as Joe Ant.]
Note that you need a separate “Disallow” line for every URL prefix you want to exclude — you cannot say “Disallow: /cgi-bin/ /tmp/”. Also, you may not have blank lines in a record, as they are used to delimit multiple records.
Note also that regular expression are not supported in either the User-agent or Disallow lines. The ‘*’ in the User-agent field is a special value meaning “any robot”. Specifically, you cannot have lines like “Disallow: /tmp/*” or “Disallow: *.gif”.
What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve. Here follow some examples:
To exclude all robots from the entire server
To allow all robots complete access
Or create an empty “/robots.txt” file.
To exclude all robots from part of the server
To exclude a single robot
To allow a single robot
To exclude all files except one
This is currently a bit awkward, as there is no “Allow” field. The easy way is to put all files to be disallowed into a separate directory, say “docs”, and leave the one file in the level above this directory:
Alternatively you can explicitly disallow all disallowed pages:
Like this post on Robots.txt? Subscribe to my RSS feed then.Tags: Crawling, Robots, SEO FAQ