Where do I find out how /robots.txt files work?
You can read the whole standard specification but the basic concept is simple: by writing a structured text file you can indicate to robots that certain parts of your server are off-limits to some or all robots. It is best explained with an example:
# /robots.txt file for http://webcrawler.com/ # mail webmaster@webcrawler.com for constructive criticism User-agent: webcrawler Disallow: User-agent: lycra Disallow: / User-agent: * Disallow: /tmp Disallow: /logs
The first two lines, starting with '#', specify a comment
The first paragraph specifies that the robot called 'webcrawler' has nothing disallowed: it may go anywhere.
The second paragraph indicates that the robot called 'lycra' has all relative URLs starting with '/' disallowed. Because all relative URL's on a server start with '/', this means the entire site is closed off.
The third paragraph indicates that all other robots should not visit URLs starting with /tmp or /log. Note the '*' is a special token, meaning "any other User-agent"; you cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines.
Two common errors:
- Wildcards are _not_ supported: instead of 'Disallow: /tmp/*' just say 'Disallow: /tmp/'.
- You shouldn't put more than one path on a Disallow line (this may change in a future version of the spec)