# /robots.txt file for http://webcrawler.com/ # mail webmaster@webcrawler.com for constructive criticism User-agent: webcrawler Disallow: User-agent: lycra Disallow: / User-agent: * Disallow: /tmp Disallow: /logs The first two lines, starting with '#', specify a comment The first paragraph specifies that the robot called 'webcrawler' has nothing disallowed: it may go anywhere. The second paragraph indicates that the robot called 'lycra' has all relative URLs starting with '/' disallowed. Because all relative URL's on a server start with '/', this means the entire site is closed off. The third paragraph indicates that all other robots should not visit URLs starting with /tmp or /log. Note the '*' is a special token, meaning "any other User-agent"; you cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines. Two common errors: Wildcards are _not_ supported: instead of 'Disallow: /tmp/*' just say 'Disallow: /tmp/'. You shouldn't put more than one path on a Disallow line (this may change in a future version of the spec) Surely listing sensitive files is asking for trouble? Some people are concerned that listing pages or directories in the /robots.txt file may invite unintended access. There are two ansers to this. The first answer is a workaround: You could put all the files you don't want robots to visit in a separate sub directory, make that directory un-listable on the web (by configuring your server), then place your files in there, and list only the directory name in the /robots.txt. Now an ill-willed robot can't traverse that directory unless you or someone else puts a direct link on the web to one of your files, and then it's not /robots.txt fault. For example, rather than: User-Agent: * Disallow: /foo.html Disallow: /bar.html do: User-Agent: * Disallow: /norobots/ and make a "norobots" directory, put foo.html and bar.html into it, and configure your server to not generate a directory listing for that directory. Now all an attacker would learn is that you have a "norobots" directory, but he won't be able to list the files in there; he'd need to gues their names. However, in practice this is a bad idea -- it's too fragile. Someone may publish a link to your files on their site. Or it may turn up in a publicly accessible log file, say of you user's proxy server, or maybe it will show up in someone's web server log as a Referer. Or someone may misconfigure your server at some future date, "fixing" it to show a directory listing. Which leads me to the real answer: The real answer is that /robots.txt is not intended for access control, so don't try to use it as such. Think of it as a "No Entry" sign, not a locked door. If you have files on your web site that you don't want unauthorized people to access, then configure your server to do authentication, and configure appropriate athorization. Basic Authentication has been around since the early days of the web (and in e.g. Apache on UNIX is trivial to configure), and if you're really serious, SSL is commonplace in web servers.