It is not an official standard backed by a standards body, or owned by any commercial organisation. It is not enforced by anybody, and there no guarantee that all current and future robots will use it. Consider it a common facility the majority of robot authors offer the WWW community to protect WWW server against unwanted accesses by their robots.
The latest version of this document can be found on http://info.webcrawler.com/mak/projects/robots/robots.html.
In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).
These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed. This standard addresses this need with an operational solution.
/robots.txt
". The contents of this file are specified below.
This approach was chosen because it can be easily implemented on any existing WWW server, and a robot can find the access policy with only a single document retrieval.
A possible drawback of this single-file approach is that only a server administrator can maintain such a list, not the individual document maintainers on the server. This can be resolved by a local process to construct the single file from a number of others, but if, or how, this is done is outside of the scope of this document.
The choice of the URL was motivated by several criteria:
/robots.txt
" file are as follows:
The file consists of one or more records separated by one or more blank lines
(terminated by CR,CR/NL, or NL). Each record contains lines of the form
"<field>:<optionalspace><value><optionalspace>
".
The field name is case insensitive.
Comments can be included in file using UNIX bourne shell conventions: the
'#
' character is used to indicate that preceding space (if any) and
the remainder of the line up to the line termination is discarded. Lines
containing only a comment are discarded completely, and therefore do not
indicate a record boundary.
The record starts with one or more User-agent
lines, followed by
one or more Disallow
lines, as detailed below. Unrecognised headers
are ignored.
If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record.
The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.
If the value is '*
', the record describes the default access
policy for any robot that has not matched any of the other records. It is not
allowed to have multiple such records in the "/robots.txt
"
file.
Disallow: /help
disallows both /help.html
and /help/index.html
,
whereas Disallow: /help/
would disallow
/help/index.html
but allow /help.html
.
Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.
/robots.txt
" file has no explicit associated semantics, it
will be treated as if it was not present, i.e. all robots will consider
themselves welcome.
/robots.txt
" file specifies that no robots should visit any URL
starting with "/cyberworld/map/
" or "/tmp/
", or
/foo.html
:
# robots.txt for http://www.example.com/ User-agent: * Disallow: /cyberworld/map/ # This is an infinite virtual URL space Disallow: /tmp/ # these will soon disappear Disallow: /foo.html
/robots.txt
" file specifies that no robots should
visit any URL starting with "/cyberworld/map/
", except the robot
called "cybermapper
":
# robots.txt for http://www.example.com/ User-agent: * Disallow: /cyberworld/map/ # This is an infinite virtual URL space # Cybermapper knows where to go. User-agent: cybermapper Disallow:
# go away User-agent: * Disallow: /
Note: This code is no longer available. Instead I recommend using the robots exclusion code in the Perl libwww-perl5 library, available from CPAN in the LWP directory.