Database Format
---------------
Records
-------
Records are formatted like RFC 822 messages.
Unless specified, values may not contain HTML, or empty lines,
but may contain 8-bit values.
Where a value contains "one or more" tokens, they are
to be separated by a comma followed by a space.
Fields can be repeated and grouped by appending number 2 and up,
for example:
robot-owner-name1: Mr A. RobotAuthor
robot-owner-url1: http://webrobot.com/~a/a.html
robot-owner-name2: Mr B. RobotCoAuthor
robot-owner-name2: http://webrobot.com/~b/b.html
Fields Schema
------
robot-id:
Short name for the robot,
used internally as a unique reference.
Should use [a-z-_]+
Example: webcrawler
robot-name:
Full name of the robot,
for presentation purposes.
Example: WebCrawler
robot-details-url:
URL of the robot home page,
containing further technical details on the robot,
background information etc.
Example: http://webcrawler.com/WebCrawler/Facts/HowItWorks.html
robot-cover-url:
URL of the robot product,
containing marketing details about either the robot,
or the service to which the robot is related.
Example: http://webcrawler.com/
robot-owner-name:
Name of the owner. For service robots this is the person
running the robot, who can be contacted in case of specific
problems.
In the case of robot products this is the person
maintaining the product, who can be contacted if the
robot has bugs.
Example: Brian Pinkerton
robot-owner-url:
Home page of the robot-owner-name
Example: http://info.webcrawler.com/bp/bp.html
robot-owner-email:
Email address of owner
Example: np@webcrawler.com
robot-status:
Deployment status of the robot. One of:
- development: robot under development
- active: robot actively in use
- retired: robot no longer used
robot-purpose:
Purpose of the robot. One or more of:
- indexing: gather content for an indexing service
- maintenance: link validation, html validation etc.
- statistics: used to gather statistics
Further details can be given in the description
robot-type:
Type of robot software. One or more of:
- standalone: a separate program
- browser: built into a browser
- plugin: a plugin for a browser
robot-platform:
Platform robot runs on. One or more of:
- unix
- windows, windows95, windowsNT
- os2
- mac
etc.
robot-availability:
Availability of robot to general public. One or more of:
- source: source code available
- binary: binary form available
- data: bulk data gathered by robot available
- none
Details on robot-url or robot-cover-url.
robot-exclusion:
Standard for Robots Exclusion supported.
yes or no
robot-exclusion-useragent:
Substring to use in /robots.txt
Example: webcrawler
robot-noindex:
directive supported:
yes or no
robot-nofollow:
directive supported:
yes or no
robot-host:
Host the robot is run from. Can be a pattern of DNS and/or IP.
If the robot is available to the general public, add '*'
Example: spidey.webcrawler.com, *.webcrawler.com, 192.216.46.*
robot-from:
The HTTP From field as defined in RFC 1945 can be set.
yes or no
robot-useragent:
The HTTP User-Agent field as defined in RFC 1945
Example: WebCrawler/1.0 libwww/4.0
robot-language:
Languages the robot is written in. One or more of:
c,c++,perl,perl4,perl5,java,tcl,python, etc.
robot-description:
Text description of the robot's functions.
More details should go on robot-url.
Example: The WebCrawler robot is used to build the database
for the WebCrawler search service operated by GNN
(part of AOL).
The robot runs weekly, and visits sites in a random order.
robot-history:
Text description of the origins of the robot.
Example: This robot finds its roots in a research project
at the University of Washington in 1994.
robot-environment:
The environment the robot operates in. One or more of:
- service: builds a commercial service
- commercial: is a commercial product
- research: used for research
- hobby: written as a hobby
modified-date:
The date this record was last modified. Format as in HTTP
Example: Fri, 21 Jun 1996 17:28:52 GMT