From /CN=robots-errors/@nexor.co.uk Wed Jun 1 21:17:14 1994 Return-Path: Delivery-Date: Wed, 1 Jun 1994 21:17:35 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 1 Jun 1994 21:17:14 +0100 Date: Wed, 1 Jun 1994 21:17:14 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:039230:940601201716] Content-Identifier: WWW robots di... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 1 Jun 1994 21:17:14 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"3912 Wed Jun 1 21:17:00 1994"@nexor.co.uk> To: Jonathon Fletcher , David Eichmann , Oliver McBryan , Roy Fielding , Brian Pinkerton , Fred Barrie , Matthew Gray , Paul De Bra , Guido van Rossum , "James E. Pitkow" , Andreas Ley , Christophe Tronche , Charlie Stross , L.McLoughlin@doc.imperial.ac.uk, Michael L Mauldin Cc: /CN=robots/@nexor.co.uk Subject: WWW robots discussion list Status: RO Content-Length: 1305 At the WWW'94 Conference the robot authors present expressed an interest in some closer collaboration. I volunteered to set up a mailing list to serve as a platform for these technical discussions. This list is now active. As you are all developing or administering robots I'd urge you to make use of this facility; together we should be able to reduce the occurence of problems caused by robots, to reduce some of the duplicate effort, and improve the service to users of robot-generated facilities. If you'd like to subscribe, send a message to robots-request@nexor.co.uk, with the lines subscribe help stop in the body of the message. The list manager is of course NXDLM, which we market as product, and is configured to keep an archive of traffic on the list. This archive is accessible from the Web vie our experimental gateway reacheable from . To send messages to the list itself use robots@nexor.co.uk. Next week (allowing people time to register) I'll post a proposed charter to the list, and list some issues I'd like to see discussed. Looking forward to your contributions, -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Mon Jun 6 09:37:38 1994 Return-Path: Delivery-Date: Mon, 6 Jun 1994 09:38:15 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 6 Jun 1994 09:37:38 +0100 Date: Mon, 6 Jun 1994 09:37:38 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:130800:940606083743] Content-Identifier: Proposed Char... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 09:37:38 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"13056 Mon Jun 6 09:37:06 1994"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: Proposed Charter Status: RO Content-Length: 1401 Welcome to you all..., Here is the proposed charter for this list, for future reference by new subscribers. It's straightforward, but if anybody would like to see any changes let me know. -- Proposed charter for robots@nexor.co.uk. This list is intended as a technical forum for authors, maintainers and administrators of WWW robots. Its aim is to maximise the benefits WWW robots can offer while minimising drawbacks and duplication of effort. It is intended to address both development and operational aspects of WWW robots. This list is not intended for general discussion of WWW development efforts, or as a first line of support for users of robot facilities. Postings to this list are informal, and decisions and recommendations formulated here do not constitute any official standards. Postings to this list will be made available publicly through the list-managers archive, and NEXOR doesn't accept any responsibility for the content of the postings. Related lists: www-talk@info.cern.ch: technical WWW development discussions www-html@info.cern.ch: HTML specific development discussions www-cache@info.cern.ch: technical discussions on proxys and caching comp.infosystems.www.*: WWW discussions -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Mon Jun 6 09:39:16 1994 Return-Path: Delivery-Date: Mon, 6 Jun 1994 09:39:54 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 6 Jun 1994 09:39:16 +0100 Date: Mon, 6 Jun 1994 09:39:16 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:131210:940606083924] Content-Identifier: Topics Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 09:39:16 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"13118 Mon Jun 6 09:39:07 1994"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: Topics Status: RO Content-Length: 7634 Here is a long list topics I'd like to see discussed at some point, in no particular order. I look forward to comments on these topics, other issues, and what the priorities should be. * public information - robot profile matrix It would be nice to have a matrix of certain attributes of the various robots that exist, that is available to the Web public at large. The list I maintain on could server as a basis; are there any additions people would like to see? * sharing of data - format / access protocol of database Most indexing robots generate a database of information, that can then be searched through publicly accessible ISINDEX/Form pages. It would be nice if the actual database was publicly available, or where applicable an access protocol can be made publicly available (eg. SQL). Others could then run local mirrors of the search engines, write their own search engines, or do analysis of the data. - distributed data gathering If there was a standard database format / access protocol the data gathering could be distributed over the net, either by separate robots, or multiple copies of the same robot. Jonathon, you mentioned once you were working on some robot database synchronisation scheme. Did you get anywhere? * data analysis As robots traverse the Web, they could do a lot of statistical analysis, either real-time, or on the resulting database. It seems silly that multiple robots go out over the same data, all doing slightly different analysis. It would be really nice to publish: - a list of servers Like Mathew Gray's list, but then one that is as up-to-date as the latest robot run, has only got hosts that actually exist, and are smart about multiple DNS names for the same IP address. - inverse maps Robots can create inverse maps, so that I can find out which pages refer to a particular page. Until the Referer HTTP field becomes more used this could be very valuable to find bad links. And it'd be nice to know the average number of links to a page; how inter- linked is the Web? We could have a most-referenced league table; which is the most popular page in the web in terms of links? - general stats like avergae number of visited documents per server (and min & max), total number of documents visited, total number of hosts visited, percentage of links that are bad, percentage of HTML documents that are a tag soup, percentage of documents not changed in x days, etc. etc. * sharing of operational tips All robot maintainers hit the same problems at certain sites, and get things like: - seed documents What documents are good to start robots from. - site exclusion lists Which sites explicitly ask not to be visited. - black hole lists Which cgi-scripts create infinitely linked web spaces - avoidance lists Which data should be avoided (e.g. the UNIX manual gateways) - robotsnotwanted proposal I'd like to get some more discussion on this. As all the robot writers are on this list we should be able to decide on something that can easily be implemented by robots and users. The only outstanding issue is the name of the file; it is too long for DOS-based servers. Is there any problem with changing the filename to robotsp.txt (for robots policy) ? - scheduled runs It might be nice to know when which robots are running, just in case people start wondering. - ALIWEB For those sites that have a /site.idx file it might be worth to take the documents referenced in it special consideration. * sharing of algorithms All robots have different algorithms for a lot of the same functions. It should be possible to find the best algorithm that all robots can use: - document selection Which documents do you visit? A lot of robots to "n levels deep" which seems pretty arbitrary to me. Doing "n levels from the root document" might make more sense. - HTML parsing This is tricky, with so much bad HTML out there. There must be a "best way" to extract URL's from documents; I am sure that at the moments some robots barf on some documents. - load balancing How do you decide when to query a site as to balance the load most? It is by now clear the "visit one site at top speed" approach is nasty; what is used now? Roud robin? Can time zones be used? How fast do robots run? - search algorithms Once you have a database, what algorithm can one use to search it? At the moment there are Perl scripts, SQL scripts, WAIS database etc. If there was a standard database format these could be benchmarked. - error recovery Robots should be restarteable without having to backtrack. How is this best achieved? * sharing of code There is a lot of duplication of effort in the coding and maintenance of robot code. It would be useful if there was one common code base for robots to draw from, implementing the separate algorithms used in robots. I would really like to see a single robot implemetation (TUM: The Unified Robot?), that could run cooperatively around the world. Is this me dreaming, or is it something more of you see as beneficial. If so, how can we make this a reality? What language is most suiteable (Perl, surely :-) ? What design allows the most flexibility and safety? * HTML/HTTP extensions It maybe that there are things in HTTP/HTML that robots could use but don't at the moment, and it may even be worth extending the protocol to put facilities aimed at automated tools in (eg If-modified-since). At WWW'94 one idea was for example to implement as server-side facility to parse an HTML document, and return only the links. * Caching issues The increased use of caching presents special problems for robots: how does a robot recognise a cached document sitting in the cache data area of a chaching server? Should it document them? But caches and robot do similar things, a robot uses it's own database as a cache (I hope!), but a caching server could also use that data. This comes back to standardising the database; maybe the structure used by the CERN cache can be used as format for robot gathering output. Robots can also be useful for pre-loading a cache, to do mirroring, or to prepare for off-line demo's. Maybe robots should have command-line options to facilitate this. Then again, robot code should probably not be handed out freely. * Testing The person running a robot should keep close tabs on what it is doing at any one time. What sort of monitoring tools are used to do that? Testing robot modifications is another issue. I have noticed in the past that a robot did the same run several times in a day, which it turned out to do "for testing". Surely tests should be done locally. Right, I have been waiting to get all these off my chest. I think TUM is the most challenging long-term topic, but in the short term I think the standard database(s) is the most important; it would bring immediate benefit, and a lot of the other issues can follow on from that. Any comments? -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Mon Jun 6 10:06:46 1994 Return-Path: Delivery-Date: Mon, 6 Jun 1994 10:07:47 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 6 Jun 1994 10:06:46 +0100 Date: Mon, 6 Jun 1994 10:06:46 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:136200:940606090649] Content-Identifier: Inverse Maps Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 10:06:46 +0100; Alternate-Recipient: Allowed From: mkgray@MIT.EDU Message-ID: <9406060905.AA22918@deathtongue.MIT.EDU> To: /CN=robots/@nexor.co.uk Subject: Inverse Maps X-Url: http://www.mit.edu:8001/people/mkgray/mkgray.html Status: RO Content-Length: 770 I am currently working on W4v3.0, and one of the features I have implemented so far is some inverse mapping features. It's yielded some interesting results. Not surprisingly, the most pointed to sites in the documents examined in a preliminary run were info.cern.ch and www.ncsa.uiuc.edu. Other highly pointed to sites include nearnet.gnn.com (:-), www.cis.ohio-state.edu, www.cs.cmu.edu, gopher.vt.edu, and sunsite.unc.edu. For the initial portion of the implementation, I am only constructing interconnectivity within sites. That is, I keep track of what documents point to site FOO, not what documents point to what documents. Any ideas on implemenation of the latter that is reasonable? Has anyone else done such interconnectivity mapping? ...Matthew From /CN=robots-errors/@nexor.co.uk Mon Jun 6 10:36:42 1994 Return-Path: Delivery-Date: Tue, 7 Jun 1994 09:48:42 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 6 Jun 1994 10:36:42 +0100 Date: Mon, 6 Jun 1994 10:36:42 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:140740:940606093644] Content-Identifier: re: Inverse M... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 10:36:42 +0100; Alternate-Recipient: Allowed From: Charlie Stross Message-ID: <9406061033.aa01127@ruddles.sco.com> To: /CN=robots/@nexor.co.uk, mkgray@MIT.EDU Subject: re: Inverse Maps X-Mailer: SCO Portfolio 2.0 Status: RO Content-Length: 1799 mkgray@MIT.EDU writes ... >I am currently working on W4v3.0, and one of the features I have implemented >so far is some inverse mapping features. It's yielded some interesting >results. Not surprisingly, the most pointed to sites in the documents >examined in a preliminary run were info.cern.ch and www.ncsa.uiuc.edu. >Other highly pointed to sites include nearnet.gnn.com (:-), >www.cis.ohio-state.edu, www.cs.cmu.edu, gopher.vt.edu, and sunsite.unc.edu. >For the initial portion of the implementation, I am only constructing >interconnectivity within sites. That is, I keep track of what documents >point to site FOO, not what documents point to what documents. Any ideas >on implemenation of the latter that is reasonable? One idea I was playing with when I was working on websnarf 2 (which is currently on the shelf) was the idea of using a whacking great .dbm file to store either entire HTML files, indexed on their URL, or a list of URLs extracted from such files. (I ran into a problem in that the standard dbm and Berkeley dbm libraries have a maximum record size of 1024 or 2096 bytes respectively; GNU dbm apparently doesn't have this restriction, but I didn't have time to rebuild my version of Perl with a new library.) Anyway, the idea is that keeping such a database would reduce the problem of cross- referencing large webs; simply read a record, and for each URL in the record (which contains a list) do a lookup on the database. (The output could then be turned into input for a graph-generating program like AT&T's NEATO.) -- Charlie -------------------------------------------------------------------------------- Charlie Stross is charless@sco.com, SCO Technical Publications GO d-- -p+ c++++(i---) u++ l-(+) *e++ m+ s/+ !n h-(++) f+ g+ w++ t-(---) r-(++) y+ From /CN=robots-errors/@nexor.co.uk Mon Jun 6 18:28:28 1994 Return-Path: Delivery-Date: Tue, 7 Jun 1994 09:46:57 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 6 Jun 1994 18:28:28 +0100 Date: Mon, 6 Jun 1994 18:28:28 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:203320:940606172830] Content-Identifier: Re: Inverse M... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 18:28:28 +0100; Alternate-Recipient: Allowed From: Brian Pinkerton Message-ID: <9406061727.AA09398@biotech.washington.edu> To: mkgray@MIT.EDU Cc: /CN=robots/@nexor.co.uk Subject: Re: Inverse Maps Original-Received: by NeXT.Mailer (1.100) PP-warning: Illegal Received field on preceding line Original-Received: by NeXT Mailer (1.100) PP-warning: Illegal Received field on preceding line Status: RO Content-Length: 403 I've done some inverse mapping with the WebCrawler, but not to any great extent. Right now, I just generate the "Top 25" list -- a list of the 25 most frequently referenced sites on the Web (at least, based the WebCrawler's limited experience). This turns out to work pretty well -- you can see the (predictable) results at http://www.biotech.washington.edu/WebCrawler/Top25.html. bri From /CN=robots-errors/@nexor.co.uk Mon Jun 6 10:14:53 1994 Return-Path: Delivery-Date: Mon, 6 Jun 1994 10:15:25 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 6 Jun 1994 10:14:53 +0100 Date: Mon, 6 Jun 1994 10:14:53 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:137830:940606091454] Content-Identifier: Avoidance Alg... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 10:14:53 +0100; Alternate-Recipient: Allowed From: mkgray@MIT.EDU Message-ID: <9406060914.AA22927@deathtongue.MIT.EDU> To: /CN=robots/@nexor.co.uk Subject: Avoidance Algorithms X-Url: http://www.mit.edu:8001/people/mkgray/mkgray.html Status: RO Content-Length: 1101 One of the features that I implemented in W4v1.0 was an avoidance algorithm I called 'boredom'. First a brief implementation profile of W4v1.0: W4v1.0 was written in June of 1993 as a simple depth first search that kept the entire database in memory of where it had been and dumped to disk when it had exhausted a document tree. Very simple. So, one issue I was concerned about was infinite trees (this is a bad thing with depth first searches :-) so I added a feature to the Wanderer that allowed it to 'get bored'. Specifically, if it retrieved more than N documents with the same path (except for the last element) and a few other heuristics, it bailed out and found something more interesting to do. For the most part this was very successful. W4v2.0 was a modification to do breadth first searching, and in that revision 'boredom' got removed, as it was not as useful to the algorithm. I am planning on reimplementing a more advanced version of 'boredom' in W4v3.0, partially based on content parsing. Suggestions? Comments? Other implementations to avoid large trees? ...Matthew From /CN=robots-errors/@nexor.co.uk Mon Jun 6 10:26:27 1994 Return-Path: Delivery-Date: Tue, 7 Jun 1994 09:48:21 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 6 Jun 1994 10:26:27 +0100 Date: Mon, 6 Jun 1994 10:26:27 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:138740:940606092628] Content-Identifier: Database/memo... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 10:26:27 +0100; Alternate-Recipient: Allowed From: mkgray@MIT.EDU Message-ID: <9406060925.AA22934@deathtongue.MIT.EDU> To: /CN=robots/@nexor.co.uk Subject: Database/memory implementation X-Url: http://www.mit.edu:8001/people/mkgray/mkgray.html Status: RO Content-Length: 1128 How have people in general implemented the DB? By the database (DB) I mean the robot's record of where it has been, not necassarily anything it constructs for later consumption. W4v1.0 implemented a completely in memory DB. This worked fine when there were 100 sites on the web. It doesn't work any more :-) Plus if the Wanderer crashed, it wouldn't always successfully dump it's DB. W4v2.0 implemented a disk based DB which has a number of advantages 1) It can get as big as it wants and not kill the machine 2) It saves state, so arbitrary crashes don't lose any substantial data On the other hand, it is somewhat slower, though most of the time is spent waiting for HTTP responses. Currently, it maintains one record of where it has been ('log') and another record of where it plans on going ('dq') and another set of analogous in-memory lists which regularly get flushed to disk. Any other more novel implementations out there? I've given a passing thought to trying a heierarchical DB, but I'm not sure it would be useful. Any ideas on how to make an in-memory DB smaller? Or a disk DB faster? ...Matthew From /CN=robots-errors/@nexor.co.uk Mon Jun 6 10:34:35 1994 Return-Path: Delivery-Date: Tue, 7 Jun 1994 09:48:38 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 6 Jun 1994 10:34:35 +0100 Date: Mon, 6 Jun 1994 10:34:35 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:140130:940606093437] Content-Identifier: Server list Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 10:34:35 +0100; Alternate-Recipient: Allowed From: mkgray@MIT.EDU Message-ID: <9406060934.AA22943@deathtongue.MIT.EDU> To: /CN=robots/@nexor.co.uk Subject: Server list X-Url: http://www.mit.edu:8001/people/mkgray/mkgray.html Status: RO Content-Length: 977 Once I get W4v3.0 finished, I intend to add a number of the modifications mentioned by Martijn in his initial letter (DNS identification of identical servers, bogus servers eliminated, etc.) Additionally, I would welcome any other lists of servers. I can merge such lists with the comprehensive list. I will continue to maintain the "Comprehensive List of WWW Sites", so anything to make this as up to date and accurate as possible would be great. Suggestions on other useful techniques for sorting the comprehensive list would be great too. If you don't know what I'm talking about, or have lost the URL: http://www.mit.edu:8001/people/mkgray/compre.bydomain.html So, please do send me any sitelists. No desperate need to crosscheck with my list, I can do that. Of course, if you want to, that just makes my life easier. ...Matthew BTW, I'm sending all these messages out separately to keep the topic threads vaguely separate, in case that wasn't apparent. From /CN=robots-errors/@nexor.co.uk Mon Jun 6 11:26:46 1994 Return-Path: Delivery-Date: Tue, 7 Jun 1994 09:47:04 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 6 Jun 1994 11:26:46 +0100 Date: Mon, 6 Jun 1994 11:26:46 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:145260:940606102648] Content-Identifier: Avoidance Alg... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 11:26:46 +0100; Alternate-Recipient: Allowed From: Charlie Stross Message-ID: <9406061120.aa01408@ruddles.sco.com> To: mkgray@MIT.EDU, /CN=robots/@nexor.co.uk Subject: Avoidance Algorithms X-Mailer: SCO Portfolio 2.0 Status: RO Content-Length: 3168 mkgray@MIT.EDU writes ... >One of the features that I implemented in W4v1.0 was an avoidance algorithm >I called 'boredom'. First a brief implementation profile of W4v1.0: >W4v1.0 was written in June of 1993 as a simple depth first search that kept >the entire database in memory of where it had been and dumped to disk when >it had exhausted a document tree. Very simple. : >W4v2.0 was a modification to do breadth first searching, and in that revision >'boredom' got removed, as it was not as useful to the algorithm. I am planning >on reimplementing a more advanced version of 'boredom' in W4v3.0, partially >based on content parsing. >Suggestions? Comments? Other implementations to avoid large trees? Well, my first cut at websnarf was a recursive depth-first probe. This rapidly ran away into the web, and as my bandwidth is limited to my share of a 64K line this seemed like a bad idea. It also had a tendency to dump core due to stack frame overflows. I went to the bookshelf and was most interested to read the chapter on graph searching in Sedgewick (Algorithms, 2nd edn, can't remember the year). It turns out that you can use a stack to emulate a recursive depth-first traversal, and a queue to emulate a recursive breadth-first traversal, both without the need for recursion. Perl provides a handy data structure -- the list -- and calls to use a given list as either a queue or a stack. I modified websnarf so that it could do both breadth- and depth- traversals (with the switchover being handled in a small subroutine that decided whether to push or shift URLs onto the list, and pop or unshift them off the list). Because my nonrecursive implementation created a list of URLs representing the current state of your tree walk, it was then relatively trivial to scan along the list. If you see two occurences of the same URL in the list, you know there's some danger of getting into a loop, and you can just prune one of them out of the list. It's also useful to store the "depth" (i.e. number of links away from home) along with each URL in the list. If two pointers to the same URL occur at the same depth, the odds are that they're fairly safe -- just time-consuming. But if one is above the other, there's the possibility of some kind of weird loop occuring. Finally, one thing I'd do immediately if I was working on websnarf right now [*] would be to ensure that it avoids any URLs that look like search commands or internal document pointers. That way lies madness ... -- Charlie [*] websnarf is a personal effort, not a company-sanctioned project. It's on the shelf due to lack of spare time at work. I'm hoping (fingers firmly crossed) to get a grant for next year to do the job professionally, i.e. to spend all my time on it, not just a couple of hours a week. Meantime, I'm spending my time thinking about how to get right all the things I got wrong the first time round ... -------------------------------------------------------------------------------- Charlie Stross is charless@sco.com, SCO Technical Publications GO d-- -p+ c++++(i---) u++ l-(+) *e++ m+ s/+ !n h-(++) f+ g+ w++ t-(---) r-(++) y+ From /CN=robots-errors/@nexor.co.uk Mon Jun 6 11:52:10 1994 Return-Path: Delivery-Date: Tue, 7 Jun 1994 09:47:52 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 6 Jun 1994 11:52:10 +0100 Date: Mon, 6 Jun 1994 11:52:10 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:148030:940606105212] Content-Identifier: Re: Avoidance... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 11:52:10 +0100; Alternate-Recipient: Allowed From: Lee McLoughlin Message-ID: <"swan.doc.i.352:06.05.94.10.50.10"@doc.ic.ac.uk> To: Charlie Stross , mkgray@MIT.EDU, /CN=robots/@nexor.co.uk In-Reply-To: Subject: Re: Avoidance Algorithms X-Mailer: Mail User's Shell (7.2.5 10/14/92) Status: RO Content-Length: 747 My scanning routine was the usual depth-first search. But this meant that certain sites were scanned before others. Apart from causing sites way down the list to be left out it also meant that one site would get "soaked". In the end I went for random scanning of all stored URLs looking for a URL and site that I hadn't gone to recently. My system is also written in perl except that I store all the retrieved data in dbm files. Since I have URL and site timers I can now also run multiple scanning processes at the same time. -- -- Lee McLoughlin. Phone: +44 71 589 5111 X 5085 Dept of Computing, Imperial College, Fax: +44 71 581 8024 180 Queens Gate, London, SW7 2BZ, UK. Email: L.McLoughlin@doc.ic.ac.uk From /CN=robots-errors/@nexor.co.uk Wed Jun 8 10:09:49 1994 Replied: Wed, 08 Jun 1994 11:02:20 +0100 Replied: /CN=robots/@nexor.co.uk Replied: " (Paul De Bra)" Replied: nlc@cs.nott.ac.uk Return-Path: Delivery-Date: Wed, 8 Jun 1994 10:11:38 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 8 Jun 1994 10:09:49 +0100 Date: Wed, 8 Jun 1994 10:09:49 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:090430:940608090957] Content-Identifier: Re: Avoidance... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 8 Jun 1994 10:09:49 +0100; Alternate-Recipient: Allowed From: " (Paul De Bra)" Message-ID: <199406080910.JAA11791@pcpaul.info.win.tue.nl> To: /CN=robots/@nexor.co.uk Subject: Re: Avoidance Algorithms Status: RO Content-Length: 857 Strange to hear that the strategy in W4 changed from depth-first in 1.0 to breadth-first in 2.0. The experiments we ran with the fish-search, both real and simulated, all showed that depth-first is a better navigation algorithm than breadth-first. We also have boredom, set to 1, meaning that we never retrieve the same url twice. Another thing the fish-search does is to try to not load url's from the same host in succession. (It searches among the first 30 or so url's in its list to find one from another host.) A feature still missing in the fish-search, which I would like to hear about from others is the use of ISMAP's. My idea is to try a limited selection of coordinates first, and put a larger selection further back in the "queue" of url's to be tried. How do other robots find out which url's can be reached by clicking in an ismap? Paul. From /CN=robots-errors/@nexor.co.uk Wed Jun 8 11:02:39 1994 Return-Path: Delivery-Date: Wed, 8 Jun 1994 11:03:19 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 8 Jun 1994 11:02:39 +0100 Date: Wed, 8 Jun 1994 11:02:39 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:096060:940608100242] Content-Identifier: Re: Avoidance... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 8 Jun 1994 11:02:39 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"9595 Wed Jun 8 11:02:26 1994"@nexor.co.uk> To: " (Paul De Bra)" Cc: /CN=robots/@nexor.co.uk, nlc@computer-science.nottingham.ac.uk In-Reply-To: <199406080910.JAA11791@pcpaul.info.win.tue.nl> Subject: Re: Avoidance Algorithms Status: RO Content-Length: 1994 > Strange to hear that the strategy in W4 changed from depth-first in 1.0 to > breadth-first in 2.0. Not really; most Web server URL spaces are structured hierarchically, with growing more specific towards the leaves. So if you start from a server root and you do a bread-first search for a limited number of documents you'll get a broader (and therefore for the purposes of general indexing better) overview than if you do a depth-first search for a limited number of documents, which can shoot of down one specific area (especially in deep trees). If you use maximum-depth rathyer then maximum-documents it shouldn't matter much (depending on the structure of the data). > The experiments we ran with the fish-search, both real > and simulated, all showed that depth-first is a better navigation algorithm > than breadth-first. Can you elaborate on how they were better? > A feature still missing in the fish-search, which I would like to hear about > from others is the use of ISMAP's. > My idea is to try a limited selection of coordinates first, and put a larger > selection further back in the "queue" of url's to be tried. > > How do other robots find out which url's can be reached by clicking in an > ismap? Dave Ragget would refer you to the HTML 3.0 facilities for specifying links on figures within the figure element. I think that is the only way; there are an infinite number of coordinates in an ISMAP, you don't know where they are, and you can't check two locations for equivalence. Incidentally, I reckon that it is bad HTML if you provide an ismap as sole access to a small set of URL's. A friend of mine is working on a "click on this festival map to show where you are going to be" service. I'd hate to think what a random ISMAP coordinate-trying robot would do to that. ;-) -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From debra@info.win.tue.nl Wed Jun 8 12:50:17 1994 Replied: Wed, 08 Jun 1994 13:45:38 +0100 Replied: robots Replied: debra@info.win.tue.nl (Paul De Bra) Return-Path: Delivery-Date: Wed, 8 Jun 1994 12:50:28 +0100 Received: from svin04.info.win.tue.nl by lancaster.nexor.co.uk with SMTP (XTPP); Wed, 8 Jun 1994 12:50:17 +0100 Received: from pcpaul.info.win.tue.nl by svin04.info.win.tue.nl (8.6.8/1.45) id NAA28716; Wed, 8 Jun 1994 13:50:05 +0200 Received: from localhost by pcpaul.info.win.tue.nl (8.6.4/1.60) id LAA20290; Wed, 8 Jun 1994 11:52:19 GMT From: debra@info.win.tue.nl (Paul De Bra) Message-Id: <199406081152.LAA20290@pcpaul.info.win.tue.nl> Subject: Re: Avoidance Algorithms To: m.koster@nexor.co.uk (Martijn Koster) Date: Wed, 8 Jun 1994 13:52:17 +0200 (MET DST) Cc: /CN=robots/@nexor.co.uk In-Reply-To: <199406081002.MAA06363@svin02.info.win.tue.nl> from "Martijn Koster" at Jun 8, 94 11:02:11 am X-Mailer: ELM [version 2.4 PL23] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Status: RO Content-Length: 2835 > > Strange to hear that the strategy in W4 changed from depth-first in 1.0 to > > breadth-first in 2.0. > > Not really; most Web server URL spaces are structured hierarchically, > with growing more specific towards the leaves. So if you start from a > server root and you do a bread-first search for a limited number of > documents you'll get a broader (and therefore for the purposes of > general indexing better) overview than if you do a depth-first search > for a limited number of documents, which can shoot of down one > specific area (especially in deep trees). I guess our algorithm avoids shooting down into a specific area by trying to find links to other sites first. When you have limited search time you often don't get anywhere with breadth- first navigation because you don't reach documents that are deep enough to deal with a specific topic. A robot that spends a *lot* of time, visiting very many documents, could work equally well using breadth-first search. > If you use maximum-depth rathyer then maximum-documents it shouldn't > matter much (depending on the structure of the data). We do use maximum-depth to avoid going too far in a non-relevant direction. > > The experiments we ran with the fish-search, both real > > and simulated, all showed that depth-first is a better navigation algorithm > > than breadth-first. > > Can you elaborate on how they were better? They found more cross-reference links. Which need not mean anything, but considering that cross-reference links may (in the web) be links leading to different sites, this suggests a better chance of penetrating into a larger part of the web. again, the fact that we search for a limited time is important here. > ... > > ismap? > > Dave Ragget would refer you to the HTML 3.0 facilities for specifying > links on figures within the figure element. I think that is the only > way; there are an infinite number of coordinates in an ISMAP, you > don't know where they are, and you can't check two locations for > equivalence. nice to here something will be coming along. there are a finite number of coordinates in an ISMAP, but the number is large. we would never consider trying all possible coordinates. which isn't necessary in any ismap i know. > Incidentally, I reckon that it is bad HTML if you provide an ismap as > sole access to a small set of URL's. dunno. the course on hypertext which i have on line does it... and databases that deal with mostly graphical information, providing ismaps to zoom in on things and providing information would only have access through ismaps as well. > A friend of mine is working on a "click on this festival map to show > where you are going to be" service. I'd hate to think what a random > ISMAP coordinate-trying robot would do to that. ;-) we'll work on it and see how it performs. From /CN=robots-errors/@nexor.co.uk Wed Jun 8 13:46:24 1994 Return-Path: Delivery-Date: Wed, 8 Jun 1994 13:47:20 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 8 Jun 1994 13:46:24 +0100 Date: Wed, 8 Jun 1994 13:46:24 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:119080:940608124627] Content-Identifier: Re: Avoidance... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 8 Jun 1994 13:46:24 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"11890 Wed Jun 8 13:45:48 1994"@nexor.co.uk> To: " (Paul De Bra)" Cc: /CN=robots/@nexor.co.uk In-Reply-To: <199406081152.LAA20290@pcpaul.info.win.tue.nl> Subject: Re: Avoidance Algorithms Status: RO Content-Length: 1998 > there are a finite number of coordinates in an ISMAP, I was under the impression you could provide fractional coordinates, which would make the theoretical address space infinte, but I'll settle for large :-) > but the number is large. we would never consider trying all possible > coordinates. which isn't necessary in any ismap i know. Sure, my point as that you don't know which to try, or which map onto the "same" document, if that concept applies (think about a click-on-the-world-map-to-get-lat-long ismap server) > > Incidentally, I reckon that it is bad HTML if you provide an ismap as > > sole access to a small set of URL's. > > dunno. the course on hypertext which i have on line does it... > and databases that deal with mostly graphical information, providing ismaps > to zoom in on things and providing information would only have access through > ismaps as well. And they all rule out non-graphical displays :-( It isn't always possible/ useful to provide textual links in addition, but it is quite often. > > A friend of mine is working on a "click on this festival map to show > > where you are going to be" service. I'd hate to think what a random > > ISMAP coordinate-trying robot would do to that. ;-) > > we'll work on it and see how it performs. It's not the performance I'm worried about in this example, but the fact there is a semantic associated with this action. Imagine this cool hypothetical ismap server that uses a slide to register your appreciation with a page, or even a graphical green(yes), red(no) voting card. If a robot tries some random clicks in these ismaps this could have a nasty effect. Of course all links in the Web have this danger, but especially with graphical user-interface things like maps and forms you expect to be interfacing with a user... -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Thu Jun 9 16:06:34 1994 Return-Path: Delivery-Date: Thu, 9 Jun 1994 16:13:31 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 9 Jun 1994 16:06:34 +0100 Date: Thu, 9 Jun 1994 16:06:34 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:282050:940609150638] Content-Identifier: Re: Avoidance... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 9 Jun 1994 16:06:34 +0100; Alternate-Recipient: Allowed From: Charlie Stross Message-ID: <9406091059.aa02900@ruddles.sco.com> To: debra@info.win.tue.nl, m.koster@nexor.co.uk Cc: /CN=robots/@nexor.co.uk Subject: Re: Avoidance Algorithms X-Mailer: SCO Portfolio 2.0 Status: RO Content-Length: 2508 (Paul De Bra) writes: >> > The experiments we ran with the fish-search, both real >> > and simulated, all showed that depth-first is a better navigation algorithm >> > than breadth-first. >> >> Can you elaborate on how they were better? >They found more cross-reference links. Which need not mean anything, but >considering that cross-reference links may (in the web) be links leading to >different sites, this suggests a better chance of penetrating into a larger >part of the web. again, the fact that we search for a limited time is important >here. A minor point may be of interest to you: here at SCO we distinguish between "navigation nodes" and "information nodes" in our formally-constructed web pages. An information node may be linked to other nodes, but its primary function is to store text; in general it has a link to and items for linear browsing. A navigation node, on the other hand, has scads of URLs pointing both to information nodes and to other navigation nodes. This seems to be a fairly common distinction between web pages, as it naturally falls out of most methods of structuring information in hypertext; and it provides a clue for ways to avoid flooding servers while doing a depth-first search. When grabbing a page and searching it for URLs, count the URLs on the page; if there're three or less, the page is probably an information node and the pointers probably point to adjacent nodes in the same document, while four or more URLs suggest a navigation node (with URLs that could point anywhere on the net). Put URLs from information nodes on one stack, and URLs from navigation nodes on another stack. When selecting the next URL to explore, alternate between the two stacks. Alternatively, maintain a list of local stacks -- one per server polled -- and work along the list, taking a new URL from each stack in turn, so that no server is ever polled twice in succession. There are other ways of avoiding flooding a server, but these should be pretty easy to implement and will, most of the time, ensure that a local lookup will be followed by a remote one. -- Charlie PS : Anyone else run across: http://www.biotech.washington.edu/WebCrawler/WebCrawler.html yet? I'm most impressed ... -------------------------------------------------------------------------------- Charlie Stross is charless@sco.com, SCO Technical Publications GO d-- -p+ c++++(i---) u++ l-(+) *e++ m+ s/+ !n h-(++) f+ g+ w++ t-(---) r-(++) y+ From /CN=robots-errors/@nexor.co.uk Thu Jun 9 18:27:13 1994 Return-Path: Delivery-Date: Thu, 9 Jun 1994 18:27:57 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 9 Jun 1994 18:27:13 +0100 Date: Thu, 9 Jun 1994 18:27:13 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:019820:940609172715] Content-Identifier: Re: Avoidance... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 9 Jun 1994 18:27:13 +0100; Alternate-Recipient: Allowed From: Brian Pinkerton Message-ID: <9406091726.AA02507@biotech.washington.edu> To: Charlie Stross Cc: /CN=robots/@nexor.co.uk Subject: Re: Avoidance Algorithms Original-Received: by NeXT.Mailer (1.100) PP-warning: Illegal Received field on preceding line Original-Received: by NeXT Mailer (1.100) PP-warning: Illegal Received field on preceding line Status: RO Content-Length: 1024 I like the idea of distinguishing among nodes based on the number of links they have. For the WebCrawler, this would be a good way to reduce the number of nodes that need to be considered when deciding which nodes to visit next. When it's running in breadth-first mode (and generating an index), the WebCrawler doesn't do any kind of avoidance -- it just visits each server in succession, giving priority to servers that have never been visited before. When the number of known servers is bigger than 100 or so, then there's no chance the WebCrawler will get back to a server before a reasonable amount of time has passed. It its "directed searching" mode, the WebCrawler will avoid a server for some period of time after visiting it once, because there's a good chance its search criteria will want to grab another document from that server. Right now, it think I set that time period to 60 seconds, which roughly corresponds to my intuition of how fast a human would do the same operation. bri From /CN=robots-errors/@nexor.co.uk Mon Jun 6 12:01:53 1994 Return-Path: Delivery-Date: Tue, 7 Jun 1994 09:48:03 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 6 Jun 1994 12:01:53 +0100 Date: Mon, 6 Jun 1994 12:01:53 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:150550:940606110159] Content-Identifier: Best environm... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 12:01:53 +0100; Alternate-Recipient: Allowed From: Charlie Stross Message-ID: <9406061158.aa01564@ruddles.sco.com> To: /CN=robots/@nexor.co.uk Subject: Best environment for knowbot development? X-Mailer: SCO Portfolio 2.0 Status: RO Content-Length: 1029 I notice that an awful lot of knowbots seem to be being developed in Perl. There's at least one written in Python, probably a couple in C ... but there are Perl applications crawling all over the place (or so it feels at times!). Off the cuff, I'd attribute this to the rich string-manipulation features and accessible TCP/IP sockets provided by Perl, along with its fairly high execution speed (for an interpreted language). On the other hand, Perl is complex and syntactically dense -- a nightmare for non-UNIXheads. Has anyone given any serious thought to the optimal development environment for knowbots? Apart from Perl, are there any languages/- platforms you'd consider to be specially suitable for developing knowbots? And if so, what are their salient characteristics? -- Charlie -------------------------------------------------------------------------------- Charlie Stross is charless@sco.com, SCO Technical Publications GO d-- -p+ c++++(i---) u++ l-(+) *e++ m+ s/+ !n h-(++) f+ g+ w++ t-(---) r-(++) y+ From /CN=robots-errors/@nexor.co.uk Mon Jun 6 14:31:59 1994 Replied: Tue, 07 Jun 1994 11:54:16 +0100 Replied: /CN=robots/@nexor.co.uk Replied: Michael.Mauldin@NL.CS.CMU.EDU Return-Path: Delivery-Date: Tue, 7 Jun 1994 09:30:07 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 6 Jun 1994 14:31:59 +0100 Date: Mon, 6 Jun 1994 14:31:59 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:174170:940606133201] Content-Identifier: Description o... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 14:31:59 +0100; Alternate-Recipient: Allowed From: Michael.Mauldin@NL.CS.CMU.EDU Message-ID: <"17413 Mon Jun 6 14:31:48 1994"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: Description of the Lycos WW searcher at CMU Status: RO Content-Length: 5944 The Lycos project at Carnegie Mellon is in the early stage, we have a Web explorer in operation, and our indexer will come on-line later this month. We will use the SCOUT indexer which has an HTTP gateway (a set Sample database of the Tipster corpus from Wall Street Journal is available intermittently from http://fuzine.mt.cs.cmu.edu/scout/home.html). Lycos is written in Perl, but uses a C program based on CERN's libwww to fetch URLs. It uses a random search, keeps its record of URLs visited in a Perl assoc list stored in DBM (thanks to Charlie Stross for the tip that Gnu DBM doesn't have arbitrary limits!). It searches HTTP, FTP, and GOPHER sites, ignoreing TELNET, MAILTO, and WAIS. Lycos uses a data reduction scheme to reduce the stored information about each document: Title Headings and Subheadings 100 most "weighty" words (using Tf*IDf, Term freq / Inverse doc freq) First 20 lines Size in bytes Number of words Lycos keeps a word frequency count as it runs...it has read over 25 million words. A list of the most frequent words found after searching 6.3 million words is available off the Lycos home page. So far, Lycos has run for less than a month URLs found: 313,468 URLs fetched: 41,391 (35,382 successful) HTTP servers: 3,138 Citation counting (number of "parents" by URL): this is the first 50 URLs sorted by number of documents that reference that URL. What I did not do was to count only references from different sites (the I'm sure that 99% of the refs to http://gdbwww.gdb.orf/omim come from the Genome Database server itself. ------------------------------------------------------------------------ 1703 http://gdbwww.gdb.org/omim/ 1578 http://cossack.cosmic.uga.edu/keywords.html 692 ftp://ftp.network.com/IPSEC/rfcindex4.html 421 ftp://ftp.network.com/IPSEC/rfcindex3.html 322 ftp://ftp.network.com/IPSEC/rfcauthor.html 319 ftp://ftp.network.com/IPSEC/rfcindex5.html 234 ftp://ftp.network.com/IPSEC/rfcindex2.html 202 ftp://ftp.network.com/IPSEC/rfcindex1.html 177 http://info.cern.ch/hypertext/WWW/TheProject.html 166 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/whats-new.html 135 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/MetaIndex.html 133 http://www.cs.columbia.edu/~radev/ 133 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/NCSAMosaicHome.html 118 http://www.cs.colorado.edu/homes/mcbryan/public_html/bb/summary.html 108 http://www.mcs.anl.gov/home/gropp/ 107 http://info.cern.ch/hypertext/DataSources/bySubject/Overview.html 105 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/StartingPoints/NetworkStartingPoints.html 101 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/help-about.html 85 http://cui_www.unige.ch/w3catalog 84 http://wings.buffalo.edu/world 82 http://sass577.endo.sandia.gov/SEACAS/CUBIT/Developers/ 80 http://cui_www.unige.ch/OSG/MultimediaInfo/mmsurvey/ 79 http://www.nta.no/telektronikk/4.93.dir/ 76 http://asp.esam.nwu.edu/chris/dce_prodlist.html 76 http://hypatia.gsfc.nasa.gov/NASA_homepage.html 76 http://info.cern.ch/hypertext/DataSources/WWW/Servers.html 75 http://www.ncsa.uiuc.edu/demoweb/demo.html 75 http://www.rtd.com/people/rawn/ 74 ftp://ftp.network.com/IPSEC/rfcindex0.html 74 http://tns-www.lcs.mit.edu/cgi-bin/value-added/sports/register.sos.texas.gov/texreg/ 73 http://rs560.cl.msu.edu/weather/getmegif.html 71 http://rs560.cl.msu.edu/weather/interactive.html 70 http://rs560.cl.msu.edu/weather/textindex.html 70 http://rs560.cl.msu.edu/~henrich/ 70 http://www.seas.upenn.edu/~mengwong/ 68 http://info.cern.ch/hypertext/DataSources/WWW/Geographical.html 68 http://rs560.cl.msu.edu/weather/uscmp.gif 66 http://rs560.cl.msu.edu/weather/uscmp.mpg 66 http://www.cso.uiuc.edu/~kline/cvk.html 65 ftp://cs.nott.ac.uk/pub/sat-images/ 65 http://rs560.cl.msu.edu/weather/goes7ir.mpg 65 http://rs560.cl.msu.edu/weather/worldir.mpg 65 http://www.hmc.edu/~irilyth/diplomacy/ 64 gopher://burrow.cl.msu.edu/00/news/weather/lan 64 gopher://ssec.wisc.edu 64 http://rs560.cl.msu.edu/weather/6panel.mpg 64 http://rs560.cl.msu.edu/weather/d2.jpg 64 http://rs560.cl.msu.edu/weather/gmsvis.mpg 63 http://cui_www.unige.ch/meta-index.html 63 http://rd13doc.cern.ch/public/doc/Rd13StatusReport.html ------------------------------------------------------------------------ The Lycos philosophy is to keep a finite model of the web that enables subsequent searches to proceed more rapidly. The idea is to prune the "tree" of documents and to represent the clipped ends with a summary of the documents found under that node. The 100 most important words lists from several documents can be combined to produce a list of the 100 most important words in the set of documents. Alternative fixed representations of documents or document sets include the vector models such as Dumais at BellCore and Gallant & Caid at Hecht-Neilson Corp. The number 100 was chosen arbitarily, so we will need to investigate to find whether than number is too high or too low. I also subscribe to the dream of a single format and indexing scheme that each server runs on its own data, but given the current state of the community I believe it is premature to settle on a single format. Various information retrieval schemes depend on wildly different kinds of data. We should try out more ideas and evaluate them carefully and only then should we try to settle on a single format. Resources: I have agreed to share my code for research and educational users. Should I make a requirement that recipients of the code post to this mailing list so we can keep track of its proliferation? I already have promised code to two people. I will make lists, statistics, reports, and the index server accessible off the Lycos home page as they become available. --Michael L. Mauldin Carnegie Mellon University Center for Machine Translation 5000 Forbes Avenue Pittsburgh, PA 15213-3890 fuzzy@cmu.edu From /CN=robots-errors/@nexor.co.uk Tue Jun 7 11:55:08 1994 Return-Path: Delivery-Date: Tue, 7 Jun 1994 11:56:17 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 7 Jun 1994 11:55:08 +0100 Date: Tue, 7 Jun 1994 11:55:08 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:273170:940607105511] Content-Identifier: Data formats ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 7 Jun 1994 11:55:08 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"27301 Tue Jun 7 11:54:46 1994"@nexor.co.uk> To: Michael.Mauldin@NL.CS.CMU.EDU Cc: /CN=robots/@nexor.co.uk In-Reply-To: <"17413 Mon Jun 6 14:31:48 1994"@nexor.co.uk> Subject: Data formats (was Re: Description of the Lycos WW searcher at CMU) Status: RO Content-Length: 2295 > I also subscribe to the dream of a single format and indexing scheme > that each server runs on its own data, That is one step further then what I proposed; I was talking about a single format for the database the robot uses to store its information in locally. This should be achievable before suggesting any Web-wide solution. > but given the current state of the community I believe it is > premature to settle on a single format. Various information > retrieval schemes depend on wildly different kinds of data. We > should try out more ideas and evaluate them carefully and only then > should we try to settle on a single format. Sure, we don't know exactly what is required yet, and identifying all our present and future requirements is probabaly an impossible task anyway. I believe in a gradual approach. The fact remains that almost all robots keep a same set of core information: which URL's were visited, which are going to be visited. Probably when URL's were retrieved, what Last-modified time was, what headings/keywords are, which URL's are referenced in which documents, etc. It should be possible to extract these common elements from the internal robot database, in a standard format. Maybe the word "data exchange format" is more applicable than "database format" I'd love to suggest something more concrete myself, but I can't use personal first-hand experience as I don't have a robot. What concrete data formats do people use? > I have agreed to share my code for research and educational users. > Should I make a requirement that recipients of the code post to this > mailing list so we can keep track of its proliferation? I already > have promised code to two people. Handing out robot code is exactly what needn't be required if the data gathered by a robot was accessible in a mungeable format. If everybody who'd like to do some anaylisis of Web data was running their own robot we're wasting a lot of bandwidth. > I will make lists, statistics, reports, and the index server > accessible off the Lycos home page as they become available. Great, keep us posted. -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From charless@sco.COM Tue Jun 7 14:59:24 1994 Return-Path: Delivery-Date: Tue, 7 Jun 1994 14:59:57 +0100 Received: from relay1.UU.NET by lancaster.nexor.co.uk with SMTP (XTPP); Tue, 7 Jun 1994 14:59:24 +0100 Received: from sco.sco.COM by relay1.UU.NET with SMTP (rama) id QQwtgp28331; Tue, 7 Jun 1994 09:59:00 -0400 Received: from scol.london.sco.COM by sco.sco.COM id aa06700; Tue, 7 Jun 94 7:05:39 PDT Received: from ruddles.london.sco.com by scol.sco.COM id ab07727; Tue, 7 Jun 94 14:57:54 BST From: Charlie Stross To: Michael.Mauldin@NL.CS.CMU.EDU, m.koster@nexor.co.uk Subject: re: Data formats Cc: /CN=robots/@nexor.co.uk X-Mailer: SCO Portfolio 2.0 Date: Tue, 7 Jun 1994 14:53:43 +0100 (BST) Message-ID: <9406071455.aa14751@ruddles.sco.com> Status: RO Content-Length: 8203 Martijn Koster writes ... >> I also subscribe to the dream of a single format and indexing scheme >> that each server runs on its own data, >That is one step further then what I proposed; I was talking about a >single format for the database the robot uses to store its information >in locally. This should be achievable before suggesting any Web-wide >solution. : >I'd love to suggest something more concrete myself, but I can't use >personal first-hand experience as I don't have a robot. What concrete >data formats do people use? I don't use this yet, but it intrigues me as a possible future route, for reasons that should be obvious ... Here's the readme file from GlimpseHTTP 1.0, released earlier this week: --------------------------- cut here ---------------------------------- NAME GlimpseHTTP WHAT IS GLIMPSE Glimpse (which stands for GLobal IMPlicit SEarch) is an indexing and query system that allows you to search through lots of files in many (possibly nested) directories very quickly. Glimpseindex, which you run by saying glimpseindex builds a very small index (2-5% of the text). With it, glimpse can search through all the files in these directories much the same way as grep, except that you don't have to specify file names. Glimpse supports most of agrep's options (agrep is our powerful version of grep, and it is part of glimpse) including approximate matching (e.g., finding misspelled words), Boolean queries, and even some limited forms of regular expressions. DESCRIPTION GlimpseHTTP is a collection of tools that allows you to incorporate glimpse in WWW documents. With it, you can provide general search capabilities to any user without incurring too much space overhead. Furthermore, these tools allow you to integrate search with browsing. If you have several nested directories which the user may browse, you can include the glimpse interface in each document such that only the relevant directories will be included in the search. More details are given below. The current version of GlimpseHTTP was tested under httpd 1.2 HTML server from NCSA and Glimpse currently works on many Unix platforms. To search and browse the information any HTML browser can be used (this includes NCSA Mosaic for X-Windows, MS-Windows and Macintosh, Lynx and other browsers. For maximum convenience your browser should support forms, although minimal functionality can be achieved with any browser). Since GlimpseHTTP uses Glimpse, this provides some unique features - A very small index (3-5% of the total text). - Reasonably fast search. - Search for approximate match allowing errors. In addition, GlimpseHTTP provides you with the following capabilities: - You can use a combination of browsing and searching: first, you locate the directory where the relevant information can be stored, then you can use search to locate specific files. - The result of the search is a nicely formatted hypertext with hyperlinks to matching documents. - Following the hyperlink leads you not only to a particular file, but also to the exact place where the match occured. - Hyperlinks in the documents are converted on the fly to actual hyperlinks, which you can follow immediately. This makes the GlimpseHTTP particularily suitable for searching meta-information (Internet directories etc.). - Similar tools are provided for archiving and searching USENET newsgroups. You can maintain the archive of news articles and allow people to search your archive using the same interface. Features supported include kill-file for articles and fast search for particular posters. Since news archiver uses NNTP interface, you can archive news articles from remote news servers. (Browse and search for news is yet to be implemented: browsing in this case means selection of pertinent newsgroup(s), currently supported is only the search within one newsgroup a time) Among the possible applications of GlimpseHTTP we envision: - FTP sites with search possibilities; - news archiving sites; - any search application which should be accessed over local or global network where searching for approximate match and/or saving of disk space for indices is an issue. GlimpseHTTP components: 1. aglimpse - "Archive Glimpse" - a tool for searching file hierarchies indexed for Glimpse. aglimpse is a CGI-compliant program which performs the search and formats the output as HTML document with hyperlinks to the matches. 2. Administrative tools which facilitate maintaining and indexing of Glimpse archives. One of the programs is the HTML indexer which prepares hypertext indices for each searchable directory - this supports the concept of combined browsing and searching. 3. GlimpseNews - a collection of tools for archiving and searching newsgroups archives. SEE ALSO http://glimpse.cs.arizona.edu:1994/glimpsehttp.html - GlimpseHTTP home page. http://glimpse.cs.arizona.edu:1994 - Glimpse developers home page. README.install - directions on installing GlimpseHTTP on your server. README.amgr - description of Archive Manager. README.indexing - descriptioN of HTML indexer. AUTHORS Paul Klark (GlimpseHTTP) Udi Manber, Sun Wu, and Burra Gopal (Glimpse) University of Arizona, Department of Computer Science To be put on glimpse mailing list, send mail to glimpse-request@cs.arizona.edu -------------------------- CUT HERE ------------------------------ If you're still reading, what I'm thinking is: A URL is effectively a "word"; any given document has a unique URL. Indeed, an inverted-text index of URLs is a fairly sensible way to keep track of large maps of the web, where the same URLs may be replicated frequently. GlimpseHTTP is a really convenient alternative to WAIS or Z39.50 retreival systems for providing access to stored text -- which is what HTML indexers need to do. What's got me interested is this: Imagine a web traversing 'bot that does a breadth-first search of the web. Every time it retrieves an HTML document, it indexes it using glimpse. However, rather than storing a pointer to the disk block where the file resides, as glimpse does when it's indexing text files, the knowbot stores a pointer to the URL of the file. (URLs are stored in a separate table, so that only an index pointer -- say, a 24-bit integer -- is required to represent a URL in the actual text database.) When you do a search for some words, the glimpse system finds the best match, then dereferences the stored URLs to retrieve the documents containing those words, then does an agrep search on it for context. This schema gives you a good compromise between index size and storage space. A 24-bit integer -- enough to index 16 million URLs -- is of the same order of magnitude as the index pointers that glimpse already provides for pointing to blocks: it would be a bit slower, but the problem is not insuperable. So in return for, say, 50Mb of disk space devoted to an index, you could have a complete inverted-text database refering to 500-1000 Mb of HTML files on the web; these would be available for retrieval with a single URL lookup. Now layer a client-server lookup mechanism like Alibi -- the UberNet system -- on top of the glimpse/knowbot combo, and you have a mechanism for propagating queries between index servers. A properly designed system could answer queries on a huge information domain without doing any off-site lookups, or (if the information is not found locally) forward the query to other servers. The result would be something like Veronica, only with full free-text search capacity over the whole of WebSpace. -- Charlie -------------------------------------------------------------------------------- Charlie Stross is charless@sco.com, SCO Technical Publications GO d-- -p+ c++++(i---) u++ l-(+) *e++ m+ s/+ !n h-(++) f+ g+ w++ t-(---) r-(++) y+ From /CN=robots-errors/@nexor.co.uk Tue Jun 7 15:00:04 1994 Return-Path: Delivery-Date: Tue, 7 Jun 1994 15:00:43 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 7 Jun 1994 15:00:04 +0100 Date: Tue, 7 Jun 1994 15:00:04 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:294320:940607140005] Content-Identifier: ALIBI release... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 7 Jun 1994 15:00:04 +0100; Alternate-Recipient: Allowed From: Charlie Stross Message-ID: <9406071455.aa14752@ruddles.sco.com> To: /CN=robots/@nexor.co.uk Subject: ALIBI release readme X-Mailer: SCO Portfolio 2.0 Status: RO Content-Length: 10614 Please accept my apologies if you've already read this. It's the readme for Alibi, a new resource retrieval system written at NIST, and released earlier this week; I think it's relevent to this list ... --------------------------- CUT HERE ----------------------------- Alibi/Unetd README (C) 1994 David Flater, dave@case50.ncsl.nist.gov VERSION: BETA 001 Quick Summary ------------- Alibi is a new resource discovery and information retrieval system for the Internet. The acronym stands for Adaptive Location of Internetworked Bases of Information. Alibi provides a query interface that allows users to retrieve information with keyword queries, without contacting remote servers or navigating. The resource discovery is fully automated, and the retrieval is truly location independent. Using Alibi ----------- The source to the client is called alibi.c. It's in a subdirectory called clients and is also available separately. The client program alibi can talk with any Unetd anywhere, but you should talk to a local daemon if you have one and let the system itself worry about remote access. Alibi requires as its one and only command line argument the name of the machine running the Unetd. If all is well, you will get a prompt asking for a query. Typing 'help' or anything else that is not a well-formed query will produce a brief help screen. A basic query is a group of keywords in parenthesis, e.g. (cache software). The special command 'more' (no parens) will retrieve something else like what you just got. More complex options are described in the help. 'quit' gets you out of the client. The client can be suspended and brought back without interfering with the processing of the query by the information system; only the final delivery will be delayed. System Description (Read Before Installing Unetd) ------------------------------------------------- The Ubernet is the information network used by Alibi. Alibi is the name of the entire system, including the simple client that allows users to retrieve information. Alibi is neither a navigational system like WWW nor a resource catalog like Archie. It is a fully distributed, fully automatic resource discovery and information retrieval system with a query-based user interface. The client (called alibi) contacts a Unetd, submits queries on behalf of the user, and processes replies from the Unetd. Unetd maintains Internet connections with other Unetds at other sites, and it may also communicate with mediators / resource managers at the local site to retrieve information. When a Unetd receives a query, it either generates a response using its local resources or forwards the query to another Unetd. Alibi can handle just about any kind of information. Currently available resources include the MS-DOS subtree of wuarchive (via NFS), the SEC's EDGAR database (via FTP), a geographical database for Virginia, and the following "demo-sized" information bases: a collection of sound files; a group of images at NASA Goddard Space Flight Center; several Usenet newsgroups; the Alibi FTP directory; and a source code reuse library. You do not need to have an information base to run Unetd, but it would be very nice if you would contribute what you have to the Alibi information network. It is preferable to run a Unetd at the site having the information base than to make Unetd access the data remotely, since Unetd handles distributed information retrieval much better than anything else. Installing Unetd ---------------- You do not need root privileges to run Unetd, but root should add it to rc.local to keep it up through power failures. As a last resort, a utility called cron_rc has been included in the utils subdirectory to restart the daemons after power failures without needing any privileges except access to crontab. Several makefiles are provided. The default makefile assumes that you have an ANSI C compiler, finds the maximum level of optimization it supports, and compiles the daemon. knrmakefile assumes that you have a K&R C compiler and the ansi2knr utility. makefile.sun.cc is a hacked version of knrmakefile that bypasses the compiler-finding script, which is necessary, for example, if your system administrator installed gcc (the preferred compiler) in a bogus way that makes it so that you can't actually compile anything. alibi.h contains a small number of #defines that can be altered to set the working directory of the daemon and to bypass C library functions that are missing on your machine. By default, Unetd does a 'cd FQDN' (filling in the FQDN of the local machine) when it is started. In that directory you need to put a file called bozo.txt (you can use the one provided in the source directory) and a file called peers that lists the FQDN's of other hosts running Unetds with which you want to connect. When bringing up Unetd for the first time, forget about the peers file and just run Unetd as an isolated daemon to make sure it's working. When you do eventually choose other Unetds to connect to, choose some small number of sites that are geographically close. Three is a good number; six is okay; ten is getting excessive. You won't gain anything from creating too many links except lots of overhead. Keep in mind that other sites can put YOU in their peers file without telling you (just like you were about to do to someone else) and make you have more links than you thought. Unetd writes logging information to stdout and stderr, so redirect them to a file. Among the first information written is the FQDN of the local machine, the PID of the daemon, and so on. If the FQDN is wrong, you must put a file called FQDN in the directory from which Unetd is started containing the correct FQDN. FQDN is the ONLY file that is read from the initial directory before 'cd FQDN' happens. If your Unetd announces itself with an incorrect FQDN, other daemons will "bozo" it repeatedly (this will be noted in the log file) and you will not be able to talk to other sites. The default FQDN will be correct if your system is correctly configured. Unetd also writes periodic statistics on things like average response time. Some of these statistics have known bugs, such as the fact that you can register more delivered responses than accepted queries. This logging information currently accumulates at a rate that will produce 100k of log data in a few days. You can truncate the log by sending a HUP signal to Unetd. (This will not kill the daemon) After the testing period is over I intend to greatly reduce the amount of logging that is done. Verbose logging is also enabled by default for the cache decision function because I haven't collected enough information to tune it yet. You might want to disable that by undoing the #define debugcache in alibi.h. Queries from users are logged since I want to know what somebody typed that crashed Unetd when it happens. It just logs the queries, not the identity of the users who entered them. Providing Information Bases --------------------------- To provide an information base you need to install a mediator. If you do this wrong, you can degrade the performance of the entire system. A mediator is a separate program and process that creates two named pipes in the directory used by Unetd. Unetd opens those pipes and talks with the mediator using a simple protocol. Unetd sends subqueries (keyword queries with no Boolean logic) to the mediator, and the mediator returns OIDs (Object IDentifiers) of matching data objects. Unetd might then ask the mediator to retrieve a data object or send another subquery. The reasons that incorrectly installed mediators are a danger are as follows: -- Unetd trusts mediators to provide intelligent classifications for data objects that mesh with the generally accepted class hierarchy of the Ubernet. If you start creating lots of bogus data classes, the bogosity will propagate into the adaptive query classification heuristics used by Unetds all over the place and degrade performance. -- Unetd trusts mediators not to give stupid answers to good queries. If a mediator says that a data object matches a query, it is assumed that the degree of relevance is fairly high and that every keyword in the subquery was found to relate. A mediator MUST NOT simply find the closest thing in the database regardless of the magnitude of its irrelevance! If no data are relevant, a null response is expected so that some other mediator will be given a chance. -- Unetd trusts mediators not to act in a Byzantine manner designed to crash the system. Some examples of mediators are provided in a subdirectory called example_resources. rblobs.c is the most frequent starting point for building resources. rblobs.c will turn a file system subtree into an information base using index files that you must provide. A slight variation on rblobs.c was used to provide the MS-DOS subtree of wuarchive. rnntp.c shows how you can overhaul rblobs.c to let Unetd retrieve information from diverse information sources, and r_c_sources.c shows how automatic indexing can be employed. Getting Sources and Further Reading ----------------------------------- Alibi sources and miscellaneous Alibi-related papers are available for anonymous FTP on speckle.ncsl.nist.gov under the directory called flater. Of course, you can also get them through Alibi. Licensing and All That Jazz --------------------------- Everything that is shipped in the Alibi/Unetd package is (C) 1994 David Flater, but permission is granted for free copying. The sources may be modified, reused, or rewritten provided that fair credit to David Flater is given where appropriate, to the extent that is appropriate for the level of reuse. The right to use this software is granted to the public; the right to misuse it is not. Misuse of this software on an open network may degrade the performance of the entire information system and violate the rights of other users. Such misuse is expressly prohibited, and all rights that you have been granted to this software may be revoked in the event of such misuse. No warranties of any kind are made with respect to this package. The author disclaims any and all responsibility for anything bad that happens as a result of the use or misuse of this software. -------------------------------------------------------------------------------- Charlie Stross is charless@sco.com, SCO Technical Publications GO d-- -p+ c++++(i---) u++ l-(+) *e++ m+ s/+ !n h-(++) f+ g+ w++ t-(---) r-(++) y+ From /CN=robots-errors/@nexor.co.uk Mon Jun 13 14:06:48 1994 Return-Path: Delivery-Date: Mon, 13 Jun 1994 14:07:27 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 13 Jun 1994 14:06:48 +0100 Date: Mon, 13 Jun 1994 14:06:48 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:050560:940613130650] Content-Identifier: new paper Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 13 Jun 1994 14:06:48 +0100; Alternate-Recipient: Allowed From: Charlie Stross Message-ID: <9406131403.aa24556@ruddles.sco.com> To: /CN=robots/@nexor.co.uk Cc: charless@sco.COM Subject: new paper X-Mailer: SCO Portfolio 2.0 Status: RO Content-Length: 394 I've just written a first-draft informal paper discussing knowbots. It's on the web: http://gemma.demon.co.uk:8001/~charlie/websearch.html Comments? -- Charlie -------------------------------------------------------------------------------- Charlie Stross is charless@sco.com, SCO Technical Publications GO d-- -p+ c++++(i---) u++ l-(+) *e++ m+ s/+ !n h-(++) f+ g+ w++ t-(---) r-(++) y+ From /CN=robots-errors/@nexor.co.uk Tue Jun 14 08:44:58 1994 Replied: Tue, 14 Jun 1994 13:51:37 +0100 Replied: "Roy T. Fielding" Return-Path: Delivery-Date: Tue, 14 Jun 1994 08:45:38 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 14 Jun 1994 08:44:58 +0100 Date: Tue, 14 Jun 1994 08:44:58 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:128690:940614074459] Content-Identifier: libwww-perl: ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 14 Jun 1994 08:44:58 +0100; Alternate-Recipient: Allowed From: "Roy T. Fielding" Message-ID: <9406140044.aa03471@paris.ics.uci.edu> To: /CN=robots/@nexor.co.uk Cc: oscar@cui.unige.ch, grimes@raison.mro.dec.com, shelden@fatty.law.cornell.edu Subject: libwww-perl: A generic WWW interface library for perl tools Status: RO Content-Length: 850 Hello all, After some prompting from Martijn Koster and Oscar Nierstrasz at WWW94, I decided to rewrite the core of MOMspider so that it can serve as a generic library for WWW clients written in Perl. So far it includes support for all of HTTP and also local file requests. I am looking for more contributions to support the many other protocols and also to provide better HTML libraries. The distribution site and much more information about the libraries can be found at and also at Please take a look and tell me what you think. ....Roy Fielding ICS Grad Student, University of California, Irvine USA (fielding@ics.uci.edu) About Roy From /CN=robots-errors/@nexor.co.uk Tue Jun 14 13:31:03 1994 Return-Path: Delivery-Date: Tue, 14 Jun 1994 13:32:27 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 14 Jun 1994 13:31:03 +0100 Date: Tue, 14 Jun 1994 13:31:03 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:176910:940614123106] Content-Identifier: Re: libwww-pe... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 14 Jun 1994 13:31:03 +0100; Alternate-Recipient: Allowed From: " (Michael Mauldin)" Message-ID: <9406141228.AA18336@fuzine.mt.cs.cmu.edu> To: "Roy T. Fielding" Cc: /CN=robots/@nexor.co.uk Subject: Re: libwww-perl: A generic WWW interface library for perl tools Original-Received: by NeXT Mailer (1.63) PP-warning: Illegal Received field on preceding line Status: RO Content-Length: 815 I have a C program to fetch URL's based on CERN's libwww that is available on the Web. The value of using libwww is that it works with HTTP, Gopher, FTP, and other protocols. My robot uses this to fetch URLs, but does the text processing in Perl. I also have a Perl subroutine that implements the RobotsNot Wanted function using Martijn's standard. It caches the rights file to prevent multiple accesses. Check out http://fuzine.mt.cs.cmu.edu/mlm/scoutget.html http://fuzine.mt.cs.cmu.edu/mlm/rnw.html Each contains a short description, code, and a sample test run. A question: Does anybody know a good way to randomly select an entry from a Perl associative list without looping through the whole array using 'each'? --Michael L. Mauldin fuzzy@cmu.edu http://fuzine.mt.cs.cmu.edu/mlm/home.html From /CN=robots-errors/@nexor.co.uk Wed Jun 15 14:58:51 1994 Replied: Wed, 15 Jun 1994 17:44:24 +0100 Replied: /CN=robots/@nexor.co.uk Replied: "Roy T. Fielding" Return-Path: Delivery-Date: Wed, 15 Jun 1994 14:59:34 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 15 Jun 1994 14:58:51 +0100 Date: Wed, 15 Jun 1994 14:58:51 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:025190:940615135853] Content-Identifier: Proposed name... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 14:58:51 +0100; Alternate-Recipient: Allowed From: "Roy T. Fielding" Message-ID: <9406150657.aa16431@paris.ics.uci.edu> To: /CN=robots/@nexor.co.uk Subject: Proposed name change for /RobotsNotWanted.txt Status: RO Content-Length: 955 Hello all, I was just editing my MOMspider paper for final submission in the WWW94 proceedings (what a pain!) and noticed that I have several references to the name /RobotsNotWanted.txt in the text. I would like to change the name before it gets written in stone (i.e. before I hand over copyright to Elsevier). I propose that the name be: /spiders.txt Reasons: 1) It fits within the 8.3 filename restrictions for PCs 2) It is easy to remember and hard to mistake (i.e. no mixed case) 3) It is more web-ish than /robots.txt 4) It does not imply that all robots are excluded (/norobots.txt) So, what's the general consensus? I need to have a decision within the next 24 hours in order to get my paper done on time ;-) ....Roy Fielding ICS Grad Student, University of California, Irvine USA (fielding@ics.uci.edu) About Roy From /CN=robots-errors/@nexor.co.uk Wed Jun 15 15:14:59 1994 Return-Path: Delivery-Date: Wed, 15 Jun 1994 15:15:21 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 15 Jun 1994 15:14:59 +0100 Date: Wed, 15 Jun 1994 15:14:59 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:026740:940615141500] Content-Identifier: Re: Proposed ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 15:14:59 +0100; Alternate-Recipient: Allowed From: Guido.van.Rossum@cwi.nl Message-ID: <9406151407.AA09156=guido@voorn.cwi.nl> To: "Roy T. Fielding" Cc: /CN=robots/@nexor.co.uk In-Reply-To: <9406150657.aa16431@paris.ics.uci.edu> References: <9406150657.aa16431@paris.ics.uci.edu> Subject: Re: Proposed name change for /RobotsNotWanted.txt X-Organization: CWI (Centrum voor Wiskunde en Informatica) X-Address: P.O. Box 94079, 1090 GB Amsterdam, The Netherlands X-Phone: +31 20 5924127 (work), +31 20 6225521 (home), +31 20 5924199 (fax) Status: RO Content-Length: 245 I vote for /robots.txt. Seems more neutral (after all the general term for web crawlers seems to be robots, not spiders). --Guido van Rossum, CWI, Amsterdam URL: From /CN=robots-errors/@nexor.co.uk Wed Jun 15 17:01:23 1994 Return-Path: Delivery-Date: Wed, 15 Jun 1994 17:01:49 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 15 Jun 1994 17:01:23 +0100 Date: Wed, 15 Jun 1994 17:01:23 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:039080:940615160125] Content-Identifier: Re: Proposed ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 17:01:23 +0100; Alternate-Recipient: Allowed From: " (Michael Mauldin)" Message-ID: <9406151559.AA24432@fuzine.mt.cs.cmu.edu> To: "Roy T. Fielding" Cc: /CN=robots/@nexor.co.uk Subject: Re: Proposed name change for /RobotsNotWanted.txt Original-Received: by NeXT Mailer (1.63) PP-warning: Illegal Received field on preceding line Status: RO Content-Length: 295 I am in favor of the new name, if only because this is the chance to put it out on paper, which is hard to change, and I have no major objections to this name. There are few enough RNW files out there that we can contact all known such servers by email... --Michael L. Mauldin fuzzy@cmu.edu From /CN=robots-errors/@nexor.co.uk Wed Jun 15 17:44:43 1994 Return-Path: Delivery-Date: Wed, 15 Jun 1994 17:45:00 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 15 Jun 1994 17:44:43 +0100 Date: Wed, 15 Jun 1994 17:44:43 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:048580:940615164444] Content-Identifier: Re: Proposed ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 17:44:43 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"4855 Wed Jun 15 17:44:30 1994"@nexor.co.uk> To: "Roy T. Fielding" Cc: /CN=robots/@nexor.co.uk In-Reply-To: <9406150657.aa16431@paris.ics.uci.edu> Subject: Re: Proposed name change for /RobotsNotWanted.txt Status: RO Content-Length: 1425 > I propose that the name be: /spiders.txt > > Reasons: 1) It fits within the 8.3 filename restrictions for PCs > 2) It is easy to remember and hard to mistake (i.e. no mixed case) Agree. The reason for a far-out name was a smaller chance of a name collision, but the PC's are a problem. > 4) It does not imply that all robots are excluded (/norobots.txt) Agree. > 3) It is more web-ish than /robots.txt This is hardly a convincing argument. I'd prefer /robots.txt because it is seems a broader term which can include other automated processes, such as mirrors. But if my vote results in a hung decision I'll happily change. > So, what's the general consensus? I need to have a decision within > the next 24 hours in order to get my paper done on time ;-) Yup, This is a good occasion to fix that outstanding issue. So, what have we got: /robots.txt:2 /spiders.txt:1 So Roy, as you've got the clock, let us know which name it is to be. Another issue with the robots.txt spec; is there any problem with allowing for shell-like "#" comment lines? This has been suggested by two other people, and I'd like to add it when I add the new name. I'd also like any other comments on the proposal... -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Wed Jun 15 19:04:38 1994 Return-Path: Delivery-Date: Wed, 15 Jun 1994 19:05:01 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 15 Jun 1994 19:04:38 +0100 Date: Wed, 15 Jun 1994 19:04:38 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:054230:940615180442] Content-Identifier: Re: Proposed ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 19:04:38 +0100; Alternate-Recipient: Allowed From: John.R.R.Leavitt@NL.CS.CMU.EDU Message-ID: <"5406 Wed Jun 15 19:04:17 1994"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: Re: Proposed name change for /RobotsNotWanted.txt Status: RO Content-Length: 1236 If anyone gets this twice, I apologize... I got some nasty bounce mail when I submitted it before, so I am trying again. I would prefer /robots.txt (or even better something like robots.lmt (limit), since txt implies human-readable text to me). My main preference for this is that my robots are named after ants, not spiders (since they will cooperate when they are done (someday...)). Also, there is the world wide web worm and the webcrawler, none of which seem to use the spider metaphor. Just my $0.02. -John. --------------------------------jrrl@cs.cmu.edu------------------------------- John R. R. Leavitt "Even through the darkest phase Research Programmer Be it thick or thin Center for Machine Translation Always someone marches brave Carnegie Mellon University Here beneath my skin" Editor, Omphalos Magazine k.d.lang, "Constant Craving" ------------------------------------------------------------------------------ Reading: Little, Big by John Crowley Remaking History by Kim Stanley Robinson ------------------------------------------------------------------------------ From /CN=robots-errors/@nexor.co.uk Wed Jun 15 19:17:41 1994 Return-Path: Delivery-Date: Wed, 15 Jun 1994 19:18:12 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 15 Jun 1994 19:17:41 +0100 Date: Wed, 15 Jun 1994 19:17:41 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:055110:940615181742] Content-Identifier: Re: Proposed ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 19:17:41 +0100; Alternate-Recipient: Allowed From: "Tronche Ch. le comique" Message-ID: <9406151820.AA01347@indy1.lri.fr> To: /CN=robots/@nexor.co.uk Subject: Re: Proposed name change for /RobotsNotWanted.txt Original-Received: from indy1.lri.fr by lri.lri.fr, Wed, 15 Jun 1994 20:14:58 +0200 PP-warning: Illegal Received field on preceding line Original-Received: by indy1.lri.fr, Wed, 15 Jun 94 20:20:46 +0200 PP-warning: Illegal Received field on preceding line X-Face: $)p(\g8Er<<5PVeh"4>0m&);m(]e_X3<%RIgbR>?i=I#c0ksU'>?+~)ztzpF&b#nVhu+zsv x4[FS*c8aHrq\<7qL/v#+MSQ\g_Fs0gTR[s)B%Q14\;&J~1E9^`@{Sgl*2g:IRc56f:\4o1k'BDp!3 "`^ET=!)>J-V[hiRPu4QQ~wDm\%L=y>:P|lGBufW@EJcU4{~z/O?26]&OLOWLZ I would prefer /robots.txt (or even better something like robots.lmt (limit), > since txt implies human-readable text to me). The file _is_ human-readable, in some sense. Just $0.02 more. +--------------------------+------------------------------------+ | | | | Christophe TRONCHE | E-mail : tronche@lri.fr | | | | | +-=-+-=-+ | Phone : 33 - 1 - 69 41 66 25 | | | Fax : 33 - 1 - 69 41 65 86 | +--------------------------+------------------------------------+ | ###### ** | | ## # Laboratoire de Recherche en Informatique | | ## # ## Batiment 490 | | ## # ## Universite de Paris-Sud | | ## #### ## 91405 ORSAY CEDEX | | ###### ## ## FRANCE | |###### ### | +---------------------------------------------------------------+ From /CN=robots-errors/@nexor.co.uk Wed Jun 15 19:28:10 1994 Return-Path: Delivery-Date: Wed, 15 Jun 1994 19:28:43 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 15 Jun 1994 19:28:10 +0100 Date: Wed, 15 Jun 1994 19:28:10 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:055740:940615182811] Content-Identifier: Re: Proposed ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 19:28:10 +0100; Alternate-Recipient: Allowed From: " (Michael Mauldin)" Message-ID: <9406151828.AA24805@fuzine.mt.cs.cmu.edu> To: /CN=robots/@nexor.co.uk Cc: fuzzy@CMU.EDU Subject: Re: Proposed name change for /RobotsNotWanted.txt Original-Received: by NeXT Mailer (1.63) PP-warning: Illegal Received field on preceding line Status: RO Content-Length: 1057 Okay, let me modify my earlier vote. I am still in favor of doing the name change NOW and picking a DOS compatible name. How about agents.pol (For agents policy). This satisfies a number of criteria 1. It is neutral, it does not imply that agents are good or bad. 2. "agent" is a general accepted term for what spiders, worms, ants and robots do. 3. the .pol extension does not seem to imply human readability Let me also second (or vote for) the suggestion to add comments to the spec, with '#' being a perfectly acceptable comment introduction character. Finally, let's drop the notion that an empty agents.pol file has a meaning...given the diversity of server responses to a non-existant file, let's force someone to use the exclusion language to deny access to every one: Robot: * Disallow: / should be the accepted way to turn off remote agents. We might as well change the "Robot:" to "Agent:", and then, we'll even be consistent with the CERN WWW spec (it is a User-Agent, after all). --Michael Mauldin From /CN=robots-errors/@nexor.co.uk Wed Jun 15 19:43:32 1994 Replied: Thu, 16 Jun 1994 09:37:30 +0100 Replied: /CN=robots/@nexor.co.uk Replied: John.R.R.Leavitt@NL.CS.CMU.EDU Return-Path: Delivery-Date: Wed, 15 Jun 1994 19:44:06 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 15 Jun 1994 19:43:32 +0100 Date: Wed, 15 Jun 1994 19:43:32 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:056630:940615184333] Content-Identifier: Re: Proposed ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 19:43:32 +0100; Alternate-Recipient: Allowed From: John.R.R.Leavitt@NL.CS.CMU.EDU Message-ID: <"5660 Wed Jun 15 19:43:27 1994"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: Re: Proposed name change for /RobotsNotWanted.txt Status: RO Content-Length: 1537 "Tronche Ch. le comique" writes: >John (John.R.R.Leavitt@NL.CS.CMU.EDU) writes: > >> I would prefer /robots.txt (or even better something like robots.lmt (limit), >> since txt implies human-readable text to me). > >The file _is_ human-readable, in some sense. True. But then, to the the right people, so are .ps files, .c files, and even strange things like .dvi and .o files. Around here, .perl and .lisp files are considered human readable for the most part. What I meant, was that .txt seems to suggest non-computer-readable data (meaning not designed for computer readability, since I'm sure a computer could read anything I could). In the end, the extension really doesn't matter all that much. :^) a couple more cents (if we keep going, we can all chip in on a soda! :^) -John. --------------------------------jrrl@cs.cmu.edu------------------------------- John R. R. Leavitt "Even through the darkest phase Research Programmer Be it thick or thin Center for Machine Translation Always someone marches brave Carnegie Mellon University Here beneath my skin" Editor, Omphalos Magazine k.d.lang, "Constant Craving" ------------------------------------------------------------------------------ Reading: Little, Big by John Crowley Remaking History by Kim Stanley Robinson ------------------------------------------------------------------------------ From /CN=robots-errors/@nexor.co.uk Wed Jun 15 21:13:19 1994 Replied: Thu, 16 Jun 1994 09:38:53 +0100 Replied: /CN=robots/@nexor.co.uk Replied: "Roy T. Fielding" Return-Path: Delivery-Date: Wed, 15 Jun 1994 21:13:41 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 15 Jun 1994 21:13:19 +0100 Date: Wed, 15 Jun 1994 21:13:19 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:062430:940615201320] Content-Identifier: Re: Proposed ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 21:13:19 +0100; Alternate-Recipient: Allowed From: "Roy T. Fielding" Message-ID: <9406151307.aa07835@paris.ics.uci.edu> To: /CN=robots/@nexor.co.uk In-Reply-To: <"5406 Wed Jun 15 19:04:17 1994"@nexor.co.uk> Subject: Re: Proposed name change for /RobotsNotWanted.txt Status: RO Content-Length: 1923 Hmmm...not a whole lot of consensus out there. Acronyms such as "racl.txt" are too hard to remember. I think the extension needs to reflect the content-type, not its purpose. Specifically, it is not fair to ask people to define a new type just for this file. On the other hand, we could always call it "robots.pl" and require the format to be in Perl4. ;-) Yes, the comment syntax should be "all lines starting with # and all empty lines are ignored". I would also like to add an "Expires: " entry, e.g. Expires: daily (means don't check me again until tomorrow) Expires: weekly ( " " " " " for 7 days) Expires: monthly ( " " " " " for 30 days) Expires: never (means never check me again) Expires: 27 Jun 1994 (means don't check again until after the given date) Just my NZ half-penny ... ======================================================================= Okay, the voting so far, counting my own (I think): RTF = your's truly Y = Yes GvR = Guido van Rossum N = No MLM = Michael L. Mauldin O = Okay, maybe, don't care, ... MAK = Martijn Koster JRL = John R. R. Leavitt CT = Christophe Tronche spiders.txt robots.txt robots.lmt racl.txt agents.pol agents.txt avoidURL.txt ----------- ---------- ---------- -------- ---------- ---------- ------------ RTF Y O N N N O Y GvR N Y MLM O O Y MAK N Y JRL N Y Y CT N Y and a grand total of USD $0.04 + FF $0.02 + NZD $0.005 ....Roy Fielding ICS Grad Student, University of California, Irvine USA (fielding@ics.uci.edu) About Roy From /CN=robots-errors/@nexor.co.uk Thu Jun 16 09:39:36 1994 Return-Path: Delivery-Date: Thu, 16 Jun 1994 09:41:05 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 16 Jun 1994 09:39:36 +0100 Date: Thu, 16 Jun 1994 09:39:36 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:109360:940616083939] Content-Identifier: Re: Proposed ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 16 Jun 1994 09:39:36 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"10917 Thu Jun 16 09:39:14 1994"@nexor.co.uk> To: "Roy T. Fielding" Cc: /CN=robots/@nexor.co.uk In-Reply-To: <9406151307.aa07835@paris.ics.uci.edu> Subject: Re: Proposed name change for /RobotsNotWanted.txt Status: RO Content-Length: 301 > I would also like to add an "Expires: " entry, e.g. Why not rely on the HTTP Expires? That's what it's for... -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Thu Jun 16 09:37:50 1994 Return-Path: Delivery-Date: Thu, 16 Jun 1994 09:38:09 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 16 Jun 1994 09:37:50 +0100 Date: Thu, 16 Jun 1994 09:37:50 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:108770:940616083751] Content-Identifier: Re: Proposed ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 16 Jun 1994 09:37:50 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"10874 Thu Jun 16 09:37:43 1994"@nexor.co.uk> To: John.R.R.Leavitt@NL.CS.CMU.EDU Cc: /CN=robots/@nexor.co.uk In-Reply-To: <"5660 Wed Jun 15 19:43:27 1994"@nexor.co.uk> Subject: Re: Proposed name change for /RobotsNotWanted.txt Status: RO Content-Length: 2155 Michael Mauldin wrote: > 3. the .pol extension does not seem to imply > human readability ... > In the end, the extension really doesn't matter all that much. I've had problems in the past with ALIWEB's /site.idx, where servers (in these cases CERN and the NT one) didn't recognise the extension and made it application/binary or something. This can be a bit annoying if the client uses an Accept line with text/plain and text/html. So I guess officially we'd like a separate mime type for this and don't worry about extensions at all, but in practice using .txt saves you hassle. > 2. "agent" is a general accepted term for what > spiders, worms, ants and robots do. Well, it's close to user-agent, ie any client. This is a larger category than the automated robots these policy lines are directed at; I don't mind manual browser going through all these places I want to hide from robots. So this may not be appropriate. On my way home last night I remembered that a while back someone suggested "/robotsp.txt", with the P for policy. This is even better than /robots.txt, as it describes the contents of the file well, and has less chance of name collision. But then, I wouldn't want to upset Roy's chart :-) > Let me also second (or vote for) the suggestion to > add comments to the spec, with '#' being a perfectly > acceptable comment introduction character. Good. Anybody object? > Finally, let's drop the notion that an empty agents.pol > file has a meaning...given the diversity of server responses > to a non-existant file, let's force someone to use the > exclusion language to deny access to every one: OK, it was a bit obscure. > should be the accepted way to turn off remote agents. > We might as well change the "Robot:" to "Agent:", and > then, we'll even be consistent with the CERN WWW spec > (it is a User-Agent, after all). "User-agent:" then? Mmm, I think I like the sound of that. -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Thu Jun 16 12:40:49 1994 Replied: Fri, 17 Jun 1994 14:07:21 +0100 Replied: "Roy T. Fielding" Replied: Thu, 16 Jun 1994 12:44:18 +0100 Replied: "Roy T. Fielding" Return-Path: Delivery-Date: Thu, 16 Jun 1994 12:41:47 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 16 Jun 1994 12:40:49 +0100 Date: Thu, 16 Jun 1994 12:40:49 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:134480:940616114051] Content-Identifier: Re: Proposed ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 16 Jun 1994 12:40:49 +0100; Alternate-Recipient: Allowed From: "Roy T. Fielding" Message-ID: <9406160440.aa00301@paris.ics.uci.edu> To: /CN=robots/@nexor.co.uk Subject: Re: Proposed name change for /RobotsNotWanted.txt Status: RO Content-Length: 2604 ------- Forwarded Message From: Peter Beebee To: fielding@simplon.ICS.UCI.EDU In-reply-to: "Roy T. Fielding"'s message of Wed, 15 Jun 1994 13:13:19 -0700 <9406151307.aa07835@paris.ics.uci.edu> Subject: Re: Proposed name change for /RobotsNotWanted.txt Reply-to: beebee@parc.xerox.com Message-Id: <94Jun15.164028pdt.2695@persica.parc.xerox.com> Date: Wed, 15 Jun 1994 16:40:23 PDT Yet more American $$ But first, an introduction: Hello everybody, my name is Peter Beebee. I'm an undergraduate at MIT currently working at Xerox PARC. One of my recent projects is to implement an experimental web browser which operates through existing WWW clients but provides more natural searching options. For this project I am writing (in PERL) a robot currently identified as SG-Scout. The purpose of this robot is to collect the information needed for the searching algorithms I will be using. Actually, the first version of SG-Scout is already written (thanks to the help of a couple of you); I've gotten it to run inside Xerox, but I've had problems with our firewall when I've tried to access remote servers. I do (and plan to continue to) comply with the proposed standard of exclusion. As for the name problem, I vote for something like "robots.cnf" or "robots.cfg" (configure) over "robots.txt". This way we could avoid creating our own extension for one file, but we would at the same time reduce the chances of collision. The RobotsNotWanted.txt file is more of a configuration file than a text file... -- Peter ------- End of Forwarded Message And my reply is: That sounds a lot like fish-search -- you should talk to Reiner Post. The libwww-perl code includes the ability to use a proxy server. See And there is no defined mime-type for config files, so .cnf and .cfg would be no better than .lmt in that regard. Of course, we could always define one and make it a standard, say text/config cfg but that would still be somewhat annoying to server maintainers. Oh, and never mind about the Expires thing -- I agree with Martijn that we should use the (painfully obvious) existing mechanism. However, I do not think that "robotsp.txt" more accurately reflects the purpose of the file -- it sounds like robot's pee (which is not quite what we had in mind ;-) ....Roy Fielding ICS Grad Student, University of California, Irvine USA (fielding@ics.uci.edu) About Roy From /CN=robots-errors/@nexor.co.uk Fri Jun 17 00:59:37 1994 Replied: Fri, 17 Jun 1994 09:17:21 +0100 Replied: /CN=robots/@nexor.co.uk Replied: beebee@parc.xerox.com Return-Path: Delivery-Date: Fri, 17 Jun 1994 01:00:36 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 17 Jun 1994 00:59:37 +0100 Date: Fri, 17 Jun 1994 00:59:37 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:195960:940616235940] Content-Identifier: Evolving Stan... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 17 Jun 1994 00:59:37 +0100; Alternate-Recipient: Allowed From: Peter Beebee Message-ID: <94Jun16.165832pdt.2695@persica.parc.xerox.com> To: /CN=robots/@nexor.co.uk Subject: Evolving Standard Reply-To: beebee@parc.xerox.com Status: RO Content-Length: 169 Ok.. so how is the standard emerging out of all this turmoil? "robots.txt"? empty file = all robots permitted? '#' = comment character? no "Expires" lines? - Peter From /CN=robots-errors/@nexor.co.uk Fri Jun 17 10:54:41 1994 Replied: Fri, 17 Jun 1994 14:05:53 +0100 Replied: /CN=robots/@nexor.co.uk Replied: beebee@parc.xerox.com Return-Path: Delivery-Date: Fri, 17 Jun 1994 10:56:08 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 17 Jun 1994 10:54:41 +0100 Date: Fri, 17 Jun 1994 10:54:41 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:247490:940617095444] Content-Identifier: Re: Evolving ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 17 Jun 1994 10:54:41 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"23074 Fri Jun 17 09:17:28 1994"@nexor.co.uk> To: beebee@parc.xerox.com Cc: /CN=robots/@nexor.co.uk In-Reply-To: <94Jun16.165832pdt.2695@persica.parc.xerox.com> Subject: Re: Evolving Standard Status: RO Content-Length: 405 > Ok.. so how is the standard emerging out of all this turmoil? > > "robots.txt"? > empty file = all robots permitted? > '#' = comment character? > no "Expires" lines? Yes. I'll be changing the document accordingly. -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Fri Jun 17 14:06:28 1994 Return-Path: Delivery-Date: Fri, 17 Jun 1994 14:07:27 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 17 Jun 1994 14:06:28 +0100 Date: Fri, 17 Jun 1994 14:06:28 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:273620:940617130630] Content-Identifier: Re: Evolving ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 17 Jun 1994 14:06:28 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"27354 Fri Jun 17 14:06:01 1994"@nexor.co.uk> Cc: beebee@parc.xerox.com, /CN=robots/@nexor.co.uk In-Reply-To: <"23074 Fri Jun 17 09:17:28 1994"@nexor.co.uk> Subject: Re: Evolving Standard Status: RO Content-Length: 392 I wrote: > Yes. I'll be changing the document accordingly. I ahve in fact rewritten it entirely. Please let me know if there's anything I've missed. http://web.nexor.co.uk/mak/doc/robots/norobots.html -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Fri Jun 17 17:41:25 1994 Replied: Fri, 17 Jun 1994 17:43:55 +0100 Replied: "Roy T. Fielding" Return-Path: Delivery-Date: Fri, 17 Jun 1994 17:41:59 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 17 Jun 1994 17:41:25 +0100 Date: Fri, 17 Jun 1994 17:41:25 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:007700:940617164126] Content-Identifier: Re: Evolving ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 17 Jun 1994 17:41:25 +0100; Alternate-Recipient: Allowed From: "Roy T. Fielding" Message-ID: <9406170939.aa07770@paris.ics.uci.edu> To: /CN=robots/@nexor.co.uk In-Reply-To: <"27354 Fri Jun 17 14:06:01 1994"@nexor.co.uk> Subject: Re: Evolving Standard Status: RO Content-Length: 396 Martijn wrote: > I have in fact rewritten it entirely. Please let me know if there's > anything I've missed. > > http://web.nexor.co.uk/mak/doc/robots/norobots.html Oooh, very nice. Looks great, ....Roy Fielding ICS Grad Student, University of California, Irvine USA (fielding@ics.uci.edu) About Roy From /CN=robots-errors/@nexor.co.uk Fri Jun 17 17:37:38 1994 Return-Path: Delivery-Date: Fri, 17 Jun 1994 17:38:11 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 17 Jun 1994 17:37:38 +0100 Date: Fri, 17 Jun 1994 17:37:38 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:007030:940617163739] Content-Identifier: Re: New code ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 17 Jun 1994 17:37:38 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"699 Fri Jun 17 17:37:28 1994"@nexor.co.uk> To: " (Michael Mauldin)" Cc: /CN=robots/@nexor.co.uk In-Reply-To: <9406171622.AA09303@fuzine.mt.cs.cmu.edu> Subject: Re: New code available to implement the latest standard (Perl) Status: RO Content-Length: 1174 Michael Mauldin wondered about ordering of the lines in the record. As it stands the ordering isn't explicitly specified, so that a User-agent line can follow a Disallow line: > Disallow: / > User-agent: GoodRobot > Disallow: > > What does this mean? The same as User-agent: GoodRobot Disallow: Disallow: / which is silly, but is to be interpreted to mean "Allow all URL's except those which start with a slash", which in practice disallows all urls. > Requiring the robot name before the action allows a simple > way to determine how to proceed > 1. Find your name (or *) > 2. Read all Disallow lines and act on them. > > Otherwise you force the robot to read the whole file > to figure out what to do. I like unspecified ordering because that is how RFC822 is specified, and this format looks very much like rfc822. I can't imagine parsing overhead to really be a problem. But if there is a lot of resistance to it I'll change it. comments? -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Fri Jun 17 18:38:05 1994 Replied: Sun, 19 Jun 1994 15:39:08 +0100 Replied: /CN=robots/@nexor.co.uk Replied: "Roy T. Fielding" Return-Path: Delivery-Date: Fri, 17 Jun 1994 18:38:43 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 17 Jun 1994 18:38:05 +0100 Date: Fri, 17 Jun 1994 18:38:05 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:011430:940617173807] Content-Identifier: Re: New code ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 17 Jun 1994 18:38:05 +0100; Alternate-Recipient: Allowed From: "Roy T. Fielding" Message-ID: <9406171036.aa11854@paris.ics.uci.edu> To: /CN=robots/@nexor.co.uk In-Reply-To: <"699 Fri Jun 17 17:37:28 1994"@nexor.co.uk> Subject: Re: New code available to implement the latest standard (Perl) Status: RO Content-Length: 1010 Martijn wrote: > Michael Mauldin wondered: >> Requiring the robot name before the action allows a simple >> way to determine how to proceed >> 1. Find your name (or *) >> 2. Read all Disallow lines and act on them. >> >> Otherwise you force the robot to read the whole file >> to figure out what to do. > > I like unspecified ordering because that is how RFC822 is specified, > and this format looks very much like rfc822. I can't imagine parsing > overhead to really be a problem. But if there is a lot of resistance > to it I'll change it. comments? Nope, it won't work that way. rfc822 parsers combine identical headers into a single, comma-separated list, thus causing any blank Disallow: lines to disappear. I recommend defining it as ordered (it is less confusing to the reader that way). ....Roy Fielding ICS Grad Student, University of California, Irvine USA (fielding@ics.uci.edu) About Roy From /CN=robots-errors/@nexor.co.uk Sun Jun 19 15:39:21 1994 Return-Path: Delivery-Date: Sun, 19 Jun 1994 15:39:54 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Sun, 19 Jun 1994 15:39:21 +0100 Date: Sun, 19 Jun 1994 15:39:21 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:102640:940619143922] Content-Identifier: Re: New code ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sun, 19 Jun 1994 15:39:21 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"10260 Sun Jun 19 15:39:14 1994"@nexor.co.uk> To: "Roy T. Fielding" Cc: /CN=robots/@nexor.co.uk In-Reply-To: <9406171036.aa11854@paris.ics.uci.edu> Subject: Re: New code available to implement the latest standard (Perl) Status: RO Content-Length: 508 Roy wrote: > Nope, it won't work that way. rfc822 parsers combine identical headers > into a single, comma-separated list, thus causing any blank Disallow: > lines to disappear. which is consistent with its semantics. > I recommend defining it as ordered (it is less confusing to the reader > that way). alright... -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Sat Jun 18 03:54:36 1994 Return-Path: Delivery-Date: Sat, 18 Jun 1994 03:55:12 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Sat, 18 Jun 1994 03:54:36 +0100 Date: Sat, 18 Jun 1994 03:54:36 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:044150:940618025438] Content-Identifier: libwww-perl v... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sat, 18 Jun 1994 03:54:36 +0100; Alternate-Recipient: Allowed From: "Roy T. Fielding" Message-ID: <9406171953.aa14559@paris.ics.uci.edu> To: /CN=robots/@nexor.co.uk Cc: oscar@cui.unige.ch, grimes@raison.mro.dec.com, shelden@fatty.law.cornell.edu Subject: libwww-perl version 0.11 Status: RO Content-Length: 1134 Hello all, I made a few bug fixes and upgrades to the libwww-perl in preparation for a general announcement on www-talk. Version 0.11 June 17, 1994 Changed environment variable LIBWWW-PERL to LIBWWW_PERL because some systems can't handle the dash (Charlie Stross). Fixed bug in "get" that caused full pathname to be used as the method (Martijn Koster). Fixed handling of perverse relative URLs (e.g. ../../) in wwwurl'absolute. The distribution site and much more information about the libraries can be found at and also at If you have already picked up a copy, a patch file is available at both locations (patch010to011.txt). After today, I will just announce changes on www-talk (I know how annoying it is to get several copies of the same announcement). ....Roy Fielding ICS Grad Student, University of California, Irvine USA (fielding@ics.uci.edu) About Roy From /CN=robots-errors/@nexor.co.uk Sun Jun 19 16:19:45 1994 Return-Path: Delivery-Date: Sun, 19 Jun 1994 16:20:09 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Sun, 19 Jun 1994 16:19:45 +0100 Date: Sun, 19 Jun 1994 16:19:45 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:104030:940619151947] Content-Identifier: (q)Version(q)... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sun, 19 Jun 1994 16:19:45 +0100; Alternate-Recipient: Allowed From: "Tronche Ch. le comique" Message-ID: <9406191522.AA26207@indy1.lri.fr> To: /CN=robots/@nexor.co.uk Subject: "Version" field in /robots.txt Original-Received: from indy1.lri.fr by lri.lri.fr, Sun, 19 Jun 1994 17:16:43 +0200 PP-warning: Illegal Received field on preceding line Original-Received: by indy1.lri.fr, Sun, 19 Jun 94 17:22:42 +0200 PP-warning: Illegal Received field on preceding line X-Face: $)p(\g8Er<<5PVeh"4>0m&);m(]e_X3<%RIgbR>?i=I#c0ksU'>?+~)ztzpF&b#nVhu+zsv x4[FS*c8aHrq\<7qL/v#+MSQ\g_Fs0gTR[s)B%Q14\;&J~1E9^`@{Sgl*2g:IRc56f:\4o1k'BDp!3 "`^ET=!)>J-V[hiRPu4QQ~wDm\%L=y>:P|lGBufW@EJcU4{~z/O?26]&OLOWLZ Delivery-Date: Mon, 20 Jun 1994 13:28:25 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 20 Jun 1994 13:27:38 +0100 Date: Mon, 20 Jun 1994 13:27:38 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:187400:940620122739] C