From /CN=robots-errors/@nexor.co.uk Wed Jun 1 21:17:14 1994 Return-Path: Delivery-Date: Wed, 1 Jun 1994 21:17:35 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 1 Jun 1994 21:17:14 +0100 Date: Wed, 1 Jun 1994 21:17:14 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:039230:940601201716] Content-Identifier: WWW robots di... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 1 Jun 1994 21:17:14 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"3912 Wed Jun 1 21:17:00 1994"@nexor.co.uk> To: Jonathon Fletcher , David Eichmann , Oliver McBryan , Roy Fielding , Brian Pinkerton , Fred Barrie , Matthew Gray , Paul De Bra , Guido van Rossum , "James E. Pitkow" , Andreas Ley , Christophe Tronche , Charlie Stross , L.McLoughlin@doc.imperial.ac.uk, Michael L Mauldin Cc: /CN=robots/@nexor.co.uk Subject: WWW robots discussion list Status: RO Content-Length: 1305 At the WWW'94 Conference the robot authors present expressed an interest in some closer collaboration. I volunteered to set up a mailing list to serve as a platform for these technical discussions. This list is now active. As you are all developing or administering robots I'd urge you to make use of this facility; together we should be able to reduce the occurence of problems caused by robots, to reduce some of the duplicate effort, and improve the service to users of robot-generated facilities. If you'd like to subscribe, send a message to robots-request@nexor.co.uk, with the lines subscribe help stop in the body of the message. The list manager is of course NXDLM, which we market as product, and is configured to keep an archive of traffic on the list. This archive is accessible from the Web vie our experimental gateway reacheable from . To send messages to the list itself use robots@nexor.co.uk. Next week (allowing people time to register) I'll post a proposed charter to the list, and list some issues I'd like to see discussed. Looking forward to your contributions, -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Mon Jun 6 09:37:38 1994 Return-Path: Delivery-Date: Mon, 6 Jun 1994 09:38:15 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 6 Jun 1994 09:37:38 +0100 Date: Mon, 6 Jun 1994 09:37:38 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:130800:940606083743] Content-Identifier: Proposed Char... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 09:37:38 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"13056 Mon Jun 6 09:37:06 1994"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: Proposed Charter Status: RO Content-Length: 1401 Welcome to you all..., Here is the proposed charter for this list, for future reference by new subscribers. It's straightforward, but if anybody would like to see any changes let me know. -- Proposed charter for robots@nexor.co.uk. This list is intended as a technical forum for authors, maintainers and administrators of WWW robots. Its aim is to maximise the benefits WWW robots can offer while minimising drawbacks and duplication of effort. It is intended to address both development and operational aspects of WWW robots. This list is not intended for general discussion of WWW development efforts, or as a first line of support for users of robot facilities. Postings to this list are informal, and decisions and recommendations formulated here do not constitute any official standards. Postings to this list will be made available publicly through the list-managers archive, and NEXOR doesn't accept any responsibility for the content of the postings. Related lists: www-talk@info.cern.ch: technical WWW development discussions www-html@info.cern.ch: HTML specific development discussions www-cache@info.cern.ch: technical discussions on proxys and caching comp.infosystems.www.*: WWW discussions -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Mon Jun 6 09:39:16 1994 Return-Path: Delivery-Date: Mon, 6 Jun 1994 09:39:54 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 6 Jun 1994 09:39:16 +0100 Date: Mon, 6 Jun 1994 09:39:16 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:131210:940606083924] Content-Identifier: Topics Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 09:39:16 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"13118 Mon Jun 6 09:39:07 1994"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: Topics Status: RO Content-Length: 7634 Here is a long list topics I'd like to see discussed at some point, in no particular order. I look forward to comments on these topics, other issues, and what the priorities should be. * public information - robot profile matrix It would be nice to have a matrix of certain attributes of the various robots that exist, that is available to the Web public at large. The list I maintain on could server as a basis; are there any additions people would like to see? * sharing of data - format / access protocol of database Most indexing robots generate a database of information, that can then be searched through publicly accessible ISINDEX/Form pages. It would be nice if the actual database was publicly available, or where applicable an access protocol can be made publicly available (eg. SQL). Others could then run local mirrors of the search engines, write their own search engines, or do analysis of the data. - distributed data gathering If there was a standard database format / access protocol the data gathering could be distributed over the net, either by separate robots, or multiple copies of the same robot. Jonathon, you mentioned once you were working on some robot database synchronisation scheme. Did you get anywhere? * data analysis As robots traverse the Web, they could do a lot of statistical analysis, either real-time, or on the resulting database. It seems silly that multiple robots go out over the same data, all doing slightly different analysis. It would be really nice to publish: - a list of servers Like Mathew Gray's list, but then one that is as up-to-date as the latest robot run, has only got hosts that actually exist, and are smart about multiple DNS names for the same IP address. - inverse maps Robots can create inverse maps, so that I can find out which pages refer to a particular page. Until the Referer HTTP field becomes more used this could be very valuable to find bad links. And it'd be nice to know the average number of links to a page; how inter- linked is the Web? We could have a most-referenced league table; which is the most popular page in the web in terms of links? - general stats like avergae number of visited documents per server (and min & max), total number of documents visited, total number of hosts visited, percentage of links that are bad, percentage of HTML documents that are a tag soup, percentage of documents not changed in x days, etc. etc. * sharing of operational tips All robot maintainers hit the same problems at certain sites, and get things like: - seed documents What documents are good to start robots from. - site exclusion lists Which sites explicitly ask not to be visited. - black hole lists Which cgi-scripts create infinitely linked web spaces - avoidance lists Which data should be avoided (e.g. the UNIX manual gateways) - robotsnotwanted proposal I'd like to get some more discussion on this. As all the robot writers are on this list we should be able to decide on something that can easily be implemented by robots and users. The only outstanding issue is the name of the file; it is too long for DOS-based servers. Is there any problem with changing the filename to robotsp.txt (for robots policy) ? - scheduled runs It might be nice to know when which robots are running, just in case people start wondering. - ALIWEB For those sites that have a /site.idx file it might be worth to take the documents referenced in it special consideration. * sharing of algorithms All robots have different algorithms for a lot of the same functions. It should be possible to find the best algorithm that all robots can use: - document selection Which documents do you visit? A lot of robots to "n levels deep" which seems pretty arbitrary to me. Doing "n levels from the root document" might make more sense. - HTML parsing This is tricky, with so much bad HTML out there. There must be a "best way" to extract URL's from documents; I am sure that at the moments some robots barf on some documents. - load balancing How do you decide when to query a site as to balance the load most? It is by now clear the "visit one site at top speed" approach is nasty; what is used now? Roud robin? Can time zones be used? How fast do robots run? - search algorithms Once you have a database, what algorithm can one use to search it? At the moment there are Perl scripts, SQL scripts, WAIS database etc. If there was a standard database format these could be benchmarked. - error recovery Robots should be restarteable without having to backtrack. How is this best achieved? * sharing of code There is a lot of duplication of effort in the coding and maintenance of robot code. It would be useful if there was one common code base for robots to draw from, implementing the separate algorithms used in robots. I would really like to see a single robot implemetation (TUM: The Unified Robot?), that could run cooperatively around the world. Is this me dreaming, or is it something more of you see as beneficial. If so, how can we make this a reality? What language is most suiteable (Perl, surely :-) ? What design allows the most flexibility and safety? * HTML/HTTP extensions It maybe that there are things in HTTP/HTML that robots could use but don't at the moment, and it may even be worth extending the protocol to put facilities aimed at automated tools in (eg If-modified-since). At WWW'94 one idea was for example to implement as server-side facility to parse an HTML document, and return only the links. * Caching issues The increased use of caching presents special problems for robots: how does a robot recognise a cached document sitting in the cache data area of a chaching server? Should it document them? But caches and robot do similar things, a robot uses it's own database as a cache (I hope!), but a caching server could also use that data. This comes back to standardising the database; maybe the structure used by the CERN cache can be used as format for robot gathering output. Robots can also be useful for pre-loading a cache, to do mirroring, or to prepare for off-line demo's. Maybe robots should have command-line options to facilitate this. Then again, robot code should probably not be handed out freely. * Testing The person running a robot should keep close tabs on what it is doing at any one time. What sort of monitoring tools are used to do that? Testing robot modifications is another issue. I have noticed in the past that a robot did the same run several times in a day, which it turned out to do "for testing". Surely tests should be done locally. Right, I have been waiting to get all these off my chest. I think TUM is the most challenging long-term topic, but in the short term I think the standard database(s) is the most important; it would bring immediate benefit, and a lot of the other issues can follow on from that. Any comments? -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Mon Jun 6 10:06:46 1994 Return-Path: Delivery-Date: Mon, 6 Jun 1994 10:07:47 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 6 Jun 1994 10:06:46 +0100 Date: Mon, 6 Jun 1994 10:06:46 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:136200:940606090649] Content-Identifier: Inverse Maps Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 10:06:46 +0100; Alternate-Recipient: Allowed From: mkgray@MIT.EDU Message-ID: <9406060905.AA22918@deathtongue.MIT.EDU> To: /CN=robots/@nexor.co.uk Subject: Inverse Maps X-Url: http://www.mit.edu:8001/people/mkgray/mkgray.html Status: RO Content-Length: 770 I am currently working on W4v3.0, and one of the features I have implemented so far is some inverse mapping features. It's yielded some interesting results. Not surprisingly, the most pointed to sites in the documents examined in a preliminary run were info.cern.ch and www.ncsa.uiuc.edu. Other highly pointed to sites include nearnet.gnn.com (:-), www.cis.ohio-state.edu, www.cs.cmu.edu, gopher.vt.edu, and sunsite.unc.edu. For the initial portion of the implementation, I am only constructing interconnectivity within sites. That is, I keep track of what documents point to site FOO, not what documents point to what documents. Any ideas on implemenation of the latter that is reasonable? Has anyone else done such interconnectivity mapping? ...Matthew From /CN=robots-errors/@nexor.co.uk Mon Jun 6 10:36:42 1994 Return-Path: Delivery-Date: Tue, 7 Jun 1994 09:48:42 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 6 Jun 1994 10:36:42 +0100 Date: Mon, 6 Jun 1994 10:36:42 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:140740:940606093644] Content-Identifier: re: Inverse M... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 10:36:42 +0100; Alternate-Recipient: Allowed From: Charlie Stross Message-ID: <9406061033.aa01127@ruddles.sco.com> To: /CN=robots/@nexor.co.uk, mkgray@MIT.EDU Subject: re: Inverse Maps X-Mailer: SCO Portfolio 2.0 Status: RO Content-Length: 1799 mkgray@MIT.EDU writes ... >I am currently working on W4v3.0, and one of the features I have implemented >so far is some inverse mapping features. It's yielded some interesting >results. Not surprisingly, the most pointed to sites in the documents >examined in a preliminary run were info.cern.ch and www.ncsa.uiuc.edu. >Other highly pointed to sites include nearnet.gnn.com (:-), >www.cis.ohio-state.edu, www.cs.cmu.edu, gopher.vt.edu, and sunsite.unc.edu. >For the initial portion of the implementation, I am only constructing >interconnectivity within sites. That is, I keep track of what documents >point to site FOO, not what documents point to what documents. Any ideas >on implemenation of the latter that is reasonable? One idea I was playing with when I was working on websnarf 2 (which is currently on the shelf) was the idea of using a whacking great .dbm file to store either entire HTML files, indexed on their URL, or a list of URLs extracted from such files. (I ran into a problem in that the standard dbm and Berkeley dbm libraries have a maximum record size of 1024 or 2096 bytes respectively; GNU dbm apparently doesn't have this restriction, but I didn't have time to rebuild my version of Perl with a new library.) Anyway, the idea is that keeping such a database would reduce the problem of cross- referencing large webs; simply read a record, and for each URL in the record (which contains a list) do a lookup on the database. (The output could then be turned into input for a graph-generating program like AT&T's NEATO.) -- Charlie -------------------------------------------------------------------------------- Charlie Stross is charless@sco.com, SCO Technical Publications GO d-- -p+ c++++(i---) u++ l-(+) *e++ m+ s/+ !n h-(++) f+ g+ w++ t-(---) r-(++) y+ From /CN=robots-errors/@nexor.co.uk Mon Jun 6 18:28:28 1994 Return-Path: Delivery-Date: Tue, 7 Jun 1994 09:46:57 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 6 Jun 1994 18:28:28 +0100 Date: Mon, 6 Jun 1994 18:28:28 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:203320:940606172830] Content-Identifier: Re: Inverse M... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 18:28:28 +0100; Alternate-Recipient: Allowed From: Brian Pinkerton Message-ID: <9406061727.AA09398@biotech.washington.edu> To: mkgray@MIT.EDU Cc: /CN=robots/@nexor.co.uk Subject: Re: Inverse Maps Original-Received: by NeXT.Mailer (1.100) PP-warning: Illegal Received field on preceding line Original-Received: by NeXT Mailer (1.100) PP-warning: Illegal Received field on preceding line Status: RO Content-Length: 403 I've done some inverse mapping with the WebCrawler, but not to any great extent. Right now, I just generate the "Top 25" list -- a list of the 25 most frequently referenced sites on the Web (at least, based the WebCrawler's limited experience). This turns out to work pretty well -- you can see the (predictable) results at http://www.biotech.washington.edu/WebCrawler/Top25.html. bri From /CN=robots-errors/@nexor.co.uk Mon Jun 6 10:14:53 1994 Return-Path: Delivery-Date: Mon, 6 Jun 1994 10:15:25 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 6 Jun 1994 10:14:53 +0100 Date: Mon, 6 Jun 1994 10:14:53 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:137830:940606091454] Content-Identifier: Avoidance Alg... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 10:14:53 +0100; Alternate-Recipient: Allowed From: mkgray@MIT.EDU Message-ID: <9406060914.AA22927@deathtongue.MIT.EDU> To: /CN=robots/@nexor.co.uk Subject: Avoidance Algorithms X-Url: http://www.mit.edu:8001/people/mkgray/mkgray.html Status: RO Content-Length: 1101 One of the features that I implemented in W4v1.0 was an avoidance algorithm I called 'boredom'. First a brief implementation profile of W4v1.0: W4v1.0 was written in June of 1993 as a simple depth first search that kept the entire database in memory of where it had been and dumped to disk when it had exhausted a document tree. Very simple. So, one issue I was concerned about was infinite trees (this is a bad thing with depth first searches :-) so I added a feature to the Wanderer that allowed it to 'get bored'. Specifically, if it retrieved more than N documents with the same path (except for the last element) and a few other heuristics, it bailed out and found something more interesting to do. For the most part this was very successful. W4v2.0 was a modification to do breadth first searching, and in that revision 'boredom' got removed, as it was not as useful to the algorithm. I am planning on reimplementing a more advanced version of 'boredom' in W4v3.0, partially based on content parsing. Suggestions? Comments? Other implementations to avoid large trees? ...Matthew From /CN=robots-errors/@nexor.co.uk Mon Jun 6 10:26:27 1994 Return-Path: Delivery-Date: Tue, 7 Jun 1994 09:48:21 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 6 Jun 1994 10:26:27 +0100 Date: Mon, 6 Jun 1994 10:26:27 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:138740:940606092628] Content-Identifier: Database/memo... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 10:26:27 +0100; Alternate-Recipient: Allowed From: mkgray@MIT.EDU Message-ID: <9406060925.AA22934@deathtongue.MIT.EDU> To: /CN=robots/@nexor.co.uk Subject: Database/memory implementation X-Url: http://www.mit.edu:8001/people/mkgray/mkgray.html Status: RO Content-Length: 1128 How have people in general implemented the DB? By the database (DB) I mean the robot's record of where it has been, not necassarily anything it constructs for later consumption. W4v1.0 implemented a completely in memory DB. This worked fine when there were 100 sites on the web. It doesn't work any more :-) Plus if the Wanderer crashed, it wouldn't always successfully dump it's DB. W4v2.0 implemented a disk based DB which has a number of advantages 1) It can get as big as it wants and not kill the machine 2) It saves state, so arbitrary crashes don't lose any substantial data On the other hand, it is somewhat slower, though most of the time is spent waiting for HTTP responses. Currently, it maintains one record of where it has been ('log') and another record of where it plans on going ('dq') and another set of analogous in-memory lists which regularly get flushed to disk. Any other more novel implementations out there? I've given a passing thought to trying a heierarchical DB, but I'm not sure it would be useful. Any ideas on how to make an in-memory DB smaller? Or a disk DB faster? ...Matthew From /CN=robots-errors/@nexor.co.uk Mon Jun 6 10:34:35 1994 Return-Path: Delivery-Date: Tue, 7 Jun 1994 09:48:38 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 6 Jun 1994 10:34:35 +0100 Date: Mon, 6 Jun 1994 10:34:35 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:140130:940606093437] Content-Identifier: Server list Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 10:34:35 +0100; Alternate-Recipient: Allowed From: mkgray@MIT.EDU Message-ID: <9406060934.AA22943@deathtongue.MIT.EDU> To: /CN=robots/@nexor.co.uk Subject: Server list X-Url: http://www.mit.edu:8001/people/mkgray/mkgray.html Status: RO Content-Length: 977 Once I get W4v3.0 finished, I intend to add a number of the modifications mentioned by Martijn in his initial letter (DNS identification of identical servers, bogus servers eliminated, etc.) Additionally, I would welcome any other lists of servers. I can merge such lists with the comprehensive list. I will continue to maintain the "Comprehensive List of WWW Sites", so anything to make this as up to date and accurate as possible would be great. Suggestions on other useful techniques for sorting the comprehensive list would be great too. If you don't know what I'm talking about, or have lost the URL: http://www.mit.edu:8001/people/mkgray/compre.bydomain.html So, please do send me any sitelists. No desperate need to crosscheck with my list, I can do that. Of course, if you want to, that just makes my life easier. ...Matthew BTW, I'm sending all these messages out separately to keep the topic threads vaguely separate, in case that wasn't apparent. From /CN=robots-errors/@nexor.co.uk Mon Jun 6 11:26:46 1994 Return-Path: Delivery-Date: Tue, 7 Jun 1994 09:47:04 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 6 Jun 1994 11:26:46 +0100 Date: Mon, 6 Jun 1994 11:26:46 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:145260:940606102648] Content-Identifier: Avoidance Alg... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 11:26:46 +0100; Alternate-Recipient: Allowed From: Charlie Stross Message-ID: <9406061120.aa01408@ruddles.sco.com> To: mkgray@MIT.EDU, /CN=robots/@nexor.co.uk Subject: Avoidance Algorithms X-Mailer: SCO Portfolio 2.0 Status: RO Content-Length: 3168 mkgray@MIT.EDU writes ... >One of the features that I implemented in W4v1.0 was an avoidance algorithm >I called 'boredom'. First a brief implementation profile of W4v1.0: >W4v1.0 was written in June of 1993 as a simple depth first search that kept >the entire database in memory of where it had been and dumped to disk when >it had exhausted a document tree. Very simple. : >W4v2.0 was a modification to do breadth first searching, and in that revision >'boredom' got removed, as it was not as useful to the algorithm. I am planning >on reimplementing a more advanced version of 'boredom' in W4v3.0, partially >based on content parsing. >Suggestions? Comments? Other implementations to avoid large trees? Well, my first cut at websnarf was a recursive depth-first probe. This rapidly ran away into the web, and as my bandwidth is limited to my share of a 64K line this seemed like a bad idea. It also had a tendency to dump core due to stack frame overflows. I went to the bookshelf and was most interested to read the chapter on graph searching in Sedgewick (Algorithms, 2nd edn, can't remember the year). It turns out that you can use a stack to emulate a recursive depth-first traversal, and a queue to emulate a recursive breadth-first traversal, both without the need for recursion. Perl provides a handy data structure -- the list -- and calls to use a given list as either a queue or a stack. I modified websnarf so that it could do both breadth- and depth- traversals (with the switchover being handled in a small subroutine that decided whether to push or shift URLs onto the list, and pop or unshift them off the list). Because my nonrecursive implementation created a list of URLs representing the current state of your tree walk, it was then relatively trivial to scan along the list. If you see two occurences of the same URL in the list, you know there's some danger of getting into a loop, and you can just prune one of them out of the list. It's also useful to store the "depth" (i.e. number of links away from home) along with each URL in the list. If two pointers to the same URL occur at the same depth, the odds are that they're fairly safe -- just time-consuming. But if one is above the other, there's the possibility of some kind of weird loop occuring. Finally, one thing I'd do immediately if I was working on websnarf right now [*] would be to ensure that it avoids any URLs that look like search commands or internal document pointers. That way lies madness ... -- Charlie [*] websnarf is a personal effort, not a company-sanctioned project. It's on the shelf due to lack of spare time at work. I'm hoping (fingers firmly crossed) to get a grant for next year to do the job professionally, i.e. to spend all my time on it, not just a couple of hours a week. Meantime, I'm spending my time thinking about how to get right all the things I got wrong the first time round ... -------------------------------------------------------------------------------- Charlie Stross is charless@sco.com, SCO Technical Publications GO d-- -p+ c++++(i---) u++ l-(+) *e++ m+ s/+ !n h-(++) f+ g+ w++ t-(---) r-(++) y+ From /CN=robots-errors/@nexor.co.uk Mon Jun 6 11:52:10 1994 Return-Path: Delivery-Date: Tue, 7 Jun 1994 09:47:52 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 6 Jun 1994 11:52:10 +0100 Date: Mon, 6 Jun 1994 11:52:10 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:148030:940606105212] Content-Identifier: Re: Avoidance... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 11:52:10 +0100; Alternate-Recipient: Allowed From: Lee McLoughlin Message-ID: <"swan.doc.i.352:06.05.94.10.50.10"@doc.ic.ac.uk> To: Charlie Stross , mkgray@MIT.EDU, /CN=robots/@nexor.co.uk In-Reply-To: Subject: Re: Avoidance Algorithms X-Mailer: Mail User's Shell (7.2.5 10/14/92) Status: RO Content-Length: 747 My scanning routine was the usual depth-first search. But this meant that certain sites were scanned before others. Apart from causing sites way down the list to be left out it also meant that one site would get "soaked". In the end I went for random scanning of all stored URLs looking for a URL and site that I hadn't gone to recently. My system is also written in perl except that I store all the retrieved data in dbm files. Since I have URL and site timers I can now also run multiple scanning processes at the same time. -- -- Lee McLoughlin. Phone: +44 71 589 5111 X 5085 Dept of Computing, Imperial College, Fax: +44 71 581 8024 180 Queens Gate, London, SW7 2BZ, UK. Email: L.McLoughlin@doc.ic.ac.uk From /CN=robots-errors/@nexor.co.uk Wed Jun 8 10:09:49 1994 Replied: Wed, 08 Jun 1994 11:02:20 +0100 Replied: /CN=robots/@nexor.co.uk Replied: " (Paul De Bra)" Replied: nlc@cs.nott.ac.uk Return-Path: Delivery-Date: Wed, 8 Jun 1994 10:11:38 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 8 Jun 1994 10:09:49 +0100 Date: Wed, 8 Jun 1994 10:09:49 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:090430:940608090957] Content-Identifier: Re: Avoidance... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 8 Jun 1994 10:09:49 +0100; Alternate-Recipient: Allowed From: " (Paul De Bra)" Message-ID: <199406080910.JAA11791@pcpaul.info.win.tue.nl> To: /CN=robots/@nexor.co.uk Subject: Re: Avoidance Algorithms Status: RO Content-Length: 857 Strange to hear that the strategy in W4 changed from depth-first in 1.0 to breadth-first in 2.0. The experiments we ran with the fish-search, both real and simulated, all showed that depth-first is a better navigation algorithm than breadth-first. We also have boredom, set to 1, meaning that we never retrieve the same url twice. Another thing the fish-search does is to try to not load url's from the same host in succession. (It searches among the first 30 or so url's in its list to find one from another host.) A feature still missing in the fish-search, which I would like to hear about from others is the use of ISMAP's. My idea is to try a limited selection of coordinates first, and put a larger selection further back in the "queue" of url's to be tried. How do other robots find out which url's can be reached by clicking in an ismap? Paul. From /CN=robots-errors/@nexor.co.uk Wed Jun 8 11:02:39 1994 Return-Path: Delivery-Date: Wed, 8 Jun 1994 11:03:19 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 8 Jun 1994 11:02:39 +0100 Date: Wed, 8 Jun 1994 11:02:39 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:096060:940608100242] Content-Identifier: Re: Avoidance... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 8 Jun 1994 11:02:39 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"9595 Wed Jun 8 11:02:26 1994"@nexor.co.uk> To: " (Paul De Bra)" Cc: /CN=robots/@nexor.co.uk, nlc@computer-science.nottingham.ac.uk In-Reply-To: <199406080910.JAA11791@pcpaul.info.win.tue.nl> Subject: Re: Avoidance Algorithms Status: RO Content-Length: 1994 > Strange to hear that the strategy in W4 changed from depth-first in 1.0 to > breadth-first in 2.0. Not really; most Web server URL spaces are structured hierarchically, with growing more specific towards the leaves. So if you start from a server root and you do a bread-first search for a limited number of documents you'll get a broader (and therefore for the purposes of general indexing better) overview than if you do a depth-first search for a limited number of documents, which can shoot of down one specific area (especially in deep trees). If you use maximum-depth rathyer then maximum-documents it shouldn't matter much (depending on the structure of the data). > The experiments we ran with the fish-search, both real > and simulated, all showed that depth-first is a better navigation algorithm > than breadth-first. Can you elaborate on how they were better? > A feature still missing in the fish-search, which I would like to hear about > from others is the use of ISMAP's. > My idea is to try a limited selection of coordinates first, and put a larger > selection further back in the "queue" of url's to be tried. > > How do other robots find out which url's can be reached by clicking in an > ismap? Dave Ragget would refer you to the HTML 3.0 facilities for specifying links on figures within the figure element. I think that is the only way; there are an infinite number of coordinates in an ISMAP, you don't know where they are, and you can't check two locations for equivalence. Incidentally, I reckon that it is bad HTML if you provide an ismap as sole access to a small set of URL's. A friend of mine is working on a "click on this festival map to show where you are going to be" service. I'd hate to think what a random ISMAP coordinate-trying robot would do to that. ;-) -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From debra@info.win.tue.nl Wed Jun 8 12:50:17 1994 Replied: Wed, 08 Jun 1994 13:45:38 +0100 Replied: robots Replied: debra@info.win.tue.nl (Paul De Bra) Return-Path: Delivery-Date: Wed, 8 Jun 1994 12:50:28 +0100 Received: from svin04.info.win.tue.nl by lancaster.nexor.co.uk with SMTP (XTPP); Wed, 8 Jun 1994 12:50:17 +0100 Received: from pcpaul.info.win.tue.nl by svin04.info.win.tue.nl (8.6.8/1.45) id NAA28716; Wed, 8 Jun 1994 13:50:05 +0200 Received: from localhost by pcpaul.info.win.tue.nl (8.6.4/1.60) id LAA20290; Wed, 8 Jun 1994 11:52:19 GMT From: debra@info.win.tue.nl (Paul De Bra) Message-Id: <199406081152.LAA20290@pcpaul.info.win.tue.nl> Subject: Re: Avoidance Algorithms To: m.koster@nexor.co.uk (Martijn Koster) Date: Wed, 8 Jun 1994 13:52:17 +0200 (MET DST) Cc: /CN=robots/@nexor.co.uk In-Reply-To: <199406081002.MAA06363@svin02.info.win.tue.nl> from "Martijn Koster" at Jun 8, 94 11:02:11 am X-Mailer: ELM [version 2.4 PL23] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Status: RO Content-Length: 2835 > > Strange to hear that the strategy in W4 changed from depth-first in 1.0 to > > breadth-first in 2.0. > > Not really; most Web server URL spaces are structured hierarchically, > with growing more specific towards the leaves. So if you start from a > server root and you do a bread-first search for a limited number of > documents you'll get a broader (and therefore for the purposes of > general indexing better) overview than if you do a depth-first search > for a limited number of documents, which can shoot of down one > specific area (especially in deep trees). I guess our algorithm avoids shooting down into a specific area by trying to find links to other sites first. When you have limited search time you often don't get anywhere with breadth- first navigation because you don't reach documents that are deep enough to deal with a specific topic. A robot that spends a *lot* of time, visiting very many documents, could work equally well using breadth-first search. > If you use maximum-depth rathyer then maximum-documents it shouldn't > matter much (depending on the structure of the data). We do use maximum-depth to avoid going too far in a non-relevant direction. > > The experiments we ran with the fish-search, both real > > and simulated, all showed that depth-first is a better navigation algorithm > > than breadth-first. > > Can you elaborate on how they were better? They found more cross-reference links. Which need not mean anything, but considering that cross-reference links may (in the web) be links leading to different sites, this suggests a better chance of penetrating into a larger part of the web. again, the fact that we search for a limited time is important here. > ... > > ismap? > > Dave Ragget would refer you to the HTML 3.0 facilities for specifying > links on figures within the figure element. I think that is the only > way; there are an infinite number of coordinates in an ISMAP, you > don't know where they are, and you can't check two locations for > equivalence. nice to here something will be coming along. there are a finite number of coordinates in an ISMAP, but the number is large. we would never consider trying all possible coordinates. which isn't necessary in any ismap i know. > Incidentally, I reckon that it is bad HTML if you provide an ismap as > sole access to a small set of URL's. dunno. the course on hypertext which i have on line does it... and databases that deal with mostly graphical information, providing ismaps to zoom in on things and providing information would only have access through ismaps as well. > A friend of mine is working on a "click on this festival map to show > where you are going to be" service. I'd hate to think what a random > ISMAP coordinate-trying robot would do to that. ;-) we'll work on it and see how it performs. From /CN=robots-errors/@nexor.co.uk Wed Jun 8 13:46:24 1994 Return-Path: Delivery-Date: Wed, 8 Jun 1994 13:47:20 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 8 Jun 1994 13:46:24 +0100 Date: Wed, 8 Jun 1994 13:46:24 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:119080:940608124627] Content-Identifier: Re: Avoidance... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 8 Jun 1994 13:46:24 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"11890 Wed Jun 8 13:45:48 1994"@nexor.co.uk> To: " (Paul De Bra)" Cc: /CN=robots/@nexor.co.uk In-Reply-To: <199406081152.LAA20290@pcpaul.info.win.tue.nl> Subject: Re: Avoidance Algorithms Status: RO Content-Length: 1998 > there are a finite number of coordinates in an ISMAP, I was under the impression you could provide fractional coordinates, which would make the theoretical address space infinte, but I'll settle for large :-) > but the number is large. we would never consider trying all possible > coordinates. which isn't necessary in any ismap i know. Sure, my point as that you don't know which to try, or which map onto the "same" document, if that concept applies (think about a click-on-the-world-map-to-get-lat-long ismap server) > > Incidentally, I reckon that it is bad HTML if you provide an ismap as > > sole access to a small set of URL's. > > dunno. the course on hypertext which i have on line does it... > and databases that deal with mostly graphical information, providing ismaps > to zoom in on things and providing information would only have access through > ismaps as well. And they all rule out non-graphical displays :-( It isn't always possible/ useful to provide textual links in addition, but it is quite often. > > A friend of mine is working on a "click on this festival map to show > > where you are going to be" service. I'd hate to think what a random > > ISMAP coordinate-trying robot would do to that. ;-) > > we'll work on it and see how it performs. It's not the performance I'm worried about in this example, but the fact there is a semantic associated with this action. Imagine this cool hypothetical ismap server that uses a slide to register your appreciation with a page, or even a graphical green(yes), red(no) voting card. If a robot tries some random clicks in these ismaps this could have a nasty effect. Of course all links in the Web have this danger, but especially with graphical user-interface things like maps and forms you expect to be interfacing with a user... -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Thu Jun 9 16:06:34 1994 Return-Path: Delivery-Date: Thu, 9 Jun 1994 16:13:31 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 9 Jun 1994 16:06:34 +0100 Date: Thu, 9 Jun 1994 16:06:34 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:282050:940609150638] Content-Identifier: Re: Avoidance... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 9 Jun 1994 16:06:34 +0100; Alternate-Recipient: Allowed From: Charlie Stross Message-ID: <9406091059.aa02900@ruddles.sco.com> To: debra@info.win.tue.nl, m.koster@nexor.co.uk Cc: /CN=robots/@nexor.co.uk Subject: Re: Avoidance Algorithms X-Mailer: SCO Portfolio 2.0 Status: RO Content-Length: 2508 (Paul De Bra) writes: >> > The experiments we ran with the fish-search, both real >> > and simulated, all showed that depth-first is a better navigation algorithm >> > than breadth-first. >> >> Can you elaborate on how they were better? >They found more cross-reference links. Which need not mean anything, but >considering that cross-reference links may (in the web) be links leading to >different sites, this suggests a better chance of penetrating into a larger >part of the web. again, the fact that we search for a limited time is important >here. A minor point may be of interest to you: here at SCO we distinguish between "navigation nodes" and "information nodes" in our formally-constructed web pages. An information node may be linked to other nodes, but its primary function is to store text; in general it has a link to and items for linear browsing. A navigation node, on the other hand, has scads of URLs pointing both to information nodes and to other navigation nodes. This seems to be a fairly common distinction between web pages, as it naturally falls out of most methods of structuring information in hypertext; and it provides a clue for ways to avoid flooding servers while doing a depth-first search. When grabbing a page and searching it for URLs, count the URLs on the page; if there're three or less, the page is probably an information node and the pointers probably point to adjacent nodes in the same document, while four or more URLs suggest a navigation node (with URLs that could point anywhere on the net). Put URLs from information nodes on one stack, and URLs from navigation nodes on another stack. When selecting the next URL to explore, alternate between the two stacks. Alternatively, maintain a list of local stacks -- one per server polled -- and work along the list, taking a new URL from each stack in turn, so that no server is ever polled twice in succession. There are other ways of avoiding flooding a server, but these should be pretty easy to implement and will, most of the time, ensure that a local lookup will be followed by a remote one. -- Charlie PS : Anyone else run across: http://www.biotech.washington.edu/WebCrawler/WebCrawler.html yet? I'm most impressed ... -------------------------------------------------------------------------------- Charlie Stross is charless@sco.com, SCO Technical Publications GO d-- -p+ c++++(i---) u++ l-(+) *e++ m+ s/+ !n h-(++) f+ g+ w++ t-(---) r-(++) y+ From /CN=robots-errors/@nexor.co.uk Thu Jun 9 18:27:13 1994 Return-Path: Delivery-Date: Thu, 9 Jun 1994 18:27:57 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 9 Jun 1994 18:27:13 +0100 Date: Thu, 9 Jun 1994 18:27:13 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:019820:940609172715] Content-Identifier: Re: Avoidance... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 9 Jun 1994 18:27:13 +0100; Alternate-Recipient: Allowed From: Brian Pinkerton Message-ID: <9406091726.AA02507@biotech.washington.edu> To: Charlie Stross Cc: /CN=robots/@nexor.co.uk Subject: Re: Avoidance Algorithms Original-Received: by NeXT.Mailer (1.100) PP-warning: Illegal Received field on preceding line Original-Received: by NeXT Mailer (1.100) PP-warning: Illegal Received field on preceding line Status: RO Content-Length: 1024 I like the idea of distinguishing among nodes based on the number of links they have. For the WebCrawler, this would be a good way to reduce the number of nodes that need to be considered when deciding which nodes to visit next. When it's running in breadth-first mode (and generating an index), the WebCrawler doesn't do any kind of avoidance -- it just visits each server in succession, giving priority to servers that have never been visited before. When the number of known servers is bigger than 100 or so, then there's no chance the WebCrawler will get back to a server before a reasonable amount of time has passed. It its "directed searching" mode, the WebCrawler will avoid a server for some period of time after visiting it once, because there's a good chance its search criteria will want to grab another document from that server. Right now, it think I set that time period to 60 seconds, which roughly corresponds to my intuition of how fast a human would do the same operation. bri From /CN=robots-errors/@nexor.co.uk Mon Jun 6 12:01:53 1994 Return-Path: Delivery-Date: Tue, 7 Jun 1994 09:48:03 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 6 Jun 1994 12:01:53 +0100 Date: Mon, 6 Jun 1994 12:01:53 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:150550:940606110159] Content-Identifier: Best environm... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 12:01:53 +0100; Alternate-Recipient: Allowed From: Charlie Stross Message-ID: <9406061158.aa01564@ruddles.sco.com> To: /CN=robots/@nexor.co.uk Subject: Best environment for knowbot development? X-Mailer: SCO Portfolio 2.0 Status: RO Content-Length: 1029 I notice that an awful lot of knowbots seem to be being developed in Perl. There's at least one written in Python, probably a couple in C ... but there are Perl applications crawling all over the place (or so it feels at times!). Off the cuff, I'd attribute this to the rich string-manipulation features and accessible TCP/IP sockets provided by Perl, along with its fairly high execution speed (for an interpreted language). On the other hand, Perl is complex and syntactically dense -- a nightmare for non-UNIXheads. Has anyone given any serious thought to the optimal development environment for knowbots? Apart from Perl, are there any languages/- platforms you'd consider to be specially suitable for developing knowbots? And if so, what are their salient characteristics? -- Charlie -------------------------------------------------------------------------------- Charlie Stross is charless@sco.com, SCO Technical Publications GO d-- -p+ c++++(i---) u++ l-(+) *e++ m+ s/+ !n h-(++) f+ g+ w++ t-(---) r-(++) y+ From /CN=robots-errors/@nexor.co.uk Mon Jun 6 14:31:59 1994 Replied: Tue, 07 Jun 1994 11:54:16 +0100 Replied: /CN=robots/@nexor.co.uk Replied: Michael.Mauldin@NL.CS.CMU.EDU Return-Path: Delivery-Date: Tue, 7 Jun 1994 09:30:07 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 6 Jun 1994 14:31:59 +0100 Date: Mon, 6 Jun 1994 14:31:59 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:174170:940606133201] Content-Identifier: Description o... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 6 Jun 1994 14:31:59 +0100; Alternate-Recipient: Allowed From: Michael.Mauldin@NL.CS.CMU.EDU Message-ID: <"17413 Mon Jun 6 14:31:48 1994"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: Description of the Lycos WW searcher at CMU Status: RO Content-Length: 5944 The Lycos project at Carnegie Mellon is in the early stage, we have a Web explorer in operation, and our indexer will come on-line later this month. We will use the SCOUT indexer which has an HTTP gateway (a set Sample database of the Tipster corpus from Wall Street Journal is available intermittently from http://fuzine.mt.cs.cmu.edu/scout/home.html). Lycos is written in Perl, but uses a C program based on CERN's libwww to fetch URLs. It uses a random search, keeps its record of URLs visited in a Perl assoc list stored in DBM (thanks to Charlie Stross for the tip that Gnu DBM doesn't have arbitrary limits!). It searches HTTP, FTP, and GOPHER sites, ignoreing TELNET, MAILTO, and WAIS. Lycos uses a data reduction scheme to reduce the stored information about each document: Title Headings and Subheadings 100 most "weighty" words (using Tf*IDf, Term freq / Inverse doc freq) First 20 lines Size in bytes Number of words Lycos keeps a word frequency count as it runs...it has read over 25 million words. A list of the most frequent words found after searching 6.3 million words is available off the Lycos home page. So far, Lycos has run for less than a month URLs found: 313,468 URLs fetched: 41,391 (35,382 successful) HTTP servers: 3,138 Citation counting (number of "parents" by URL): this is the first 50 URLs sorted by number of documents that reference that URL. What I did not do was to count only references from different sites (the I'm sure that 99% of the refs to http://gdbwww.gdb.orf/omim come from the Genome Database server itself. ------------------------------------------------------------------------ 1703 http://gdbwww.gdb.org/omim/ 1578 http://cossack.cosmic.uga.edu/keywords.html 692 ftp://ftp.network.com/IPSEC/rfcindex4.html 421 ftp://ftp.network.com/IPSEC/rfcindex3.html 322 ftp://ftp.network.com/IPSEC/rfcauthor.html 319 ftp://ftp.network.com/IPSEC/rfcindex5.html 234 ftp://ftp.network.com/IPSEC/rfcindex2.html 202 ftp://ftp.network.com/IPSEC/rfcindex1.html 177 http://info.cern.ch/hypertext/WWW/TheProject.html 166 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/whats-new.html 135 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/MetaIndex.html 133 http://www.cs.columbia.edu/~radev/ 133 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/NCSAMosaicHome.html 118 http://www.cs.colorado.edu/homes/mcbryan/public_html/bb/summary.html 108 http://www.mcs.anl.gov/home/gropp/ 107 http://info.cern.ch/hypertext/DataSources/bySubject/Overview.html 105 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/StartingPoints/NetworkStartingPoints.html 101 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/help-about.html 85 http://cui_www.unige.ch/w3catalog 84 http://wings.buffalo.edu/world 82 http://sass577.endo.sandia.gov/SEACAS/CUBIT/Developers/ 80 http://cui_www.unige.ch/OSG/MultimediaInfo/mmsurvey/ 79 http://www.nta.no/telektronikk/4.93.dir/ 76 http://asp.esam.nwu.edu/chris/dce_prodlist.html 76 http://hypatia.gsfc.nasa.gov/NASA_homepage.html 76 http://info.cern.ch/hypertext/DataSources/WWW/Servers.html 75 http://www.ncsa.uiuc.edu/demoweb/demo.html 75 http://www.rtd.com/people/rawn/ 74 ftp://ftp.network.com/IPSEC/rfcindex0.html 74 http://tns-www.lcs.mit.edu/cgi-bin/value-added/sports/register.sos.texas.gov/texreg/ 73 http://rs560.cl.msu.edu/weather/getmegif.html 71 http://rs560.cl.msu.edu/weather/interactive.html 70 http://rs560.cl.msu.edu/weather/textindex.html 70 http://rs560.cl.msu.edu/~henrich/ 70 http://www.seas.upenn.edu/~mengwong/ 68 http://info.cern.ch/hypertext/DataSources/WWW/Geographical.html 68 http://rs560.cl.msu.edu/weather/uscmp.gif 66 http://rs560.cl.msu.edu/weather/uscmp.mpg 66 http://www.cso.uiuc.edu/~kline/cvk.html 65 ftp://cs.nott.ac.uk/pub/sat-images/ 65 http://rs560.cl.msu.edu/weather/goes7ir.mpg 65 http://rs560.cl.msu.edu/weather/worldir.mpg 65 http://www.hmc.edu/~irilyth/diplomacy/ 64 gopher://burrow.cl.msu.edu/00/news/weather/lan 64 gopher://ssec.wisc.edu 64 http://rs560.cl.msu.edu/weather/6panel.mpg 64 http://rs560.cl.msu.edu/weather/d2.jpg 64 http://rs560.cl.msu.edu/weather/gmsvis.mpg 63 http://cui_www.unige.ch/meta-index.html 63 http://rd13doc.cern.ch/public/doc/Rd13StatusReport.html ------------------------------------------------------------------------ The Lycos philosophy is to keep a finite model of the web that enables subsequent searches to proceed more rapidly. The idea is to prune the "tree" of documents and to represent the clipped ends with a summary of the documents found under that node. The 100 most important words lists from several documents can be combined to produce a list of the 100 most important words in the set of documents. Alternative fixed representations of documents or document sets include the vector models such as Dumais at BellCore and Gallant & Caid at Hecht-Neilson Corp. The number 100 was chosen arbitarily, so we will need to investigate to find whether than number is too high or too low. I also subscribe to the dream of a single format and indexing scheme that each server runs on its own data, but given the current state of the community I believe it is premature to settle on a single format. Various information retrieval schemes depend on wildly different kinds of data. We should try out more ideas and evaluate them carefully and only then should we try to settle on a single format. Resources: I have agreed to share my code for research and educational users. Should I make a requirement that recipients of the code post to this mailing list so we can keep track of its proliferation? I already have promised code to two people. I will make lists, statistics, reports, and the index server accessible off the Lycos home page as they become available. --Michael L. Mauldin Carnegie Mellon University Center for Machine Translation 5000 Forbes Avenue Pittsburgh, PA 15213-3890 fuzzy@cmu.edu From /CN=robots-errors/@nexor.co.uk Tue Jun 7 11:55:08 1994 Return-Path: Delivery-Date: Tue, 7 Jun 1994 11:56:17 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 7 Jun 1994 11:55:08 +0100 Date: Tue, 7 Jun 1994 11:55:08 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:273170:940607105511] Content-Identifier: Data formats ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 7 Jun 1994 11:55:08 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"27301 Tue Jun 7 11:54:46 1994"@nexor.co.uk> To: Michael.Mauldin@NL.CS.CMU.EDU Cc: /CN=robots/@nexor.co.uk In-Reply-To: <"17413 Mon Jun 6 14:31:48 1994"@nexor.co.uk> Subject: Data formats (was Re: Description of the Lycos WW searcher at CMU) Status: RO Content-Length: 2295 > I also subscribe to the dream of a single format and indexing scheme > that each server runs on its own data, That is one step further then what I proposed; I was talking about a single format for the database the robot uses to store its information in locally. This should be achievable before suggesting any Web-wide solution. > but given the current state of the community I believe it is > premature to settle on a single format. Various information > retrieval schemes depend on wildly different kinds of data. We > should try out more ideas and evaluate them carefully and only then > should we try to settle on a single format. Sure, we don't know exactly what is required yet, and identifying all our present and future requirements is probabaly an impossible task anyway. I believe in a gradual approach. The fact remains that almost all robots keep a same set of core information: which URL's were visited, which are going to be visited. Probably when URL's were retrieved, what Last-modified time was, what headings/keywords are, which URL's are referenced in which documents, etc. It should be possible to extract these common elements from the internal robot database, in a standard format. Maybe the word "data exchange format" is more applicable than "database format" I'd love to suggest something more concrete myself, but I can't use personal first-hand experience as I don't have a robot. What concrete data formats do people use? > I have agreed to share my code for research and educational users. > Should I make a requirement that recipients of the code post to this > mailing list so we can keep track of its proliferation? I already > have promised code to two people. Handing out robot code is exactly what needn't be required if the data gathered by a robot was accessible in a mungeable format. If everybody who'd like to do some anaylisis of Web data was running their own robot we're wasting a lot of bandwidth. > I will make lists, statistics, reports, and the index server > accessible off the Lycos home page as they become available. Great, keep us posted. -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From charless@sco.COM Tue Jun 7 14:59:24 1994 Return-Path: Delivery-Date: Tue, 7 Jun 1994 14:59:57 +0100 Received: from relay1.UU.NET by lancaster.nexor.co.uk with SMTP (XTPP); Tue, 7 Jun 1994 14:59:24 +0100 Received: from sco.sco.COM by relay1.UU.NET with SMTP (rama) id QQwtgp28331; Tue, 7 Jun 1994 09:59:00 -0400 Received: from scol.london.sco.COM by sco.sco.COM id aa06700; Tue, 7 Jun 94 7:05:39 PDT Received: from ruddles.london.sco.com by scol.sco.COM id ab07727; Tue, 7 Jun 94 14:57:54 BST From: Charlie Stross To: Michael.Mauldin@NL.CS.CMU.EDU, m.koster@nexor.co.uk Subject: re: Data formats Cc: /CN=robots/@nexor.co.uk X-Mailer: SCO Portfolio 2.0 Date: Tue, 7 Jun 1994 14:53:43 +0100 (BST) Message-ID: <9406071455.aa14751@ruddles.sco.com> Status: RO Content-Length: 8203 Martijn Koster writes ... >> I also subscribe to the dream of a single format and indexing scheme >> that each server runs on its own data, >That is one step further then what I proposed; I was talking about a >single format for the database the robot uses to store its information >in locally. This should be achievable before suggesting any Web-wide >solution. : >I'd love to suggest something more concrete myself, but I can't use >personal first-hand experience as I don't have a robot. What concrete >data formats do people use? I don't use this yet, but it intrigues me as a possible future route, for reasons that should be obvious ... Here's the readme file from GlimpseHTTP 1.0, released earlier this week: --------------------------- cut here ---------------------------------- NAME GlimpseHTTP WHAT IS GLIMPSE Glimpse (which stands for GLobal IMPlicit SEarch) is an indexing and query system that allows you to search through lots of files in many (possibly nested) directories very quickly. Glimpseindex, which you run by saying glimpseindex builds a very small index (2-5% of the text). With it, glimpse can search through all the files in these directories much the same way as grep, except that you don't have to specify file names. Glimpse supports most of agrep's options (agrep is our powerful version of grep, and it is part of glimpse) including approximate matching (e.g., finding misspelled words), Boolean queries, and even some limited forms of regular expressions. DESCRIPTION GlimpseHTTP is a collection of tools that allows you to incorporate glimpse in WWW documents. With it, you can provide general search capabilities to any user without incurring too much space overhead. Furthermore, these tools allow you to integrate search with browsing. If you have several nested directories which the user may browse, you can include the glimpse interface in each document such that only the relevant directories will be included in the search. More details are given below. The current version of GlimpseHTTP was tested under httpd 1.2 HTML server from NCSA and Glimpse currently works on many Unix platforms. To search and browse the information any HTML browser can be used (this includes NCSA Mosaic for X-Windows, MS-Windows and Macintosh, Lynx and other browsers. For maximum convenience your browser should support forms, although minimal functionality can be achieved with any browser). Since GlimpseHTTP uses Glimpse, this provides some unique features - A very small index (3-5% of the total text). - Reasonably fast search. - Search for approximate match allowing errors. In addition, GlimpseHTTP provides you with the following capabilities: - You can use a combination of browsing and searching: first, you locate the directory where the relevant information can be stored, then you can use search to locate specific files. - The result of the search is a nicely formatted hypertext with hyperlinks to matching documents. - Following the hyperlink leads you not only to a particular file, but also to the exact place where the match occured. - Hyperlinks in the documents are converted on the fly to actual hyperlinks, which you can follow immediately. This makes the GlimpseHTTP particularily suitable for searching meta-information (Internet directories etc.). - Similar tools are provided for archiving and searching USENET newsgroups. You can maintain the archive of news articles and allow people to search your archive using the same interface. Features supported include kill-file for articles and fast search for particular posters. Since news archiver uses NNTP interface, you can archive news articles from remote news servers. (Browse and search for news is yet to be implemented: browsing in this case means selection of pertinent newsgroup(s), currently supported is only the search within one newsgroup a time) Among the possible applications of GlimpseHTTP we envision: - FTP sites with search possibilities; - news archiving sites; - any search application which should be accessed over local or global network where searching for approximate match and/or saving of disk space for indices is an issue. GlimpseHTTP components: 1. aglimpse - "Archive Glimpse" - a tool for searching file hierarchies indexed for Glimpse. aglimpse is a CGI-compliant program which performs the search and formats the output as HTML document with hyperlinks to the matches. 2. Administrative tools which facilitate maintaining and indexing of Glimpse archives. One of the programs is the HTML indexer which prepares hypertext indices for each searchable directory - this supports the concept of combined browsing and searching. 3. GlimpseNews - a collection of tools for archiving and searching newsgroups archives. SEE ALSO http://glimpse.cs.arizona.edu:1994/glimpsehttp.html - GlimpseHTTP home page. http://glimpse.cs.arizona.edu:1994 - Glimpse developers home page. README.install - directions on installing GlimpseHTTP on your server. README.amgr - description of Archive Manager. README.indexing - descriptioN of HTML indexer. AUTHORS Paul Klark (GlimpseHTTP) Udi Manber, Sun Wu, and Burra Gopal (Glimpse) University of Arizona, Department of Computer Science To be put on glimpse mailing list, send mail to glimpse-request@cs.arizona.edu -------------------------- CUT HERE ------------------------------ If you're still reading, what I'm thinking is: A URL is effectively a "word"; any given document has a unique URL. Indeed, an inverted-text index of URLs is a fairly sensible way to keep track of large maps of the web, where the same URLs may be replicated frequently. GlimpseHTTP is a really convenient alternative to WAIS or Z39.50 retreival systems for providing access to stored text -- which is what HTML indexers need to do. What's got me interested is this: Imagine a web traversing 'bot that does a breadth-first search of the web. Every time it retrieves an HTML document, it indexes it using glimpse. However, rather than storing a pointer to the disk block where the file resides, as glimpse does when it's indexing text files, the knowbot stores a pointer to the URL of the file. (URLs are stored in a separate table, so that only an index pointer -- say, a 24-bit integer -- is required to represent a URL in the actual text database.) When you do a search for some words, the glimpse system finds the best match, then dereferences the stored URLs to retrieve the documents containing those words, then does an agrep search on it for context. This schema gives you a good compromise between index size and storage space. A 24-bit integer -- enough to index 16 million URLs -- is of the same order of magnitude as the index pointers that glimpse already provides for pointing to blocks: it would be a bit slower, but the problem is not insuperable. So in return for, say, 50Mb of disk space devoted to an index, you could have a complete inverted-text database refering to 500-1000 Mb of HTML files on the web; these would be available for retrieval with a single URL lookup. Now layer a client-server lookup mechanism like Alibi -- the UberNet system -- on top of the glimpse/knowbot combo, and you have a mechanism for propagating queries between index servers. A properly designed system could answer queries on a huge information domain without doing any off-site lookups, or (if the information is not found locally) forward the query to other servers. The result would be something like Veronica, only with full free-text search capacity over the whole of WebSpace. -- Charlie -------------------------------------------------------------------------------- Charlie Stross is charless@sco.com, SCO Technical Publications GO d-- -p+ c++++(i---) u++ l-(+) *e++ m+ s/+ !n h-(++) f+ g+ w++ t-(---) r-(++) y+ From /CN=robots-errors/@nexor.co.uk Tue Jun 7 15:00:04 1994 Return-Path: Delivery-Date: Tue, 7 Jun 1994 15:00:43 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 7 Jun 1994 15:00:04 +0100 Date: Tue, 7 Jun 1994 15:00:04 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:294320:940607140005] Content-Identifier: ALIBI release... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 7 Jun 1994 15:00:04 +0100; Alternate-Recipient: Allowed From: Charlie Stross Message-ID: <9406071455.aa14752@ruddles.sco.com> To: /CN=robots/@nexor.co.uk Subject: ALIBI release readme X-Mailer: SCO Portfolio 2.0 Status: RO Content-Length: 10614 Please accept my apologies if you've already read this. It's the readme for Alibi, a new resource retrieval system written at NIST, and released earlier this week; I think it's relevent to this list ... --------------------------- CUT HERE ----------------------------- Alibi/Unetd README (C) 1994 David Flater, dave@case50.ncsl.nist.gov VERSION: BETA 001 Quick Summary ------------- Alibi is a new resource discovery and information retrieval system for the Internet. The acronym stands for Adaptive Location of Internetworked Bases of Information. Alibi provides a query interface that allows users to retrieve information with keyword queries, without contacting remote servers or navigating. The resource discovery is fully automated, and the retrieval is truly location independent. Using Alibi ----------- The source to the client is called alibi.c. It's in a subdirectory called clients and is also available separately. The client program alibi can talk with any Unetd anywhere, but you should talk to a local daemon if you have one and let the system itself worry about remote access. Alibi requires as its one and only command line argument the name of the machine running the Unetd. If all is well, you will get a prompt asking for a query. Typing 'help' or anything else that is not a well-formed query will produce a brief help screen. A basic query is a group of keywords in parenthesis, e.g. (cache software). The special command 'more' (no parens) will retrieve something else like what you just got. More complex options are described in the help. 'quit' gets you out of the client. The client can be suspended and brought back without interfering with the processing of the query by the information system; only the final delivery will be delayed. System Description (Read Before Installing Unetd) ------------------------------------------------- The Ubernet is the information network used by Alibi. Alibi is the name of the entire system, including the simple client that allows users to retrieve information. Alibi is neither a navigational system like WWW nor a resource catalog like Archie. It is a fully distributed, fully automatic resource discovery and information retrieval system with a query-based user interface. The client (called alibi) contacts a Unetd, submits queries on behalf of the user, and processes replies from the Unetd. Unetd maintains Internet connections with other Unetds at other sites, and it may also communicate with mediators / resource managers at the local site to retrieve information. When a Unetd receives a query, it either generates a response using its local resources or forwards the query to another Unetd. Alibi can handle just about any kind of information. Currently available resources include the MS-DOS subtree of wuarchive (via NFS), the SEC's EDGAR database (via FTP), a geographical database for Virginia, and the following "demo-sized" information bases: a collection of sound files; a group of images at NASA Goddard Space Flight Center; several Usenet newsgroups; the Alibi FTP directory; and a source code reuse library. You do not need to have an information base to run Unetd, but it would be very nice if you would contribute what you have to the Alibi information network. It is preferable to run a Unetd at the site having the information base than to make Unetd access the data remotely, since Unetd handles distributed information retrieval much better than anything else. Installing Unetd ---------------- You do not need root privileges to run Unetd, but root should add it to rc.local to keep it up through power failures. As a last resort, a utility called cron_rc has been included in the utils subdirectory to restart the daemons after power failures without needing any privileges except access to crontab. Several makefiles are provided. The default makefile assumes that you have an ANSI C compiler, finds the maximum level of optimization it supports, and compiles the daemon. knrmakefile assumes that you have a K&R C compiler and the ansi2knr utility. makefile.sun.cc is a hacked version of knrmakefile that bypasses the compiler-finding script, which is necessary, for example, if your system administrator installed gcc (the preferred compiler) in a bogus way that makes it so that you can't actually compile anything. alibi.h contains a small number of #defines that can be altered to set the working directory of the daemon and to bypass C library functions that are missing on your machine. By default, Unetd does a 'cd FQDN' (filling in the FQDN of the local machine) when it is started. In that directory you need to put a file called bozo.txt (you can use the one provided in the source directory) and a file called peers that lists the FQDN's of other hosts running Unetds with which you want to connect. When bringing up Unetd for the first time, forget about the peers file and just run Unetd as an isolated daemon to make sure it's working. When you do eventually choose other Unetds to connect to, choose some small number of sites that are geographically close. Three is a good number; six is okay; ten is getting excessive. You won't gain anything from creating too many links except lots of overhead. Keep in mind that other sites can put YOU in their peers file without telling you (just like you were about to do to someone else) and make you have more links than you thought. Unetd writes logging information to stdout and stderr, so redirect them to a file. Among the first information written is the FQDN of the local machine, the PID of the daemon, and so on. If the FQDN is wrong, you must put a file called FQDN in the directory from which Unetd is started containing the correct FQDN. FQDN is the ONLY file that is read from the initial directory before 'cd FQDN' happens. If your Unetd announces itself with an incorrect FQDN, other daemons will "bozo" it repeatedly (this will be noted in the log file) and you will not be able to talk to other sites. The default FQDN will be correct if your system is correctly configured. Unetd also writes periodic statistics on things like average response time. Some of these statistics have known bugs, such as the fact that you can register more delivered responses than accepted queries. This logging information currently accumulates at a rate that will produce 100k of log data in a few days. You can truncate the log by sending a HUP signal to Unetd. (This will not kill the daemon) After the testing period is over I intend to greatly reduce the amount of logging that is done. Verbose logging is also enabled by default for the cache decision function because I haven't collected enough information to tune it yet. You might want to disable that by undoing the #define debugcache in alibi.h. Queries from users are logged since I want to know what somebody typed that crashed Unetd when it happens. It just logs the queries, not the identity of the users who entered them. Providing Information Bases --------------------------- To provide an information base you need to install a mediator. If you do this wrong, you can degrade the performance of the entire system. A mediator is a separate program and process that creates two named pipes in the directory used by Unetd. Unetd opens those pipes and talks with the mediator using a simple protocol. Unetd sends subqueries (keyword queries with no Boolean logic) to the mediator, and the mediator returns OIDs (Object IDentifiers) of matching data objects. Unetd might then ask the mediator to retrieve a data object or send another subquery. The reasons that incorrectly installed mediators are a danger are as follows: -- Unetd trusts mediators to provide intelligent classifications for data objects that mesh with the generally accepted class hierarchy of the Ubernet. If you start creating lots of bogus data classes, the bogosity will propagate into the adaptive query classification heuristics used by Unetds all over the place and degrade performance. -- Unetd trusts mediators not to give stupid answers to good queries. If a mediator says that a data object matches a query, it is assumed that the degree of relevance is fairly high and that every keyword in the subquery was found to relate. A mediator MUST NOT simply find the closest thing in the database regardless of the magnitude of its irrelevance! If no data are relevant, a null response is expected so that some other mediator will be given a chance. -- Unetd trusts mediators not to act in a Byzantine manner designed to crash the system. Some examples of mediators are provided in a subdirectory called example_resources. rblobs.c is the most frequent starting point for building resources. rblobs.c will turn a file system subtree into an information base using index files that you must provide. A slight variation on rblobs.c was used to provide the MS-DOS subtree of wuarchive. rnntp.c shows how you can overhaul rblobs.c to let Unetd retrieve information from diverse information sources, and r_c_sources.c shows how automatic indexing can be employed. Getting Sources and Further Reading ----------------------------------- Alibi sources and miscellaneous Alibi-related papers are available for anonymous FTP on speckle.ncsl.nist.gov under the directory called flater. Of course, you can also get them through Alibi. Licensing and All That Jazz --------------------------- Everything that is shipped in the Alibi/Unetd package is (C) 1994 David Flater, but permission is granted for free copying. The sources may be modified, reused, or rewritten provided that fair credit to David Flater is given where appropriate, to the extent that is appropriate for the level of reuse. The right to use this software is granted to the public; the right to misuse it is not. Misuse of this software on an open network may degrade the performance of the entire information system and violate the rights of other users. Such misuse is expressly prohibited, and all rights that you have been granted to this software may be revoked in the event of such misuse. No warranties of any kind are made with respect to this package. The author disclaims any and all responsibility for anything bad that happens as a result of the use or misuse of this software. -------------------------------------------------------------------------------- Charlie Stross is charless@sco.com, SCO Technical Publications GO d-- -p+ c++++(i---) u++ l-(+) *e++ m+ s/+ !n h-(++) f+ g+ w++ t-(---) r-(++) y+ From /CN=robots-errors/@nexor.co.uk Mon Jun 13 14:06:48 1994 Return-Path: Delivery-Date: Mon, 13 Jun 1994 14:07:27 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 13 Jun 1994 14:06:48 +0100 Date: Mon, 13 Jun 1994 14:06:48 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:050560:940613130650] Content-Identifier: new paper Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 13 Jun 1994 14:06:48 +0100; Alternate-Recipient: Allowed From: Charlie Stross Message-ID: <9406131403.aa24556@ruddles.sco.com> To: /CN=robots/@nexor.co.uk Cc: charless@sco.COM Subject: new paper X-Mailer: SCO Portfolio 2.0 Status: RO Content-Length: 394 I've just written a first-draft informal paper discussing knowbots. It's on the web: http://gemma.demon.co.uk:8001/~charlie/websearch.html Comments? -- Charlie -------------------------------------------------------------------------------- Charlie Stross is charless@sco.com, SCO Technical Publications GO d-- -p+ c++++(i---) u++ l-(+) *e++ m+ s/+ !n h-(++) f+ g+ w++ t-(---) r-(++) y+ From /CN=robots-errors/@nexor.co.uk Tue Jun 14 08:44:58 1994 Replied: Tue, 14 Jun 1994 13:51:37 +0100 Replied: "Roy T. Fielding" Return-Path: Delivery-Date: Tue, 14 Jun 1994 08:45:38 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 14 Jun 1994 08:44:58 +0100 Date: Tue, 14 Jun 1994 08:44:58 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:128690:940614074459] Content-Identifier: libwww-perl: ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 14 Jun 1994 08:44:58 +0100; Alternate-Recipient: Allowed From: "Roy T. Fielding" Message-ID: <9406140044.aa03471@paris.ics.uci.edu> To: /CN=robots/@nexor.co.uk Cc: oscar@cui.unige.ch, grimes@raison.mro.dec.com, shelden@fatty.law.cornell.edu Subject: libwww-perl: A generic WWW interface library for perl tools Status: RO Content-Length: 850 Hello all, After some prompting from Martijn Koster and Oscar Nierstrasz at WWW94, I decided to rewrite the core of MOMspider so that it can serve as a generic library for WWW clients written in Perl. So far it includes support for all of HTTP and also local file requests. I am looking for more contributions to support the many other protocols and also to provide better HTML libraries. The distribution site and much more information about the libraries can be found at and also at Please take a look and tell me what you think. ....Roy Fielding ICS Grad Student, University of California, Irvine USA (fielding@ics.uci.edu) About Roy From /CN=robots-errors/@nexor.co.uk Tue Jun 14 13:31:03 1994 Return-Path: Delivery-Date: Tue, 14 Jun 1994 13:32:27 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 14 Jun 1994 13:31:03 +0100 Date: Tue, 14 Jun 1994 13:31:03 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:176910:940614123106] Content-Identifier: Re: libwww-pe... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 14 Jun 1994 13:31:03 +0100; Alternate-Recipient: Allowed From: " (Michael Mauldin)" Message-ID: <9406141228.AA18336@fuzine.mt.cs.cmu.edu> To: "Roy T. Fielding" Cc: /CN=robots/@nexor.co.uk Subject: Re: libwww-perl: A generic WWW interface library for perl tools Original-Received: by NeXT Mailer (1.63) PP-warning: Illegal Received field on preceding line Status: RO Content-Length: 815 I have a C program to fetch URL's based on CERN's libwww that is available on the Web. The value of using libwww is that it works with HTTP, Gopher, FTP, and other protocols. My robot uses this to fetch URLs, but does the text processing in Perl. I also have a Perl subroutine that implements the RobotsNot Wanted function using Martijn's standard. It caches the rights file to prevent multiple accesses. Check out http://fuzine.mt.cs.cmu.edu/mlm/scoutget.html http://fuzine.mt.cs.cmu.edu/mlm/rnw.html Each contains a short description, code, and a sample test run. A question: Does anybody know a good way to randomly select an entry from a Perl associative list without looping through the whole array using 'each'? --Michael L. Mauldin fuzzy@cmu.edu http://fuzine.mt.cs.cmu.edu/mlm/home.html From /CN=robots-errors/@nexor.co.uk Wed Jun 15 14:58:51 1994 Replied: Wed, 15 Jun 1994 17:44:24 +0100 Replied: /CN=robots/@nexor.co.uk Replied: "Roy T. Fielding" Return-Path: Delivery-Date: Wed, 15 Jun 1994 14:59:34 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 15 Jun 1994 14:58:51 +0100 Date: Wed, 15 Jun 1994 14:58:51 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:025190:940615135853] Content-Identifier: Proposed name... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 14:58:51 +0100; Alternate-Recipient: Allowed From: "Roy T. Fielding" Message-ID: <9406150657.aa16431@paris.ics.uci.edu> To: /CN=robots/@nexor.co.uk Subject: Proposed name change for /RobotsNotWanted.txt Status: RO Content-Length: 955 Hello all, I was just editing my MOMspider paper for final submission in the WWW94 proceedings (what a pain!) and noticed that I have several references to the name /RobotsNotWanted.txt in the text. I would like to change the name before it gets written in stone (i.e. before I hand over copyright to Elsevier). I propose that the name be: /spiders.txt Reasons: 1) It fits within the 8.3 filename restrictions for PCs 2) It is easy to remember and hard to mistake (i.e. no mixed case) 3) It is more web-ish than /robots.txt 4) It does not imply that all robots are excluded (/norobots.txt) So, what's the general consensus? I need to have a decision within the next 24 hours in order to get my paper done on time ;-) ....Roy Fielding ICS Grad Student, University of California, Irvine USA (fielding@ics.uci.edu) About Roy From /CN=robots-errors/@nexor.co.uk Wed Jun 15 15:14:59 1994 Return-Path: Delivery-Date: Wed, 15 Jun 1994 15:15:21 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 15 Jun 1994 15:14:59 +0100 Date: Wed, 15 Jun 1994 15:14:59 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:026740:940615141500] Content-Identifier: Re: Proposed ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 15:14:59 +0100; Alternate-Recipient: Allowed From: Guido.van.Rossum@cwi.nl Message-ID: <9406151407.AA09156=guido@voorn.cwi.nl> To: "Roy T. Fielding" Cc: /CN=robots/@nexor.co.uk In-Reply-To: <9406150657.aa16431@paris.ics.uci.edu> References: <9406150657.aa16431@paris.ics.uci.edu> Subject: Re: Proposed name change for /RobotsNotWanted.txt X-Organization: CWI (Centrum voor Wiskunde en Informatica) X-Address: P.O. Box 94079, 1090 GB Amsterdam, The Netherlands X-Phone: +31 20 5924127 (work), +31 20 6225521 (home), +31 20 5924199 (fax) Status: RO Content-Length: 245 I vote for /robots.txt. Seems more neutral (after all the general term for web crawlers seems to be robots, not spiders). --Guido van Rossum, CWI, Amsterdam URL: From /CN=robots-errors/@nexor.co.uk Wed Jun 15 17:01:23 1994 Return-Path: Delivery-Date: Wed, 15 Jun 1994 17:01:49 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 15 Jun 1994 17:01:23 +0100 Date: Wed, 15 Jun 1994 17:01:23 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:039080:940615160125] Content-Identifier: Re: Proposed ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 17:01:23 +0100; Alternate-Recipient: Allowed From: " (Michael Mauldin)" Message-ID: <9406151559.AA24432@fuzine.mt.cs.cmu.edu> To: "Roy T. Fielding" Cc: /CN=robots/@nexor.co.uk Subject: Re: Proposed name change for /RobotsNotWanted.txt Original-Received: by NeXT Mailer (1.63) PP-warning: Illegal Received field on preceding line Status: RO Content-Length: 295 I am in favor of the new name, if only because this is the chance to put it out on paper, which is hard to change, and I have no major objections to this name. There are few enough RNW files out there that we can contact all known such servers by email... --Michael L. Mauldin fuzzy@cmu.edu From /CN=robots-errors/@nexor.co.uk Wed Jun 15 17:44:43 1994 Return-Path: Delivery-Date: Wed, 15 Jun 1994 17:45:00 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 15 Jun 1994 17:44:43 +0100 Date: Wed, 15 Jun 1994 17:44:43 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:048580:940615164444] Content-Identifier: Re: Proposed ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 17:44:43 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"4855 Wed Jun 15 17:44:30 1994"@nexor.co.uk> To: "Roy T. Fielding" Cc: /CN=robots/@nexor.co.uk In-Reply-To: <9406150657.aa16431@paris.ics.uci.edu> Subject: Re: Proposed name change for /RobotsNotWanted.txt Status: RO Content-Length: 1425 > I propose that the name be: /spiders.txt > > Reasons: 1) It fits within the 8.3 filename restrictions for PCs > 2) It is easy to remember and hard to mistake (i.e. no mixed case) Agree. The reason for a far-out name was a smaller chance of a name collision, but the PC's are a problem. > 4) It does not imply that all robots are excluded (/norobots.txt) Agree. > 3) It is more web-ish than /robots.txt This is hardly a convincing argument. I'd prefer /robots.txt because it is seems a broader term which can include other automated processes, such as mirrors. But if my vote results in a hung decision I'll happily change. > So, what's the general consensus? I need to have a decision within > the next 24 hours in order to get my paper done on time ;-) Yup, This is a good occasion to fix that outstanding issue. So, what have we got: /robots.txt:2 /spiders.txt:1 So Roy, as you've got the clock, let us know which name it is to be. Another issue with the robots.txt spec; is there any problem with allowing for shell-like "#" comment lines? This has been suggested by two other people, and I'd like to add it when I add the new name. I'd also like any other comments on the proposal... -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Wed Jun 15 19:04:38 1994 Return-Path: Delivery-Date: Wed, 15 Jun 1994 19:05:01 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 15 Jun 1994 19:04:38 +0100 Date: Wed, 15 Jun 1994 19:04:38 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:054230:940615180442] Content-Identifier: Re: Proposed ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 19:04:38 +0100; Alternate-Recipient: Allowed From: John.R.R.Leavitt@NL.CS.CMU.EDU Message-ID: <"5406 Wed Jun 15 19:04:17 1994"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: Re: Proposed name change for /RobotsNotWanted.txt Status: RO Content-Length: 1236 If anyone gets this twice, I apologize... I got some nasty bounce mail when I submitted it before, so I am trying again. I would prefer /robots.txt (or even better something like robots.lmt (limit), since txt implies human-readable text to me). My main preference for this is that my robots are named after ants, not spiders (since they will cooperate when they are done (someday...)). Also, there is the world wide web worm and the webcrawler, none of which seem to use the spider metaphor. Just my $0.02. -John. --------------------------------jrrl@cs.cmu.edu------------------------------- John R. R. Leavitt "Even through the darkest phase Research Programmer Be it thick or thin Center for Machine Translation Always someone marches brave Carnegie Mellon University Here beneath my skin" Editor, Omphalos Magazine k.d.lang, "Constant Craving" ------------------------------------------------------------------------------ Reading: Little, Big by John Crowley Remaking History by Kim Stanley Robinson ------------------------------------------------------------------------------ From /CN=robots-errors/@nexor.co.uk Wed Jun 15 19:17:41 1994 Return-Path: Delivery-Date: Wed, 15 Jun 1994 19:18:12 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 15 Jun 1994 19:17:41 +0100 Date: Wed, 15 Jun 1994 19:17:41 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:055110:940615181742] Content-Identifier: Re: Proposed ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 19:17:41 +0100; Alternate-Recipient: Allowed From: "Tronche Ch. le comique" Message-ID: <9406151820.AA01347@indy1.lri.fr> To: /CN=robots/@nexor.co.uk Subject: Re: Proposed name change for /RobotsNotWanted.txt Original-Received: from indy1.lri.fr by lri.lri.fr, Wed, 15 Jun 1994 20:14:58 +0200 PP-warning: Illegal Received field on preceding line Original-Received: by indy1.lri.fr, Wed, 15 Jun 94 20:20:46 +0200 PP-warning: Illegal Received field on preceding line X-Face: $)p(\g8Er<<5PVeh"4>0m&);m(]e_X3<%RIgbR>?i=I#c0ksU'>?+~)ztzpF&b#nVhu+zsv x4[FS*c8aHrq\<7qL/v#+MSQ\g_Fs0gTR[s)B%Q14\;&J~1E9^`@{Sgl*2g:IRc56f:\4o1k'BDp!3 "`^ET=!)>J-V[hiRPu4QQ~wDm\%L=y>:P|lGBufW@EJcU4{~z/O?26]&OLOWLZ I would prefer /robots.txt (or even better something like robots.lmt (limit), > since txt implies human-readable text to me). The file _is_ human-readable, in some sense. Just $0.02 more. +--------------------------+------------------------------------+ | | | | Christophe TRONCHE | E-mail : tronche@lri.fr | | | | | +-=-+-=-+ | Phone : 33 - 1 - 69 41 66 25 | | | Fax : 33 - 1 - 69 41 65 86 | +--------------------------+------------------------------------+ | ###### ** | | ## # Laboratoire de Recherche en Informatique | | ## # ## Batiment 490 | | ## # ## Universite de Paris-Sud | | ## #### ## 91405 ORSAY CEDEX | | ###### ## ## FRANCE | |###### ### | +---------------------------------------------------------------+ From /CN=robots-errors/@nexor.co.uk Wed Jun 15 19:28:10 1994 Return-Path: Delivery-Date: Wed, 15 Jun 1994 19:28:43 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 15 Jun 1994 19:28:10 +0100 Date: Wed, 15 Jun 1994 19:28:10 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:055740:940615182811] Content-Identifier: Re: Proposed ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 19:28:10 +0100; Alternate-Recipient: Allowed From: " (Michael Mauldin)" Message-ID: <9406151828.AA24805@fuzine.mt.cs.cmu.edu> To: /CN=robots/@nexor.co.uk Cc: fuzzy@CMU.EDU Subject: Re: Proposed name change for /RobotsNotWanted.txt Original-Received: by NeXT Mailer (1.63) PP-warning: Illegal Received field on preceding line Status: RO Content-Length: 1057 Okay, let me modify my earlier vote. I am still in favor of doing the name change NOW and picking a DOS compatible name. How about agents.pol (For agents policy). This satisfies a number of criteria 1. It is neutral, it does not imply that agents are good or bad. 2. "agent" is a general accepted term for what spiders, worms, ants and robots do. 3. the .pol extension does not seem to imply human readability Let me also second (or vote for) the suggestion to add comments to the spec, with '#' being a perfectly acceptable comment introduction character. Finally, let's drop the notion that an empty agents.pol file has a meaning...given the diversity of server responses to a non-existant file, let's force someone to use the exclusion language to deny access to every one: Robot: * Disallow: / should be the accepted way to turn off remote agents. We might as well change the "Robot:" to "Agent:", and then, we'll even be consistent with the CERN WWW spec (it is a User-Agent, after all). --Michael Mauldin From /CN=robots-errors/@nexor.co.uk Wed Jun 15 19:43:32 1994 Replied: Thu, 16 Jun 1994 09:37:30 +0100 Replied: /CN=robots/@nexor.co.uk Replied: John.R.R.Leavitt@NL.CS.CMU.EDU Return-Path: Delivery-Date: Wed, 15 Jun 1994 19:44:06 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 15 Jun 1994 19:43:32 +0100 Date: Wed, 15 Jun 1994 19:43:32 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:056630:940615184333] Content-Identifier: Re: Proposed ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 19:43:32 +0100; Alternate-Recipient: Allowed From: John.R.R.Leavitt@NL.CS.CMU.EDU Message-ID: <"5660 Wed Jun 15 19:43:27 1994"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: Re: Proposed name change for /RobotsNotWanted.txt Status: RO Content-Length: 1537 "Tronche Ch. le comique" writes: >John (John.R.R.Leavitt@NL.CS.CMU.EDU) writes: > >> I would prefer /robots.txt (or even better something like robots.lmt (limit), >> since txt implies human-readable text to me). > >The file _is_ human-readable, in some sense. True. But then, to the the right people, so are .ps files, .c files, and even strange things like .dvi and .o files. Around here, .perl and .lisp files are considered human readable for the most part. What I meant, was that .txt seems to suggest non-computer-readable data (meaning not designed for computer readability, since I'm sure a computer could read anything I could). In the end, the extension really doesn't matter all that much. :^) a couple more cents (if we keep going, we can all chip in on a soda! :^) -John. --------------------------------jrrl@cs.cmu.edu------------------------------- John R. R. Leavitt "Even through the darkest phase Research Programmer Be it thick or thin Center for Machine Translation Always someone marches brave Carnegie Mellon University Here beneath my skin" Editor, Omphalos Magazine k.d.lang, "Constant Craving" ------------------------------------------------------------------------------ Reading: Little, Big by John Crowley Remaking History by Kim Stanley Robinson ------------------------------------------------------------------------------ From /CN=robots-errors/@nexor.co.uk Wed Jun 15 21:13:19 1994 Replied: Thu, 16 Jun 1994 09:38:53 +0100 Replied: /CN=robots/@nexor.co.uk Replied: "Roy T. Fielding" Return-Path: Delivery-Date: Wed, 15 Jun 1994 21:13:41 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 15 Jun 1994 21:13:19 +0100 Date: Wed, 15 Jun 1994 21:13:19 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:062430:940615201320] Content-Identifier: Re: Proposed ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 15 Jun 1994 21:13:19 +0100; Alternate-Recipient: Allowed From: "Roy T. Fielding" Message-ID: <9406151307.aa07835@paris.ics.uci.edu> To: /CN=robots/@nexor.co.uk In-Reply-To: <"5406 Wed Jun 15 19:04:17 1994"@nexor.co.uk> Subject: Re: Proposed name change for /RobotsNotWanted.txt Status: RO Content-Length: 1923 Hmmm...not a whole lot of consensus out there. Acronyms such as "racl.txt" are too hard to remember. I think the extension needs to reflect the content-type, not its purpose. Specifically, it is not fair to ask people to define a new type just for this file. On the other hand, we could always call it "robots.pl" and require the format to be in Perl4. ;-) Yes, the comment syntax should be "all lines starting with # and all empty lines are ignored". I would also like to add an "Expires: " entry, e.g. Expires: daily (means don't check me again until tomorrow) Expires: weekly ( " " " " " for 7 days) Expires: monthly ( " " " " " for 30 days) Expires: never (means never check me again) Expires: 27 Jun 1994 (means don't check again until after the given date) Just my NZ half-penny ... ======================================================================= Okay, the voting so far, counting my own (I think): RTF = your's truly Y = Yes GvR = Guido van Rossum N = No MLM = Michael L. Mauldin O = Okay, maybe, don't care, ... MAK = Martijn Koster JRL = John R. R. Leavitt CT = Christophe Tronche spiders.txt robots.txt robots.lmt racl.txt agents.pol agents.txt avoidURL.txt ----------- ---------- ---------- -------- ---------- ---------- ------------ RTF Y O N N N O Y GvR N Y MLM O O Y MAK N Y JRL N Y Y CT N Y and a grand total of USD $0.04 + FF $0.02 + NZD $0.005 ....Roy Fielding ICS Grad Student, University of California, Irvine USA (fielding@ics.uci.edu) About Roy From /CN=robots-errors/@nexor.co.uk Thu Jun 16 09:39:36 1994 Return-Path: Delivery-Date: Thu, 16 Jun 1994 09:41:05 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 16 Jun 1994 09:39:36 +0100 Date: Thu, 16 Jun 1994 09:39:36 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:109360:940616083939] Content-Identifier: Re: Proposed ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 16 Jun 1994 09:39:36 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"10917 Thu Jun 16 09:39:14 1994"@nexor.co.uk> To: "Roy T. Fielding" Cc: /CN=robots/@nexor.co.uk In-Reply-To: <9406151307.aa07835@paris.ics.uci.edu> Subject: Re: Proposed name change for /RobotsNotWanted.txt Status: RO Content-Length: 301 > I would also like to add an "Expires: " entry, e.g. Why not rely on the HTTP Expires? That's what it's for... -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Thu Jun 16 09:37:50 1994 Return-Path: Delivery-Date: Thu, 16 Jun 1994 09:38:09 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 16 Jun 1994 09:37:50 +0100 Date: Thu, 16 Jun 1994 09:37:50 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:108770:940616083751] Content-Identifier: Re: Proposed ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 16 Jun 1994 09:37:50 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"10874 Thu Jun 16 09:37:43 1994"@nexor.co.uk> To: John.R.R.Leavitt@NL.CS.CMU.EDU Cc: /CN=robots/@nexor.co.uk In-Reply-To: <"5660 Wed Jun 15 19:43:27 1994"@nexor.co.uk> Subject: Re: Proposed name change for /RobotsNotWanted.txt Status: RO Content-Length: 2155 Michael Mauldin wrote: > 3. the .pol extension does not seem to imply > human readability ... > In the end, the extension really doesn't matter all that much. I've had problems in the past with ALIWEB's /site.idx, where servers (in these cases CERN and the NT one) didn't recognise the extension and made it application/binary or something. This can be a bit annoying if the client uses an Accept line with text/plain and text/html. So I guess officially we'd like a separate mime type for this and don't worry about extensions at all, but in practice using .txt saves you hassle. > 2. "agent" is a general accepted term for what > spiders, worms, ants and robots do. Well, it's close to user-agent, ie any client. This is a larger category than the automated robots these policy lines are directed at; I don't mind manual browser going through all these places I want to hide from robots. So this may not be appropriate. On my way home last night I remembered that a while back someone suggested "/robotsp.txt", with the P for policy. This is even better than /robots.txt, as it describes the contents of the file well, and has less chance of name collision. But then, I wouldn't want to upset Roy's chart :-) > Let me also second (or vote for) the suggestion to > add comments to the spec, with '#' being a perfectly > acceptable comment introduction character. Good. Anybody object? > Finally, let's drop the notion that an empty agents.pol > file has a meaning...given the diversity of server responses > to a non-existant file, let's force someone to use the > exclusion language to deny access to every one: OK, it was a bit obscure. > should be the accepted way to turn off remote agents. > We might as well change the "Robot:" to "Agent:", and > then, we'll even be consistent with the CERN WWW spec > (it is a User-Agent, after all). "User-agent:" then? Mmm, I think I like the sound of that. -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Thu Jun 16 12:40:49 1994 Replied: Fri, 17 Jun 1994 14:07:21 +0100 Replied: "Roy T. Fielding" Replied: Thu, 16 Jun 1994 12:44:18 +0100 Replied: "Roy T. Fielding" Return-Path: Delivery-Date: Thu, 16 Jun 1994 12:41:47 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 16 Jun 1994 12:40:49 +0100 Date: Thu, 16 Jun 1994 12:40:49 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:134480:940616114051] Content-Identifier: Re: Proposed ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 16 Jun 1994 12:40:49 +0100; Alternate-Recipient: Allowed From: "Roy T. Fielding" Message-ID: <9406160440.aa00301@paris.ics.uci.edu> To: /CN=robots/@nexor.co.uk Subject: Re: Proposed name change for /RobotsNotWanted.txt Status: RO Content-Length: 2604 ------- Forwarded Message From: Peter Beebee To: fielding@simplon.ICS.UCI.EDU In-reply-to: "Roy T. Fielding"'s message of Wed, 15 Jun 1994 13:13:19 -0700 <9406151307.aa07835@paris.ics.uci.edu> Subject: Re: Proposed name change for /RobotsNotWanted.txt Reply-to: beebee@parc.xerox.com Message-Id: <94Jun15.164028pdt.2695@persica.parc.xerox.com> Date: Wed, 15 Jun 1994 16:40:23 PDT Yet more American $$ But first, an introduction: Hello everybody, my name is Peter Beebee. I'm an undergraduate at MIT currently working at Xerox PARC. One of my recent projects is to implement an experimental web browser which operates through existing WWW clients but provides more natural searching options. For this project I am writing (in PERL) a robot currently identified as SG-Scout. The purpose of this robot is to collect the information needed for the searching algorithms I will be using. Actually, the first version of SG-Scout is already written (thanks to the help of a couple of you); I've gotten it to run inside Xerox, but I've had problems with our firewall when I've tried to access remote servers. I do (and plan to continue to) comply with the proposed standard of exclusion. As for the name problem, I vote for something like "robots.cnf" or "robots.cfg" (configure) over "robots.txt". This way we could avoid creating our own extension for one file, but we would at the same time reduce the chances of collision. The RobotsNotWanted.txt file is more of a configuration file than a text file... -- Peter ------- End of Forwarded Message And my reply is: That sounds a lot like fish-search -- you should talk to Reiner Post. The libwww-perl code includes the ability to use a proxy server. See And there is no defined mime-type for config files, so .cnf and .cfg would be no better than .lmt in that regard. Of course, we could always define one and make it a standard, say text/config cfg but that would still be somewhat annoying to server maintainers. Oh, and never mind about the Expires thing -- I agree with Martijn that we should use the (painfully obvious) existing mechanism. However, I do not think that "robotsp.txt" more accurately reflects the purpose of the file -- it sounds like robot's pee (which is not quite what we had in mind ;-) ....Roy Fielding ICS Grad Student, University of California, Irvine USA (fielding@ics.uci.edu) About Roy From /CN=robots-errors/@nexor.co.uk Fri Jun 17 00:59:37 1994 Replied: Fri, 17 Jun 1994 09:17:21 +0100 Replied: /CN=robots/@nexor.co.uk Replied: beebee@parc.xerox.com Return-Path: Delivery-Date: Fri, 17 Jun 1994 01:00:36 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 17 Jun 1994 00:59:37 +0100 Date: Fri, 17 Jun 1994 00:59:37 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:195960:940616235940] Content-Identifier: Evolving Stan... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 17 Jun 1994 00:59:37 +0100; Alternate-Recipient: Allowed From: Peter Beebee Message-ID: <94Jun16.165832pdt.2695@persica.parc.xerox.com> To: /CN=robots/@nexor.co.uk Subject: Evolving Standard Reply-To: beebee@parc.xerox.com Status: RO Content-Length: 169 Ok.. so how is the standard emerging out of all this turmoil? "robots.txt"? empty file = all robots permitted? '#' = comment character? no "Expires" lines? - Peter From /CN=robots-errors/@nexor.co.uk Fri Jun 17 10:54:41 1994 Replied: Fri, 17 Jun 1994 14:05:53 +0100 Replied: /CN=robots/@nexor.co.uk Replied: beebee@parc.xerox.com Return-Path: Delivery-Date: Fri, 17 Jun 1994 10:56:08 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 17 Jun 1994 10:54:41 +0100 Date: Fri, 17 Jun 1994 10:54:41 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:247490:940617095444] Content-Identifier: Re: Evolving ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 17 Jun 1994 10:54:41 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"23074 Fri Jun 17 09:17:28 1994"@nexor.co.uk> To: beebee@parc.xerox.com Cc: /CN=robots/@nexor.co.uk In-Reply-To: <94Jun16.165832pdt.2695@persica.parc.xerox.com> Subject: Re: Evolving Standard Status: RO Content-Length: 405 > Ok.. so how is the standard emerging out of all this turmoil? > > "robots.txt"? > empty file = all robots permitted? > '#' = comment character? > no "Expires" lines? Yes. I'll be changing the document accordingly. -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Fri Jun 17 14:06:28 1994 Return-Path: Delivery-Date: Fri, 17 Jun 1994 14:07:27 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 17 Jun 1994 14:06:28 +0100 Date: Fri, 17 Jun 1994 14:06:28 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:273620:940617130630] Content-Identifier: Re: Evolving ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 17 Jun 1994 14:06:28 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"27354 Fri Jun 17 14:06:01 1994"@nexor.co.uk> Cc: beebee@parc.xerox.com, /CN=robots/@nexor.co.uk In-Reply-To: <"23074 Fri Jun 17 09:17:28 1994"@nexor.co.uk> Subject: Re: Evolving Standard Status: RO Content-Length: 392 I wrote: > Yes. I'll be changing the document accordingly. I ahve in fact rewritten it entirely. Please let me know if there's anything I've missed. http://web.nexor.co.uk/mak/doc/robots/norobots.html -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Fri Jun 17 17:41:25 1994 Replied: Fri, 17 Jun 1994 17:43:55 +0100 Replied: "Roy T. Fielding" Return-Path: Delivery-Date: Fri, 17 Jun 1994 17:41:59 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 17 Jun 1994 17:41:25 +0100 Date: Fri, 17 Jun 1994 17:41:25 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:007700:940617164126] Content-Identifier: Re: Evolving ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 17 Jun 1994 17:41:25 +0100; Alternate-Recipient: Allowed From: "Roy T. Fielding" Message-ID: <9406170939.aa07770@paris.ics.uci.edu> To: /CN=robots/@nexor.co.uk In-Reply-To: <"27354 Fri Jun 17 14:06:01 1994"@nexor.co.uk> Subject: Re: Evolving Standard Status: RO Content-Length: 396 Martijn wrote: > I have in fact rewritten it entirely. Please let me know if there's > anything I've missed. > > http://web.nexor.co.uk/mak/doc/robots/norobots.html Oooh, very nice. Looks great, ....Roy Fielding ICS Grad Student, University of California, Irvine USA (fielding@ics.uci.edu) About Roy From /CN=robots-errors/@nexor.co.uk Fri Jun 17 17:37:38 1994 Return-Path: Delivery-Date: Fri, 17 Jun 1994 17:38:11 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 17 Jun 1994 17:37:38 +0100 Date: Fri, 17 Jun 1994 17:37:38 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:007030:940617163739] Content-Identifier: Re: New code ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 17 Jun 1994 17:37:38 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"699 Fri Jun 17 17:37:28 1994"@nexor.co.uk> To: " (Michael Mauldin)" Cc: /CN=robots/@nexor.co.uk In-Reply-To: <9406171622.AA09303@fuzine.mt.cs.cmu.edu> Subject: Re: New code available to implement the latest standard (Perl) Status: RO Content-Length: 1174 Michael Mauldin wondered about ordering of the lines in the record. As it stands the ordering isn't explicitly specified, so that a User-agent line can follow a Disallow line: > Disallow: / > User-agent: GoodRobot > Disallow: > > What does this mean? The same as User-agent: GoodRobot Disallow: Disallow: / which is silly, but is to be interpreted to mean "Allow all URL's except those which start with a slash", which in practice disallows all urls. > Requiring the robot name before the action allows a simple > way to determine how to proceed > 1. Find your name (or *) > 2. Read all Disallow lines and act on them. > > Otherwise you force the robot to read the whole file > to figure out what to do. I like unspecified ordering because that is how RFC822 is specified, and this format looks very much like rfc822. I can't imagine parsing overhead to really be a problem. But if there is a lot of resistance to it I'll change it. comments? -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Fri Jun 17 18:38:05 1994 Replied: Sun, 19 Jun 1994 15:39:08 +0100 Replied: /CN=robots/@nexor.co.uk Replied: "Roy T. Fielding" Return-Path: Delivery-Date: Fri, 17 Jun 1994 18:38:43 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 17 Jun 1994 18:38:05 +0100 Date: Fri, 17 Jun 1994 18:38:05 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:011430:940617173807] Content-Identifier: Re: New code ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 17 Jun 1994 18:38:05 +0100; Alternate-Recipient: Allowed From: "Roy T. Fielding" Message-ID: <9406171036.aa11854@paris.ics.uci.edu> To: /CN=robots/@nexor.co.uk In-Reply-To: <"699 Fri Jun 17 17:37:28 1994"@nexor.co.uk> Subject: Re: New code available to implement the latest standard (Perl) Status: RO Content-Length: 1010 Martijn wrote: > Michael Mauldin wondered: >> Requiring the robot name before the action allows a simple >> way to determine how to proceed >> 1. Find your name (or *) >> 2. Read all Disallow lines and act on them. >> >> Otherwise you force the robot to read the whole file >> to figure out what to do. > > I like unspecified ordering because that is how RFC822 is specified, > and this format looks very much like rfc822. I can't imagine parsing > overhead to really be a problem. But if there is a lot of resistance > to it I'll change it. comments? Nope, it won't work that way. rfc822 parsers combine identical headers into a single, comma-separated list, thus causing any blank Disallow: lines to disappear. I recommend defining it as ordered (it is less confusing to the reader that way). ....Roy Fielding ICS Grad Student, University of California, Irvine USA (fielding@ics.uci.edu) About Roy From /CN=robots-errors/@nexor.co.uk Sun Jun 19 15:39:21 1994 Return-Path: Delivery-Date: Sun, 19 Jun 1994 15:39:54 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Sun, 19 Jun 1994 15:39:21 +0100 Date: Sun, 19 Jun 1994 15:39:21 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:102640:940619143922] Content-Identifier: Re: New code ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sun, 19 Jun 1994 15:39:21 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"10260 Sun Jun 19 15:39:14 1994"@nexor.co.uk> To: "Roy T. Fielding" Cc: /CN=robots/@nexor.co.uk In-Reply-To: <9406171036.aa11854@paris.ics.uci.edu> Subject: Re: New code available to implement the latest standard (Perl) Status: RO Content-Length: 508 Roy wrote: > Nope, it won't work that way. rfc822 parsers combine identical headers > into a single, comma-separated list, thus causing any blank Disallow: > lines to disappear. which is consistent with its semantics. > I recommend defining it as ordered (it is less confusing to the reader > that way). alright... -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Sat Jun 18 03:54:36 1994 Return-Path: Delivery-Date: Sat, 18 Jun 1994 03:55:12 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Sat, 18 Jun 1994 03:54:36 +0100 Date: Sat, 18 Jun 1994 03:54:36 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:044150:940618025438] Content-Identifier: libwww-perl v... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sat, 18 Jun 1994 03:54:36 +0100; Alternate-Recipient: Allowed From: "Roy T. Fielding" Message-ID: <9406171953.aa14559@paris.ics.uci.edu> To: /CN=robots/@nexor.co.uk Cc: oscar@cui.unige.ch, grimes@raison.mro.dec.com, shelden@fatty.law.cornell.edu Subject: libwww-perl version 0.11 Status: RO Content-Length: 1134 Hello all, I made a few bug fixes and upgrades to the libwww-perl in preparation for a general announcement on www-talk. Version 0.11 June 17, 1994 Changed environment variable LIBWWW-PERL to LIBWWW_PERL because some systems can't handle the dash (Charlie Stross). Fixed bug in "get" that caused full pathname to be used as the method (Martijn Koster). Fixed handling of perverse relative URLs (e.g. ../../) in wwwurl'absolute. The distribution site and much more information about the libraries can be found at and also at If you have already picked up a copy, a patch file is available at both locations (patch010to011.txt). After today, I will just announce changes on www-talk (I know how annoying it is to get several copies of the same announcement). ....Roy Fielding ICS Grad Student, University of California, Irvine USA (fielding@ics.uci.edu) About Roy From /CN=robots-errors/@nexor.co.uk Sun Jun 19 16:19:45 1994 Return-Path: Delivery-Date: Sun, 19 Jun 1994 16:20:09 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Sun, 19 Jun 1994 16:19:45 +0100 Date: Sun, 19 Jun 1994 16:19:45 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:104030:940619151947] Content-Identifier: (q)Version(q)... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sun, 19 Jun 1994 16:19:45 +0100; Alternate-Recipient: Allowed From: "Tronche Ch. le comique" Message-ID: <9406191522.AA26207@indy1.lri.fr> To: /CN=robots/@nexor.co.uk Subject: "Version" field in /robots.txt Original-Received: from indy1.lri.fr by lri.lri.fr, Sun, 19 Jun 1994 17:16:43 +0200 PP-warning: Illegal Received field on preceding line Original-Received: by indy1.lri.fr, Sun, 19 Jun 94 17:22:42 +0200 PP-warning: Illegal Received field on preceding line X-Face: $)p(\g8Er<<5PVeh"4>0m&);m(]e_X3<%RIgbR>?i=I#c0ksU'>?+~)ztzpF&b#nVhu+zsv x4[FS*c8aHrq\<7qL/v#+MSQ\g_Fs0gTR[s)B%Q14\;&J~1E9^`@{Sgl*2g:IRc56f:\4o1k'BDp!3 "`^ET=!)>J-V[hiRPu4QQ~wDm\%L=y>:P|lGBufW@EJcU4{~z/O?26]&OLOWLZ Delivery-Date: Mon, 20 Jun 1994 13:28:25 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 20 Jun 1994 13:27:38 +0100 Date: Mon, 20 Jun 1994 13:27:38 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:187400:940620122739] Content-Identifier: Administrativ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 20 Jun 1994 13:27:38 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"18731 Mon Jun 20 13:27:15 1994"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: Administrativa, and norobots.pl Status: RO Content-Length: 580 As a number of people have recently asked me about the robots mailing list I have put up a Web page with some info, which has little news for you, but you might want to keep for future reference: Jumping on the bandwagon I have also written a /robots.txt parser in Perl, it's on -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From guido@cwi.nl Mon Jun 20 13:53:33 1994 Replied: Mon, 20 Jun 1994 15:15:39 +0100 Replied: Martijn Koster Replied: /DD.Common=robots/@nexor.co.uk Replied: Guido.van.Rossum@cwi.nl Return-Path: Delivery-Date: Mon, 20 Jun 1994 13:53:47 +0100 Received: from charon.cwi.nl by lancaster.nexor.co.uk with SMTP (XTPP); Mon, 20 Jun 1994 13:53:33 +0100 Received: from voorn.cwi.nl by charon.cwi.nl with SMTP id ; Mon, 20 Jun 1994 14:53:15 +0200 Received: by voorn.cwi.nl with SMTP id ; Mon, 20 Jun 94 14:53:14 +0200 Message-Id: <9406201253.AA23764=guido@voorn.cwi.nl> To: Martijn Koster Cc: /DD.Common=robots/@nexor.co.uk Subject: Re: Administrativa, and norobots.pl In-Reply-To: Your message of "Mon, 20 Jun 1994 14:27:38 MDT." <"18731 Mon Jun 20 13:27:15 1994"@nexor.co.uk> References: <"18731 Mon Jun 20 13:27:15 1994"@nexor.co.uk> From: Guido.van.Rossum@cwi.nl X-Organization: CWI (Centrum voor Wiskunde en Informatica) X-Address: P.O. Box 94079, 1090 GB Amsterdam, The Netherlands X-Phone: +31 20 5924127 (work), +31 20 6225521 (home), +31 20 5924199 (fax) Date: Mon, 20 Jun 1994 14:53:13 +0200 Sender: Guido.van.Rossum@cwi.nl Status: RO Content-Length: 1634 Martijn, sorry if any of this has been discussed before: Looking at your recent norobots.html, I noticed that the spec uses "User-agent:" but the examples use "Robot:" ... (I prefer Robot). I personally have some problems interpreting your format description precisely -- the language seems to be open for misinterpretation. (I want to write a robots.txt parser in Python. I can't read Perl so looking at your example parser won't do me much good, and anyway that shouldn't be necessary :-) E.g. records are separated by blank lines. Is this before or after removing comments? (This would make a difference regarding Is there no allowed after the ? (The example suggests there is -- between the value and the #comment.) Also since there's a strict alternation of Robots: and Disallow: lines, why not use the appearance of a Robots: line to signal the end of a record? Then the syntax would be (using my own BNF variant -- hope it's clear): file: endline* record* record: robotsline+ disallowline+ robotsline: 'Robots:' sp* value sp* endline+ disallowline: 'Disallow:' sp* value sp* endline+ sp: SPACE | TAB endline: ['#' comment] (CR | LF | CR LF) value: comment: with the proviso that 'Robots:' and 'Disallow:' should be parsed case insensitive. Parsers could be told to treat unrecognized headers as comments, for future extensions. --Guido van Rossum, CWI, Amsterdam URL: From /CN=robots-errors/@nexor.co.uk Mon Jun 20 13:54:12 1994 Return-Path: Delivery-Date: Mon, 20 Jun 1994 13:54:39 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 20 Jun 1994 13:54:12 +0100 Date: Mon, 20 Jun 1994 13:54:12 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:193310:940620125414] Content-Identifier: Re: Administr... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 20 Jun 1994 13:54:12 +0100; Alternate-Recipient: Allowed From: Guido.van.Rossum@cwi.nl Message-ID: <9406201253.AA23764=guido@voorn.cwi.nl> To: Martijn Koster Cc: /CN=robots/@nexor.co.uk In-Reply-To: <"18731 Mon Jun 20 13:27:15 1994"@nexor.co.uk> References: <"18731 Mon Jun 20 13:27:15 1994"@nexor.co.uk> Subject: Re: Administrativa, and norobots.pl X-Organization: CWI (Centrum voor Wiskunde en Informatica) X-Address: P.O. Box 94079, 1090 GB Amsterdam, The Netherlands X-Phone: +31 20 5924127 (work), +31 20 6225521 (home), +31 20 5924199 (fax) Status: RO Content-Length: 1634 Martijn, sorry if any of this has been discussed before: Looking at your recent norobots.html, I noticed that the spec uses "User-agent:" but the examples use "Robot:" ... (I prefer Robot). I personally have some problems interpreting your format description precisely -- the language seems to be open for misinterpretation. (I want to write a robots.txt parser in Python. I can't read Perl so looking at your example parser won't do me much good, and anyway that shouldn't be necessary :-) E.g. records are separated by blank lines. Is this before or after removing comments? (This would make a difference regarding Is there no allowed after the ? (The example suggests there is -- between the value and the #comment.) Also since there's a strict alternation of Robots: and Disallow: lines, why not use the appearance of a Robots: line to signal the end of a record? Then the syntax would be (using my own BNF variant -- hope it's clear): file: endline* record* record: robotsline+ disallowline+ robotsline: 'Robots:' sp* value sp* endline+ disallowline: 'Disallow:' sp* value sp* endline+ sp: SPACE | TAB endline: ['#' comment] (CR | LF | CR LF) value: comment: with the proviso that 'Robots:' and 'Disallow:' should be parsed case insensitive. Parsers could be told to treat unrecognized headers as comments, for future extensions. --Guido van Rossum, CWI, Amsterdam URL: From /CN=robots-errors/@nexor.co.uk Mon Jun 20 15:17:46 1994 Return-Path: Delivery-Date: Mon, 20 Jun 1994 15:19:19 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 20 Jun 1994 15:17:46 +0100 Date: Mon, 20 Jun 1994 15:17:46 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:213810:940620141748] Content-Identifier: Re: Administr... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 20 Jun 1994 15:17:46 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"21338 Mon Jun 20 15:16:12 1994"@nexor.co.uk> To: Guido.van.Rossum@cwi.nl Cc: Martijn Koster , /CN=robots/@nexor.co.uk In-Reply-To: <9406201253.AA23764=guido@voorn.cwi.nl> Subject: Re: Administrativa, and norobots.pl Status: RO Content-Length: 1662 > Martijn, sorry if any of this has been discussed before: > > Looking at your recent norobots.html, I noticed that the spec uses > "User-agent:" but the examples use "Robot:" ... (I prefer Robot). Yes, we just discussed that :-) It was felt User-agent is closer to HTTP. It doesn't really matter what the name is... > I personally have some problems interpreting your format description > precisely. OK, let us know, this thing should be interpretable. > E.g. records are separated by blank lines. Is this before or after > removing comments? (This would make a difference regarding Is there > no allowed after the ? (The example suggests > there is -- between the value and the #comment.) Did your editor eat something there? The records are separated by blank lines after removing comments, but I don't really see the difference. Yes, there is optionalspace allowed after the value, which has no meaning, and is stripped (I've added this explicitly). > Also since there's a strict alternation of Robots: and Disallow: > lines, why not use the appearance of a Robots: line to signal the > end of a record? Then it all becomes one big collection of lines which is difficult to read. With blank lines it is clearer where one record ends and another begins. I was hoping not to need a BNF description, but it looks like one is needed. :-) > Parsers could be told to treat unrecognized headers as comments, for > future extensions. OK -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From guido@cwi.nl Mon Jun 20 15:54:14 1994 Replied: Mon, 20 Jun 1994 17:25:07 +0100 Replied: /DD.Common=robots/@nexor.co.uk Replied: Guido.van.Rossum@cwi.nl Return-Path: Delivery-Date: Mon, 20 Jun 1994 15:54:30 +0100 Received: from charon.cwi.nl by lancaster.nexor.co.uk with SMTP (XTPP); Mon, 20 Jun 1994 15:54:14 +0100 Received: from voorn.cwi.nl by charon.cwi.nl with SMTP id ; Mon, 20 Jun 1994 16:54:01 +0200 Received: by voorn.cwi.nl with SMTP id ; Mon, 20 Jun 94 16:54:00 +0200 Message-Id: <9406201454.AA24145=guido@voorn.cwi.nl> To: Martijn Koster Cc: /DD.Common=robots/@nexor.co.uk Subject: Re: Administrativa, and norobots.pl In-Reply-To: Your message of "Mon, 20 Jun 1994 15:15:33 MDT." <9406201415.AA29767=m.koster@nexor.co.uk@charon.cwi.nl> References: <9406201415.AA29767=m.koster@nexor.co.uk@charon.cwi.nl> From: Guido.van.Rossum@cwi.nl X-Organization: CWI (Centrum voor Wiskunde en Informatica) X-Address: P.O. Box 94079, 1090 GB Amsterdam, The Netherlands X-Phone: +31 20 5924127 (work), +31 20 6225521 (home), +31 20 5924199 (fax) Date: Mon, 20 Jun 1994 16:54:00 +0200 Sender: Guido.van.Rossum@cwi.nl Status: RO Content-Length: 1520 > > Looking at your recent norobots.html, I noticed that the spec uses > > "User-agent:" but the examples use "Robot:" ... (I prefer Robot). > > Yes, we just discussed that :-) It was felt User-agent is closer to > HTTP. It doesn't really matter what the name is... Well, if my vote still counts, I'd rather see Robot -- basically for the same reason I prefer "robots.txt" over anything else: it's easiest to remember. And on some sense a Robot isn't really a user agent at all... > Did your editor eat something there? The records are separated by > blank lines after removing comments, but I don't really see the > difference. Quoting from http://web.nexor.co.uk/mak/doc/robots/norobots.html (is that the official source?): The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL). Each record contains lines of the form " :". The field name is case insensitive. and further down: Comments can be included in file using UNIX bourne shell conventions: the '#' character is used to indicate that the remainder of the line is a comment. But this doesn't tell me in which order these rules are executed. Anyway, testing for a blank line after removing comments would mean that you can't have a whole-line comment in a record. I prefer requiring a blank line before comment stripping. --Guido van Rossum, CWI, Amsterdam URL: From /CN=robots-errors/@nexor.co.uk Mon Jun 20 17:25:29 1994 Replied: Thu, 23 Jun 1994 10:20:41 +0100 Replied: Martijn Koster Replied: /CN=robots/@nexor.co.uk Replied: Mon, 20 Jun 1994 18:25:04 +0100 Replied: robots Replied: Guido.van.Rossum@cwi.nl Return-Path: Delivery-Date: Mon, 20 Jun 1994 17:26:42 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 20 Jun 1994 17:25:29 +0100 Date: Mon, 20 Jun 1994 17:25:29 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:244100:940620162530] Content-Identifier: Re: Administr... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 20 Jun 1994 17:25:29 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"24407 Mon Jun 20 17:25:15 1994"@nexor.co.uk> To: Guido.van.Rossum@cwi.nl Cc: /CN=robots/@nexor.co.uk In-Reply-To: <9406201454.AA24145=guido@voorn.cwi.nl> Subject: Re: Administrativa, and norobots.pl Status: RO Content-Length: 1194 > Well, if my vote still counts, I'd rather see Robot -- basically for > the same reason I prefer "robots.txt" over anything else: it's easiest > to remember. And on some sense a Robot isn't really a user agent at > all... Fine, if you really feel it's that important, then let's vote on that too. The choice is "Robot" vs "User-agent", send votes only to me, not the entire list. On Wednesday 17:00 my time I'll count the votes, and change the spec if required. If there's a tie I'll decide. > Quoting from http://web.nexor.co.uk/mak/doc/robots/norobots.html (is > that the official source?): Well, that's as official as it gets :-) > Anyway, testing for a blank line after removing comments would mean > that you can't have a whole-line comment in a record. I prefer > requiring a blank line before comment stripping. OK, I see the source for the confusion. When I strip a whole-line comment I mean strip the entire line, which is the same as what you're saying. Yes, I'm in favour of that too. -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Mon Jun 20 18:25:24 1994 Return-Path: Delivery-Date: Mon, 20 Jun 1994 18:26:02 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 20 Jun 1994 18:25:24 +0100 Date: Mon, 20 Jun 1994 18:25:24 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:249660:940620172526] Content-Identifier: Comments (was... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 20 Jun 1994 18:25:24 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"24963 Mon Jun 20 18:25:10 1994"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Cc: Guido.van.Rossum@cwi.nl In-Reply-To: <"24407 Mon Jun 20 17:25:15 1994"@nexor.co.uk> Subject: Comments (was: Re: Administrativa, and norobots.pl ) Status: RO Content-Length: 841 > OK, I see the source for the confusion. When I strip a whole-line > comment I mean strip the entire line, which is the same as what > you're saying. Yes, I'm in favour of that too. I've changed the page to read:

Comments can be included in file using UNIX bourne shell conventions: the '#' character is used to indicate that preceding space (if any) and the remainder of the line up to the line termination is discarded. Lines containing only a comment are discarded completely, and therefore do not indicate a record boundary.

Is that unambiguous? -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From guido@cwi.nl Tue Jun 21 00:47:34 1994 Replied: Tue, 21 Jun 1994 09:27:08 +0100 Replied: /DD.Common=robots/@nexor.co.uk Replied: Guido.van.Rossum@cwi.nl Return-Path: Delivery-Date: Tue, 21 Jun 1994 00:48:01 +0100 Received: from charon.cwi.nl by lancaster.nexor.co.uk with SMTP (XTPP); Tue, 21 Jun 1994 00:47:34 +0100 Received: from voorn.cwi.nl by charon.cwi.nl with SMTP id ; Tue, 21 Jun 1994 01:47:21 +0200 Received: by voorn.cwi.nl with SMTP id ; Tue, 21 Jun 94 01:47:18 +0200 Message-Id: <9406202347.AA25606=guido@voorn.cwi.nl> To: Martijn Koster Cc: /DD.Common=robots/@nexor.co.uk Subject: norobots.py In-Reply-To: Your message of "Mon, 20 Jun 1994 16:54:00 MDT." From: Guido.van.Rossum@cwi.nl X-Organization: CWI (Centrum voor Wiskunde en Informatica) X-Address: P.O. Box 94079, 1090 GB Amsterdam, The Netherlands X-Phone: +31 20 5924127 (work), +31 20 6225521 (home), +31 20 5924199 (fax) Date: Tue, 21 Jun 1994 01:47:17 +0200 Sender: Guido.van.Rossum@cwi.nl Status: RO Content-Length: 3339 Here's my norobots script in Python. Note that I haven't been able to find the Perl code (I can actually read Perl :-) since Martijn's norobots.html page doesn't seem to have a link to the source code -- or I missed it (using the www linemode browser in an Emacs shell window :-). # norobots.py # Handle /robots.txt files. # # Manages a cache of parsed robots.txt files, indexed by host:port. # # Proposed usage: # # import norobots # if norobots.allowed(url): # ... # # OK to read this url # # Author: Guido van Rossum # Version: 1.0 # Date: 21 June 1994 # XXX Worry about Expires: header later. import urllib import string # Parse a robots.txt file. # Return a list of records, where each record is represented as a # dictionary with keys 'robots' and 'disallow' (and possibly others). # The value for each key is a list of values, with one item for each # corresponding line (leading and trailing whitespace stripped). def parse(fp): records = [] current = {} while 1: line = fp.readline() if not line: break line = string.strip(line) if not line: if current: records.append(current) current = {} continue i = string.find(line, '#') if i >= 0: line = line[:i] line = string.strip(line) i = string.find(line, ':') if i < 0: continue # Ignore bad line key = string.lower(line[:i]) value = string.strip(line[i+1:]) if key in ('robot', 'user-agent'): key = 'robot' value = string.lower(value) if not current.has_key(key): current[key] = [] current[key].append(value) if current: records.append(current) return records # Check whether this robot is allowed to read the given URL. DEFAULT_NAME = 'python' cache = {} # Format: {'host:port': {'robot': [name, ...], 'disallow': [path, ...]}, ...} def allowed(url, my_name = DEFAULT_NAME): my_name = string.lower(my_name) # Substring must occur in record spec type, url = urllib.splittype(url) if type != 'http': return 1 # Don't mess with other protocols host, path = urllib.splithost(url) host = string.lower(host) # Hostnames are case insensitive host, port = urllib.splitport(host) if not port: port = '80' key = host + ':' + port # Normalized form if not cache.has_key(key): robots_url = '%s://%s:%s/robots.txt' % (type, host, port) records = [] try: fp = urllib.urlopen(robots_url) records = parse(fp) fp.close() except IOError: pass cache[key] = records records = cache[key] for record in records: if not record.has_key('robot'): continue if not record.has_key('disallow'): continue specs = record['robot'] if '*' in specs or my_name in specs: for prefix in record['disallow']: if path[:len(prefix)] == prefix: return 0 return 1 # Test program def test(): url = 'http://web.nexor.co.uk/mak/doc/robots/robots.html' print url, allowed(url) url = 'http://web.nexor.co.uk/aliweb/data/' print url, allowed(url) url = 'http://www.cwi.nl/' print url, allowed(url) for host, records in cache.items(): print print host for record in records: print for key, values in record.items(): for value in values: print '\t' + key + ':', value if __name__ == '__main__': test() # end norobots.py --Guido van Rossum, CWI, Amsterdam URL: From /CN=robots-errors/@nexor.co.uk Tue Jun 21 09:34:45 1994 Return-Path: Delivery-Date: Tue, 21 Jun 1994 09:35:46 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 21 Jun 1994 09:34:45 +0100 Date: Tue, 21 Jun 1994 09:34:45 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:011860:940621083447] Content-Identifier: Re: norobots.py Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 21 Jun 1994 09:34:45 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"1082 Tue Jun 21 09:27:42 1994"@nexor.co.uk> To: Guido.van.Rossum@cwi.nl Cc: /CN=robots/@nexor.co.uk In-Reply-To: <9406202347.AA25606=guido@voorn.cwi.nl> Subject: Re: norobots.py Status: RO Content-Length: 630 Guido.van.Rossum@cwi.nl wrote: > Note that I haven't been able to find the Perl code ... since > Martijn's norobots.html page doesn't seem to have a link to the > source code Oops, I've been typing "emacs norobots.html&" so often I gave the wrong URL, it is actually in norobots.pl. I have also added a link in the .html. ^^ > (I can actually read Perl :-) Probably a lot better than I can read python :-) -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Thu Jun 23 10:20:57 1994 Return-Path: Delivery-Date: Thu, 23 Jun 1994 10:21:35 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 23 Jun 1994 10:20:57 +0100 Date: Thu, 23 Jun 1994 10:20:57 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:276390:940623092058] Content-Identifier: User-agent vs... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 23 Jun 1994 10:20:57 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"27618 Thu Jun 23 10:20:48 1994"@nexor.co.uk> To: Martijn Koster Cc: /CN=robots/@nexor.co.uk In-Reply-To: <"24407 Mon Jun 20 17:25:15 1994"@nexor.co.uk> Subject: User-agent vs Robot (was Re: Administrativa, and norobots.pl) Status: RO Content-Length: 672 I wrote: > then let's vote on that too. The choice is "Robot" vs "User-agent", > send votes only to me, not the entire list. On Wednesday 17:00 my > time I'll count the votes, and change the spec if required. If > there's a tie I'll decide. Only got 4 votes, so I guess people aren't all that fussed which way: Robot: Peter Beebee, Guido van Rossum User-agent: Michael Mauldin, Roy Fielding There was a tie, I vote for "User-agent" and the status quo, so User-agent it remains. -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Fri Jun 24 10:57:15 1994 Return-Path: Delivery-Date: Fri, 24 Jun 1994 10:57:56 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 24 Jun 1994 10:57:15 +0100 Date: Fri, 24 Jun 1994 10:57:15 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:138890:940624095717] Content-Identifier: Any more robo... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 24 Jun 1994 10:57:15 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"13879 Fri Jun 24 10:56:43 1994"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: Any more robots.txt comments? Status: RO Content-Length: 292 Are there any more things I should change to the robots.txt standard before announcing it to www-talk? -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Fri Jun 24 12:39:09 1994 Return-Path: Delivery-Date: Fri, 24 Jun 1994 12:39:56 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 24 Jun 1994 12:39:09 +0100 Date: Fri, 24 Jun 1994 12:39:09 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:161940:940624113912] Content-Identifier: quick questio... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 24 Jun 1994 12:39:09 +0100; Alternate-Recipient: Allowed From: Paul Harrington Message-ID: <18795.9406241138@tamdhu.cs.st-andrews.ac.uk> To: /CN=robots/@nexor.co.uk Cc: afcondon@dsg.cs.tcd.ie Subject: quick question on output from robots Status: RO Content-Length: 984 [ sent originally to Martijn. submitted here at his suggestion ] Hi, just joined the robots list a few hours ago and have been browsing the archive. I am interested in visualisation of web structures. I wrote a very primitive robot last year and used it to automatically (that should really be "semi-automatically" :-) generate navigational maps cf. http://www.dsg.cs.tcd.ie:1969/afc_draft.html I would like to know if there has been any work on output formats and descriptions from the various robots. I would like to take robot output and write some filters to generate graphical maps. I have been looking at applying some of Schniederman's metrics to the generation of visual representations of nodes, subgraphs etc. apologies for the direct mail but I don't know whether or not this may have any bearing on the ongoing discussion. pjjH Paul Harrington, phrrngtn@dcs.st-andrews.ac.uk +44 334 63261 Division of Computer Science, St Andrews University, Scotland KY16 9SS From m.koster@nexor.co.uk Fri Jul 8 10:28:29 1994 Return-Path: Delivery-Date: Fri, 8 Jul 1994 10:28:52 +0100 Received: from nexor.co.uk (actually victor.nexor.co.uk) by lancaster.nexor.co.uk with SMTP (PP); Fri, 8 Jul 1994 10:28:29 +0100 To: beebee@parc.xerox.com cc: /CN=robots/@nexor.co.uk cc: m.koster@nexor.co.uk Subject: Re: New Robot: SG-Scout In-reply-to: Your message of "Fri, 08 Jul 1994 01:55:59 PDT." <94Jul8.015610pdt.18822@kolsaas.parc.xerox.com> Date: Fri, 08 Jul 1994 10:27:58 +0100 From: Martijn Koster Status: RO Content-Length: 2335 [Peter, I've cc'ed the robots list on this] Peter Beebee wrote: > I noticed in my searches that only 8 of the 6425 servers I inspected > have a /robots.txt file ... Is there any way we can make server > operators more aware of the standard? ... Perhaps there is another > forum which would reach more server admins... Yes, I'm planning to do that. Periodic postings to c.i.www.providers and updating FAQ's would be a start. A conference presentation or something wouldn't go amiss either... As always, it's the time, the time... > 6425 servers Ehr? That's about twice the number I know exist. I guess there may be some duplicates there. | http://vulcan.nexor.co.uk:8001 | http://web.nexor.co.uk:80 | http://wellington.nexor.co.uk:80 Yup, here we go, web=wellington (grr, who uses wellington :-/) I wonder how many robots don't recognise duplicates... This sounds like a perfect start for starting to come up with some common formats for sharing data; every robot at one point uses a list of servers, so we might as well keep a comprehensive list in a common format, so that each robot can convert it to their own format and use it as required. Something like (let's call it WWW-SERVER): | URL: | Host-Name: | Host-Port: | Alias: | IP-Address: | | ... Note that a server on a different port is considered a different server, which happens to share Alias and IP-Address with other records. The shortest referenced URL is an attempt to do better than "somewhere here is a server"; for example, one of my servers (vulcan above) "starts" at a specific path, the home page itself is not referenced (and is empty). The best-guess would be "www.domain" followed by "web.domain" etc, in the hope to get the generic DNS alias, not the hostname itself. I can adapt my Matthew-Gray's-List-parsing script to produce that format, which would get Lycos and WWW Wanderer host as a start, if that is required. Good idea? Bad idea? -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Fri Jul 8 10:30:15 1994 Return-Path: Delivery-Date: Fri, 8 Jul 1994 10:32:51 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 8 Jul 1994 10:30:15 +0100 Date: Fri, 8 Jul 1994 10:30:15 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:228330:940708093018] Content-Identifier: Re: New Robot... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 8 Jul 1994 10:30:15 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"22775 Fri Jul 8 10:28:45 1994"@nexor.co.uk> To: beebee@parc.xerox.com Cc: /CN=robots/@nexor.co.uk, m.koster@nexor.co.uk In-Reply-To: <94Jul8.015610pdt.18822@kolsaas.parc.xerox.com> Subject: Re: New Robot: SG-Scout Status: RO Content-Length: 2335 [Peter, I've cc'ed the robots list on this] Peter Beebee wrote: > I noticed in my searches that only 8 of the 6425 servers I inspected > have a /robots.txt file ... Is there any way we can make server > operators more aware of the standard? ... Perhaps there is another > forum which would reach more server admins... Yes, I'm planning to do that. Periodic postings to c.i.www.providers and updating FAQ's would be a start. A conference presentation or something wouldn't go amiss either... As always, it's the time, the time... > 6425 servers Ehr? That's about twice the number I know exist. I guess there may be some duplicates there. | http://vulcan.nexor.co.uk:8001 | http://web.nexor.co.uk:80 | http://wellington.nexor.co.uk:80 Yup, here we go, web=wellington (grr, who uses wellington :-/) I wonder how many robots don't recognise duplicates... This sounds like a perfect start for starting to come up with some common formats for sharing data; every robot at one point uses a list of servers, so we might as well keep a comprehensive list in a common format, so that each robot can convert it to their own format and use it as required. Something like (let's call it WWW-SERVER): | URL: | Host-Name: | Host-Port: | Alias: | IP-Address: | | ... Note that a server on a different port is considered a different server, which happens to share Alias and IP-Address with other records. The shortest referenced URL is an attempt to do better than "somewhere here is a server"; for example, one of my servers (vulcan above) "starts" at a specific path, the home page itself is not referenced (and is empty). The best-guess would be "www.domain" followed by "web.domain" etc, in the hope to get the generic DNS alias, not the hostname itself. I can adapt my Matthew-Gray's-List-parsing script to produce that format, which would get Lycos and WWW Wanderer host as a start, if that is required. Good idea? Bad idea? -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Fri Jul 8 11:05:09 1994 Replied: Fri, 08 Jul 1994 11:19:26 +0100 Replied: /CN=robots/@nexor.co.uk Replied: beebee@parc.xerox.com Return-Path: Delivery-Date: Fri, 8 Jul 1994 11:06:36 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 8 Jul 1994 11:05:09 +0100 Date: Fri, 8 Jul 1994 11:05:09 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:233720:940708100511] Content-Identifier: Robotics Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 8 Jul 1994 11:05:09 +0100; Alternate-Recipient: Allowed From: Peter Beebee Message-ID: <94Jul8.030415pdt.18822@kolsaas.parc.xerox.com> To: /CN=robots/@nexor.co.uk Subject: Robotics Reply-To: beebee@parc.xerox.com Status: RO Content-Length: 780 Actually, I checked into the situation with the duplicate entries for web.nexor.co.uk. I'm fairly sure that that is a special case. The duplicate only exists because your server was, for some executions of my robot, the root node for my search. I entered the server name manually one way and when the computer subsequently found links to wellington/web they were stored in the database for the other server name. I don't expect there are very many duplicates other that 5 or 6 of these exceptions. I've been using the gethostbyname procedure to find what my man page says is the 'official' name of each host. I believe this name is not the preferred alias. Is there any simple way to find the TRULY appropriate name of a server? (ex: WEB.nexor...) -- Peter From /CN=robots-errors/@nexor.co.uk Fri Jul 8 11:19:39 1994 Return-Path: Delivery-Date: Fri, 8 Jul 1994 11:21:10 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 8 Jul 1994 11:19:39 +0100 Date: Fri, 8 Jul 1994 11:19:39 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:237660:940708101940] Content-Identifier: Re: Robotics Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 8 Jul 1994 11:19:39 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"23764 Fri Jul 8 11:19:34 1994"@nexor.co.uk> To: beebee@parc.xerox.com Cc: /CN=robots/@nexor.co.uk In-Reply-To: <94Jul8.030415pdt.18822@kolsaas.parc.xerox.com> Subject: Re: Robotics Status: RO Content-Length: 1928 > Actually, I checked into the situation with the duplicate entries > for web.nexor.co.uk. I'm fairly sure that that is a special case. > The duplicate> only exists because your server was, for some > executions of my robot, the roo t node for my search. I entered the > server name manually one way and when the computer subsequently > found links to wellington/web they were stored in the database for > the other server name. I don't expect there are very many > duplicates other that 5 or 6 of these exceptions. I don't think it's that unique a case (but as said am eager to find out). Have a look at , especially compre.out. This seems to happen regularly. > I've been using the gethostbyname procedure to find what my man > page says is the 'official' name of each host. I believe this name > is not the preferred alias. No, and this is how people get "welington" instead of "web.nexor", which has swapped machines twice alreayd and will probably do so again. (Doesn't anyone else have this problem of being shoved around? :-) > Is there any simple way to find the TRULY appropriate name of a > server? (ex: WEB.nexor...) Well, I did say "best guess" :-) I use: # decide on the "best" name for a www host sub bestname { local(@hosts) = sort(@_); for (@hosts) { return $_ if (/^(www|web)\./); } for (@hosts) { return $_ if (/(www|web)/); } for (@hosts) { return $_ if (/(gopher)/); } for (@hosts) { return $_ if (/(ftp)/); } for (@hosts) { return $_ if (/(veronica)/); } for (@hosts) { return $_ if (/(gate)/); } return $hosts[0]; } and update it when I see something obvious in the logs. Suggestions welcome. -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Fri Jul 8 18:00:08 1994 Return-Path: Delivery-Date: Fri, 8 Jul 1994 18:01:11 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 8 Jul 1994 18:00:08 +0100 Date: Fri, 8 Jul 1994 18:00:08 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:296500:940708170009] Content-Identifier: Re: Server Co... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 8 Jul 1994 18:00:08 +0100; Alternate-Recipient: Allowed From: " (Michael Mauldin)" Message-ID: <9407081659.AA27195@fuzine.mt.cs.cmu.edu> To: " (Mary Morris)" Cc: /CN=robots/@nexor.co.uk Subject: Re: Server Count Original-Received: by NeXT Mailer (1.63) PP-warning: Illegal Received field on preceding line Status: RO Content-Length: 175 I currently show about 3000 servers in my list http://fuzine.mt.cs.cmu.edu/mlm/servers.html I am updating it right now. --Fuzzy http://fuzine.mt.cs.cmu.edu/mlm/home.html From /CN=robots-errors/@nexor.co.uk Fri Jul 8 22:11:43 1994 Replied: Mon, 11 Jul 1994 10:17:28 +0100 Replied: beebee@parc.xerox.com Return-Path: Delivery-Date: Fri, 8 Jul 1994 22:12:44 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 8 Jul 1994 22:11:43 +0100 Date: Fri, 8 Jul 1994 22:11:43 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:016250:940708211144] Content-Identifier: Re: Server Co... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 8 Jul 1994 22:11:43 +0100; Alternate-Recipient: Allowed From: Peter Beebee Message-ID: <94Jul8.141047pdt.18822@kolsaas.parc.xerox.com> To: marym@finesse.com, /CN=robots/@nexor.co.uk In-Reply-To: <9407081629.AA01018@thyme.finesse.com> Subject: Re: Server Count Reply-To: beebee@parc.xerox.com Status: RO Content-Length: 845 I recently ran a script on my database to translate the names of the servers I've explored to the appropriate aliases ( ex:wellington.nexor.co.uk to web.nexor.co.uk ). I just updated the html accordingly. The names of the 6412 servers can be found at http://www-swiss.ai.mit.edu/~ptbb/servers.html. If you need a single file containing only all the server names (the html list is, obviously, cluttered with html code, and it's separated into files by domain) I'll gladly compile one and mail it to you... maybe I'll make it available from the main list page... My robot only follows links with the http scheme, so if my list contains gopher servers (as I'm sure it does...) then they were improperly linked to by someone. Does anyone know a way to find the aliases of a server without using gethostbyname/addr or nslookup? -- Peter From beebee@parc.xerox.com Mon Jul 11 16:18:50 1994 Return-Path: Delivery-Date: Mon, 11 Jul 1994 16:19:06 +0100 Received: from alpha.xerox.com by lancaster.nexor.co.uk with SMTP (XTPP); Mon, 11 Jul 1994 16:18:50 +0100 Received: from skye.parc.xerox.com ([13.1.102.95]) by alpha.xerox.com with SMTP id <14441(8)>; Mon, 11 Jul 1994 08:18:15 PDT Received: by skye.parc.xerox.com id <32262>; Mon, 11 Jul 1994 08:17:59 -0700 From: Peter Beebee To: m.koster@nexor.co.uk, /CN=robots/@nexor.co.uk In-reply-to: Martijn Koster's message of Mon, 11 Jul 1994 02:17:23 -0700 <94Jul11.021746pdt.14437(8)@alpha.xerox.com> Subject: Re: Server Count Reply-to: beebee@parc.xerox.com Message-Id: <94Jul11.081759pdt.32262@skye.parc.xerox.com> Date: Mon, 11 Jul 1994 08:17:44 PDT Status: RO Content-Length: 1585 I posted my 'complete' list from my home page... the URL for the list is http://www-swiss.ai.mit.edu/~ptbb/servers.txt . The c calls gethostbyname/addr don't seem to work on any of the three systems I have accounts on (Xerox, MIT, or MIT AI) in that they all yield the results (or lack of results) that I mentioned on Friday. Also, I can't seem to find nslookup on the system here at Xerox. I sorted for duplicate/inappropriate names with the following inadequate code, so there are certainly still some errors in the list. I'll correct everything as soon as I can find another way to determine the 'true' name of a server. (Thanks for the suggestion, Mary. I'm checking it out..) -- Peter code: sub getbestname { ### this is really hurting, but I don't know a better way that works on these machines. local($tmp, $*, $name) = ('', 1, @_); local($_, $name) = ($name, gethostbyname($name)); s/^[^\.]+\.(.+)$/www\.$1/; ($tmp) = gethostbyname($_); return($_) if ($tmp eq $name); s/^[^\.]+\.(.+)$/web\.$1/; ($tmp) = gethostbyname($_); return($_) if ($tmp eq $name); s/^[^\.]+\.(.+)$/gopher\.$1/; ($tmp) = gethostbyname($_); return($_) if ($tmp eq $name); s/^[^\.]+\.(.+)$/ftp\.$1/; ($tmp) = gethostbyname($_); return($_) if ($tmp eq $name); s/^[^\.]+\.(.+)$/veronica\.$1/; ($tmp) = gethostbyname($_); return($_) if ($tmp eq $name); s/^[^\.]+\.(.+)$/gate\.$1/; ($tmp) = gethostbyname($_); return($_) if ($tmp eq $name); s/^[^\.]+\.(.+)$/wais\.$1/; ($tmp) = gethostbyname($_); return($_) if ($tmp eq $name); return($name); } From /CN=robots-errors/@nexor.co.uk Mon Jul 11 16:19:30 1994 Return-Path: Delivery-Date: Mon, 11 Jul 1994 16:20:23 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 11 Jul 1994 16:19:30 +0100 Date: Mon, 11 Jul 1994 16:19:30 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:250560:940711151931] Content-Identifier: Re: Server Co... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 11 Jul 1994 16:19:30 +0100; Alternate-Recipient: Allowed From: Peter Beebee Message-ID: <94Jul11.081759pdt.32262@skye.parc.xerox.com> To: m.koster@nexor.co.uk, /CN=robots/@nexor.co.uk In-Reply-To: <94Jul11.021746pdt.14437(8)@alpha.xerox.com> Subject: Re: Server Count Reply-To: beebee@parc.xerox.com Status: RO Content-Length: 1585 I posted my 'complete' list from my home page... the URL for the list is http://www-swiss.ai.mit.edu/~ptbb/servers.txt . The c calls gethostbyname/addr don't seem to work on any of the three systems I have accounts on (Xerox, MIT, or MIT AI) in that they all yield the results (or lack of results) that I mentioned on Friday. Also, I can't seem to find nslookup on the system here at Xerox. I sorted for duplicate/inappropriate names with the following inadequate code, so there are certainly still some errors in the list. I'll correct everything as soon as I can find another way to determine the 'true' name of a server. (Thanks for the suggestion, Mary. I'm checking it out..) -- Peter code: sub getbestname { ### this is really hurting, but I don't know a better way that works on these machines. local($tmp, $*, $name) = ('', 1, @_); local($_, $name) = ($name, gethostbyname($name)); s/^[^\.]+\.(.+)$/www\.$1/; ($tmp) = gethostbyname($_); return($_) if ($tmp eq $name); s/^[^\.]+\.(.+)$/web\.$1/; ($tmp) = gethostbyname($_); return($_) if ($tmp eq $name); s/^[^\.]+\.(.+)$/gopher\.$1/; ($tmp) = gethostbyname($_); return($_) if ($tmp eq $name); s/^[^\.]+\.(.+)$/ftp\.$1/; ($tmp) = gethostbyname($_); return($_) if ($tmp eq $name); s/^[^\.]+\.(.+)$/veronica\.$1/; ($tmp) = gethostbyname($_); return($_) if ($tmp eq $name); s/^[^\.]+\.(.+)$/gate\.$1/; ($tmp) = gethostbyname($_); return($_) if ($tmp eq $name); s/^[^\.]+\.(.+)$/wais\.$1/; ($tmp) = gethostbyname($_); return($_) if ($tmp eq $name); return($name); } From /CN=robots-errors/@nexor.co.uk Sun Jul 10 19:47:57 1994 Replied: Thu, 14 Jul 1994 11:18:57 +0100 Replied: " (Mary Morris)" Replied: Mon, 11 Jul 1994 10:13:55 +0100 Replied: " (Mary Morris)" Return-Path: Delivery-Date: Sun, 10 Jul 1994 19:48:48 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Sun, 10 Jul 1994 19:47:57 +0100 Date: Sun, 10 Jul 1994 19:47:57 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:144670:940710184759] Content-Identifier: Deduping Serv... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sun, 10 Jul 1994 19:47:57 +0100; Alternate-Recipient: Allowed From: " (Mary Morris)" Message-ID: <9407101844.AA00514@thyme.finesse.com> To: /CN=robots/@nexor.co.uk Subject: Deduping Server Counts X-Sun-Charset: US-ASCII Status: RO Content-Length: 1116 Hi While going through the lists of servers that I picked up thanks to Peter Beebee, Michael Mauldin, and Matthew Gray, I noticed that there are times when the same server is listed twice. The only difference is the port number. Now in the case where the port numbers are say 70 and 80, I would say that the port 70 is gopher and eliminate it immediatly. However, there are some cases where the port numbers are: 80, 8000, 8001, and 8008. Does anyone know what is happening in that case? I know that 8000 is a pretty common alternate port for 80. What I am asking here is do you think that all of those ports are feeding the same data? I could think of one scenario where CERN's httpd could be on say 8001 and NCSA's could be on 8000. Should they be considered different here? If we find common urls should we write them off as the same or what? FYI - after merging all of these lists and counting only one port per server *regardless* of port number, I have a count of 6855 servers. My next step in the deduping will be to do an nslookup of each of these servers and compare IP numbers. Comments? Mary Morris From /CN=robots-errors/@nexor.co.uk Mon Jul 11 10:47:43 1994 Return-Path: Delivery-Date: Mon, 11 Jul 1994 10:48:57 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 11 Jul 1994 10:47:43 +0100 Date: Mon, 11 Jul 1994 10:47:43 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:201030:940711094744] Content-Identifier: Another rapid... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 11 Jul 1994 10:47:43 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"20093 Mon Jul 11 10:47:31 1994"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: Another rapid fire attack Status: RO Content-Length: 2430 Paul Ginsparg last week reported another rapid fire attack on his server. I have included several excerpts from his message, FYI. > this one had all the classic hallmarks indicted in your > http://web.nexor.co.uk/mak/doc/robots/robots.html > > a) no means of determining who was running robot (finger and rusers > gave no info on remote machine) > b) the robot was making parallel rather than sequential requests > (I determined this by trapping the $REMOTE_HOST with sleeping processes > and saw multiple ones spawned before any timeouts) > c) after the robot had access cut off, it continued to make rapid-fire > requests (in this case it was cut off after 465 requests, and made > another 920 requests (total of 1385 in a 3 hour period) > before being stopped by a helpful sysadmin at the site in question > d) the person running the robot was neither monitoring nor available. This was another attempt to mirror his server. From the robot operator: : [apology deleted] : We were preparing the cache data for Interop to demonstrate how : wonderful the WWW system is, because the place will not have the : bandwidth enough to make realtime connections. And your host was, : unfortunate for you, one of the most fascinating server to : demonstrate the charm of WWW. : The fault was caused by our technical mistake and I have to say that : we were too careless about automated downloading, although I think : caching and mirroring can be a good technology if they have had done : in good manner and with deep thought and matured method. This confirms my belief that the Web would benefit from a well-written, and maintained caching program. Only today someone posted a request for such a program to c.i.w.p. In this case the response was a number of bulk mailings to the adminstrator of the remote site to attract attention. This had the desired effect, although I think it is unfortunate this sort of action is required. > I plan to add an explict warning on my front page re automated > downloads (ROBOTS BEWARE) -- I'm now set up to detect automatically > the above conditions and mail the log of requests from that site > (typically a 100kb file) to the sysadmin at the site in question in > response to each subsequent request (I can spare the > bandwidth). when everyone has form-parsing clients the problem will > ameliorate since I can hide the database behind an explicit POST or > two. From /CN=robots-errors/@nexor.co.uk Thu Jul 14 16:52:00 1994 Return-Path: Delivery-Date: Thu, 14 Jul 1994 16:52:56 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 14 Jul 1994 16:52:00 +0100 Date: Thu, 14 Jul 1994 16:52:00 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:187660:940714155203] Content-Identifier: Re: Represent... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 14 Jul 1994 16:52:00 +0100; Alternate-Recipient: Allowed From: Paul Harrington Message-ID: <1069.9407141551@tamdhu.cs.st-andrews.ac.uk> To: Charlie Stross Cc: /CN=robots/@nexor.co.uk In-Reply-To: <9407141527.aa23834@ruddles.sco.com> Subject: Re: Representing big webs visually Status: RO Content-Length: 1878 Charlie> Has anyone got any ideas about good methods of visually Charlie> representing large webs? Lots of them! ... but most are untested. :-) Myself and a colleague of mine in Trinity College in Dublin developed a simple web-walker which -- like yourself -- output a description in neato format and generated an image map. Originally, we intended to use this as a 'goto' mechanism for home pages: they are very useful for getting a handle on the structure of a local web and how it is connected to other webs. Our visual encoding techniques are _very_ primitive but were encouraging enough to make us want to do more work on the area. However, being the lazy people that we are, we are waiting for the emergence of some kind of web description interchange format from the spider writers. Charlie> Am I reinventing the wheel here, or is this terra incognita? I think it is mosly terra incognita. There are a few people who have contacted me over the last year who may be working on it still. I can put you in touch with them if you mail me. I am busy trying to write papers at the moment so treat all of the stuff below as 'aspirational' We want to take up a lot of the formal work that has been done on Hypertext systems (See the CACM special issue from some time last year for a good pointer to surveys) and experiment with applying them to web structures. Other systems that may be of some use for resource characterisation are freemont and indie. I have a prejudice that says that visualisation of an object is no good unless there is some input from the server(s) which serve the object. e.g. frequency of access, frequency of remote access, age, frequency of change, I am hoping that one of the most useful applications of Web visualisation will be the representation of resource location queries. How is that for a "pie in the sky" statement? pjjH From pp@dfnrelay.d400.de Fri Jul 15 13:52:18 1994 Return-Path: Delivery-Date: Fri, 15 Jul 1994 13:52:35 +0100 Received: from ixgate01.dfnrelay.d400.de by lancaster.nexor.co.uk with SMTP (XTPP); Fri, 15 Jul 1994 13:52:18 +0100 Received: from dfnrelay.d400.de by ixgate01.dfnrelay.d400.de id <02764-0@ixgate01.dfnrelay.d400.de>; Fri, 15 Jul 1994 14:53:10 +0200 X400-MTS-Identifier: [/PRMD=dfnrelay/ADMD=d400/C=de/;ixgate01.d.757:15.07.94.12.53.08] From: pp@dfnrelay.d400.de To: /CN=robots-errors/@nexor.co.uk Subject: Delivery Report (failure) for webmap@informatik.uni-frankfurt.de Message-Type: Delivery Report Date: Fri, 15 Jul 1994 14:53:10 +0200 Message-ID: <"ixgate01.d.757:15.07.94.12.53.08"@dfnrelay.d400.de> Status: RO Content-Length: 1607 This report relates to your message with subject: Results of list processing of Fri, 15 Jul 1994 14:52:35 +0200 Your message was not delivered to webmap@informatik.uni-frankfurt.de for the following reason: Unknown Address MTA 'informatik.uni-frankfurt.de' gives error message ... User unknown ***** The following information is directed towards the local administrator ***** and is not intended for the end user * * DR generated by: mta d400relay * in /PRMD=dfnrelay/ADMD=d400/C=de/ * at Fri, 15 Jul 1994 14:53:08 +0200 * * Converted to RFC 822 at dfnrelay.d400.de * at Fri, 15 Jul 1994 14:53:10 +0200 * * Delivery Report Contents: * * Subject-Submission-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:014631:940715125034] * Content-Type: p2 * Original-Encoded-Information-Types: ia5-text * Subject-Intermediate-Trace-Information: /PRMD=dfnrelay/ADMD=d400/C=de/arrival Fri, 15 Jul 1994 14:52:35 +0200 action Relayed * Recipient-Info: webmap@informatik.uni-frankfurt.de, * /S=webmap/OU=informatik/PRMD=uni-frankfurt/ADMD=d400-gw/C=de/; * FAILURE reason Unable-To-Transfer (1); * diagnostic Unrecognised-ORName (0); * last trace (ia5-text) Fri, 15 Jul 1994 14:52:35 +0200; * converted eits ia5-text; * supplementary info "MTA 'informatik.uni-frankfurt.de' gives * error message ... User * unknown"; ****** End of administration information The return of the original message was not requested From pp@dfnrelay.d400.de Fri Jul 15 13:58:18 1994 Return-Path: Delivery-Date: Fri, 15 Jul 1994 13:58:41 +0100 Received: from ixgate01.dfnrelay.d400.de by lancaster.nexor.co.uk with SMTP (XTPP); Fri, 15 Jul 1994 13:58:18 +0100 Received: from dfnrelay.d400.de by ixgate01.dfnrelay.d400.de id <03061-0@ixgate01.dfnrelay.d400.de>; Fri, 15 Jul 1994 14:59:08 +0200 X400-MTS-Identifier: [/PRMD=dfnrelay/ADMD=d400/C=de/;ixgate01.d.037:15.07.94.12.59.04] From: pp@dfnrelay.d400.de To: /CN=robots-errors/@nexor.co.uk Subject: Delivery Report (failure) for webmap@informatik.uni-frankfurt.de Message-Type: Delivery Report Date: Fri, 15 Jul 1994 14:59:08 +0200 Message-ID: <"ixgate01.d.037:15.07.94.12.59.04"@dfnrelay.d400.de> Status: RO Content-Length: 1596 This report relates to your message with subject: Re: get archive of Fri, 15 Jul 1994 14:58:11 +0200 Your message was not delivered to webmap@informatik.uni-frankfurt.de for the following reason: Unknown Address MTA 'informatik.uni-frankfurt.de' gives error message ... User unknown ***** The following information is directed towards the local administrator ***** and is not intended for the end user * * DR generated by: mta d400relay * in /PRMD=dfnrelay/ADMD=d400/C=de/ * at Fri, 15 Jul 1994 14:59:04 +0200 * * Converted to RFC 822 at dfnrelay.d400.de * at Fri, 15 Jul 1994 14:59:08 +0200 * * Delivery Report Contents: * * Subject-Submission-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:015493:940715125526] * Content-Type: p2 * Original-Encoded-Information-Types: ia5-text * Subject-Intermediate-Trace-Information: /PRMD=dfnrelay/ADMD=d400/C=de/arrival Fri, 15 Jul 1994 14:58:11 +0200 action Relayed * Recipient-Info: webmap@informatik.uni-frankfurt.de, * /S=webmap/OU=informatik/PRMD=uni-frankfurt/ADMD=d400-gw/C=de/; * FAILURE reason Unable-To-Transfer (1); * diagnostic Unrecognised-ORName (0); * last trace (ia5-text) Fri, 15 Jul 1994 14:58:11 +0200; * converted eits ia5-text; * supplementary info "MTA 'informatik.uni-frankfurt.de' gives * error message ... User * unknown"; ****** End of administration information The return of the original message was not requested From /CN=robots-errors/@nexor.co.uk Mon Jul 18 21:13:24 1994 Return-Path: Delivery-Date: Mon, 18 Jul 1994 21:14:04 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 18 Jul 1994 21:13:24 +0100 Date: Mon, 18 Jul 1994 21:13:24 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:089410:940718201325] Content-Identifier: Other Web Ser... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 18 Jul 1994 21:13:24 +0100; Alternate-Recipient: Allowed From: " (Mary Morris)" Message-ID: <9407181923.AA02834@thyme.finesse.com> To: /CN=robots/@nexor.co.uk Subject: Other Web Servers X-Sun-Charset: US-ASCII Status: RO Content-Length: 229 Hi I went through the Net-Happenings mail list and found ~280 servers that aren't on anyone's lists. I have a list of http pointers that I can send to anyone who wants to take a robot out and register these guys. Mary Morris From /CN=robots-errors/@nexor.co.uk Wed Jul 20 11:33:23 1994 Return-Path: Delivery-Date: Wed, 20 Jul 1994 11:33:50 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 20 Jul 1994 11:33:23 +0100 Date: Wed, 20 Jul 1994 11:33:23 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:100520:940720103327] Content-Identifier: IP multicast ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 20 Jul 1994 11:33:23 +0100; Alternate-Recipient: Allowed From: Paul Harrington Message-ID: <21786.9407201033@tamdhu.cs.st-andrews.ac.uk> To: /CN=robots/@nexor.co.uk Subject: IP multicast for robot 'map' interchange? Status: RO Content-Length: 499 I read an interesting paper a week or so ago entitled "Drinking for the Firehose: Multicast USENET News" by Kurt Lidl.('usenix-muse.ps') Has any work been done on using a similar mechanism for propogating robot information? Or page updates? Would it be possible to reverse the way that some basic robots operate i.e change them to be passive monitoring agents that listen for updates/{classifcation information} which would be transmitted by http servers and/or local robots? comments? pjjH From /CN=robots-errors/@nexor.co.uk Wed Jul 20 16:41:52 1994 Return-Path: Delivery-Date: Wed, 20 Jul 1994 16:42:52 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 20 Jul 1994 16:41:52 +0100 Date: Wed, 20 Jul 1994 16:41:52 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:154700:940720154155] Content-Identifier: Status update... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 20 Jul 1994 16:41:52 +0100; Alternate-Recipient: Allowed From: " (Michael Mauldin)" Message-ID: <9407201517.AA00648@fuzine.mt.cs.cmu.edu> To: /CN=robots/@nexor.co.uk Cc: fuzzy@CMU.EDU Subject: Status update: Lycos search engine now on-line Original-Received: by NeXT Mailer (1.63) PP-warning: Illegal Received field on preceding line Status: RO Content-Length: 1247 Lycos now supports searches of its database of WWW documents. Please access the search page from the Lycos Home Page http://fuzine.mt.cs.cmu.edu/mlm/lycos-home.html Because once this service becomes popular, I will probably move the index and search server to another computer. The very first anchor in the Lycos Home Page is the SEARCH. Lycos' database was collected during June, and contains summaries of 54,000 documents (about 41meg of summaries). Current plans are 1. Add the 250,000 documents for which I have only descriptions to the search database. 2. Resume Lycos exploration. Lycos has not been fetching documents since June. 3. Experiment with best-first search. 4. Release a copy of the PURSUIT search engine for educational and research use. PURSUIT provides HTML formatted search results from (almost) arbitrary text files. It is suitable for running via CGI from httpd. If you would like to be a beta tester for PURSUIT, please send email to fuzzy@cmu.edu -- Michael L. Mauldin "How big is the Web? You may think it's Carnegie Mellon Univ. a long way to the chemist's, but that's fuzzy@cmu.edu peanuts compared to the Web, listen..." http://fuzine.mt.cs.cmu.edu/mlm/home.html From /CN=robots-errors/@nexor.co.uk Thu Jul 21 15:21:58 1994 Return-Path: Delivery-Date: Thu, 21 Jul 1994 15:23:20 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 21 Jul 1994 15:21:58 +0100 Date: Thu, 21 Jul 1994 15:21:58 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:287150:940721142203] Content-Identifier: ANL/MCS/SIGGR... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 21 Jul 1994 15:21:58 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"28708 Thu Jul 21 15:21:44 1994"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: ANL/MCS/SIGGRAPH/VROOM Walker Status: RO Content-Length: 637 And another robot enters (or should I say runs over) the net... I noticed them in my logs today. From the active.html: | ANL/MCS/SIGGRAPH/VROOM Walker | | Owner/Maintainer unknown. | | Identification: sets User-agent to ANL/MCS/SIGGRAPH/VROOM Walker, and | From to olson.anl.gov | | Another rapid-fire robot that doesn't use the robot exclusion protocol. | Depressing. Anybody got any more details/pointers, or know people at olson.anl.gov ? -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Mon Aug 8 18:11:04 1994 Return-Path: Delivery-Date: Mon, 8 Aug 1994 18:12:12 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 8 Aug 1994 18:11:04 +0100 Date: Mon, 8 Aug 1994 18:11:04 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:167850:940808171105] Content-Identifier: Re: ANNOUNCE:... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 8 Aug 1994 18:11:04 +0100; Alternate-Recipient: Allowed From: " (David Eichmann)" Message-ID: <9408081712.AA25100@rbse.jsc.nasa.gov> To: /CN=robots/@nexor.co.uk Subject: Re: ANNOUNCE: A WWWWorm for the mac! X-Sender: eichmann@192.88.42.10 MIME-version: 1.0 Content-type: text/plain; charset="us-ascii" Content-transfer-encoding: 7BIT Status: RO Content-Length: 2577 Oh goodie! Now everyone can be the proud owner of a web-wide process. This hasn't touched my site... yet. - Dave >Received: by rbse.jsc.nasa.gov (4.1/SMI-4.0:RAL-041790) > id AA24710; Mon, 8 Aug 94 11:24:41 CDT >Message-Id: <9408081624.AA24710@rbse.jsc.nasa.gov> >From: lemieuse@ERE.UMontreal.CA (Lemieux Sebastien) >Date: Sun, 7 Aug 1994 01:17:37 GMT >Newsgroup: comp.infosystems.www.providers/2778 >Subject: ANNOUNCE: A WWWWorm for the mac! >Apparently-To: eichmann > >Hi WebSurfers, > > I've just finished the programmation of a WWW Worm for the mac. >Inspired from the famous WWWWorm that is currently accessible through >the net, I programmed it to be used by people wishing to "index" the >web (get URLs) providing informations on a specific topic. > >What it does: > > It systematically scan the WWW and evaluate each page according to a >keyword search. If the evaluation is successful, then the URL of the >page is kept. > >What are the results and interests: > > If you are making up WWW pages for a specific topic, one of your >heavier job will be to accumulate enough URL to interesting sites to >make your page worth for the people that will be using it. By setting >the keywords properly, the worm will do the dirty job for you and >write you automatically a valid HTML document that can be used to test >the URL founds. > > In about 8 hours of search, it produced me a 359 entries html >document pointing toward most of the molecular biology pages in the >world! > >---------- > >The worm is currently all written in french (i'm french speaking!), so >I don't think it is ready for distribution. Basically, I'm calling >for comments on the project and want to know how much people are >interested in such a project. > >Any comments, collaborations, proposals or suggestions are welcomed. > >---------- > >The Web's gonna get scanned! > > >-- >| Sebastien Lemieux, dept. biol. | rootPGPCryptoDESEscrowCIAHackDSSRIPEM >| lemieuse@alize.ERE.UMontre| NSASkipJackFBIKerberosRSACapstoneNIST >| PGP public key on finger. ------------- AnonymousMailClipperChip >| http://alize.ere.umontreal.ca:8001/~lemieuse/ | UFC-FastCryptCrackPassWd > > ----------- David Eichmann Asst. Prof. / RBSE Director of R & D Software Engineering Program Phone: (713) 283-3875 University of Houston - Clear Lake fax : (713) 283-3810 Box 113, 2700 Bay Area Blvd. Email: eichmann@rbse.jsc.nasa.gov Houston, TX 77058 or: eichmann@cl.uh.edu RBSE on the Web: http://rbse.jsc.nasa.gov/eichmann/rbse.html From m.koster@nexor.co.uk Tue Aug 9 13:56:17 1994 Return-Path: Delivery-Date: Tue, 9 Aug 1994 14:01:07 +0100 Received: from nexor.co.uk (actually victor.nexor.co.uk) by lancaster.nexor.co.uk with SMTP (PP); Tue, 9 Aug 1994 13:56:17 +0100 To: d-garaffa@ski.mskcc.org (Dave Garaffa) cc: m.koster@nexor.co.uk, /CN=robots/@nexor.co.uk Subject: Re: ANNOUNCE: A WWWWorm for the mac! In-reply-to: Your message of "Tue, 09 Aug 1994 08:34:49 CDT." Date: Tue, 09 Aug 1994 13:55:59 +0100 From: Martijn Koster Status: RO Content-Length: 2247 d-garaffa@ski.mskcc.org (Dave Garaffa) wrote: > Martijn, > > I do see your point but we still have the problem of how to best find the > information that servers our users... Of course. This problem has always existed, even in pre-electronic times. > Okay so the idea of having 1000's of robots riding that web adinfinitum is > not a good thing. However who should say that NEXOR *CAN* run a robot and > Memorial Sloan-Kettering Cancer Center *CANT* ( I don't even know if you > do run one, this is just an example ) Why should one group have the > indexes local and someone else not? [ Just for the record, I don't run a robot at NEXOR. ] I don't make judgements on how can or cannot run a robot. But i don't want robots to roam the web looking for _one particular query_. If they build databases of info which you can later search for queries, great (well, maybe not great, but fine). I do try to dissuade people from giving robots away -- that is adding to the problem, not solving it. > How about this?? If an institution is going to run a worm then they must > do the following... > > 1] Make their database searchable to all web-surfers > > 2] Make their *DATA* ftpable to all server maintainers so we can index the > data in a way we think is best. > > Its not too much to ask and Yes, Absolutely spot on. This I have been campaigning for for ages. You will find most robot authors quite willing to share their data -- it's just that nobody asks. I personally would like to see a simple standard dataformat for this exchange -- it's make it all so much easier. However, not having a robot of my own (nor time/intention to write and maintain one), there is little I can do to speed up that progress. > it would save me the trouble of operating my own worm... Sure. It'd save you time, and effort, and it's save the net bandwidth, and servers load. > Any thoughts?? Communicate with the authors of robot, and see if together there is something that can be done. This is part of what the robot mailing list is meant for. Cheers, -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Tue Aug 9 14:01:39 1994 Return-Path: Delivery-Date: Tue, 9 Aug 1994 14:02:57 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 9 Aug 1994 14:01:39 +0100 Date: Tue, 9 Aug 1994 14:01:39 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:258620:940809130141] Content-Identifier: Re: ANNOUNCE:... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 9 Aug 1994 14:01:39 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"25831 Tue Aug 9 14:01:07 1994"@nexor.co.uk> To: " (Dave Garaffa)" Cc: m.koster@nexor.co.uk, /CN=robots/@nexor.co.uk In-Reply-To: Subject: Re: ANNOUNCE: A WWWWorm for the mac! Status: RO Content-Length: 2247 d-garaffa@ski.mskcc.org (Dave Garaffa) wrote: > Martijn, > > I do see your point but we still have the problem of how to best find the > information that servers our users... Of course. This problem has always existed, even in pre-electronic times. > Okay so the idea of having 1000's of robots riding that web adinfinitum is > not a good thing. However who should say that NEXOR *CAN* run a robot and > Memorial Sloan-Kettering Cancer Center *CANT* ( I don't even know if you > do run one, this is just an example ) Why should one group have the > indexes local and someone else not? [ Just for the record, I don't run a robot at NEXOR. ] I don't make judgements on how can or cannot run a robot. But i don't want robots to roam the web looking for _one particular query_. If they build databases of info which you can later search for queries, great (well, maybe not great, but fine). I do try to dissuade people from giving robots away -- that is adding to the problem, not solving it. > How about this?? If an institution is going to run a worm then they must > do the following... > > 1] Make their database searchable to all web-surfers > > 2] Make their *DATA* ftpable to all server maintainers so we can index the > data in a way we think is best. > > Its not too much to ask and Yes, Absolutely spot on. This I have been campaigning for for ages. You will find most robot authors quite willing to share their data -- it's just that nobody asks. I personally would like to see a simple standard dataformat for this exchange -- it's make it all so much easier. However, not having a robot of my own (nor time/intention to write and maintain one), there is little I can do to speed up that progress. > it would save me the trouble of operating my own worm... Sure. It'd save you time, and effort, and it's save the net bandwidth, and servers load. > Any thoughts?? Communicate with the authors of robot, and see if together there is something that can be done. This is part of what the robot mailing list is meant for. Cheers, -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From mak@nexor.co.uk Tue Aug 9 09:04:52 1994 To: lemieuse@ERE.UMontreal.CA, lemieuse@alize.ERE.UMontreal.Ca cc: robots Subject: Your WWW robot for the web Date: Tue, 09 Aug 1994 09:04:52 +0100 From: Martijn Koster Status: RO Content-Length: 1507 I was alerted to the fact that you are writing a Macintosh robot. Are you aware of the material on ? The main things it contains are a list of robots, guidelines for robot writers, a robot exclusion standard, and a mailing-list (to which I have cc'ed this message) I have added your robot to the list of robots, but would appreciate some further details, especially how to identify it (what does it use for User-agent?). I also strongly urge you to comply with the guidelines and the standard for robot exclusion -- not doing so will give people no control of what your robot does to their server, resources, time and effort, and they will get rather upset. You may find it useful to join the robots list -- you may find other people interested in robots, many of whom run robots, who can help you in your requirements, without having to resort to scanning the web yourself. Finally I'd like to urge you not to distribute your robot -- it is just too easy to be abused by people. If only 100 people would regularly run your robot that alone would give a noticeable overhead to a number of resources. You write: > The Web's gonna get scanned! Sure, but it is very important to be real careful about it... Regards, -- Martijn Koster (webmaster for web.nexor.co.uk) __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Tue Aug 9 09:05:09 1994 Return-Path: Delivery-Date: Tue, 9 Aug 1994 09:06:33 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 9 Aug 1994 09:05:09 +0100 Date: Tue, 9 Aug 1994 09:05:09 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:220730:940809080510] Content-Identifier: Your WWW robo... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 9 Aug 1994 09:05:09 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"22063 Tue Aug 9 09:05:03 1994"@nexor.co.uk> To: lemieuse@ERE.UMontreal.CA, lemieuse@alize.ERE.UMontreal.Ca Cc: /CN=robots/@nexor.co.uk Subject: Your WWW robot for the web Status: RO Content-Length: 1507 I was alerted to the fact that you are writing a Macintosh robot. Are you aware of the material on ? The main things it contains are a list of robots, guidelines for robot writers, a robot exclusion standard, and a mailing-list (to which I have cc'ed this message) I have added your robot to the list of robots, but would appreciate some further details, especially how to identify it (what does it use for User-agent?). I also strongly urge you to comply with the guidelines and the standard for robot exclusion -- not doing so will give people no control of what your robot does to their server, resources, time and effort, and they will get rather upset. You may find it useful to join the robots list -- you may find other people interested in robots, many of whom run robots, who can help you in your requirements, without having to resort to scanning the web yourself. Finally I'd like to urge you not to distribute your robot -- it is just too easy to be abused by people. If only 100 people would regularly run your robot that alone would give a noticeable overhead to a number of resources. You write: > The Web's gonna get scanned! Sure, but it is very important to be real careful about it... Regards, -- Martijn Koster (webmaster for web.nexor.co.uk) __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From @cs.st-andrews.ac.uk:phrrngtn@cs.st-andrews.ac.uk Wed Aug 10 10:03:25 1994 Return-Path: <@cs.st-andrews.ac.uk:phrrngtn@cs.st-andrews.ac.uk> Delivery-Date: Wed, 10 Aug 1994 10:03:36 +0100 Received: from cs.st-andrews.ac.uk by lancaster.nexor.co.uk via JANET with NIFTP (XTPP) id <12218-0@lancaster.nexor.co.uk>; Wed, 10 Aug 1994 10:03:25 +0100 Message-Id: <6300.9408100903@tamdhu.cs.st-andrews.ac.uk> Received: from jameson by tamdhu.cs.st-andrews.ac.uk; Wed, 10 Aug 94 10:03:53 BST To: Martijn Koster Cc: /CN=robots/@nexor.co.uk Subject: Output from spiders. Was Re: ANNOUNCE: A WWWWorm for the mac! In-Reply-To: Your message of "Tue, 09 Aug 1994 14:01:39 BST." <"25831 Tue Aug 9 14:01:07 1994"@nexor.co.uk> Date: Wed, 10 Aug 1994 10:03:52 +0100 From: Paul Harrington Status: RO Content-Length: 643 >> How about this?? If an institution is going to run a worm then they must >> do the following... >> >> 1] Make their database searchable to all web-surfers >> >> 2] Make their *DATA* ftpable to all server maintainers so we can index the >> data in a way we think is best. >> >> Its not too much to ask and Martijn> Yes, Absolutely spot on. This I have been campaigning for for Martijn> ages. You will find most robot authors quite willing to share their Martijn> data -- it's just that nobody asks. Ok, I'm asking! Can anyone give me pointers to the some spider output together with a description of the format of that output. From /CN=robots-errors/@nexor.co.uk Wed Aug 10 12:33:21 1994 Return-Path: Delivery-Date: Wed, 10 Aug 1994 12:34:57 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 10 Aug 1994 12:33:21 +0100 Date: Wed, 10 Aug 1994 12:33:21 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:149290:940810113325] Content-Identifier: MOMspider is ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 10 Aug 1994 12:33:21 +0100; Alternate-Recipient: Allowed From: "Roy T. Fielding" Message-ID: <9408100430.aa21665@paris.ics.uci.edu> To: libwww-perl@ics.uci.edu, /CN=robots/@nexor.co.uk Cc: vg@dcs.edinburgh.ac.uk, bbehlen@soda.csua.berkeley.edu, sugino@bart.sps.mot.com, shelden@fatty.law.cornell.edu, grimes@raison.mro.dec.com, mtaipale@dxcern.cern.ch, mvanheyn@cs.indiana.edu, altis@ibeam.jf.intel.com, casey@ptsun00.cern.ch, jlh@linus.mitre.org, dmk@allegra.att.com, Gary.Adams@east.sun.com Subject: MOMspider is now publicly available Status: RO Content-Length: 1839 Hello all, This is just going out to people who I know have been waiting for this release for much-too-long-a-time. I have done my best to make the software as robust as possible and have tested it under a variety of conditions. Note that this is not an alpha or beta release -- the software is robust enough to be marketable (even though I am just giving it away). The only fault that still remains is that the documentation is rather paltry. That should be fixed in a future patch. MOMspider is a bit unlike other Web spiders/robots in that it is not engaged in resource discovery and thus generates very little load on remote servers. However, please let me know immediately if anyone starts running it in an irresponsible manner. Please do not rebroadcast this message to the high-profile mailing lists and especially not to netnews. If everything looks like it is going fine, I will send a general announcement to www-talk and, later, to comp.infosystems.www.providers. Please do let me know if you have any problems installing/testing it or you find the documentation incomprehensible or lacking some important bit. The earlier I can catch any problems, the more likely they will be solved before the great unwashed masses get their hands on it. ;-) MOMspider can be retrieved via HTTP from http://www.ics.uci.edu/WebSoft/MOMspider/ or by anonymous ftp from ftp://liege.ics.uci.edu/pub/arcadia/MOMspider/ See the file INSTALL.txt for notes on how to install the program and other info. In particular, note that I would like to receive a copy of the initial test results from sites that are using MOMspider. ....Roy Fielding ICS Grad Student, University of California, Irvine USA (fielding@ics.uci.edu) About Roy From libwww-perl-request@ics.UCI.EDU Wed Aug 10 12:38:45 1994 Return-Path: Delivery-Date: Wed, 10 Aug 1994 12:39:03 +0100 Received: from binky.ics.uci.edu by lancaster.nexor.co.uk with SMTP (XTPP); Wed, 10 Aug 1994 12:38:45 +0100 Received: from ics.uci.edu by binky.ics.uci.edu id aa15790; 10 Aug 94 4:31 PDT Received: from paris.ics.uci.edu by binky.ics.uci.edu id aa15786; 10 Aug 94 4:30 PDT Received: from simplon.ics.uci.edu by paris.ics.uci.edu id aa21665; 10 Aug 94 4:30 PDT To: libwww-perl@ics.UCI.EDU, /CN=robots/@nexor.co.uk cc: vg@dcs.edinburgh.ac.uk, bbehlen@soda.csua.berkeley.edu, sugino@bart.sps.mot.com, shelden@fatty.law.cornell.edu, grimes@raison.mro.dec.com, mtaipale@dxcern.cern.ch, mvanheyn@cs.indiana.edu, altis@ibeam.jf.intel.com, casey@ptsun00.cern.ch, jlh@linus.mitre.org, dmk@allegra.att.com, Gary.Adams@east.sun.com Subject: MOMspider is now publicly available Date: Wed, 10 Aug 1994 04:29:41 -0700 From: "Roy T. Fielding" Message-ID: <9408100430.aa21665@paris.ics.uci.edu> Status: RO Content-Length: 1839 Hello all, This is just going out to people who I know have been waiting for this release for much-too-long-a-time. I have done my best to make the software as robust as possible and have tested it under a variety of conditions. Note that this is not an alpha or beta release -- the software is robust enough to be marketable (even though I am just giving it away). The only fault that still remains is that the documentation is rather paltry. That should be fixed in a future patch. MOMspider is a bit unlike other Web spiders/robots in that it is not engaged in resource discovery and thus generates very little load on remote servers. However, please let me know immediately if anyone starts running it in an irresponsible manner. Please do not rebroadcast this message to the high-profile mailing lists and especially not to netnews. If everything looks like it is going fine, I will send a general announcement to www-talk and, later, to comp.infosystems.www.providers. Please do let me know if you have any problems installing/testing it or you find the documentation incomprehensible or lacking some important bit. The earlier I can catch any problems, the more likely they will be solved before the great unwashed masses get their hands on it. ;-) MOMspider can be retrieved via HTTP from http://www.ics.uci.edu/WebSoft/MOMspider/ or by anonymous ftp from ftp://liege.ics.uci.edu/pub/arcadia/MOMspider/ See the file INSTALL.txt for notes on how to install the program and other info. In particular, note that I would like to receive a copy of the initial test results from sites that are using MOMspider. ....Roy Fielding ICS Grad Student, University of California, Irvine USA (fielding@ics.uci.edu) About Roy From /CN=robots-errors/@nexor.co.uk Sun Aug 14 23:34:41 1994 Replied: Mon, 22 Aug 1994 10:06:53 +0100 Replied: /CN=robots/@nexor.co.uk Replied: chakl Return-Path: Delivery-Date: Sun, 14 Aug 1994 23:35:28 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Sun, 14 Aug 1994 23:34:41 +0100 Date: Sun, 14 Aug 1994 23:34:41 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:218920:940814223442] Content-Identifier: new robot pre... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sun, 14 Aug 1994 23:34:41 +0100; Alternate-Recipient: Allowed From: chakl Message-ID: <9408142234.AA04543@w250zrz.zrz.TU-Berlin.DE> To: /CN=robots/@nexor.co.uk Subject: new robot pre-announce X-Mailer: ELM [version 2.4 PL5] Content-Type: text Status: RO Content-Length: 2171 Hi, I have also written a robot that has just made its first steps out of my local system (but staying in the local area). I still find bugs with the handling of "non-standard" URLs and I won't let the robot go out of area until these are fixed. Consider this a pre-announce :-) Purpose ------- The robot is intended to retrieve subgraphs of the Web, saving each document to local disk. Files ending in ".html" are parsed for further URLs which are then retrieved recursively. The robot will NOT follow links to other servers than the initial one, and will NOT follow links to documents below the directory of the initial URL. It will also ignore links it can't handle. Its main purpose is to retrieve complete hyperdocuments that are spread over several files on the remote server (typically 3-100 files). I'm running a non-networked system with dialup IP and a local httpd (Linux1.0, NCSAhttpd1.1, term). I find it useful to have local copies of documents with "long-term value" (eg the httpd docs) rather than having to establish a dialup connection each time. Further Plans ------------- This program is rather a test of the robot mechanics routines I have written, limiting the possible Web load to a small number of documents. In the long term, I'd like to couple this 'mechanics' to an AI system. I'm a student working in Distributed AI, hacking an experimental LISP based multiagent testbed for fine food (read money ;-) I imagine a society of cooperating intelligent agents in the Web domain, each agent being an expert in some particular area. So if my-agent gets a request from another agent (possibly human :-), it might know that your-agent is specialized in this area and would contact your-agent directly using some inter-agent language and knowledge-exchange format. Misc ---- I'm aware of the material on M. Koster's robot page. The robot follows the Guidelines and the Exclusion Standard. Written in perl based on libwww0.12 (wasn't aware of 0.30 then). Many thanks to the authors. Comments welcome. ciao, chakl chakl is Olaf Schreck, FU Berlin information science, student chakl@fu-berlin.de olafabbe@w250zrz.zrz.tu-berlin.de From /CN=robots-errors/@nexor.co.uk Mon Aug 15 16:15:43 1994 Replied: Mon, 22 Aug 1994 10:08:10 +0100 Replied: /CN=robots/@nexor.co.uk Replied: Billy Barron Replied: " (chakl)" Return-Path: Delivery-Date: Mon, 15 Aug 1994 16:17:23 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 15 Aug 1994 16:15:43 +0100 Date: Mon, 15 Aug 1994 16:15:43 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:006700:940815151545] Content-Identifier: Re: new robot... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 15 Aug 1994 16:15:43 +0100; Alternate-Recipient: Allowed From: Billy Barron Message-ID: <94Aug15.101514cdt.14417@utdallas.edu> To: " (chakl)" Cc: /CN=robots/@nexor.co.uk In-Reply-To: <9408142234.AA04543@w250zrz.zrz.TU-Berlin.DE> Subject: Re: new robot pre-announce X-WWW-Page: http://www.utdallas.edu/acc/billy.html X-Mailer: ELM [version 2.4 PL23] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Status: RO Content-Length: 564 In reply to chakl's message: > >I imagine a society of cooperating intelligent agents in the Web domain, >each agent being an expert in some particular area. So if my-agent gets >a request from another agent (possibly human :-), it might know that >your-agent is specialized in this area and would contact your-agent >directly using some inter-agent language and knowledge-exchange format. > Look at http://rd.cs.colorado.edu/harvest/. Such a system is under development. -- Billy Barron, Network Services Manager, Univ of Texas at Dallas billy@utdallas.edu From /CN=robots-errors/@nexor.co.uk Mon Aug 22 10:09:59 1994 Return-Path: Delivery-Date: Mon, 22 Aug 1994 10:12:43 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 22 Aug 1994 10:09:59 +0100 Date: Mon, 22 Aug 1994 10:09:59 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:178740:940822091000] Content-Identifier: Re: new robot... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 22 Aug 1994 10:09:59 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"17869 Mon Aug 22 10:09:47 1994"@nexor.co.uk> To: Billy Barron Cc: " (chakl)" , /CN=robots/@nexor.co.uk In-Reply-To: <94Aug15.101514cdt.14417@utdallas.edu> Subject: Re: new robot pre-announce Status: RO Content-Length: 361 Billy Barron wrote: > Look at http://rd.cs.colorado.edu/harvest/. Such a system is under > development. Definately required reading, especially the main paper on Harvest. -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Mon Aug 22 10:07:13 1994 Return-Path: Delivery-Date: Mon, 22 Aug 1994 10:08:53 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 22 Aug 1994 10:07:13 +0100 Date: Mon, 22 Aug 1994 10:07:13 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:177850:940822090715] Content-Identifier: Re: new robot... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 22 Aug 1994 10:07:13 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"17781 Mon Aug 22 10:06:58 1994"@nexor.co.uk> To: chakl Cc: /CN=robots/@nexor.co.uk In-Reply-To: <9408142234.AA04543@w250zrz.zrz.TU-Berlin.DE> Subject: Re: new robot pre-announce Status: RO Content-Length: 558 > Consider this a pre-announce :-) For the active list: Has it got a name? What does it use as User-Agent? Where is it run from? Has it got a page with details? Have you got a personal page? > [it] will NOT follow links to documents below the directory of the > initial URL. Ehr... while I always like restrictions I don't understand the reasoning behind this one? -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Sun Sep 18 19:06:49 1994 Return-Path: Delivery-Date: Sun, 18 Sep 1994 19:07:39 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Sun, 18 Sep 1994 19:06:49 +0100 Date: Sun, 18 Sep 1994 19:06:49 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:088110:940918180650] Content-Identifier: where are rob... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sun, 18 Sep 1994 19:06:49 +0100; Alternate-Recipient: Allowed From: James Binkley Message-ID: <9409181803.AA18218@cs.pdx.edu> To: chakl Cc: /CN=robots/@nexor.co.uk, jrb@cs.pdx.edu In-Reply-To: <9408142234.AA04543@w250zrz.zrz.TU-Berlin.DE> Subject: where are robots going anyway? (was Robot pre-announce) Status: RO Content-Length: 4197 I've been working with a couple of students at Portland State University for a number of years (very slowly...) on a internet info-retrieval system. (If you want to read about it -- check out the "rama" TR paper via my home page: http://www.cs.pdx.edu/~jrb (the ascii version is more up to date but there are later developments of course)). The basic idea for rama is that you have a server that takes asynchronous queries from a number of users on a daily basis, searches some number of remote information bases (typically USENET news) and returns results via email. The original query mechanism was just pattern matching - but it is about to be upgraded to a combination of pattern matching via "agrep" and relevance feedback and "NOT" (probably the most important feature :->). The system is basically used for searching local USENET news. The server is capable of being extended to search other information spheres (e.g., we have a ftp "web-walker" but it hasn't been used much). Anyway, I've been puzzling over how it might be extended to work with the WWW. Right now there isn't a client interface but something simple with forms could be cobbled together. Back-end work is more puzzling to me. (Yes I know about harvest although I've got to do some reading and some thinking there yet). Our search paradigm up until now has been something like this: you can search a "remote" directory (for news, read news group) you can throw out immediately anything that isn't NEW. (Typically we only search for new stuff since yesterday). we search on "subject, body, or both (all) and "map", which is a fuzzy notion that currently means "give me some idea about the terrain I am searching" subject for NEWs means the subject line. for ftp it means a filename. NNTP can have the server search since a given time -- which is extremely useful. In reference to Martijn's previous comment about why searching one HTML doc would be useful. It depends on what you are doing I would think. If you are running something like Lycos - you are walking a subset of the web/world and trying to build an index for it. So naturally you want a "wide-area" walker. On the other hand, I might just want to search the NCSA "what's new" page to see what's new in it (e.g assume they add a few new items and don't roll the entire content over every time they change it...). I think the question here is: "is there one web-walker model or many". And what are those models? (And where the heck are web-walkers evolving to anyway?) My other question is: assume rama, what should a model web-walker be? My inclination is to think that rama should have something like what chakl suggests. A limited web-walker that sticks only to one site and to HTML docs at that site only. It uses HEAD to determine the date but unfortunately still has to "walk the directory" (that html doc) to get lower-level URLs (although searching the non-URL contents can be skipped). It can optionally not look farther afield than a given home page if a user wants. (ls, not ls -R) What is useful about the rama server is that queries by a lot of folks can be centralized through it and query optimization (and cacheing) can take place. (There is not much done there now but it is eminently possible and should be done RSN). Right now, I'm not thrilled about the notion of agents communicating with other agents since I have visions of security problems and more important Mickey Mouse and all those brooms (:-> by which I mean scaleability). I can see something like a centralized lycos system thaat accepts remote machine "distributed queries". That seems less of a jump. Regarding rama, I've wondered if some sort of merger with something like the CERN proxy server would be useful. The proxy server caches what the local "affinity" group uses. Certainly a cache like that is useful. Maybe a model where remote users walk webs with results cached locally makes more sense then trying to walk a LOT of webs at once and indexing all the results. Just a thought. Certainly a lot of possibilities :->. Comments very welcome. Jim Binkley cs, Portland State University jrb@cs.pdx.edu From /CN=robots-errors/@nexor.co.uk Mon Aug 15 22:35:42 1994 Replied: Mon, 22 Aug 1994 10:23:38 +0100 Replied: Matthew K Gray Return-Path: Delivery-Date: Mon, 15 Aug 1994 22:36:32 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 15 Aug 1994 22:35:42 +0100 Date: Mon, 15 Aug 1994 22:35:42 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:042460:940815213544] Content-Identifier: Wandex, the W... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 15 Aug 1994 22:35:42 +0100; Alternate-Recipient: Allowed From: Matthew K Gray Message-ID: <9408152135.AA12434@deathtongue.MIT.EDU> To: /CN=robots/@nexor.co.uk Subject: Wandex, the World Wide Web Wanderer Index X-Url: http://www.mit.edu:8001/people/mkgray/mkgray.html Status: RO Content-Length: 1041 I've finally set up access to the indexes generated by W4 (the World Wide Web Wanderer). It indexes 13000 documents from more than 5000 different sites. Please don't distribute the URL in public forums (such as c.i.www.*) just yet, but tell whoever you want and use it. In addition to the full web search, it allows one to search a number of other documents and their children. These indices contain more data than the full web index. All of the other indeces are in the process of being generated right as I compose this message and may take a while for them to finish, so don't expect them to be complete, yet. It currently appears that the web is growing faster than I can update the comprehensive list, but I'll keep updating it. (eventually) Suggestions for other document trees to index are welcome, but I may ignore them :-) ...Matthew Relevant URLs: Wandex http://www.mit.edu:8001/cgi/wandex Comprehensive List http://www.mit.edu:8001/people/mkgray/compre3.html Me http://www.mit.edu:8001/people/mkgray/mkgray.html From /CN=robots-errors/@nexor.co.uk Tue Aug 30 08:49:51 1994 Return-Path: Delivery-Date: Tue, 30 Aug 1994 08:51:10 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 30 Aug 1994 08:49:51 +0100 Date: Tue, 30 Aug 1994 08:49:51 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:298170:940830074954] Content-Identifier: The latest on... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 30 Aug 1994 08:49:51 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"29809 Tue Aug 30 08:49:33 1994"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: The latest on the Mac HC robot... Status: RO Content-Length: 3166 Found this in my mailbox, may be of interest to the list, as this robot was discussed here before... ------- Forwarded Message Return-Path: Delivery-Date: Mon, 29 Aug 1994 00:05:27 +0100 Received: from condor.CC.UMontreal.CA by lancaster.nexor.co.uk with SMTP (XTPP); Mon, 29 Aug 1994 00:05:15 +0100 Received: from eole.ERE.UMontreal.CA by condor.CC.UMontreal.CA with SMTP id AA05758 (5.65c/IDA-1.4.4 for m.koster@nexor.co.uk); Sun, 28 Aug 1994 19:03:40 -0400 Received: from alize.ERE.UMontreal.CA by eole.ERE.UMontreal.CA (940406.SGI/5.17) id AA09306; Sun, 28 Aug 94 19:03:39 -0400 Received: by alize.ERE.UMontreal.CA (940406.SGI/5.17) id AA17708; Sun, 28 Aug 94 19:03:38 -0400 Message-Id: <9408282303.AA17708@alize.ERE.UMontreal.CA> Subject: [announce] Mac WWW Worm From: lemieuse@ERE.UMontreal.CA (Mac WWW Worm) Date: Sun, 28 Aug 1994 19:03:37 -0400 (EDT) Reply-To: lemieuse@ERE.UMontreal.CA (Sebastien Lemieux) To: m.koster@nexor.co.uk X-Mailer: fastmail [version 2.4 PL21] First, sorry for my french colleagues for this english answer. I just didn't want to write it twice... ---------- Here are my presents thoughts about that: 1- Due to the net traffic that would be produce by such an easy-to-use 'bot, I first decided that it should _never_ be widely released. 2- My Mac WWW worm was an engine designed to search for specific topics. He was downloading lots of pages, but kept informations only about a little portion of them. This way there's a lot of wasting in net resource. So, if you were striving to get such a tool, you should consider using one of the publicly accessible WWW Database. 3- Everyone running a bot without letting other people acces the data is _wasting_ resources, and should not be permitted to do that... Anyone interested in the subject of WWW Robot should consider reading the following document: http://web.nexor.co.uk/mak/doc/robots/robots.html Before flaming me for not releasing the 'bot, read every thing you can find under that URL. ---------- Beside that, the MacWWW worm program still contains lots of neat HyperCard script that can be easily recycled for any internet based material... I would accept to share all this material with any other HC-minded people. Be aware that building net program is not a little thing. Even if HC permit it to be really easy, you should always keep in mind that the internet is a _public_ network. Don't waste other's resources... Anyway, thanks for your interest. | Sebastien Lemieux, dept. biol. || rootPGPCryptoDESEscrowCIAHackDSSRIPEM | lemieuse@alize.ERE.UMontreal.CA || NSASkipJackFBIKerberosRSACapstoneNIST | PGP public key on finger. || AnonymousMailUFC-FastCryptCrackPassWd http://alize.ere.umontreal.ca:8001/~lemieuse/ - ---------------------------------------------------------------------- Ce message a ete reposte par le reposteur TCL Pour info: lemieuse@ere.umontreal.ca - ---------------------------------------------------------------------- ------- End of Forwarded Message From clv2m@server.cs.virginia.edu Wed Sep 7 19:37:41 1994 Return-Path: Delivery-Date: Wed, 7 Sep 1994 19:37:55 +0100 Received: from virginia.edu (actually host uvaarpa.Virginia.EDU) by lancaster.nexor.co.uk with SMTP (XTPP); Wed, 7 Sep 1994 19:37:41 +0100 Received: from server.cs.virginia.edu by uvaarpa.virginia.edu id aa21729; 7 Sep 94 14:37 EDT Received: from mamba.cs.Virginia.EDU (mamba-fo.cs.Virginia.EDU) by uvacs.cs.virginia.edu (4.1/5.1.UVA) id AA27582; Wed, 7 Sep 94 14:36:55 EDT Posted-Date: Wed, 7 Sep 1994 14:36:33 +0500 Return-Path: Received: by mamba.cs.Virginia.EDU (5.0/SMI-2.0) id AA17297; Wed, 7 Sep 1994 14:36:33 +0500 Date: Wed, 7 Sep 1994 14:36:33 +0500 From: Charles Viles Message-Id: <9409071836.AA17297@mamba.cs.Virginia.EDU> To: hogeveen@fys.ruu.nl, ejk@ux2.cso.uiuc.edu, phillips.cs.ubc.ca@uvacs.cs.virginia.edu, sfsh@rome.classics.lsa.umich.edu, benw@chemistry.leeds.ac.uk, bob@num-alg-grp.co.uk, mike@arl.mil, mln@blearg.larc.nasa.gov, warnock@hypatia.gsfc.nasa.gov, m.koster@nexor.co.uk Cc: /CN=robots/@nexor.co.uk Subject: Latency measurements: TR Available Reply-To: clv2m@uvacs.cs.virginia.edu Status: RO Content-Length: 2248 You are receiving this because you communicated with me at some time during the "TESTCOMMAND" experiment or might otherwise be interested in the paper. The following paper is now available via ftp/WWW. Viles, Charles L. and James C. French. "Availability and Latency of World Wide Web Information Servers", Technical Report CS-94-36, Department of Computer Science, University of Virginia. The report is available in postscript form at ftp://uvacs.cs.virginia.edu/pub/techreports/CS-94-36.ps.Z Abstract: During a 90 day period in 1994, we measured the availability and connection latency of HTTP (hypertext transport protocol) information servers. These measurements were made from an Eastern United States site. The list of servers included 192 servers from Europe and 321 servers from North America. Our measurements indicate that on average, 4.6% of North American servers and 5.9% of European servers were unavailable from the measurement site on any given day. As seen from the measurement site, the day-to-day variation in availability was much greater for the European servers than for the North American servers. The measurements also show a wide variation in availability for individual information servers. For example, more than 80% of all North American servers were available at least 95% of the time, but 5% of the servers were available less than 80% of the time. The pattern of unavailability suggests a strong correlation between unavailability and geographic location. Median connection latency from the measurement site was in the 0.2 - 0.5s range to other North American sites and the 0.4 - 2.5s to European sites, depending upon the day of the week. Latencies were much more variable to Europe than to North America. The magnitude of the latencies suggest the addition of an MGET method to HTTP to help alleviate large TCP set-up times associated with the retrieval of web pages with embedded images. The data show that 97% and 99% of all successful connections from the measurement site to Europe and North America respectively were made within the first 10 s. This suggests the establishment of client-side time-out intervals much shorter than those used for normal TCP connection establishment. From /CN=robots-errors/@nexor.co.uk Mon Sep 12 15:59:07 1994 Return-Path: Delivery-Date: Mon, 12 Sep 1994 16:02:26 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 12 Sep 1994 15:59:07 +0100 Date: Mon, 12 Sep 1994 15:59:07 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:060560:940912145919] Content-Identifier: WWW Worm for ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 12 Sep 1994 15:59:07 +0100; Alternate-Recipient: Allowed From: Billy Barron Message-ID: <94Sep12.095839cdt.14462@utdallas.edu> To: /CN=robots/@nexor.co.uk Subject: WWW Worm for Tcl? (fwd) X-WWW-Page: http://www.utdallas.edu/acc/billy.html X-Mailer: ELM [version 2.4 PL23] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Status: RO Content-Length: 506 I caught wind of this from a friend: > >Newsgroups: comp.lang.tcl >From: msh@mserv1.dl.ac.uk (M.S. Smith) >Date: 9 Sep 1994 08:15:21 GMT >Organization: Daresbury Laboratory, UK >Subject: WWW Worm for Tcl? > >Hello, I'm about to embark on writing a WWW worm using Tcl, what I >would like to know is if one already exists in Tcl or if anyone has >already done any groundwork on one? > >Any help will be appreciated, and I would prefer it if you could mail >me directly. Thanks! > > > Mark msh@dl.ac.uk > From /CN=robots-errors/@nexor.co.uk Fri Oct 14 07:45:13 1994 Return-Path: Delivery-Date: Fri, 14 Oct 1994 07:46:02 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 14 Oct 1994 07:45:13 +0100 Date: Fri, 14 Oct 1994 07:45:13 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:254370:941014064514] Content-Identifier: Anybody know ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 14 Oct 1994 07:45:13 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"25426 Fri Oct 14 07:41:26 1994"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: Anybody know either of these two robots? Status: RO Content-Length: 2492 ------- Forwarded Message Replied: Thu, 13 Oct 1994 09:49:41 +0100 Replied: lamontg@u.washington.edu Return-Path: Delivery-Date: Wed, 12 Oct 1994 23:25:58 +0100 Received: from saul5.u.washington.edu by lancaster.nexor.co.uk with SMTP (XTPP); Wed, 12 Oct 1994 23:25:38 +0100 Received: by saul5.u.washington.edu (5.65+UW94.4/UW-NDC Revision: 2.30 ) id AA29545; Wed, 12 Oct 94 15:25:26 -0700 Date: Wed, 12 Oct 94 15:25:26 -0700 Message-Id: <9410122225.AA29545@saul5.u.washington.edu> X-Sender: lamontg@saul5.u.washington.edu To: m.koster@nexor.co.uk X-Url: X-Mailer: Lynx, Version 2.3 BETA X-Personal_Name: Lamont Granquist From: lamontg@u.washington.edu Subject: Putative New Webcrawlers I run http://stein1.u.washington.edu:2012/ at the University of Wash in Seattle, which is not an 'offical' UW server -- I'm just a student. I've got a stats page on this at: http://stein1.u.washington.edu:2012/admin/wwwstats.html Which was useful in identifying two putative new webcrawlers that i don't see on your list, the information from my logfile indicates: broo.tele.nokia.fi on 11 Oct 94 between 4:02 am and 4:49 am (47 mins) accessed 367 documents (i believe that this is the entire site) and downloaded 3 megabytes in 47 minutes. lanczos.maths.tcd.ie on 4 Oct 94 between 7:43 am and 8:43 am (exactly 1 hour) downloaded 109 requests and then on 10 Oct 94 between 4:22am and 4:50am downloaded 93 requests for a cumulative total of 1.6 megs of files. I also believe that this putative bot isn't following links which are on forms pages, while the finnish putative bot is following links which are on forms pages (although i think this might be over-cautious behavior on the part of the lanczos bot by simply ending the search on any interactive page... not sure...). Of course it may come back and hit me again. There was very little duplication between the two passes. Doesn't bug me much because 3 megs is just a *blip* across the university ethernet, and it was done at reasonably decent times of day... I don't have any ID fields off of these bots -- i don't know how to configure my server to tell me that. I'm running CERN httpd and if you could tell me how i could modify it to generate reports on bots who are nice and ID themselves i'd appreciate it (although if you've got this info on your WWW site and i just haven't seen it, don't waste your time telling me more than just the URL...) ------- End of Forwarded Message From /CN=robots-errors/@nexor.co.uk Fri Oct 14 18:10:07 1994 Return-Path: Delivery-Date: Fri, 14 Oct 1994 18:16:24 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 14 Oct 1994 18:10:07 +0100 Date: Fri, 14 Oct 1994 18:10:07 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:077960:941014171008] Content-Identifier: Re: What if w... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 14 Oct 1994 18:10:07 +0100; Alternate-Recipient: Allowed From: bcutter@pdn.paradyne.com Message-ID: <9410141550.AA01494@paradyne.com> To: uunet!robots Subject: Re: What if we offered a local spider? Reply-To: bcutter@paradyne.att.com X-Sun-Charset: US-ASCII Status: RO Content-Length: 2626 > > The robots discussion that I prompted with my indexing offer gave me an idea. > > > > If we built a free spider that operated only via the file system, which > > would build an index mapped to URL-space, > > I suggested this to at least one robot author a while ago in the > context of URL checking (Hi Roy :-), but there are a number of > problems: CGI-script generated pages are excluded, access > authorisation is ignored, and you need to parse server config files to > look at URL mappings. Roy Fielding's MOMSpider would definately be the best way to do this, because it does walk the web.. However, if you definately want the bot to walk the filesystem, and you're willing to live with broken URL's due to URL->directory remapping in the server (like /icons in NCSA), and no CGI output.. Back before MOMSpider, I implemented a crude link checker called "checkweb" which walked the filesystem checking the integrity of links, and as a side effect it also created a map which listed all the page links to other pages... When run with a flag, it produced a experimental (and crude) table of contents.. I've long since abandoned it, but if you must parse by the filesystem rather than the web, you may want to start there: http://www.stuff.com/~bcutter/home/programs/checkweb.html > > then offered to serve those indexes from here, would people use it? > > Well, by just making the file available on a well-known place anybody > can use locally-generated map. Ehr /ls-R.txt ? It would be nice to provide in a flat file a list of all files, ala ls-lR, so rather than doing multiple HEAD's against a site, I can pull down the single file and get my last modified and sizes from there.. However, there may be some issues of security/privacy... Most web sites put a "index.html" (or Welcome.html") file in place so you can't browse the directory structures, and in effect use security through obscurity to in effect force people to access only those pages which are linked. Providing a master ls-lR file would provide a way so that I could find out what pages existed in the filesystem, regardless of links. (I'd like to do this on hooho.ncsa.uiuc.edu, which has a number of interesting html pages, most of which are not listed off the home page.) If we can solve this problem, it would be nice to also regularly generate a file showing the relationship of document links, so robots won't have to walk the web to find this out... (some could prune their search looking just at the files - and those that just walk rather than index won't need to retrieve the HTML files) -Brooks bcutter@paradyne.att.com From narnett@verity.com Sat Oct 15 15:13:27 1994 Return-Path: Delivery-Date: Sat, 15 Oct 1994 15:13:42 +0100 Received: from verity.com (actually host unknown-143-5.verity.com) by lancaster.nexor.co.uk with SMTP (XTPP); Sat, 15 Oct 1994 15:13:27 +0100 Received: from nasty.verity.com (nasty.verity.com [192.187.143.63]) by verity.com (8.6.6.Beta9/8.6.6.Beta9) with SMTP id HAA15198; Sat, 15 Oct 1994 07:15:38 -0700 Received: from [192.187.143.12] (portanick) by nasty.verity.com (4.1/SMI-4.1) id AA18229; Sat, 15 Oct 94 07:11:47 PDT Message-Id: <9410151411.AA18229@nasty.verity.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sat, 15 Oct 1994 10:14:32 -0500 To: bcutter@paradyne.att.com, m.koster@nexor.co.uk From: narnett@verity.com (Nick Arnett) Subject: Re: What if we offered a local spider? Cc: /CN=robots/@nexor.co.uk Status: RO Content-Length: 1354 At 11:46 AM 10/14/94 -0400, bcutter@pdn.paradyne.com wrote: >However, if you definately want the bot to walk the filesystem, and >you're willing to live with broken URL's due to URL->directory >remapping in the server (like /icons in NCSA), and no CGI output.. I think our main goal would be to design a spider that would be very useful for indexing one site, but couldn't be used (easily, anyway) to index remote sites. So we'd obviously rather not end up with broken URLs, etc. It seems that we'd want some combination of each... There's probably no way to avoid using HTTP for part of the indexing, but perhaps its utility could be limited. >It would be nice to provide in a flat file a list of all files, ala ls-lR, >so rather than doing multiple HEAD's against a site, I can pull down the >single file and get my last modified and sizes from there.. Yes, absolutely. I'm considering a feature for our server that would do this *and* given a date as a parameter, would generate a file that only contained the meta-data for the files that have changed since that date. That would greatly simplify updates. I can't remember if Harvest has something like that, but I have the Harvest paper with me (I'm on the road). Your other suggestions would add a lot of utility to the indexing spider at a very low cost of programming, I suspect. Nick From /CN=robots-errors/@nexor.co.uk Thu Oct 20 16:53:43 1994 Return-Path: Delivery-Date: Thu, 20 Oct 1994 16:55:54 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 20 Oct 1994 16:53:43 +0100 Date: Thu, 20 Oct 1994 16:53:43 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:294220:941020155345] Content-Identifier: Web Navigatio... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 20 Oct 1994 16:53:43 +0100; Alternate-Recipient: Allowed From: Bowden Wise Message-ID: <199410201551.AA02397@cs.rpi.edu> To: /CN=robots/@nexor.co.uk Subject: Web Navigational Aids X-Mailer: MH 6.7.1/exmh version 1.5beta 8/10/94 Status: RO Content-Length: 1633 Hi, I am interested in maintaining a map of a user's current document for navigational purposes. Does anyone know of any techniques used to visualize the hypertext structure of the Web? I would like to incorporate these ideas into a Web robot first, and then add them to a browser as a navigational aid. Does anyone know of any existing source code that shows how to set up a search? I prefer C or C++ code. You can think of the map as a grah with the start document as the initial node. As links are found, they are added to the graph itself. A breadth first search of such a graph will visit nodes level by level, first by visiting all links from the inital node before proceeding to the next level. A depth first search would visit deeper into the graph. Obviously, the search must be limited, because you could traverse links forever. One could stop the search once a link is followed that goes off of the current document's server, or when the same node is visited again (for the case of a circular path), and by only searching to a certain level from an initial node. I would like to write a Web robot to build such a graph given a starting page. I prefer to write my code in C or C++. I would be grateful to any pointers to algorithms or code for doing similar searches. Where can I find a minimal robot written in C or C++ that uses the www library? -------------------------------------------------------------------- - G. Bowden Wise Computer Science Department Internet: wiseb@cs.rpi.edu Rensselaer Polytechnic Institute Troy, NY 12180 From /CN=robots-errors/@nexor.co.uk Fri Oct 21 21:40:26 1994 Return-Path: Delivery-Date: Fri, 21 Oct 1994 21:42:22 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 21 Oct 1994 21:40:26 +0100 Date: Fri, 21 Oct 1994 21:40:26 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:113970:941021204027] Content-Identifier: Re: Web Navig... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 21 Oct 1994 21:40:26 +0100; Alternate-Recipient: Allowed From: " (David Eichmann)" Message-ID: <9410212042.AA11639@rbse.jsc.nasa.gov> To: /CN=robots/@nexor.co.uk Subject: Re: Web Navigational Aids X-Sender: eichmann@192.88.42.10 MIME-version: 1.0 Content-type: text/plain; charset="iso-8859-1" Content-transfer-encoding: quoted-printable Status: RO Content-Length: 1080 At 10:55 AM 10/20/94 +0000, Bowden Wise wrote: ... >I am interested in maintaining a map of a user's current document for >navigational purposes. Does anyone know of any techniques used to >visualize the hypertext structure of the Web? > >I would like to incorporate these ideas into a Web robot first, and then >add them to a browser as a navigational aid. Does anyone know of any >existing source code that shows how to set up a search? I prefer C or C++ >code. This concept was demo'ed at the 2nd WWW Conference in Chicago on Tuesday by Peter D=F6mel - "Webmap - A Graphical Hypertext Navigation Tool." Peter's email address is doemel@informatik.uni-frankfurt.de. - Dave ----------- David Eichmann Asst. Prof. / RBSE Director of R & D Software Engineering Program Phone: (713) 283-3875 University of Houston - Clear Lake fax : (713) 283-3810 Box 113, 2700 Bay Area Blvd. Email: eichmann@rbse.jsc.nasa.gov Houston, TX 77058 or: eichmann@cl.uh.edu RBSE on the Web: http://rbse.jsc.nasa.gov/eichmann/rbse.html From /CN=robots-errors/@nexor.co.uk Fri Oct 21 06:01:47 1994 Return-Path: Delivery-Date: Fri, 21 Oct 1994 06:03:02 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 21 Oct 1994 06:01:47 +0100 Date: Fri, 21 Oct 1994 06:01:47 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:074790:941021050149] Content-Identifier: Robots in Int... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 21 Oct 1994 06:01:47 +0100; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"7476 Fri Oct 21 06:01:37 1994"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: Robots in Internet Courses? Status: RO Content-Length: 394 I've heard rumours that among the many new Internet courses at the various universities, some have "write a robot" assignments/ projects. Can anyone substantiate these rumours? A worrying thought really... -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Thu Nov 3 14:26:57 1994 Return-Path: Delivery-Date: Thu, 3 Nov 1994 14:29:21 +0000 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 3 Nov 1994 14:26:57 +0000 Date: Thu, 3 Nov 1994 14:26:57 +0000 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:026870:941103142658] Content-Identifier: Mechanical Sp... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 3 Nov 1994 14:26:57 +0000; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"2682 Thu Nov 3 14:26:37 1994"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: Mechanical Spider Image Status: RO Content-Length: 509 Hardly a real robot question, but does anyone know where I could get an image of that small spider robot done at MIT? I have seen it in print somewhere but cannot possibly remember, and I browsed the various robotics departments a while back without success. It would look better on the robot page than the robot arm :-) -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Thu Nov 3 14:42:33 1994 Return-Path: Delivery-Date: Thu, 3 Nov 1994 14:45:53 +0000 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 3 Nov 1994 14:42:33 +0000 Date: Thu, 3 Nov 1994 14:42:33 +0000 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:030640:941103144235] Content-Identifier: code implemen... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 3 Nov 1994 14:42:33 +0000; Alternate-Recipient: Allowed From: Mike Schwartz Message-ID: <199411031438.HAA23590@latour.cs.colorado.edu> To: /CN=robots/@nexor.co.uk Subject: code implementing robots.txt "Disallow" mechanism? Status: RO Content-Length: 135 Does anyone have some C code implementing the robots.txt "Disallow" mechanism? Thanks - Mike Schwartz Univ. of Colorado - Boulder From /CN=robots-errors/@nexor.co.uk Tue Nov 15 08:09:53 1994 Return-Path: Delivery-Date: Tue, 15 Nov 1994 08:18:21 +0000 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 15 Nov 1994 08:09:53 +0000 Date: Tue, 15 Nov 1994 08:09:53 +0000 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:078320:941115080954] Content-Identifier: Storing un-ch... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 15 Nov 1994 08:09:53 +0000; Alternate-Recipient: Allowed From: Martijn Koster Message-ID: <"7828 Tue Nov 15 08:09:44 1994"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Cc: Rob Hartill , Jack Applin , warrenw@hp10cux8.nsr.hp.com Subject: Storing un-checked links Status: RO Content-Length: 1297 Hi all, Robert Hartill brought a problem to my attention where a robot stored an invalid link to server A, found on a host B, without checking it was still valid, or that it wasn't governed by /robots.txt. I'm considering adding a note to guidelines.html that specifies that URL's shouldn't be listed unless their existence has been validated by an explicit retrieval (which of course falls under the /robots.txt restrictions). This is especially a good idea when the links are found on remote servers. This is vaguely related to an issue Jack Applin and Warren Waldo mailed me about a while ago: Does the robots.txt "disallow" mean "don't retrieve this tree", or "don't list a link to this tree". There is no distinction in norobots.html between these cases, in my mind robots.txt covered the first case, and the second case shouldn't happen. These extra retrievals reduce the harvest of links per document from many to one, but increases the quality of the robot's output. I guess at the very least non-checked links should be marked as such. Any thoughts? How do robots work in these cases currently? -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From narnett@verity.com Thu Nov 17 22:38:21 1994 Replied: Thu, 17 Nov 1994 22:50:46 +0000 Replied: Martijn Koster Replied: /DD.Common=robots/@nexor.co.uk Replied: narnett@verity.com (Nick Arnett) Return-Path: Delivery-Date: Thu, 17 Nov 1994 22:39:01 +0000 Received: from verity.com (actually host unknown-143-5.verity.com) by lancaster.nexor.co.uk with SMTP (XTPP); Thu, 17 Nov 1994 22:38:21 +0000 Received: from nasty.verity.com (nasty.verity.com [192.187.143.63]) by verity.com (8.6.6.Beta9/8.6.6.Beta9) with SMTP id OAA18655; Thu, 17 Nov 1994 14:40:57 -0800 Received: from by nasty.verity.com (4.1/SMI-4.1) id AB05203; Thu, 17 Nov 94 14:36:34 PST Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 17 Nov 1994 14:39:23 -0800 To: Martijn Koster , /DD.Common=robots/@nexor.co.uk From: narnett@verity.com (Nick Arnett) Subject: Re: Storing un-checked links Status: RO Content-Length: 671 At 8:01 AM 11/17/94, Martijn Koster wrote: >Hi all, > >Robert Hartill brought a problem to my attention where a robot stored >an invalid link to server A, found on a host B, without checking it >was still valid, or that it wasn't governed by /robots.txt. The browser should be sending referrer information. I must admit that I don't know the mechanism, but apparently there's a means for the client to tell "host B" in essence "host A sent me." Can anyone offer details? I only know this in theory at the moment. I dislike the idea of a robot that has to validate every link. That's a lot of overhead for a piece of information whose life is quite limited. Nick From /CN=robots-errors/@nexor.co.uk Thu Nov 17 22:39:35 1994 Return-Path: Delivery-Date: Thu, 17 Nov 1994 22:43:33 +0000 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 17 Nov 1994 22:39:35 +0000 Date: Thu, 17 Nov 1994 22:39:35 +0000 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:288830:941117223939] Content-Identifier: Re: Storing u... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 17 Nov 1994 22:39:35 +0000; Alternate-Recipient: Allowed From: " (Nick Arnett)" Message-ID: To: Martijn Koster , /CN=robots/@nexor.co.uk Subject: Re: Storing un-checked links Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Status: RO Content-Length: 671 At 8:01 AM 11/17/94, Martijn Koster wrote: >Hi all, > >Robert Hartill brought a problem to my attention where a robot stored >an invalid link to server A, found on a host B, without checking it >was still valid, or that it wasn't governed by /robots.txt. The browser should be sending referrer information. I must admit that I don't know the mechanism, but apparently there's a means for the client to tell "host B" in essence "host A sent me." Can anyone offer details? I only know this in theory at the moment. I dislike the idea of a robot that has to validate every link. That's a lot of overhead for a piece of information whose life is quite limited. Nick From m.koster@nexor.co.uk Thu Nov 17 22:50:45 1994 Return-Path: Delivery-Date: Thu, 17 Nov 1994 22:50:52 +0000 Received: from nexor.co.uk (actually host victor.nexor.co.uk) by lancaster.nexor.co.uk with SMTP (PP); Thu, 17 Nov 1994 22:50:45 +0000 To: narnett@verity.com (Nick Arnett) cc: Martijn Koster , /DD.Common=robots/@nexor.co.uk Subject: Re: Storing un-checked links In-reply-to: Your message of "Thu, 17 Nov 1994 14:39:23 PST." Date: Thu, 17 Nov 1994 22:50:41 +0000 From: Martijn Koster Status: RO Content-Length: 1982 > At 8:01 AM 11/17/94, Martijn Koster wrote: > >Hi all, > > > >Robert Hartill brought a problem to my attention where a robot stored > >an invalid link to server A, found on a host B, without checking it > >was still valid, or that it wasn't governed by /robots.txt. > > The browser should be sending referrer information. I must admit that I > don't know the mechanism, but apparently there's a means for the client to > tell "host B" in essence "host A sent me." There is, but that doesn't help here. The problem is that the dead link is never retrieved by the robot, so it doesn't know it is dead, and it can't tell anyone if it wanted to. The only place where Referer comes in is when a client uses the Robot-served dead link, in which case server A is told "Robot database xxx sent me", which doesn't help much. There might be scope for a new HTTP method DEADLINK or something, where a client explicitly goes back to the server that served the documents containing the dead (or moved) link, and notfies it of this fact. Of course it should only do this on a positive refusal (when it gets a "not found") not just on any failure (when it can't get to the host). But somehow I don't expect that to get into HTTP anytime soon. > Can anyone offer details? I only know this in theory at the moment. Refere is useful, but not with a third party. Oh, and lots of clients lie about Referer too :-) > I dislike the idea of a robot that has to validate every link. That's a > lot of overhead for a piece of information whose life is quite limited. That depends, it may not be that limited, it may go for months without being refreshed. I suppose you could even do the validation based on some usage pattern: "this URL has been found during the last n searches, let's make sure it exists." -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From mcbryan@cs.colorado.edu Thu Nov 17 23:37:47 1994 Return-Path: Delivery-Date: Thu, 17 Nov 1994 23:38:06 +0000 Received: from piper.cs.colorado.edu by lancaster.nexor.co.uk with SMTP (XTPP); Thu, 17 Nov 1994 23:37:47 +0000 Received: from [198.11.16.30] (mac3bryan.cs.colorado.edu [198.11.16.30]) by piper.cs.colorado.edu (8.6.9/8.6.9) with SMTP id QAA28066; Thu, 17 Nov 1994 16:37:33 -0700 Date: Thu, 17 Nov 1994 16:37:33 -0700 X-Sender: mcbr@piper.cs.colorado.edu Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: Martijn Koster , " (Nick Arnett)" From: mcbryan@cs.colorado.edu (Oliver A. McBryan) Subject: Re: Storing un-checked links Cc: Martijn Koster , /DD.Common=robots/@nexor.co.uk Status: RO Content-Length: 466 The Worm regards every link it cannot reach. It also records every host it cannot reach Finally it records the date of each attempted acccess. Those data are used later to decide whether and when to revisit a link and whatto output to the publicly accessible archive. Oliver McBryan; mcbryan@cs.colorado.edu; 303-6650544; Fax 303-4922844 Dept of Computer Science, University of Colorado, Boulder, CO 80309. WWW: http://www.cs.colorado.edu/home/mcbryan/Home.html From mlm@FUZINE.MT.CS.CMU.EDU Thu Nov 17 23:44:09 1994 Return-Path: Delivery-Date: Thu, 17 Nov 1994 23:46:29 +0000 Received: from fuzine.mt.cs.cmu.edu by lancaster.nexor.co.uk with SMTP (XTPP); Thu, 17 Nov 1994 23:44:09 +0000 Received: by fuzine.mt.cs.cmu.edu (NeXT-1.0 (From Sendmail 5.52)/NeXT-0.9) id AA05964; Thu, 17 Nov 94 18:42:50 EST Date: Thu, 17 Nov 94 18:42:50 EST From: mlm@FUZINE.MT.CS.CMU.EDU (Michael Mauldin) Message-Id: <9411172342.AA05964@fuzine.mt.cs.cmu.edu> Original-Received: by NeXT Mailer (1.63) PP-warning: Illegal Received field on preceding line To: Martijn Koster Subject: Re: Storing un-checked links Cc: /DD.Common=robots/@nexor.CO.UK, Rob Hartill , Jack Applin , warrenw@hp10cux8.nsr.HP.com Status: RO Content-Length: 1200 Lycos is the Robot in question, and I have already been in touch with Rob Hartill about the answer. One feature of Lycos is that it collects all the descriptions of a page and brings them together in a single retrieval even if Lycos has not retrieved the document in question. Informal polls of my users indicate that the increased data coverage of this approach is extremely useful, but it does mean that the link is not known to be valid. One thought that he suggested that I do like is to not store these indirect documents if the robots.txt file for that server would disallow the robot from accessing that file. This is not nearly as expensive as checking the existence of every URL, because there are only 9-10 thousand HTTP servers, and Lycos at least caches the robots.txt file for each of them. That means that Lycos can (in principal and soon in reality) check the publication status of a URL before including it in the database. I would not be willing to go farther and state that a robot must verify the existence of a URL before revealing the URL pointer. That reduces the usefulness of my robot far too much. --Dr. Michael L. Mauldin http://lycos.cs.cmu.edu/ From /CN=robots-errors/@nexor.co.uk Tue Nov 15 18:47:17 1994 Replied: Wed, 16 Nov 1994 08:24:29 +0000 Replied: /CN=robots/@nexor.co.uk Replied: "David M. Chess" Return-Path: Delivery-Date: Tue, 15 Nov 1994 18:48:52 +0000 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 15 Nov 1994 18:47:17 +0000 Date: Tue, 15 Nov 1994 18:47:17 +0000 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:194730:941115184718] Content-Identifier: Hello(b) Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 15 Nov 1994 18:47:17 +0000; Alternate-Recipient: Allowed From: "David M. Chess" Message-ID: <"19470 Tue Nov 15 18:47:08 1994"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: Hello! Status: RO Content-Length: 1627 I've just joined the list, and thought I'd introduce myself very briefly and ask some stupid questions. I'm David M. Chess, and I work in the High-Integrity Computing Lab at IBM's T. J. Watson Research Center in Westchester County, New York, USA. We work on computer viruses and related replicating things; we are the R&D staff for IBM AntiVirus, and we do research into virus-like problems on current and future highly-distributed systems. We're currently starting to look into the Web and electronic commerce. Just for exercise, I wrote a web-walker in REXX for OS/2, and while thinking about the various positive and negative aspects of the beasts, I stumbled across this list. - Should I send in a semi-formal description of my OS/2 REXX robot, for the list? I don't imagine it's ever been used outside the IBM Internal web so far, but it might be eventually. If so, just what information should I send, and to whom? - There are some reasonably obvious-looking things one could do to http to make at least some kinds of robots more efficient. For instance, if one could say "give me just the text and all <a> tags in this document", some robots could avoid reading all the text of an html document just to find those things. Is this a good place to toss around that sort of idea, or would one of the general www lists be better? - -- - David M. Chess | "Master, how may I comprehend the One?" High Integrity Computing Lab | "Have you finished your coding?" "Yes." IBM Watson Research | "Then go and compile!" -- Hacker Koan From /CN=robots-errors/@nexor.co.uk Wed Nov 16 08:24:41 1994 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Wed, 16 Nov 1994 08:26:29 +0000 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 16 Nov 1994 08:24:41 +0000 Date: Wed, 16 Nov 1994 08:24:41 +0000 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:273850:941116082442] Content-Identifier: Re: Hello(b) Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 16 Nov 1994 08:24:41 +0000; Alternate-Recipient: Allowed From: Martijn Koster <m.koster@nexor.co.uk> Message-ID: <"27382 Wed Nov 16 08:24:35 1994"@nexor.co.uk> To: "David M. Chess" <chess@watson.ibm.com> Cc: /CN=robots/@nexor.co.uk In-Reply-To: <"19470 Tue Nov 15 18:47:08 1994"@nexor.co.uk> Subject: Re: Hello! Status: RO Content-Length: 2352 > We're currently starting > to look into the Web and electronic commerce. Just for exercise, > I wrote a web-walker in REXX for OS/2, and while thinking about > the various positive and negative aspects of the beasts, I > stumbled across this list. > > - Should I send in a semi-formal description of my > OS/2 REXX robot, for the list? I don't imagine it's > ever been used outside the IBM Internal web so far, > but it might be eventually. If so, just what information > should I send, and to whom? Have a look at http://web.nexor.co.uk/mak/doc/robots/robots.html there is a list of robots there, with basic information to enable system admins to recognise them visiting, and to serve a central list with pointers to a Web robot material; I would suggest putting a full description of your robot up in the web, and then give me a short summary, with a pointer to more info. > - There are some reasonably obvious-looking things one > could do to http to make at least some kinds of robots > more efficient. For instance, if one could say "give > me just the <title> text and all <a> tags in this document", > some robots could avoid reading all the text of an > html document just to find those things. Is this a > good place to toss around that sort of idea, or would > one of the general www lists be better? This is probably a good place to hash out any requirements and the issues, before proposing any formal protocol changes to www-talk. As for the "Only title and links" idea, this is generally no thought be sufficient for indexing purposes (see eg the WebWalker paper at the Chicago conference), and obviously not enough for content-checking purposes. That leaves "dead link" checking and simple statistics, and it is my opinion that no robot should only do those. To implement it would increase server complexity, and it wouldn't work for CGI generated pages (unless the server does even more work). So I don't think it'd be worth the effort. The obvious way to make robots more efficient is for them to share results, maybe by acting as a Harvest gatherer. Regards and welcome to the club, -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Thu Nov 17 22:39:15 1994 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Thu, 17 Nov 1994 22:43:24 +0000 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 17 Nov 1994 22:39:15 +0000 Date: Thu, 17 Nov 1994 22:39:15 +0000 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:288810:941117223916] Content-Identifier: Re: Hello(b) Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 17 Nov 1994 22:39:15 +0000; Alternate-Recipient: Allowed From: " (Nick Arnett)" <narnett@verity.com> Message-ID: <aaf1896715021004c275@[192.187.143.12]> To: "David M. Chess" <chess@watson.ibm.com>, /CN=robots/@nexor.co.uk Subject: Re: Hello! Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Status: RO Content-Length: 1644 At 8:03 AM 11/17/94, David M. Chess wrote: > - Should I send in a semi-formal description of my > OS/2 REXX robot, for the list? I don't imagine it's > ever been used outside the IBM Internal web so far, > but it might be eventually. If so, just what information > should I send, and to whom? I, for one, am always interested in design comparison. Our spider is still fairly crude, but getting smarter every day. > - There are some reasonably obvious-looking things one > could do to http to make at least some kinds of robots > more efficient. For instance, if one could say "give > me just the <title> text and all <a> tags in this document", > some robots could avoid reading all the text of an > html document just to find those things. Is this a > good place to toss around that sort of idea, or would > one of the general www lists be better? I think this list is the ideal place to discuss such ideas. I'm afraid that the more general the list, the more likely you'll get flamed just for having written a spider at all. Makes it hard to have a rational discussion. Do you have any empirical evidence that getting the title and anchors would yield useful indexes? It sounds like a good idea, but my guess is that it's not going to work well, since those pieces wouldn't add up to an abstract of a typical document. A standard name for an abstract would be great, especially if it were in the header so that the HEAD command in HTTP would retrieve it along with other meta-information. We're supporting the META element in the HTML 2.0 RFC for encoding custom document attributes. Nick From /CN=robots-errors/@nexor.co.uk Thu Nov 17 23:04:35 1994 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Thu, 17 Nov 1994 23:05:46 +0000 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 17 Nov 1994 23:04:35 +0000 Date: Thu, 17 Nov 1994 23:04:35 +0000 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:293220:941117230436] Content-Identifier: Re: Hello(b) Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 17 Nov 1994 23:04:35 +0000; Alternate-Recipient: Allowed From: " (Oliver A. McBryan)" <mcbryan@cs.colorado.edu> Message-ID: <aaf12df809021004bf6f@[198.11.16.30]> To: " (Nick Arnett)" <narnett@verity.com>, "David M. Chess" <chess@watson.ibm.com>, /CN=robots/@nexor.co.uk Subject: Re: Hello! X-Sender: mcbr@piper.cs.colorado.edu Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Status: RO Content-Length: 391 The WWWW - World Wide WEb Worm - retrieves only Title and anchors (incvluding their associated text or icons). It has been quite useful - usage is at 1/3 million per month last time I checked. Oliver McBryan; mcbryan@cs.colorado.edu; 303-6650544; Fax 303-4922844 Dept of Computer Science, University of Colorado, Boulder, CO 80309. WWW: http://www.cs.colorado.edu/home/mcbryan/Home.html From /CN=robots-errors/@nexor.co.uk Thu Nov 17 23:13:53 1994 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Thu, 17 Nov 1994 23:15:39 +0000 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 17 Nov 1994 23:13:53 +0000 Date: Thu, 17 Nov 1994 23:13:53 +0000 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:295000:941117231401] Content-Identifier: Re: Hello(b) Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 17 Nov 1994 23:13:53 +0000; Alternate-Recipient: Allowed From: " (Nick Arnett)" <narnett@verity.com> Message-ID: <aaf1935d190210041989@[192.187.143.12]> To: " (Oliver A. McBryan)" <mcbryan@cs.colorado.edu>, "David M. Chess" <chess@watson.ibm.com>, /CN=robots/@nexor.co.uk Subject: Re: Hello! Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Status: RO Content-Length: 346 At 3:01 PM 11/17/94, Oliver A. McBryan wrote: >The WWWW - World Wide WEb Worm - retrieves only Title and anchors >(incvluding their associated text or icons). It has been quite useful - >usage is at 1/3 million per month last time I checked. Forgive my glibness, but I think that's a measure of its usability, not its usefulness... ;-) Nick From /CN=robots-errors/@nexor.co.uk Thu Nov 17 23:40:50 1994 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Thu, 17 Nov 1994 23:44:00 +0000 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 17 Nov 1994 23:40:50 +0000 Date: Thu, 17 Nov 1994 23:40:50 +0000 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:299200:941117234059] Content-Identifier: Re: Hello(b) Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 17 Nov 1994 23:40:50 +0000; Alternate-Recipient: Allowed From: " (Oliver A. McBryan)" <mcbryan@cs.colorado.edu> Message-ID: <aaf1341a0c021004304a@[198.11.16.30]> To: " (Nick Arnett)" <narnett@verity.com>, "David M. Chess" <chess@watson.ibm.com>, /CN=robots/@nexor.co.uk Subject: Re: Hello! X-Sender: mcbr@piper.cs.colorado.edu Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Status: RO Content-Length: 829 At 11:13 PM 11/17/94, Nick Arnett wrote: >At 3:01 PM 11/17/94, Oliver A. McBryan wrote: >>The WWWW - World Wide WEb Worm - retrieves only Title and anchors >>(incvluding their associated text or icons). It has been quite useful - >>usage is at 1/3 million per month last time I checked. > >Forgive my glibness, but I think that's a measure of its usability, not its >usefulness... ;-) > >Nick Agreed. However an awful lot of accesses are repeat from the same machine, suggesting usefulness as well. However I'll be the first to agree that full text search is needed. I just was not interested in providing that service myself. Oliver McBryan; mcbryan@cs.colorado.edu; 303-6650544; Fax 303-4922844 Dept of Computer Science, University of Colorado, Boulder, CO 80309. WWW: http://www.cs.colorado.edu/home/mcbryan/Home.html From /CN=robots-errors/@nexor.co.uk Fri Nov 18 22:51:18 1994 Replied: Mon, 21 Nov 1994 08:38:00 +0000 Replied: " (Brian Pinkerton)" <bp@cs.washington.edu> Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Fri, 18 Nov 1994 22:53:05 +0000 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 18 Nov 1994 22:51:18 +0000 Date: Fri, 18 Nov 1994 22:51:18 +0000 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:206000:941118225119] Content-Identifier: Web-wide Inde... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 18 Nov 1994 22:51:18 +0000; Alternate-Recipient: Allowed From: " (Brian Pinkerton)" <bp@cs.washington.edu> Message-ID: <199411182250.OAA22744@june.cs.washington.edu> To: /CN=robots/@nexor.co.uk Subject: Web-wide Indexing Workshop Return-Path: <bp@cs.washington.edu> X-Sender: bp@fishtail.biotech.washington.edu (Unverified) Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Status: RO Content-Length: 1140 I am trying to gauge the interest for a web-wide indexing workshop at WWW95 in Darmstadt next April. The motivation for this workshop is to find ways that we can share indexing and Web-structure information. The idea is not to change the way any of you offer indexes, but rather to find ways to cooperate on finding the information that goes into building them. I think we could focus on two specific goals: 1) take a hard look at some of the existing protocols for sharing information, and see what we can use and what we need to build and 2) come up with an operational plan for putting these tools to use on a production basis. The workshop environment is the ideal place to do this work: we will bring together people with lots of experience building and running Web-wide indexes, and hopefully some who are experts in information retrieval. We will keep the numbers small, so we can actually make some progress. If you're interested, or if you would like to come but can't make Darmstadt, send me mail. I will collect comments, see if this will be worth doing, and submit a proposal to the conference oganizers if it is. bri From /CN=robots-errors/@nexor.co.uk Sat Nov 19 03:47:09 1994 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Sat, 19 Nov 1994 03:49:36 +0000 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Sat, 19 Nov 1994 03:47:09 +0000 Date: Sat, 19 Nov 1994 03:47:09 +0000 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:234010:941119034711] Content-Identifier: Re: Web-wide ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sat, 19 Nov 1994 03:47:09 +0000; Alternate-Recipient: Allowed From: " (David Eichmann)" <eichmann@rbse.jsc.nasa.gov> Message-ID: <9411190348.AB25045@rbse.jsc.nasa.gov> To: /CN=robots/@nexor.co.uk Subject: Re: Web-wide Indexing Workshop X-Sender: eichmann@192.88.42.10 MIME-version: 1.0 Content-type: text/plain; charset="us-ascii" Content-transfer-encoding: 7BIT Status: RO Content-Length: 2078 At 4:52 PM 11/18/94 +0000, (Brian Pinkerton) wrote: >I am trying to gauge the interest for a web-wide indexing workshop at WWW95 >in Darmstadt next April. I believe that the time for some initial consensus is upon us, for if no other reason than it can act as a driver for additional research beyond the simple application of indexing schemes out of the IR community to Web documents. > >The motivation for this workshop is to find ways that we can share indexing >and Web-structure information. The idea is not to change the way any of >you offer indexes, but rather to find ways to cooperate on finding the >information that goes into building them. There are some heavy-duty projects working on deep semantics approaches to this kind of thing (e.g., the ARPA Knowledge Sharing project) as well as some fairly pragmatic collaborations on sharing metadata (e.g., the DoD Reuse Interoperability Group (RIG)). The last I heard, Joe Nieten was heading up an AIAA task force to create a standard for repository interoperation that would also be relevant. There's also an IEEE working group in metadata with their own mailing list and workshop series. The trick here will be to build an exchange structure and protocol that is sufficiently extensible as to support a fairly limited skeleton (i.e., a URL graph) and something as sophisticated as a semantic net. The Harvest group is part-way down this path with broker/broker interaction. The distributed database community has worked similar problems in the areas of heterogeneous and federated databases. - Dave p.s. Hmmm... I guess that the above is a 'yes' vote for a workshop. ----------- David Eichmann Asst. Prof. / RBSE Director of R & D Web: http://ricis.cl.uh.edu/eichmann/ Software Engineering Program Phone: (713) 283-3875 University of Houston - Clear Lake fax: (713) 283-3810 Box 113, 2700 Bay Area Blvd. Email: eichmann@rbse.jsc.nasa.gov Houston, TX 77058 or: eichmann@cl.uh.edu RBSE on the Web: http://rbse.jsc.nasa.gov/eichmann/rbse.html From /CN=robots-errors/@nexor.co.uk Tue Dec 27 16:05:47 1994 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Tue, 27 Dec 1994 16:07:16 +0000 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 27 Dec 1994 16:05:47 +0000 Date: Tue, 27 Dec 1994 16:05:47 +0000 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:201110:941227160549] Content-Identifier: Another page ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 27 Dec 1994 16:05:47 +0000; Alternate-Recipient: Allowed From: Martijn Koster <m.koster@nexor.co.uk> Message-ID: <"20101 Tue Dec 27 16:05:27 1994"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: Another page to avoid Status: RO Content-Length: 291 http://csclub.uwaterloo.ca/u/zblaxell/useless.html contains deliberately broken links. How ridiculous. -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Tue Jan 3 02:40:46 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Tue, 3 Jan 1995 02:42:20 +0000 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 3 Jan 1995 02:40:46 +0000 Date: Tue, 3 Jan 1995 02:40:46 +0000 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:162490:950103024047] Content-Identifier: Forms-based e... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 3 Jan 1995 02:40:46 +0000; Alternate-Recipient: Allowed From: " (Nick Arnett)" <narnett@verity.com> Message-ID: <ab2e5aa20602100451eb@[192.187.143.12]> To: /CN=robots/@nexor.co.uk Subject: Forms-based editor for robots.txt? Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Status: RO Content-Length: 323 Has anyone taken a stab at an HTML forms-based editor for robots.txt? We plan to make use of it for our indexing agents, so we'd like to make it easy for administrators to modify it. In keeping with an overall strategy of using CGI-based admin tools, we'd like to have something along those lines for robots.txt. Nick From /CN=robots-errors/@nexor.co.uk Tue Jan 3 19:09:23 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Tue, 3 Jan 1995 19:11:07 +0000 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 3 Jan 1995 19:09:23 +0000 Date: Tue, 3 Jan 1995 19:09:23 +0000 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:052620:950103190924] Content-Identifier: Another page ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 3 Jan 1995 19:09:23 +0000; Alternate-Recipient: Allowed From: "David M. Chess" <chess@watson.ibm.com> Message-ID: <"5243 Tue Jan 3 19:09:05 1995"@nexor.co.uk> To: m.koster@nexor.co.uk, /CN=robots/@nexor.co.uk Subject: Another page to avoid Status: RO Content-Length: 125 Hey, if the Web didn't have all sorts of odd things in it, think how much harder stress-testing robots would be! *8) DC From /CN=robots-errors/@nexor.co.uk Sun Feb 5 10:49:06 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Sun, 5 Feb 1995 10:50:37 +0000 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Sun, 5 Feb 1995 10:49:06 +0000 Date: Sun, 5 Feb 1995 10:49:06 +0000 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:201910:950205104908] Content-Identifier: bad robot? Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sun, 5 Feb 1995 10:49:06 +0000; Alternate-Recipient: Allowed From: "Roy T. Fielding" <fielding@avron.ICS.UCI.EDU> Message-ID: <9502050205.aa11376@paris.ics.uci.edu> To: /CN=robots/@nexor.co.uk Subject: bad robot? Status: RO Content-Length: 533 Hi all, I've been getting bad requests like the following at my site: nirvana.rns.com - - [31/Jan/1995:19:14:41 -0800] "GET /WebSoft/wwwstat/" ADD_DATE="786312654" LAST_VISIT="786312648 HTTP/1.0" 404 - It looks like a robot, but is not on the list and I don't know anyone from rns.com. Does anyone know who it is? ......Roy Fielding ICS Grad Student, University of California, Irvine USA <fielding@ics.uci.edu> <URL:http://www.ics.uci.edu/dir/grad/Software/fielding> From /CN=robots-errors/@nexor.co.uk Wed Feb 22 18:55:16 1995 Replied: Mon, 27 Mar 1995 15:38:11 +0100 Replied: m.koster Replied: bp@haole.cs.washington.edu Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Wed, 22 Feb 1995 18:57:22 +0000 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 22 Feb 1995 18:55:16 +0000 Date: Wed, 22 Feb 1995 18:55:16 +0000 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:174850:950222185517] Content-Identifier: WWW95 Indexin... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 22 Feb 1995 18:55:16 +0000; Alternate-Recipient: Allowed From: Brian Pinkerton <bp@haole.cs.washington.edu> Message-ID: <9502221853.AA08285@haole.cs.washington.edu> To: /CN=robots/@nexor.co.uk Subject: WWW95 Indexing Workshop Reply-To: bp@haole.cs.washington.edu Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) Original-Received: by NeXT.Mailer (1.118.2) PP-warning: Illegal Received field on preceding line Status: RO Content-Length: 1155 There will be an indexing workshop at the upcoming WWW conference in Darmstadt. The intent of the workshop is to get together those people who are actively involved in building or providing resource discovery systems for the Web. My intent is that we will discuss the state of the art, talk about some of its current limitations, and focus on some proposed solutions. It would be great to come out of this with a real plan to share data, or at least with the understanding that will enable us to do that in the future. Since the workshop is short, we hope to spend most of our time in discussion. For more information on the workshop, see the description at http://www.igd.fhg.de/www/www95/workshops/work-a.html General conference information is available at http://www.igd.fhg.de/www/www95/general.html If you would like to participate in the workshop, send either Bipin or I a short statement of why you'd like to be there. This statement doesn't have to be long -- 1 page will be fine. Deadlines: March 1: Workshop submission deadline March 15: early-bird conference registration deadline Brian Pinkerton University of Washington From /CN=robots-errors/@nexor.co.uk Tue Mar 7 13:57:53 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Tue, 7 Mar 1995 14:00:49 +0000 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 7 Mar 1995 13:57:53 +0000 Date: Tue, 7 Mar 1995 13:57:53 +0000 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:273570:950307135755] Content-Identifier: CFP: Intellig... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 7 Mar 1995 13:57:53 +0000; Alternate-Recipient: Allowed From: Michael Wooldridge <M.Wooldridge@doc.manchester-metropolitan-university.ac.uk> Message-ID: <9503071353.AA13738@patsy.doc.aca.mmu.ac.uk> To: /CN=robots/@nexor.co.uk Subject: CFP: Intelligent Agents and the Next Information Revolution X-Mailer: ELM [version 2.3 PL3] Status: RO Content-Length: 5133 INTELLIGENT AGENTS AND THE NEXT INFORMATION REVOLUTION A One-Day Meeting of the International CKBS-SIG Manchester, UK --- Tuesday May 9th, 1995 http://www.doc.mmu.ac.uk/STAFF/mike/ckbs95.html * Call for Participation * Call for Participation * INTRODUCTION It is by now a cliche that the widespread use of distributed information services will radically alter the way in which both organisations and individuals work. There are many indicators of this coming information revolution. The growth of network technology in commercial organisations, the routine use of email within academia, and the astonishing extent of interest in the World Wide Web are three obvious examples. Yet while the enormous potential presented by distributed information services is widely recognised, the software required to fully realise this potential is not yet available. There are several reasons for this, but among the most important is that current software paradigms simply do not lend themselves to developing the kind of applications required. In order to build computer systems that must operate in large, open, distributed, highly heterogeneous environments, we must make use of entirely new software technologies. The concept of an intelligent agent, that can operate autonomously and rationally on behalf of some user in such complex environments, is increasingly promoted as the foundation upon which to construct such a technology. The purpose of this meeting is to bring together researchers and practitioners interested in realising and exploiting this important emerging technology. MEETING STRUCTURE The day-long meeting will consist of: an introductory overview of the area and issues; keynote presentations from influential researchers; long presentations describing major applications, projects, and research results; and short presentations describing ongoing work. The emphasis throughout the day will be on informality, discussion, and informed speculation. HOW TO PARTICIPATE If you would like to give a presentation, then email one of the organisers (below) enclosing your full contact details and a short (one paragraph) summary of your intended presentation. Topics of interest include, but are by no means limited to, the following: network agents ** WWW agents ** agent-based information systems ** agents in decision support ** agents for resource location ** distributed information services ** software agents/softbots ** knowbots ** authentication and security issues in cooperative systems ** agent communication languages ** KQML and KIF ** cooperative information retrieval and management ** shared ontologies ** the electronic marketplace ** information management and filtering agents ** knowledge sharing The deadline for presentation proposals is Friday March 17th. Presenters will have the opportunity to publish their work in the CKBS-SIG 1995 proceedings volume. If you would like to attend without giving a presentation, then please register by simply emailing one of the organisers, enclosing your full contact and affiliation details. All are welcome. No charge will be made for attendance. Please do *not* turn up without registering. WHEN AND WHERE The meeting will be held on Tuesday May 9th, 1995, in the Department of Computing at Manchester Metropolitan University. The Department is located in the centre of Manchester, an industrial city in the north-west of England. Manchester has excellent public transport links (with hourly trains to London), and is served by an international airport with scheduled flights to all major European centres. Contact the organisers for more information. Full details (including precise location and schedule) are available via the meeting WWW page (see above), and will be provided upon registration. ORGANISERS Michael Wooldridge and Michael Fisher Department of Computing, Manchester Metropolitan University Chester Street, Manchester M1 5GD, United Kingdom email {M.Wooldridge, M.Fisher}@doc.mmu.ac.uk tel (+44 1 61) 247 {1531, 1488} fax (+44 1 61) 247 1483 ABOUT THE CKBS-SIG The Special Interest Group (SIG) on Cooperating Knowledge Based Systems (CKBS) was established in 1990, and is funded by the DTI to provide a focus for UK activities in this area through the organisation of regular meetings. In addition to the meeting described above, three other CKBS-SIG events are planned for 1995: ** The University of Newcastle Upon Tyne is holding a one and a half day meeting in June (to be arranged by Andrew Blyth). ** Loughborough University is holding a one day meeting in September (to be arranged by Rachel Jones). ** Glasgow Caledonian University is holding a one day round table discussion in December (to be arranged by Cherif Branki). -- Michael Wooldridge | email M.Wooldridge@doc.mmu.ac.uk Department of Computing | http://www.doc.mmu.ac.uk/STAFF/mikew.html Manchester Metropolitan University | tel (+44 161) 247 1531 Chester St., Manchester M1 5GD, UK | fax (+44 161) 247 1483 From /CN=robots-errors/@nexor.co.uk Thu Apr 13 23:38:16 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Thu, 13 Apr 1995 23:40:42 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 13 Apr 1995 23:38:16 +0100 Date: Thu, 13 Apr 1995 23:38:16 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:078500:950413223817] Content-Identifier: Want specific... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 13 Apr 1995 23:38:16 +0100; Alternate-Recipient: Allowed From: " (Bob Carter)" <carter@cs.bu.edu> Message-ID: <199504132236.SAA18398@csb.bu.edu> To: /CN=robots/@nexor.co.uk Subject: Want specific size files on various WWW servers X-Mailer: ELM [version 2.4 PL23] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Status: RO Content-Length: 422 I am interested in finding objects of specific sizes (1k, 5k ...) on various WWW servers. I have started writing a script to gather this info but having found this list it seems that one of the other robots may well have generated a log I can use instead. So, if such a list of HTTP documents and their sizes for some set of WWW servers exists please let me know before I duplicate the effort. Thanks, Bob Carter From /CN=robots-errors/@nexor.co.uk Fri Apr 14 09:57:49 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Fri, 14 Apr 1995 10:01:30 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 14 Apr 1995 09:57:49 +0100 Date: Fri, 14 Apr 1995 09:57:49 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:160690:950414085755] Content-Identifier: Re: (fwd) Har... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 14 Apr 1995 09:57:49 +0100; Alternate-Recipient: Allowed From: "Roy T. Fielding" <fielding@avron.ICS.UCI.EDU> Message-ID: <9504140155.aa09734@paris.ics.uci.edu> To: Partl <partl@hp01.boku.ac.at> Cc: /CN=robots/@nexor.co.uk, hgsupport@iicm.tu-graz.ac.at In-Reply-To: <"2235 Wed Apr 12 11:01:44 1995"@nexor.co.uk> Subject: Re: (fwd) Harvesters/Spiders/Crawlers/Lycos vs. Hyper-G Status: RO Content-Length: 1372 > Hyper-G has a lot of very good features: Hypermedia in hierarchical > structures, keyword searches and fulltext searches over sub-hierarchies, > multi-lingual documents, bi-directional links... > The WWW interface of Hyper-G ("wwwmaster") introduces the necessary > "state" of a user's session (the preferred language etc.) into the > "stateless" HTTP protocol by using dynamic URLs of the form > http://host:port/session-id/object-id Well, now that's a dumb idea -- the state doesn't belong in the URL. > ... > Since search engines / spiders / harvesters are most essential for > finding information in the vast worlwide web, this problem must be > solved as fast as possible! I suggest that the maintainers of all > search engines and spiders and harvesters like Lycos, WebCrawler, > WWW-Worm, Aliweb, Yahoo etc. etc. contact the maintainers of the > Hyper-G server at > hgsupport@iicm.tu-graz.ac.at > and agree on a solution to this problem. The solution is simple -- don't include the state in the URL. It doesn't belong there and will break far more than just spiders. Global history lists and hotlists will also fail to work properly. ....Roy T. Fielding Department of ICS, University of California, Irvine USA <fielding@ics.uci.edu> <URL:http://www.ics.uci.edu/dir/grad/Software/fielding> From /CN=robots-errors/@nexor.co.uk Mon Apr 17 16:52:35 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Mon, 17 Apr 1995 16:54:40 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 17 Apr 1995 16:52:35 +0100 Date: Mon, 17 Apr 1995 16:52:35 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:236170:950417155242] Content-Identifier: YA mirroring ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 17 Apr 1995 16:52:35 +0100; Alternate-Recipient: Allowed From: Cyril Slobin <slobin@feast.fe.msk.ru> Message-ID: <PK4vealWw7@feast.fe.msk.ru> To: /CN=robots/@nexor.co.uk Subject: YA mirroring robot Organization: Institute for Commercial Engineering X-Mailer: Mail/@ [v2.28 FreeBSD] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Status: RO Content-Length: 1632 Lines: 46 Hello, World! Yes, I'm writing Yet Another Mirroring Robot. User-Agent: w3steal/2.1 libwww_perl/0.40 (Version may change of course) Let me try to discharge myself: # 10. You will visit the list of known robots before writing a new one. # Look for one you can use or modify if necessary before writing a I have explore three or four of them and wasn't statisfied. My first attempt was to modify htget/w3mir, but after fixing some bugs in it and finding more ones I have change my mind. libwww_perl seems much more reliable and much more understandable then scg libs. Really I'm to rewrite w3mir using libwww_perl and make it a bit smarter. # 8. Post a message to comp.infosystems.www.providers and send mail to # robots@nexor.co.uk announcing your intentions to write a robot. Here is it. # 6. Make you set informative headers like 'User-Agent' with # the name and version of your robot and 'From' with your email See above. # 3. [Shameless plug] Use wwwbot.pl in Roy Fielding's excellent libwww-perl # package, because it implements the latest Robot exclusion protcol Yes, it's REALLY excellent library. # Of course, we have No doubts that you will joyfully provide this # information to everyone on the net for free, since most of the Sources will be put on the Net when debugged. I will announce them here. PS - Sorry for my terrible english. I hope my perl writings are more readable :-) -- Cyril Slobin <slobin@fe.msk.ru> | `And what is the use of a book,' thought <http://www.fe.msk.ru/~slobin/> | Alice `without pictures or conversation?' From /CN=robots-errors/@nexor.co.uk Mon Apr 17 17:43:14 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Mon, 17 Apr 1995 17:52:21 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 17 Apr 1995 17:43:14 +0100 Date: Mon, 17 Apr 1995 17:43:14 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:243830:950417164315] Content-Identifier: Simple robots Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 17 Apr 1995 17:43:14 +0100; Alternate-Recipient: Allowed From: " (Nancy Lehrer)" <nlehrer@isx.com> Message-ID: <9504171642.AA11917@isx.com> To: /CN=robots/@nexor.co.uk Subject: Simple robots Status: RO Content-Length: 351 I'm looking for a simple robot that I can modify to take a set of URLs, traverse and index these pages and create a new set of URL's which includes the first set's links. Language preference would be C++/C, Tcl and possibly Perl. Certainly, there is lots of code out there. Any recommendations? Thanks, Nancy Lehrer ISX Corporation nlehrer@isx.com From /CN=robots-errors/@nexor.co.uk Tue Apr 18 06:52:48 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Tue, 18 Apr 1995 06:56:42 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 18 Apr 1995 06:52:48 +0100 Date: Tue, 18 Apr 1995 06:52:48 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:039250:950418055252] Content-Identifier: Re: Simple ro... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 18 Apr 1995 06:52:48 +0100; Alternate-Recipient: Allowed From: "Roy T. Fielding" <fielding@avron.ICS.UCI.EDU> Message-ID: <9504172250.aa05277@paris.ics.uci.edu> To: " (Nancy Lehrer)" <nlehrer@isx.com> Cc: /CN=robots/@nexor.co.uk In-Reply-To: <9504171642.AA11917@isx.com> Subject: Re: Simple robots Status: RO Content-Length: 552 > I'm looking for a simple robot that I can modify to take a set of > URLs, traverse and index these pages and create a new set of URL's > which includes the first set's links. Language preference would be > C++/C, Tcl and possibly Perl. That is one of the side-effects of MOMspider. http://www.ics.uci.edu/WebSoft/MOMspider/ ....Roy T. Fielding Department of ICS, University of California, Irvine USA <fielding@ics.uci.edu> <URL:http://www.ics.uci.edu/dir/grad/Software/fielding> From /CN=robots-errors/@nexor.co.uk Wed Apr 19 18:17:25 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Wed, 19 Apr 1995 18:20:51 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 19 Apr 1995 18:17:25 +0100 Date: Wed, 19 Apr 1995 18:17:25 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:128530:950419171727] Content-Identifier: Basic clueles... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 19 Apr 1995 18:17:25 +0100; Alternate-Recipient: Allowed From: "David M. Chess" <chess@watson.ibm.com> Message-ID: <"12851 Wed Apr 19 18:17:11 1995"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: Basic clueless questions Status: RO Content-Length: 2175 I looked over the Robots and /robots.txt stuff awhile back, and didn't quite understand it at the time. I've now let it percolate around a bit in my brain, and since I still don't quite understand it, I thought I'd expose my basic cluelessness here by asking if anyone can explain it to me. As I see it, there are three basic problems that a particular spider/robot/wwworm can cause: 1) They can overload some server, or set of servers, by doing more or more frequent accesses than a human with a normal client would, 2) They can mess up things like votes, by following "Click here if you love artichokes" links without understanding what they're doing, 3) They can increase the general web load to no good purpose, by repeatedly asking for lots of information, much of which is thrown away (of course, human surfers can be guilty of this, too!). The solution to (1) would seem to be to have some rough guidelines for how often and how much a spider should hit a given server, and I see y'all have done that, and that's good. The solution to (2) would seem to be some sort of marker in the <A> tag that means "follow this link only if you really understand it, and aren't some robot or something". Sort of like <a href="votes/artichokes/love" robots=NO> or whatever. Has there been any talk of that? The solution to (3) is at least partly to make robots efficient, not run too often, make results public, and the other good things that have been suggested here. An HTTP verb meaning "give me only the stuff between the following tags in the following document" (and/or a more powerful server-side-search-and-extract protocol) would also help. The /robots.txt idea doesn't seem designed to solve (1), (2), or (3), and I'm somewhat unclear on what problems it really is designed to solve. Why would I want to exclude certain web-crawlers from certain subtrees of my server, any more than I'd want to exclude (say) users of certain browsers, or people with certain middle names? - -- - David M. Chess / Invest for the Nanotech Era: High Integrity Computing Lab / Buy Atoms! IBM Watson Research From /CN=robots-errors/@nexor.co.uk Thu Apr 20 04:42:33 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Thu, 20 Apr 1995 04:46:44 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 20 Apr 1995 04:42:33 +0100 Date: Thu, 20 Apr 1995 04:42:33 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:206300:950420034236] Content-Identifier: Re: Basic clu... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 20 Apr 1995 04:42:33 +0100; Alternate-Recipient: Allowed From: JamesB@werple.mira.net.au Message-ID: <208a2303.83d61-JamesB@ArtWorks.mira.net.au> To: /CN=robots/@nexor.co.uk References: <"12851 Wed Apr 19 18:17:11 1995"@nexor.co.uk>, <chess@watson.ibm.com> Subject: Re: Basic clueless questions Reply-To: JamesB@werple.mira.net.au X-Mailer: //\\miga Electronic Mail (AmiElm 5.42) Organization: Melbourne ArtWorks Status: RO Content-Length: 1673 Hi David, [...] > As I see it, there are three basic problems that a particular > spider/robot/wwworm can cause: > > 1) They can overload some server, or set of servers, by > doing more or more frequent accesses than a human with > a normal client would, > > 2) They can mess up things like votes, by following "Click > here if you love artichokes" links without understanding > what they're doing, Mostly this can be solved by searching for a '?' character in the URL If it contains one, it's a query of some sort and isn't much use to the robot. Pretty crude, but in most circumstances a reasonable restriction. [...] > The /robots.txt idea doesn't seem designed to solve (1), (2), > or (3), and I'm somewhat unclear on what problems it really > is designed to solve. Why would I want to exclude certain > web-crawlers from certain subtrees of my server, any more than > I'd want to exclude (say) users of certain browsers, or people > with certain middle names? I must admit now that you've described it like that it doesn't seem to make sense. All it will do is give the site maintainer control over what NICE robots do when run by NICE people. It seems to me that NICE people are not the problem. (NICE == thoughtful, responsible, don't arbitrarily waste bandwidth...). On the good side it doesn't require modifying any 'standards' and therefore it doesn't require modifying any software (other than those naughty robots). Anybody else? james -- James Burton | EMail: JamesB@werple.mira.net.au | Latrobe University WWW : http://www.cs.latrobe.edu.au/~burton/ | Melbourne, Australia From /CN=robots-errors/@nexor.co.uk Thu Apr 20 12:49:03 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Thu, 20 Apr 1995 12:56:57 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 20 Apr 1995 12:49:03 +0100 Date: Thu, 20 Apr 1995 12:49:03 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:296250:950420114923] Content-Identifier: Re: Basic clu... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 20 Apr 1995 12:49:03 +0100; Alternate-Recipient: Allowed From: Gorm Haug Eriksen <g.h.eriksen@usit.uio.no> Message-ID: <"mons.uio.n.416:20.03.95.11.32.15"@mons.uio.no> To: JamesB@werple.mira.net.au Cc: /CN=robots/@nexor.co.uk In-Reply-To: <208a2303.83d61-JamesB@ArtWorks.mira.net.au> Subject: Re: Basic clueless questions X-Mailer: exmh version 1.5.3 12/28/94 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Status: RO Content-Length: 1192 JamesB@werple.mira.net.au said on the mailinglist robots@..uk: -> I must admit now that you've described it like that it doesn't -> seem to make sense. All it will do is give the site maintainer control -> over what NICE robots do when run by NICE people. It seems to me that -> NICE people are not the problem. (NICE == thoughtful, responsible, don't -> arbitrarily waste bandwidth...). On the good side it doesn't require -> modifying any 'standards' and therefore it doesn't require modifying -> any software (other than those naughty robots). -> -> Anybody else? -> james This isn't primarly a robot problem, but it gets to be: We have seen a couple of commercial products out there that uses WWW-wanders/robots to gain and index information on The Web. This information is presented to users, that needs to pay for it. I think we will see much more of this in the future because people doesn't want to share the information their wanders has collected. What do you people think about this? Can we do anything to help? Perhaps make a sort of ultimate index that will be saved to disk in a nice format, and let everyone mirror it? Gorm Haug Eriksen USIT / UiO / Norway From /CN=robots-errors/@nexor.co.uk Thu Apr 20 13:23:22 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Thu, 20 Apr 1995 13:38:57 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 20 Apr 1995 13:23:22 +0100 Date: Thu, 20 Apr 1995 13:23:22 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:007560:950420122335] Content-Identifier: Re: Basic clu... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 20 Apr 1995 13:23:22 +0100; Alternate-Recipient: Allowed From: Martijn Koster <m.koster@nexor.co.uk> Message-ID: <"636 Thu Apr 20 13:17:46 1995"@nexor.co.uk> To: Gorm Haug Eriksen <g.h.eriksen@usit.uio.no> Cc: JamesB@werple.mira.net.au, /CN=robots/@nexor.co.uk In-Reply-To: <"mons.uio.n.416:20.03.95.11.32.15"@mons.uio.no> Subject: Re: Basic clueless questions Status: RO Content-Length: 1600 In message <"mons.uio.n.416:20.03.95.11.32.15"@mons.uio.no>, Gorm Haug Eriksen writes: > We have seen a couple of commercial products out there that uses > WWW-wanders/robots to gain and index information on The Web. This > information is presented to users, that needs to pay for it. I think > we will see much more of this in the future because people doesn't > want to share the information their wanders has collected. > What do you people think about this? I personally think this is inevitable. There was a big debate about this during the Web Conference last week. Should these guys be paying me for allowing them to use my data as content, or should I pay them for including me in their service? Endless fun, no solution :-) > Can we do anything to help? Perhaps make a sort of ultimate index > that will be saved to disk in a nice format, and let everyone mirror > it? This was discussed also. The basic problem is that all these indexers really want the full content of your pages, so they can differentiate themselves from others by designing a super-duper html-parsing/linguistic-analysing/AI-based sumarisers etc. This kind of limits you to sharing only the URL structure with update information. Robots are now starting to do that, but with today's servers it's difficult for maintainers to do this. You'd need something wonderful like the system described by Pitkow&Jones in "Towards an intelligent publishing environment" at WWW'95 -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Thu Apr 20 06:03:52 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Thu, 20 Apr 1995 06:08:41 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 20 Apr 1995 06:03:52 +0100 Date: Thu, 20 Apr 1995 06:03:52 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:217390:950420050355] Content-Identifier: Re: Basic clu... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 20 Apr 1995 06:03:52 +0100; Alternate-Recipient: Allowed From: " (David Eichmann)" <eichmann@rbse.jsc.nasa.gov> Message-ID: <9504200505.AA11072@rbse.jsc.nasa.gov> To: /CN=robots/@nexor.co.uk Subject: Re: Basic clueless questions X-Sender: eichmann@192.88.42.10 MIME-version: 1.0 Content-type: text/plain; charset="us-ascii" Content-transfer-encoding: 7BIT Status: RO Content-Length: 3021 At 4:42 AM 4/20/95 +0100, JamesB@werple.mira.net.au wrote: ... >> As I see it, there are three basic problems that a particular >> spider/robot/wwworm can cause: >> >> 1) They can overload some server, or set of servers, by >> doing more or more frequent accesses than a human with >> a normal client would, >> >> 2) They can mess up things like votes, by following "Click >> here if you love artichokes" links without understanding >> what they're doing, > >Mostly this can be solved by searching for a '?' character in the URL >If it contains one, it's a query of some sort and isn't much use to >the robot. Pretty crude, but in most circumstances a reasonable restriction. At this point in the growth/popularity of the Web, it would take a pretty persistent spider to generate the load that the Web community by mass inquistiveness can accomplish. In real practice, a well-run spider doesn't come close to the bandwidth consumption that idle browsing generates - particularly when automatic image loading is the norm. Active servers aren't really going to notice spider activity unless they go looking for it (at least for the spiders that are currently active). I seem to remember Martijn observing in Darmstadt that we'd outgrown the instability and recklessness of our youth (well maybe not *exactly* that wording, but hey - it's late in my timezone...) > >[...] >> The /robots.txt idea doesn't seem designed to solve (1), (2), >> or (3), and I'm somewhat unclear on what problems it really >> is designed to solve. Why would I want to exclude certain >> web-crawlers from certain subtrees of my server, any more than >> I'd want to exclude (say) users of certain browsers, or people >> with certain middle names? > >I must admit now that you've described it like that it doesn't >seem to make sense. All it will do is give the site maintainer control >over what NICE robots do when run by NICE people. It seems to me that >NICE people are not the problem. (NICE == thoughtful, responsible, don't >arbitrarily waste bandwidth...). On the good side it doesn't require >modifying any 'standards' and therefore it doesn't require modifying >any software (other than those naughty robots). The file is intended *precisely* to allow well-behaved agents to be able to discern what to avoid, with the clear recognition that given an open and unauthenticated (in general) Web, there is no stopping rogues. Definition by agent is useful because not all agents are created equally, and some are more robust about interesting "gravitation wells" than others... - Dave ----------- David Eichmann Asst. Prof. / RBSE Director of R & D Web: http://ricis.cl.uh.edu/eichmann/ Software Engineering Program Phone: (713) 283-3875 University of Houston - Clear Lake fax: (713) 283-3869 Box 113, 2700 Bay Area Blvd. Email: eichmann@rbse.jsc.nasa.gov Houston, TX 77058 or: eichmann@cl.uh.edu RBSE on the Web: http://rbse.jsc.nasa.gov/eichmann/rbse.html From /CN=robots-errors/@nexor.co.uk Thu Apr 20 09:29:36 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Thu, 20 Apr 1995 09:32:59 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 20 Apr 1995 09:29:36 +0100 Date: Thu, 20 Apr 1995 09:29:36 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:244960:950420082939] Content-Identifier: Re: Basic clu... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 20 Apr 1995 09:29:36 +0100; Alternate-Recipient: Allowed From: Martijn Koster <m.koster@nexor.co.uk> Message-ID: <"24493 Thu Apr 20 09:29:14 1995"@nexor.co.uk> To: " (David Eichmann)" <eichmann@rbse.jsc.nasa.gov> Cc: /CN=robots/@nexor.co.uk In-Reply-To: <9504200505.AA11072@rbse.jsc.nasa.gov> Subject: Re: Basic clueless questions Status: RO Content-Length: 2799 In message <9504200505.AA11072@rbse.jsc.nasa.gov>, " (David Eichmann)" writes: > At this point in the growth/popularity of the Web, it would take a pretty > persistent spider to generate the load that the Web community by mass > inquistiveness can accomplish. In real practice, a well-run spider doesn't > come close to the bandwidth consumption that idle browsing generates - > particularly when automatic image loading is the norm. This seems true for the major spiders currently running. Spiders that traverse the entire tree, and would visit regularly (say once every three days) would of course be noticed. > Active servers aren't really going to notice spider activity unless > they go looking for it (at least for the spiders that are currently > active). I seem to remember Martijn observing in Darmstadt that > we'd outgrown the instability and recklessness of our youth (well > maybe not *exactly* that wording, but hey - it's late in my > timezone...) I don't think I was quite that eloquent :-) > The file is intended *precisely* to allow well-behaved agents to be > able to discern what to avoid, with the clear recognition that given > an open and unauthenticated (in general) Web, there is no stopping > rogues. Definition by agent is useful because not all agents are > created equally, and some are more robust about interesting > "gravitation wells" than others... To illustrate, take my server. We have some company marketing stuff, some public services, and some company support stuff. Before /robots.txt I had spiders happily index all the subfiles of my Mac archive, and nothing else. This might make my company look like a Mac house, which we're not. I also had spiders happily index all our bug reports, and nothing else. This isn't so good for our imago either :-) With robots.txt I can control that better. Together with /site.idx this gives me a reasoneable way of informing robots what they ought to index on my server, with very little cost on the administrators part and on the robots part. There are other cases, especially in large archive servers, where the subtree exclusion is really useful. As for: >> 1) They can overload some server... >> 2) They can mess up things like votes... >> 3) They can increase the general web load... To fix 1, not just for robots but also general use, servers should have load measurements, and return "Too Busy"... For 2, well, people who use GET links for non idempotent actions deserve what they get, they should use POST. Spiders that POST should be shot :-) There isn't much anyone can enforce much against 3 other than by peer pressure and charging for bandwidth. -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Sat Apr 22 16:43:09 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Sat, 22 Apr 1995 16:45:35 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Sat, 22 Apr 1995 16:43:09 +0100 Date: Sat, 22 Apr 1995 16:43:09 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:144710:950422154311] Content-Identifier: The indexing ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sat, 22 Apr 1995 16:43:09 +0100; Alternate-Recipient: Allowed From: " (Lemieux Sebastien)" <lemieuse@ERE.UMontreal.CA> Message-ID: <9504221541.AA11731@alize.ERE.UMontreal.CA> To: /CN=robots/@nexor.co.uk Subject: The indexing problem... X-Mailer: ELM [version 2.4 PL21] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Status: RO Content-Length: 1221 Hi, Here is my suggestion for solving the problem of indexing the web. It should simplify the works to be done by agents and minimize load on the net... It should also provide much more consistent results. The HTML should be enhanced with a 'KeyWord' tag that should appear inside the <HEAD> of the document, something like: <HEAD> <KEYWORD list="biology,computing,protein"> </HEAD> These keywords would have been choosen by the author of the page which is the best person to tell what his page is about! Also, to avoid transfering the whole page, the servers should be able to answer to a special query (EXPLORE? or PROBE?) that would return two lists: - The list of keywords extracted from the KEYWORD tag in the HEAD. - And a list of all the 'href' items of the <A> tags. It could also send back some info about the pages (length, number of images, author, last date of modification...). Are those changes really complicated to implement? -- | Sebastien Lemieux, dept. biol. || Look behind the wave of changes | lemieuse@alize.ERE.UMontreal.CA || Feel the future taking shape | PGP public key on finger. || I can see the world to come (KJ) http://alize.ere.umontreal.ca/~lemieuse/ From /CN=robots-errors/@nexor.co.uk Mon Apr 24 05:42:38 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Mon, 24 Apr 1995 05:45:38 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 24 Apr 1995 05:42:38 +0100 Date: Mon, 24 Apr 1995 05:42:38 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:004760:950424044247] Content-Identifier: Re: The index... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 24 Apr 1995 05:42:38 +0100; Alternate-Recipient: Allowed From: " (Michael Mauldin)" <mlm@FUZINE.MT.CS.CMU.EDU> Message-ID: <9504240441.AA11998@fuzine.mt.cs.cmu.edu> To: " (Lemieux Sebastien)" <lemieuse@ere.UMONTREAL.CA> Cc: /CN=robots/@nexor.co.uk Subject: Re: The indexing problem... Original-Received: by NeXT Mailer (1.63) PP-warning: Illegal Received field on preceding line Status: RO Content-Length: 202 I completely disagree. The author is that last person who should index the document. Lycos' philosophy is that a single agent do the indexing, thus assuring a consistent level of mediocrity. --Fuzzy From /CN=robots-errors/@nexor.co.uk Mon Apr 24 11:05:26 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Mon, 24 Apr 1995 11:10:54 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 24 Apr 1995 11:05:26 +0100 Date: Mon, 24 Apr 1995 11:05:26 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:050880:950424100528] Content-Identifier: Re: The index... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 24 Apr 1995 11:05:26 +0100; Alternate-Recipient: Allowed From: Martijn Koster <m.koster@nexor.co.uk> Message-ID: <"5050 Mon Apr 24 11:04:00 1995"@nexor.co.uk> To: " (Lemieux Sebastien)" <lemieuse@ERE.UMontreal.CA> Cc: /CN=robots/@nexor.co.uk In-Reply-To: <9504221541.AA11731@alize.ERE.UMontreal.CA> Subject: Re: The indexing problem... Status: RO Content-Length: 1756 In message <9504221541.AA11731@alize.ERE.UMontreal.CA>, " (Lemieux Sebastien)" writes: > The HTML should be enhanced with a 'KeyWord' tag that should appear >inside the <HEAD> of the document, something like: > ><HEAD> ><KEYWORD list="biology,computing,protein"> ></HEAD> > > These keywords would have been choosen by the author of the page >which is the best person to tell what his page is about! You'll be pleased to know you can already do this with HTML, using the META tag. As it happens it's not all that great, as the choices of keywords are not always easy, and people don't do it anyway :-) > Also, to avoid transfering the whole page, the servers should be >able to answer to a special query (EXPLORE? or PROBE?) that would >return two lists: > > - The list of keywords extracted from the KEYWORD tag in the HEAD. > - And a list of all the 'href' items of the <A> tags. If you want this on a per-document basis you should probably do it with a content-type: text/urls-only. This has been discussed before, and it appears robots want the whole document anyway... > It could also send back some info about the pages (length, number of >images, author, last date of modification...). Yes, and the <H?> so you know the structure of the document, and the first paragraphs, so you have quick introductions, and ... :-) You pretty quickly end up shipping the whole document. >Are those changes really complicated to implement? Changing servers is more an issue than simply coding it up: you need to provide code for _all_ servers, and then get the new servers deployed. That's not so easy. -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M WWW: http://web.nexor.co.uk/mak/mak.html From narnett@Verity.COM Mon Apr 24 17:07:32 1995 Return-Path: <narnett@Verity.COM> Delivery-Date: Mon, 24 Apr 1995 17:07:45 +0100 Received: from verity.com (actually host unknown-143-5.verity.com) by lancaster.nexor.co.uk with SMTP (XTPP); Mon, 24 Apr 1995 17:07:32 +0100 Received: from by verity.com (4.1/SMI-4.1_Verity-Main-950202) id AB05399; Mon, 24 Apr 95 09:04:57 PDT X-Sender: narnett@hawaii.verity.com Message-Id: <abc17b8103021004e1c9@[192.187.143.12]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Mon, 24 Apr 1995 09:04:31 -0700 To: Martijn Koster <m.koster@nexor.co.uk>, " (Lemieux Sebastien)" <lemieuse@ERE.UMontreal.CA> From: narnett@Verity.COM (Nick Arnett) Subject: Re: The indexing problem... Cc: /CN=robots/@nexor.co.uk Status: RO Content-Length: 598 At 3:05 AM 4/24/95, Martijn Koster wrote: >As it happens it's not all that great, as the choices of keywords >are not always easy, and people don't do it anyway :-) It's also becoming clear that search engines with knowledgebases can do a better job of assigning keywords than humans can (or will, anyway). What's more, such a gizmo can go back and assign new keywords to old articles, which is impractical for humans, given the volume of documents being produced. Even more to the point, keyword searching doesn't provide nearly the accuracy that is possible with full-text searching. Nick From /CN=robots-errors/@nexor.co.uk Mon Apr 24 17:08:07 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Mon, 24 Apr 1995 17:12:45 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 24 Apr 1995 17:08:07 +0100 Date: Mon, 24 Apr 1995 17:08:07 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:140220:950424160809] Content-Identifier: Re: The index... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 24 Apr 1995 17:08:07 +0100; Alternate-Recipient: Allowed From: " (Nick Arnett)" <narnett@Verity.COM> Message-ID: <abc17b8103021004e1c9@[192.187.143.12]> To: Martijn Koster <m.koster@nexor.co.uk>, " (Lemieux Sebastien)" <lemieuse@ERE.UMontreal.CA> Cc: /CN=robots/@nexor.co.uk Subject: Re: The indexing problem... X-Sender: narnett@hawaii.verity.com Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Status: RO Content-Length: 598 At 3:05 AM 4/24/95, Martijn Koster wrote: >As it happens it's not all that great, as the choices of keywords >are not always easy, and people don't do it anyway :-) It's also becoming clear that search engines with knowledgebases can do a better job of assigning keywords than humans can (or will, anyway). What's more, such a gizmo can go back and assign new keywords to old articles, which is impractical for humans, given the volume of documents being produced. Even more to the point, keyword searching doesn't provide nearly the accuracy that is possible with full-text searching. Nick From /CN=robots-errors/@nexor.co.uk Sun Apr 30 19:38:23 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Sun, 30 Apr 1995 19:40:21 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Sun, 30 Apr 1995 19:38:23 +0100 Date: Sun, 30 Apr 1995 19:38:23 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:103490:950430183824] Content-Identifier: Re: The index... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sun, 30 Apr 1995 19:38:23 +0100; Alternate-Recipient: Allowed From: " (Tim Bray)" <tbray@opentext.com> Message-ID: <m0s5cvs-0001syC@giant.mindlink.net> To: /CN=robots/@nexor.co.uk Subject: Re: The indexing problem... X-Sender: a07893@giant.mindlink.net X-Mailer: Windows Eudora Version 2.0.3 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Status: RO Content-Length: 1211 Nick Arnett @ Verity writes: >It's also becoming clear that search engines with knowledgebases can do a >better job of assigning keywords than humans can (or will, anyway). Just to be fair, it should be noted that lots of pepole disagree with this assertion. What is pretty clear from IR research is that search engines can cluster text objects by degree of similarity. The ability to assign human-credible keywords is a *VERY* strong claim and one that I haven't seen running code to support. There are vendors whose marketing literature makes precisely this claim, although to be fair to Verity, I don't think they're one of them. >Even more to the point, keyword searching doesn't provide nearly the >accuracy that is possible with full-text searching. I disagree completely. I think that subject keywords, assigned by unhurried, professional, intelligent humans, with subject matter expertise, will support a much greater degree of search accuracy - and I'm a full-text search vendor! Unfortunately, there is too much material and not enough people to do such keywording, so we full-text vendors are the only alternative for retrieval in many situations. Cheers, Tim Bray, Open Text Corporation From /CN=robots-errors/@nexor.co.uk Tue May 2 21:10:40 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Tue, 2 May 1995 21:12:45 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 2 May 1995 21:10:40 +0100 Date: Tue, 2 May 1995 21:10:40 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:018650:950502201042] Content-Identifier: unsubsrci(008... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 2 May 1995 21:10:40 +0100; Alternate-Recipient: Allowed From: "Kilaman........." <ktutu@meceng.coe.neu.edu> Message-ID: <199505022010.QAA01821@splinter.coe.neu.edu> To: /CN=robots/@nexor.co.uk Subject: unsubsrci[4?[4?[4?[4? Status: RO Content-Length: 12 unsubscribe From /CN=robots-errors/@nexor.co.uk Tue May 2 22:43:31 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Tue, 2 May 1995 22:48:14 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 2 May 1995 22:43:31 +0100 Date: Tue, 2 May 1995 22:43:31 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:030220:950502214332] Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 2 May 1995 22:43:31 +0100; Alternate-Recipient: Allowed From: " (Drew Dupont)" <dsdupont@indiana.edu> Message-ID: <v01510100abcc57a40525@[129.79.18.32]> To: /CN=robots/@nexor.co.uk X-Sender: dsdupont@silver.ucs.indiana.edu Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Status: RO Content-Length: 14 unsubscribe From /CN=robots-errors/@nexor.co.uk Wed May 3 04:47:49 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Wed, 3 May 1995 04:52:52 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 3 May 1995 04:47:49 +0100 Date: Wed, 3 May 1995 04:47:49 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:080330:950503034752] Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 3 May 1995 04:47:49 +0100; Alternate-Recipient: Allowed From: JERATCLIFF@vaxsar.vassar.edu Message-ID: <01HQ1NSLMKJ6002WU0@vassar.edu> To: /CN=robots/@nexor.co.uk X-Envelope-to: /CN=robots/@nexor.co.uk X-VMS-To: IN%"/CN=robots/@nexor.co.uk" MIME-version: 1.0 Content-transfer-encoding: 7BIT Status: RO Content-Length: 12 UNSUBSCRIBE From /CN=robots-errors/@nexor.co.uk Wed May 3 05:11:01 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Wed, 3 May 1995 05:17:32 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 3 May 1995 05:11:01 +0100 Date: Wed, 3 May 1995 05:11:01 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:086500:950503041103] Content-Identifier: The (036)50 p... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 3 May 1995 05:11:01 +0100; Alternate-Recipient: Allowed From: Atif Ahmad Khan <aak2@Ra.MsState.Edu> Message-ID: <199505030408.XAA21676@Ra.MsState.Edu> To: /CN=robots/@nexor.co.uk Subject: The $50 prize - results X-Mailer: ELM [version 2.4 PL22] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Status: RO Content-Length: 815 Sorry folks, for taking so long to get back. Had to drive out of town for the weekend. Here are the results of the $50 Prize contest. I received a total of 13 solutions from people all around the world. I still say, Wow ! First entry was by : Victor A. Parada vparada@inf.utfsm.cl Victor's script works fine and therefore Victor has won the prize. I have included his name after getting his permission. Some people also suggested the "lynx -source" solution that is simple yet works great. I am now also looking to be able to submit a "form" using a script. For example is there a way to submit data using a script or any other automated means to a simple form at : http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/fill-out-forms/example-1.html ? Thanx a million. Atif Khan aak2@ra.msstate.edu From /CN=robots-errors/@nexor.co.uk Wed May 3 16:11:03 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Wed, 3 May 1995 16:15:41 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 3 May 1995 16:11:03 +0100 Date: Wed, 3 May 1995 16:11:03 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:050920:950503151108] Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 3 May 1995 16:11:03 +0100; Alternate-Recipient: Allowed From: William Glantz <wglantz@VNET.IBM.COM> Message-ID: <kjdthAs91H4yBiXY52@rchland.ibm.com> To: /CN=robots/@nexor.co.uk Reply-To: William Glantz <wglantz@VNET.IBM.COM> Status: RO Content-Length: 112 unsubscribe . Wm Glantz [An Andrew ToolKit view (a footnote) was included here, but could not be displayed.] From /CN=robots-errors/@nexor.co.uk Thu May 4 17:02:44 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Thu, 4 May 1995 17:09:52 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 4 May 1995 17:02:44 +0100 Date: Thu, 4 May 1995 17:02:44 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:019070:950504160246] Content-Identifier: list Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 4 May 1995 17:02:44 +0100; Alternate-Recipient: Allowed From: KHANCOCK@acc.rwu.edu Message-ID: <7210061645.ccSSN652@acc.rwu.edu> To: /CN=robots/@nexor.co.uk Subject: list Status: RO Content-Length: 13 unsubscribe From /CN=robots-errors/@nexor.co.uk Thu May 4 17:26:50 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Thu, 4 May 1995 17:36:04 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 4 May 1995 17:26:50 +0100 Date: Thu, 4 May 1995 17:26:50 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:024500:950504162651] Content-Identifier: Administrativ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 4 May 1995 17:26:50 +0100; Alternate-Recipient: Allowed From: Martijn Koster <m.koster@nexor.co.uk> Message-ID: <"2384 Thu May 4 17:25:31 1995"@nexor.co.uk> To: KHANCOCK@acc.rwu.edu Cc: /CN=robots/@nexor.co.uk In-Reply-To: <7210061645.ccSSN652@acc.rwu.edu> Subject: Administrativa (was Re: list ) Status: RO Content-Length: 974 In message <7210061645.ccSSN652@acc.rwu.edu>, KHANCOCK@acc.rwu.edu writes: > unsubscribe Please stop sending unsubscribe messages to the list. Like thousands of mailing lists, you send subscribe and unsubscribe requests to '-request', i.e. robots-request@nexor.co.uk. If you send it to the list, the message will be distributed to all hundreds of subscribed people. With this particular list server, you just put the words "unsubscribe", "stop" on separate lines in the body of the message, and it all happens automatically. If you have problems, mail the list owner (e) direct. I usually respond within minutes of receiving them... Yes, I know this is going to whole list too, I am just trying to prevent other people also making that mistake. No, I my list manager can't filter them out, and I can't switch list managers. -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Thu May 4 19:35:43 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Thu, 4 May 1995 19:42:55 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 4 May 1995 19:35:43 +0100 Date: Thu, 4 May 1995 19:35:43 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:049980:950504183553] Content-Identifier: ditto... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 4 May 1995 19:35:43 +0100; Alternate-Recipient: Allowed From: wlam@MIT.EDU Message-ID: <9505041637.AA06666@m33-222-3.MIT.EDU> To: /CN=robots/@nexor.co.uk Subject: ditto... Status: RO Content-Length: 12 unsubscribe From /CN=robots-errors/@nexor.co.uk Thu May 4 21:59:44 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Thu, 4 May 1995 22:06:14 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 4 May 1995 21:59:44 +0100 Date: Thu, 4 May 1995 21:59:44 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:075070:950504205946] Content-Identifier: Another (036)... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 4 May 1995 21:59:44 +0100; Alternate-Recipient: Allowed From: Atif Ahmad Khan <aak2@Ra.MsState.Edu> Message-ID: <199505042058.PAA06365@Ra.MsState.Edu> To: /CN=robots/@nexor.co.uk Subject: Another $50 challenge ! X-Mailer: ELM [version 2.4 PL22] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Status: RO Content-Length: 399 If you can send me a script/program/utility that can be run through cron and can submit some data to the following form and get me the results, you win $50. http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/fill-out-forms/example-1.html I will try to monitor incoming mail after every hour and will send a message to the mailing list as soon as we have a winner. Atif Khan aak2@ra.msstate.edu From /CN=robots-errors/@nexor.co.uk Thu May 4 22:11:45 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Thu, 4 May 1995 22:15:46 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 4 May 1995 22:11:45 +0100 Date: Thu, 4 May 1995 22:11:45 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:077630:950504211147] Content-Identifier: unsubscribe Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 4 May 1995 22:11:45 +0100; Alternate-Recipient: Allowed From: LJLJ@aol.com Message-ID: <950504171046_106926701@aol.com> To: /CN=robots/@nexor.co.uk Subject: unsubscribe Status: RO Content-Length: 12 unsubscribe From /CN=robots-errors/@nexor.co.uk Fri May 5 04:39:49 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Fri, 5 May 1995 04:53:26 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 5 May 1995 04:39:49 +0100 Date: Fri, 5 May 1995 04:39:49 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:142920:950505034020] Content-Identifier: Rash of unsub... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 5 May 1995 04:39:49 +0100; Alternate-Recipient: Allowed From: " (Dave Balderstone)" <balderd@crocus.sasknet.sk.ca> Message-ID: <v02110102abcf594fd0a7@[142.165.5.138]> To: /CN=robots/@nexor.co.uk Subject: Rash of unsubscribes X-Sender: balderd@mailhost.sasknet.sk.ca Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Status: RO Content-Length: 921 I have not sent any "unsubscribe" messages to the list, but my attempts at convincing the listserver to allow me to unsubscribe have been futile. I am closing this account in 24 hours, so messages will simply bounce back. C'est la vie. Goodbye. I'm commenting on this just to say that as an experienced user, this listserver appears to have some serious problems. It is bouncing my *repeated* requests to unsubscribe (says I am not a member) yet keeps sending me messages. I can understand others' frustrations. Dave Balderstone (new address: balderstone@producer.com) Dave Balderstone, Manager Business Analysis | balderd@producer.com Western Producer Publications | -------------------- 2310 Millar Ave, Saskatoon, Canada S7K 2C4 | OR Voice 306-665-3545, Fax 306-665-9614 | 75211.3630@compuserve.com -------------------------------------------------------------------------- From /CN=robots-errors/@nexor.co.uk Fri May 5 07:52:59 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Fri, 5 May 1995 07:59:09 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 5 May 1995 07:52:59 +0100 Date: Fri, 5 May 1995 07:52:59 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:159710:950505065300] Content-Identifier: Results of th... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 5 May 1995 07:52:59 +0100; Alternate-Recipient: Allowed From: Atif Ahmad Khan <aak2@Ra.MsState.Edu> Message-ID: <199505050606.BAA11080@Ra.MsState.Edu> To: /CN=robots/@nexor.co.uk Subject: Results of the 2nd challenge ! X-Mailer: ELM [version 2.4 PL22] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Status: RO Content-Length: 191 I received only 1 solution and that too was from Victor A. Parada (vparada@inf.utfsm.cl). It works wonderfully. I'll be back soon with a third challenge :) Atif Khan aak2@ra.msstate.edu From /CN=robots-errors/@nexor.co.uk Fri May 5 09:57:59 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Fri, 5 May 1995 10:02:17 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 5 May 1995 09:57:59 +0100 Date: Fri, 5 May 1995 09:57:59 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:191960:950505085800] Content-Identifier: Amusing robot... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 5 May 1995 09:57:59 +0100; Alternate-Recipient: Allowed From: m.koster@nexor.co.uk Message-ID: <"19193 Fri May 5 09:57:40 1995"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: Amusing robots.txt post - comp.infosystems.www.providers #17113 Status: RO Content-Length: 2225 For those who missed it: In article <3o8po7$g7l@news.cerf.net>, paulp@nic.cerf.net (Paul Phillips) writes: |> Found on comp.security.unix. Especially note his last line. |> |> -PSP |> |> From dave@maths.newcastle.edu.au Wed May 3 13:38:38 PDT 1995 |> Newsgroups: comp.security.unix |> Subject: Re: httpd security |> Message-ID: <3o1doa$hog@seagoon.newcastle.edu.au> |> From: dave@maths.newcastle.edu.au (David M. Williams) |> Date: Mon, 1 May 95 02:32:58 BST |> Organization: The University of Newcastle |> Lines: 41 |> |> kevin kohn (ess2426) wrote: |> : I've noticed several httpd request coming in requesting this file: |> |> : robots.txt |> |> : Does anyone know if this is a possible file inserted or created by a |> : hack attempt? The first time is saw it, I really didn't pay attention. |> : But I've had atleast 10 requests from about 4 different sources for this |> : file. |> |> Yes - I would like to know the answer to this question also. I have had |> hundreds of requests for this file on my httpd server. Eventually I created |> the following file robots.txt ... |> |> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |> So many people seem to want to access the file "robots.txt" that I thought |> I'd create one for people to look at. |> |> Get a life pal - you'll never make a hacker - and I'll always be a step ahead |> of you. |> |> Regards, |> System Manager |> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |> |> Since I did this the number of requests has been dramatically reduced! |> |> Regards, |> David Williams |> |> -- |> .---. .----------- |> / \ __ / ------ |> / / \( )/ ----- David M. Williams |> ////// ' \/ ` --- System Manager |> //// / // : : --- Computing Services |> // / / /` '-- University of Newcastle |> // //..\\ dave@maths.newcastle.edu.au |> ----UU----UU------------------------------------------------- |> '//||\\` |> ''`` -- Martijn __________ Internet: m.koster@nexor.co.uk WWW: http://web.nexor.co.uk/users/mak/mak.html From /CN=robots-errors/@nexor.co.uk Sat May 6 02:51:39 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Sat, 6 May 1995 02:53:50 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Sat, 6 May 1995 02:51:39 +0100 Date: Sat, 6 May 1995 02:51:39 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:051140:950506015141] Content-Identifier: Re: Amusing r... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sat, 6 May 1995 02:51:39 +0100; Alternate-Recipient: Allowed From: " (James Burton)" <JamesB@werple.mira.net.au> Message-ID: <209f218e.222e0-JamesB@ArtWorks.mira.net.au> To: /CN=robots/@nexor.co.uk In-Reply-To: <"19193 Fri May 5 09:57:40 1995"@nexor.co.uk> Subject: Re: Amusing robots.txt post - comp.infosystems.www.providers #17113 Reply-To: JamesB@werple.mira.net.au X-Mailer: //\\miga Electronic Mail (AmiElm 5.42) Organization: Melbourne ArtWorks Status: RO Content-Length: 1756 > |> kevin kohn (ess2426) wrote: > |> : I've noticed several httpd request coming in requesting this file: > |> > |> : robots.txt > |> > |> : Does anyone know if this is a possible file inserted or created by a > |> : hack attempt? The first time is saw it, I really didn't pay attention. > |> : But I've had atleast 10 requests from about 4 different sources for this > |> : file. > |> > |> Yes - I would like to know the answer to this question also. I have had > |> hundreds of requests for this file on my httpd server. Eventually I created > |> the following file robots.txt ... > |> > |> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > |> So many people seem to want to access the file "robots.txt" that I thought > |> I'd create one for people to look at. > |> > |> Get a life pal - you'll never make a hacker - and I'll always be a step ahead > |> of you. > |> > |> Regards, > |> System Manager > |> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > |> > |> Since I did this the number of requests has been dramatically reduced! You guys may laugh. But I did almost exectly this when I first set up my HTTPd (at a place I worked at until Jan95). I kept getting these requests for a strangley named non-existent file. I too wondered whether somebody was trying to exploit a 'well known' security hole. I eventually created the file with a message asking why they hell were these people were trying to access the file. Strangely enough I never got a reply :-) James -- James Burton | EMail: JamesB@werple.mira.net.au | Latrobe University WWW : http://www.cs.latrobe.edu.au/~burton/ | Melbourne, Australia From /CN=robots-errors/@nexor.co.uk Tue May 9 11:53:59 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Tue, 9 May 1995 11:57:46 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 9 May 1995 11:53:59 +0100 Date: Tue, 9 May 1995 11:53:59 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:019920:950509105400] Content-Identifier: any comments ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 9 May 1995 11:53:59 +0100; Alternate-Recipient: Allowed From: " (Reinier Post)" <reinpost@win.tue.nl> Message-ID: <199505091053.MAA12536@wsinis10.win.tue.nl> To: /CN=robots/@nexor.co.uk Subject: any comments on AUtoWinNet? Reply-To: reinpost@win.tue.nl X-Mailer: ELM [version 2.4 PL23] MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Status: RO Content-Length: 707 Just thought you might be interested in hearing about a new shareware tool for MS-Windows: AutoWinNet, it will graze the Internet while you're playing tennis. Quote: 'Now you can have unlimited access to TERABYTES of Internet files without forcing you to sit in front of your computer for hours, upload files, hammer at FTP sites that are busy, get updates to your favorite programs automatically - direct from their support site.' I found this at http://www.computek.net:80/physics/ and there is no mention of it in the list of robots, http://web.nexor.co.uk/mak/doc/robots/active.html Any comments? (I'm hoping for another of Martijn's outbursts ;-) -- Reinier Post reinpost@win.tue.nl From /CN=robots-errors/@nexor.co.uk Tue May 9 18:33:57 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Tue, 9 May 1995 18:37:02 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 9 May 1995 18:33:57 +0100 Date: Tue, 9 May 1995 18:33:57 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:112980:950509173358] Content-Identifier: Re: any comme... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 9 May 1995 18:33:57 +0100; Alternate-Recipient: Allowed From: Billy Barron <billy@utdallas.edu> Message-ID: <199505091732.MAA29086@utdallas.edu> To: reinpost@win.tue.nl Cc: /CN=robots/@nexor.co.uk In-Reply-To: <199505091053.MAA12536@wsinis10.win.tue.nl> Subject: Re: any comments on AUtoWinNet? X-WWW-Page: http://www.utdallas.edu/acc/billy.html X-Mailer: ELM [version 2.4 PL23] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Status: RO Content-Length: 842 In reply to /CN=robots-errors/@nexor.co.uk's message: > >Just thought you might be interested in hearing about a new shareware tool >for MS-Windows: AutoWinNet, it will graze the Internet while you're playing >tennis. Quote: 'Now you can have unlimited access to TERABYTES of Internet >files without forcing you to sit in front of your computer for hours, upload >files, hammer at FTP sites that are busy, get updates to your favorite >programs automatically - direct from their support site.' > >I found this at > > http://www.computek.net:80/physics/ > >Any comments? (I'm hoping for another of Martijn's outbursts ;-) > Right in my backyard too (oh boy!). It looks no more dangerous than FTP Mirror (if it was easier to use). I think all it does is grab files at off hours. I don't see anything it in that looks like a robot. Billy From /CN=robots-errors/@nexor.co.uk Wed May 10 12:20:55 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Wed, 10 May 1995 12:26:39 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 10 May 1995 12:20:55 +0100 Date: Wed, 10 May 1995 12:20:55 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:015750:950510112057] Content-Identifier: Re: any comme... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 10 May 1995 12:20:55 +0100; Alternate-Recipient: Allowed From: Martijn Koster <m.koster@nexor.co.uk> Message-ID: <"1551 Wed May 10 12:19:54 1995"@nexor.co.uk> To: Billy Barron <billy@utdallas.edu> Cc: reinpost@win.tue.nl, /CN=robots/@nexor.co.uk In-Reply-To: <199505091732.MAA29086@utdallas.edu> Subject: Re: any comments on AUtoWinNet? Status: RO Content-Length: 643 In message <199505091732.MAA29086@utdallas.edu>, Billy Barron writes [about AutoWinNet]: > Right in my backyard too (oh boy!). It looks no more dangerous than > FTP Mirror (if it was easier to use). I think all it does is grab > files at off hours. I don't see anything it in that looks > like a robot. I couldn't actually get it to work (complains about not being registered when trying WWW access, and ignores my FTP requests :-). Anyway, it looks like it only does single files/documents. -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Thu May 11 16:39:20 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Thu, 11 May 1995 16:43:23 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 11 May 1995 16:39:20 +0100 Date: Thu, 11 May 1995 16:39:20 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:296490:950511153921] Content-Identifier: Please let me... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 11 May 1995 16:39:20 +0100; Alternate-Recipient: Allowed From: Judy Feder <Judy_Feder@cq.com> Message-ID: <9505111727.AA3827@worldcom-18.worldcom.com> To: /CN=robots/@nexor.co.uk Subject: Please let me know how I can subscribe to this group Mime-Version: 1.0 Content-Type: Text/Plain Status: RO Content-Length: 75 Thanks! Judith Feder Director of Market Relations ConQuest Software, Inc. From /CN=robots-errors/@nexor.co.uk Tue May 23 09:22:17 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Tue, 23 May 1995 09:28:23 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 23 May 1995 09:22:17 +0100 Date: Tue, 23 May 1995 09:22:17 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:277250:950523082220] Content-Identifier: url mirror Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 23 May 1995 09:22:17 +0100; Alternate-Recipient: Allowed From: Loic Dachary <loic@afp.com> Message-ID: <199505230812.EAA12359@pinot.par.afp.com> To: /CN=robots/@nexor.co.uk Subject: url mirror Status: RO Content-Length: 1628 Hi, I'm writing a robot to maintain my emacs-w3 cache directory in synch with the net. The emacs-w3 cache is a file tree that maps the url name space on a file name space. Here is a short example that can give you an idea. w3cache-1/loic/http/org/w3/www/hypertext/DataSources/ w3cache-1/loic/http/org/w3/www/hypertext/DataSources/bySubject/ w3cache-1/loic/http/org/w3/www/hypertext/DataSources/bySubject/Libraries.html w3cache-1/loic/http/org/w3/www/hypertext/DataSources/bySubject/Libraries.html.hdr w3cache-1/loic/http/org/w3/www/Overview.html.hdr w3cache-1/loic/http/org/cnidr/ w3meta/loic/http/org/w3/www/hypertext/DataSources/ w3meta/loic/http/org/w3/www/hypertext/DataSources/bySubject/ w3meta/loic/http/org/w3/www/hypertext/DataSources/bySubject/Libraries.html.header w3meta/loic/http/org/w3/www/Overview.html.error w3meta/loic/http/org/cnidr/ To prevent obsolescence of this tree I need a robot that will check if each file still mimics a valid URL or not. I've written a small perl script that does the job, using the libwww-perl library. I'm writing the robot exclusion code now. Please keep me informed if you know about a robot that would be able to map the URL name space into the file name space, gracefully handling the errors (redirection, time out etc.). Ideally I'd like a file system that would transparently do this for me. I'd be able to use the thousand tools I have to manipulate files instead of the few that know about URLs. *sigh* Thanks, Loic P.S. I will be activating my robot every week or every two weeks. I guess that it will do an average of 500 requests each time it is run. From /CN=robots-errors/@nexor.co.uk Tue May 23 09:54:09 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Tue, 23 May 1995 11:50:34 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 23 May 1995 09:54:09 +0100 Date: Tue, 23 May 1995 09:54:09 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:284990:950523085413] Content-Identifier: Re: url mirror Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 23 May 1995 09:54:09 +0100; Alternate-Recipient: Allowed From: " (Reinier Post)" <reinpost@win.tue.nl> Message-ID: <199505230840.KAA05341@wsinis10.win.tue.nl> To: " (Loic Dachary)" <loic@afp.com> Cc: /CN=robots/@nexor.co.uk In-Reply-To: <199505230812.EAA12359@pinot.par.afp.com> Subject: Re: url mirror Reply-To: reinpost@win.tue.nl X-Mailer: ELM [version 2.4 PL23] MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Status: RO Content-Length: 234 You (Loic Dachary) write: > Hi, > > I'm writing a robot to maintain my emacs-w3 cache directory in >synch with the net. Why don't you use CERN httpd's proxy cache and some existing robot? -- Reinier Post reinpost@win.tue.nl From /CN=robots-errors/@nexor.co.uk Fri May 26 00:22:55 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Fri, 26 May 1995 00:25:38 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 26 May 1995 00:22:55 +0100 Date: Fri, 26 May 1995 00:22:55 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:182460:950525232257] Content-Identifier: Indexing non-... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 26 May 1995 00:22:55 +0100; Alternate-Recipient: Allowed From: " (Tim Bray)" <tbray@opentext.com> Message-ID: <m0sElGL-0001wpC@giant.mindlink.net> To: /CN=robots/@nexor.co.uk Cc: lauren@sqwest.bc.ca Subject: Indexing non-ASCII characters in Web pages X-Sender: a07893@giant.mindlink.net X-Mailer: Windows Eudora Version 2.0.3 Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Status: RO Content-Length: 1948 One of the problems that comes up in building/maintaining our Web Index (or any other) is the proper handling of non-ASCII characters. Right now, such characters are de facto stored in a fairly random assortment of=20 8-bit ISO-Latin codes, HTML/SGML entity references, and then there's a certain amount of JIS for Japanese, which may be New-JIS, Shift-JIS, or EUC-JIS (some of which are easy to handle, others of which conflict with ISO-Latin, sigh...). Anyhow, it seems that at indexing time, when you've copied in someone else's page, you ought to convert all characters to some canonical form before indexing them, to make the lives of people searching your index easier. One would like this form to be: (a) compact (b) a stable vendor-neutral standard (c) easy to type in (d) workable in combination with popular browser technology The most standard way would be to use entities for everything, which=20 conflicts with (a) and (c), and the current repertoire is not=20 comprehensive, and if they can't agree on basic HTML extensions, why should= we=20 expect them to agree on hard/political stuff like char entities? The most= =20 compact way would be to just use ISO Latin, and char. entities for those=20 *really* hard characters (you'll always need some if only for >,=20 &, and so on). As for which form best satisfies (d), all bets are off; I'd like to lock the vendors in a room and not provide a toilet until they'd agreed on how someone should type the French word for summer (&eacu;t&eacu;, =E9t=E9,= take your pick) into (a) an HTML editor and (b) an HTML form, and what the client= =20 sends up the pipe when said form is GET/POSTed. Unicode may be the way to go, but for now they ain't no real software and furthermore a lot of people in Japan are against it. Don't suppose anyone out there happened to solve this one in his/her spare time...=20 Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com) From /CN=robots-errors/@nexor.co.uk Tue May 30 08:46:59 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Tue, 30 May 1995 08:47:14 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 30 May 1995 08:46:59 +0100 Date: Tue, 30 May 1995 08:46:59 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:204740:950530074704] Content-Identifier: Admin Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 30 May 1995 08:46:59 +0100; Alternate-Recipient: Allowed From: World Wide Web <www@nexor.co.uk> Message-ID: <"20467 Tue May 30 08:46:41 1995"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: Admin Status: RO Content-Length: 134 I'm switching this list to moderated for now, to prevent all these subscription requests etc. If you have any problems let me know. From /CN=robots-errors/@nexor.co.uk Fri Jun 9 10:48:31 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Fri, 9 Jun 1995 10:48:51 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 9 Jun 1995 10:48:31 +0100 Date: Fri, 9 Jun 1995 10:48:31 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: m.koster@nexor.co.uk X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:294190:950609094833] Content-Identifier: HTTP library Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 9 Jun 1995 10:48:31 +0100; Alternate-Recipient: Allowed From: Luc Ihli <Luc.Ihli@cginn.cgs.fr> Message-ID: <"29345 Fri Jun 9 10:47:22 1995"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: HTTP library Status: RO Content-Length: 392 Can someone tell where I could find a PERL or C HTTP library. I would like to have a function to issue HTTP requests to HTTP servers. It's urgent. Thanks. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Luc Ihli Luc.Ihli@cginn.cgs.fr Cap Gemini Innovation Tel: +33 76.76.47.37 7 Chemin du Vieux Chene Fax: +33 76.76.47.48 ZIRST 4206 38942 Meylan CEDEX FRANCE From /CN=robots-errors/@nexor.co.uk Fri Jun 9 14:40:59 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Fri, 9 Jun 1995 14:41:13 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 9 Jun 1995 14:40:59 +0100 Date: Fri, 9 Jun 1995 14:40:59 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: m.koster@nexor.co.uk X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:051970:950609134100] Content-Identifier: Libraries Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 9 Jun 1995 14:40:59 +0100; Alternate-Recipient: Allowed From: Martijn Koster <m.koster@nexor.co.uk> Message-ID: <"5165 Fri Jun 9 14:39:50 1995"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: Libraries Status: RO Content-Length: 1192 ------- Forwarded Message Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Fri, 9 Jun 1995 10:48:51 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 9 Jun 1995 10:48:31 +0100 Date: Fri, 9 Jun 1995 10:48:31 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: m.koster@nexor.co.uk X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:294190:950609094833] Content-Identifier: HTTP library Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 9 Jun 1995 10:48:31 +0100; Alternate-Recipient: Allowed From: Luc Ihli <Luc.Ihli@cginn.cgs.fr> Message-ID: <"29345 Fri Jun 9 10:47:22 1995"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: HTTP library Can someone tell where I could find a PERL or C HTTP library. I would like to have a function to issue HTTP requests to HTTP servers. It's urgent. Thanks. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Luc Ihli Luc.Ihli@cginn.cgs.fr Cap Gemini Innovation Tel: +33 76.76.47.37 7 Chemin du Vieux Chene Fax: +33 76.76.47.48 ZIRST 4206 38942 Meylan CEDEX FRANCE ------- End of Forwarded Message From /CN=robots-errors/@nexor.co.uk Mon Jun 12 09:53:01 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Mon, 12 Jun 1995 09:55:20 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 12 Jun 1995 09:53:01 +0100 Date: Mon, 12 Jun 1995 09:53:01 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: yu@CS.UCLA.EDU, talman@sics.se, mats@medialab.se, fred@nvg.unit.no, christen.krogh@filosofi.uio.no, Svein.Parnas@bibhs.no, kruijff@cs.utwente.nl, www@win.tue.nl, floris.wiesman@MI.rulimburg.nl, fb@di.unito.it, andrea@dei.unipd.it, sadun@hp2.sm.dsi.unimi.it, musella@hp2.sm.dsi.unimi.it, esinto@nettuno.csr.unibo.it, altini@ssi06.ssi.it, coc0175@iperbole.bologna.it, Sean.McGrath@UCG.ie, rich%dcre.leeds.ac.uk%scs.leeds.ac.uk@mail-relay.ja.net, Z.Wang@cs.ucl.ac.uk, M.Levy@cs.ucl.ac.uk, sac@compsci.stirling.ac.uk, phrrngtn%cs.st-andrews.ac.uk@cs.st-andrews.ac.uk, R.J.Hollom@ecs.southampton.ac.uk, A.J.Jackson@qmw.ac.uk, marques@vax.oxford.ac.uk, s.j.masterton@open.ac.uk, username%cerfnet.com@nsfnet-relay.ac.uk, murrayb%hume.icis.qut.edu.au@nsfnet-relay.ac.uk, line%ginko.cecer.army.mil@nsfnet-relay.ac.uk, kurt_indermaur%intuit.com@nsfnet-relay.ac.uk, isamu%hopf.dnai.com@nsfnet-relay.ac.uk, gorme%ifi.uio.no@nsfnet-relay.ac.uk, bstarr%liege.ICS.UCI.EDU@nsfnet-relay.ac.uk, oklee@computer-science.nottingham.ac.uk, darren.sanders@northumbria.ac.uk, mlvanbie@undergrad.math.uwaterloo.ca, MAGERB@wl.aecl.ca, lemieuse@ERE.UMontreal.CA, lehtin@ee.ualberta.ca, js@bison.lif.icnet.uk, joel@ai.iit.nrc.ca, jfg <jfg%atl.hp.com@hpl.hewlett-packard.co.uk>, howardk@cyberstore.ca, hachemi@ims.uwindsor.ca, gaines@fsc.cpsc.ucalgary.ca, COWANP@sci.istc.ca, alex@alexu.demon.co.uk, twalker%labatt.com@mail.uunet.ca, M.Wooldridge@doc.manchester-metropolitan-university.ac.uk, brownc@computer-science.manchester.ac.uk, csc177@central1.lancaster.ac.uk, lmjm@doc.imperial.ac.uk, kk5@doc.imperial.ac.uk, CL0094P@pegasus.huddersfield.ac.uk, ibs@dcs.edinburgh.ac.uk, gjmooney@de-montfort.ac.uk, David.Halls@computer-lab.cambridge.ac.uk, omr1000@central-unix-service.cambridge.ac.uk, a.rodrigues%geography.bbk.ac.uk%ccs.bbk.ac.uk%mail1.ccs.bbk.ac.uk@mailh.ccs.bbk.ac.uk , zandy@cedar.buffalo.edu, wyliamh@sco.COM, wozz@chewy.wookie.net, wlee@chaph.usc.edu, wfocbt@ix.netcom.com, weber@eit.COM, warlock@bnr.ca, wagner@panix.com, vparada@inf.utfsm.cl, verweb@verity.com, username@cerfnet.com, ulla@stupi.se, twoods@blue.weeg.uiowa.edu, tushar@watson.ibm.com, treadwell@maine.com, tong@numbat.cs.rmit.edu.au, tomd@bga.com, tomasic@almaden.ibm.com, tom@jax-inter.net, tom@iiidns.iii.org.tw, tom@amnh.org, tobor@psycco.msae.wisc.edu, tmorling@rtm.com, tmaslen@verity.com, tm3k+@andrew.cmu.edu, tkoyt@theseas.ntua.gr, tjc4@cornell.edu, TIM.COOK@DOS.US-STATE.GOV, tim@funbox.demon.co.uk, tierno.bah@his.com, thrift@osage.csc.ti.com, thomas.schuhberger@plus.at, thom@nickel.ucs.indiana.edu, therion@fiat.gslis.utexas.edu, tgargiulo@VNET.IBM.COM, telbert@dcs118.dcsnod.uwf.edu, tbray@opentext.com, tamaki@nttlsl.ntt.jp, taluskie@utpapa.ph.utexas.edu, swneal@alpha.ncsc.mil, suther@suntid.bnl.gov, support@jfdi.demon.co.uk, sugai@okilab.oki.co.jp, ssutarwala@cc1.dttus.com, ssandke@Verity.COM, slobin@feast.fe.msk.ru, skipmoon@chattanooga.net, skg@shiva.pls.com, SHUGHES@jpl-pds.jpl.nasa.gov, shen@ISI.EDU, shah2170@css1s0.engr.ccny.cuny.edu, sggill@neumann.uwaterloo.ca, sfcallic@magneto.csc.ncsu.edu, seb1@gate.net, schwartz@latour.cs.colorado.edu, sbfrs@attila.scob.alaska.edu, sato@soum.co.jp, satishk@paul.rutgers.edu, sanyal@deathstar.cris.com, samy@ru.cs.gmr.com, sallystan@aol.com, s9010727@arcadia.cs.rmit.edu.au, rustw@utw.com, rudi@knoware.nl, rtong@verity.com, rsmith@proteus.arc.nasa.gov, Rshearer@cris.com, rs_butner@ccmail.pnl.gov, rross@supernet.net, root@sabalan.uplift.fr, ronz@rio.myra.com, ROMANBP@delphi.com, roland@technet.sg, roger@hazelton.demon.co.uk, rob@iconics.com, rmelo@ncsa.uiuc.edu, rlh@conan.ids.net, rick@vivo.com, rhb@hotsand.att.com, rgh@slc.unisys.com, reveman@arafat.ENET.dec.com, reter@mail.b-2.de.contrib.net, reader@server.blueline.com, rcalder@cfara1.harvard.edu, rc_stratton@ccmail.pnl.gov, R.Sosic@cit.gu.edu.au, pypodima@athena.auth.gr, pvp@intgp1.att.com, pschwar@world.std.com, prs@netcom.com, pp002051@interramp.com, pp001495@interramp.com, pope@qds.com, pmaugust@teleport.com, plato@xs4all.nl, pkowalsk@pipeline.com, philba@microsoft.com, phil@hiccup.demon.co.uk, PED8C@pandora.cc.uottawa.ca, pdn!pdn.paradyne.com!bcutter@uunet.uu.net, page@cod.nosc.mil, ono@n105.is.tokushima-u.ac.jp, omy@San-Jose.ate.slb.com, ohata@sdl.hitachi.co.jp, Oberst@world.std.com, norton@ypn.com, nlehrer@isx.com, nitz@erg.sri.com, nigel@eecs.umich.edu, neutron@swttools.fc.hp.com, narnett@verity.com, naoki@open.rd.nttdata.jp, nadeem@macmillan.com, mvvaut@ib.com, mowens@advtech.uswest.com, mouche@metronet.com, moriya@st.rim.or.jp, mkgray@MIT.EDU, miyata@musys.com, mikeo@world.std.com, mikeg@innovation.com, Michael.Mauldin@NL.CS.CMU.EDU, mhorne@grover.lab.eds.co.nz, mholling@castaway.cc.uwf.edu, meyerg@viper.tcs-inc.com, mdmays@server.iadfw.net, mdg1@hidec01.engr.uark.edu, mcbryan@cs.colorado.edu, marym@Finesse.COM, marty@Ahip.Getty.EDU, martinb@ix.netcom.com, Mark_M_Lee@ccm.ch.intel.com, mark_ferguson@msmgate.mrg.uswest.com, mark@darwin.sfbr.org, mario.brassard@ift.ulaval.ca, mak@bsjcube.bsj.com, lwarne01@ccsf.cc.ca.us, luciw@starwave.com, loofbour@news.cis.ohio-state.edu, lomax@smi.med.pitt.edu, loic@afp.com, logan@cs.cornell.edu, lists@konishiki.stanford.edu, lgg@cs.brown.edu, lentz@annie.astro.nwu.edu, leefi@microsoft.com, LBURKE@gco5.pb.gov.bc.ca, kumarv@apple.com, KUL@FSC.CA, kstamper@cybernetics.net, konop@techunix.technion.ac.il, kojo@ccs.mt.nec.co.jp, kljackso@cerfnet.com, king@hobart.cs.umass.edu, kimoto@people.flab.fujitsu.co.jp, kcoffee@panix.com, kball@Novell.COM, jvisagie@active.co.za, Judy_Feder@cq.com, jswift@timber.infohwy.com, jsteer@BitScout.com, jshakes@cs.washington.edu, jrb@cs.pdx.edu, joshuack@cae.uwm.edu, Jonathan@spokes.demon.co.uk, jonathan@nwnet.net, John.R.R.Leavitt@nl.cs.cmu.edu, joe@clyde.larc.nasa.gov, jmeritt@smtpinet.aspensys.com, jlh@linus.mitre.org, jjj@crasun.cra.com, jjiang@ttl.pactel.com, jjc+@pitt.edu, jherder@southwind.net, Jfeder@aol.com, jessea@trcinc.com, jeff_sylvia@quickmail.truevision.com, jbb@conan.itc.virginia.edu, jar@iapp201.mcom.com, jamesb@werple.mira.net.au, jamesb@optical.fiber.net, jallan@schoolnet.carleton.ca, Ivan_Lindenfeld@pcmailgw.ml.com, its.com!scott@mcs.com, itaru@ulis.ac.jp, ipgroup@clark.net, ihli@cginn.cgs.fr, ic58@jove.acs.unt.edu, hurleyj@netcom.com, hschilling@lerc.nasa.gov, HotShield@aol.com, horsager@wln.com, hornlo@okra.millsaps.edu, hewett@cs.utexas.edu, helper@law.uark.edu, hardy@powell.cs.colorado.edu, hajime@st.rim.or.jp, habermann@dow.com, gsk@edgtuws1.qed.qld.gov.au, gregors@edo032pc.pipe.nova.ca, gregg@fly2.berkeley.edu, grayson@char.vnet.net, grahamh@mail.mpx.com.au, gr@pogo.ccd.bnl.gov, gpl53044@uxa.cso.uiuc.edu, gjv@io.org, gfowler@wilkins.iaims.bcm.tmc.edu, garth@pisces.systems.sa.gov.au, frode@toaster.SFSU.EDU, FRITZ@Gems.VCU.EDU, frank@manua.gsfc.nasa.gov, francis@cactus.slab.ntt.jp, flash@cyber.net, finnerty@CapAccess.org, fielding@avron.ICS.UCI.EDU, fcc@agent.com, fbra@sunyit.edu, Ewan@paranoia.demon.co.uk, ew974@nextsun.ins.cwru.edu, essicm@uf4725p02.washingtondc.NCR.COM, eshyjka@dc.isx.com, eoh@hacom.nl, emery@squawfish.fsr.com, eichmann@rbse.jsc.nasa.gov, efinet@insist.com, edjlali@cs.UMD.EDU, dwjurkat@mailbox.syr.edu, duffy@csn.org, duclos@iad.ift.ulaval.ca, dsylvest@clark.net, DSINGH@apsc.com, dolphin@comeng.chungnam.ac.kr, dl@hplyot.obspm.fr, dhart@titanic.cs.umass.edu, detter@databank.com, dcornwal@mail.utexas.edu, dbakin@sybase.com, DaxMan@ix.netcom.com, david.pattarini@fi.gs.com, david@police.tas.gov.au, daveg@fultech.com, Curtisb@caladan.chattanooga.net, CSTEPHEN@us.oracle.com, cohmer@lamar.ColoState.EDU, clv2m@oak.cs.virginia.edu, close@cs.ukans.edu, Chung.Kang.Tsen@ozark.edrc.cmu.edu, chs@Verity.COM, chrisf@pipex.net, chris@cerl.gatech.edu, choy@cs.usask.ca, chess@watson.ibm.com, charless@sco.COM, charles@sw19.demon.co.uk, chang@cs.umd.edu, chang@cam.nist.gov, carter@cs.bu.edu, carro@cs.bu.edu, caadalin@mtu.edu, bynum@CS.WM.EDU, burkhart@tis.andersen.com, BUDDHI@umiami.ir.miami.edu, brycer@priscilla.ultima.org, brinskel@slowpoke.genie.uottawa.ca, bp@cs.washington.edu, bonnie@dev.prodigy.com, bonini@panix.com, bmthomas@ix.netcom.com, billy@utdallas.edu, billbe@chi.shl.com, bicknell@ussenterprise.async.vt.edu, bens@microsoft.com, benjy@benjy.cc.vt.edu, beebee@parc.xerox.com, beck@amb1.ccalmr.ogi.edu, bdtaylor@alias.com, bcutter@gate.net, barrie@scs.unr.edu, barrett@almaden.ibm.com, bal@martigny.ai.mit.edu, atc@ornl.gov, ARGRAY@rivendell.otago.ac.nz, andy@andy.net, andras@is.co.za, amonge@cs.ucsd.edu, amohesky@itg.ti.com, allsop@swttools.fc.hp.com, allmedia@world.std.com, allied@biddeford.com, allain@waiter.ira.rl.af.mil, Alex_Franz@A.NL.CS.CMU.EDU, ALEWONTIN@bentley.edu, ahovig@uclink3.berkeley.edu, AGOUNTIS@hop.qgraph.com, adriaan@eb.com, ac41@solo.pipex.com, aak2@Ra.MsState.Edu, a-mikebi@microsoft.com, 100627.2502@compuserve.com, 0004103477@mcimail.com, igirisujin%mberry.demon.co.uk@punt2.demon.co.uk, jims%globalvillag.com@pmail.globalvillag.com, NANDU%anest4.anest.ufl.edu@nervm.nerdc.ufl.edu, PBOSTROM%VIENNA.LEGENT.COM@LEGENT.COM, BenefieM%buchananpo.mpc.af.mil@dbm1.mpc.af.mil, tronche@lri.fr, Norbert.Glaser@loria.fr, taipale@rotol.fi, jta@mofile.fi, ptk@akumiitti.fi, GONZALO@cicei.ulpgc.es, jh@icl.dk, adreyer@uni-paderborn.de, doemel@informatik.uni-frankfurt.de, nwoh@software-ag.de, ralf@egd.igd.fhg.de, stegmann@rzo2.sari.fh-wuerzburg.de, Andreas.Ley@rz.uni-karlsruhe.de, olafabbe@w250zrz.zrz.tu-berlin.d400.de, pannier@cs.tu-berlin.d400.de, bene@cs.tu-berlin.d400.de, jpellizz@sp055.cern.ch, goatcher@dxcern.cern.ch, casey@ptsun00.cern.ch, pam@sunbim.be, Nicolas.GEUSKENS@DG4.cec.be, m.koster@nexor.co.uk, /CN=robots-archive/@nexor.co.uk X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:077150:950612085303] Content-Identifier: Robot Announce Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 12 Jun 1995 09:53:01 +0100; Alternate-Recipient: Allowed From: " (Yoshihiko HAYASHI)" <hayashi@nttnly.isl.ntt.jp> Message-ID: <199506120855.RAA11047@nttnly.isl.ntt.jp> To: /CN=robots/@nexor.co.uk Cc: titan-admin@nttnly.isl.ntt.jp Subject: Robot Announce Return-Path: <hayashi@nttnly.isl.ntt.jp> Status: RO Content-Length: 609 Greetings. We are going to run a robot program named TITAN/0.1 in order to collect text (html/plain) files from the WWW space. Our primary purpose is to develop an advanced method for analyzing and indexing the documents on the WWW. The robot is based on a perl library, libwww-perl (http://www.ics.uci.edu/WebSoft/libwww-perl/), which is being developed by the Arcadia Project at UCI. If the robot, by chance, visits your site, please let him in. Comments and suggestions should go to "titan-admin@nttnly.isl.ntt.jp". Thank you. -- Yoshihiko HAYASHI NTT Information and Communication Systems Laboratories From /CN=robots-errors/@nexor.co.uk Wed Jun 14 02:05:57 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Wed, 14 Jun 1995 02:11:58 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 14 Jun 1995 02:05:57 +0100 Date: Wed, 14 Jun 1995 02:05:57 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: yu@CS.UCLA.EDU, talman@sics.se, mats@medialab.se, fred@nvg.unit.no, christen.krogh@filosofi.uio.no, Svein.Parnas@bibhs.no, kruijff@cs.utwente.nl, www@win.tue.nl, floris.wiesman@MI.rulimburg.nl, fb@di.unito.it, andrea@dei.unipd.it, sadun@hp2.sm.dsi.unimi.it, musella@hp2.sm.dsi.unimi.it, esinto@nettuno.csr.unibo.it, altini@ssi06.ssi.it, coc0175@iperbole.bologna.it, Sean.McGrath@UCG.ie, rich%dcre.leeds.ac.uk%scs.leeds.ac.uk@mail-relay.ja.net, Z.Wang@cs.ucl.ac.uk, M.Levy@cs.ucl.ac.uk, sac@compsci.stirling.ac.uk, phrrngtn%cs.st-andrews.ac.uk@cs.st-andrews.ac.uk, R.J.Hollom@ecs.southampton.ac.uk, A.J.Jackson@qmw.ac.uk, marques@vax.oxford.ac.uk, s.j.masterton@open.ac.uk, username%cerfnet.com@nsfnet-relay.ac.uk, murrayb%hume.icis.qut.edu.au@nsfnet-relay.ac.uk, line%ginko.cecer.army.mil@nsfnet-relay.ac.uk, kurt_indermaur%intuit.com@nsfnet-relay.ac.uk, isamu%hopf.dnai.com@nsfnet-relay.ac.uk, gorme%ifi.uio.no@nsfnet-relay.ac.uk, bstarr%liege.ICS.UCI.EDU@nsfnet-relay.ac.uk, oklee@computer-science.nottingham.ac.uk, darren.sanders@northumbria.ac.uk, mlvanbie@undergrad.math.uwaterloo.ca, MAGERB@wl.aecl.ca, lemieuse@ERE.UMontreal.CA, lehtin@ee.ualberta.ca, js@bison.lif.icnet.uk, joel@ai.iit.nrc.ca, jfg <jfg%atl.hp.com@hpl.hewlett-packard.co.uk>, howardk@cyberstore.ca, hachemi@ims.uwindsor.ca, gaines@fsc.cpsc.ucalgary.ca, COWANP@sci.istc.ca, alex@alexu.demon.co.uk, twalker%labatt.com@mail.uunet.ca, M.Wooldridge@doc.manchester-metropolitan-university.ac.uk, brownc@computer-science.manchester.ac.uk, csc177@central1.lancaster.ac.uk, lmjm@doc.imperial.ac.uk, kk5@doc.imperial.ac.uk, CL0094P@pegasus.huddersfield.ac.uk, ibs@dcs.edinburgh.ac.uk, gjmooney@de-montfort.ac.uk, David.Halls@computer-lab.cambridge.ac.uk, omr1000@central-unix-service.cambridge.ac.uk, a.rodrigues%geography.bbk.ac.uk%ccs.bbk.ac.uk%mail1.ccs.bbk.ac.uk@mailh.ccs.bbk.ac.uk , zandy@cedar.buffalo.edu, wyliamh@sco.COM, wozz@chewy.wookie.net, wlee@chaph.usc.edu, wfocbt@ix.netcom.com, weber@eit.COM, warlock@bnr.ca, wagner@panix.com, vparada@inf.utfsm.cl, verweb@verity.com, username@cerfnet.com, ulla@stupi.se, twoods@blue.weeg.uiowa.edu, tushar@watson.ibm.com, treadwell@maine.com, tong@numbat.cs.rmit.edu.au, tomd@bga.com, tomasic@almaden.ibm.com, tom@jax-inter.net, tom@iiidns.iii.org.tw, tom@amnh.org, tobor@psycco.msae.wisc.edu, tmorling@rtm.com, tmaslen@verity.com, tm3k+@andrew.cmu.edu, tkoyt@theseas.ntua.gr, tjc4@cornell.edu, TIM.COOK@DOS.US-STATE.GOV, tim@funbox.demon.co.uk, tierno.bah@his.com, thrift@osage.csc.ti.com, thomas.schuhberger@plus.at, thom@nickel.ucs.indiana.edu, therion@fiat.gslis.utexas.edu, tgargiulo@VNET.IBM.COM, telbert@dcs118.dcsnod.uwf.edu, tbray@opentext.com, tamaki@nttlsl.ntt.jp, taluskie@utpapa.ph.utexas.edu, swneal@alpha.ncsc.mil, suther@suntid.bnl.gov, support@jfdi.demon.co.uk, sugai@okilab.oki.co.jp, ssutarwala@cc1.dttus.com, ssandke@Verity.COM, slobin@feast.fe.msk.ru, skipmoon@chattanooga.net, skg@shiva.pls.com, SHUGHES@jpl-pds.jpl.nasa.gov, shen@ISI.EDU, shah2170@css1s0.engr.ccny.cuny.edu, sggill@neumann.uwaterloo.ca, sfcallic@magneto.csc.ncsu.edu, seb1@gate.net, schwartz@latour.cs.colorado.edu, sbfrs@attila.scob.alaska.edu, sato@soum.co.jp, satishk@paul.rutgers.edu, sanyal@deathstar.cris.com, samy@ru.cs.gmr.com, sallystan@aol.com, s9010727@arcadia.cs.rmit.edu.au, rustw@utw.com, rudi@knoware.nl, rtong@verity.com, rsmith@proteus.arc.nasa.gov, Rshearer@cris.com, rs_butner@ccmail.pnl.gov, rross@supernet.net, root@sabalan.uplift.fr, ronz@rio.myra.com, ROMANBP@delphi.com, roland@technet.sg, roger@hazelton.demon.co.uk, rob@iconics.com, rmelo@ncsa.uiuc.edu, rlh@conan.ids.net, rick@vivo.com, rhb@hotsand.att.com, rgh@slc.unisys.com, reveman@arafat.ENET.dec.com, reter@mail.b-2.de.contrib.net, reader@server.blueline.com, rcalder@cfara1.harvard.edu, rc_stratton@ccmail.pnl.gov, R.Sosic@cit.gu.edu.au, pypodima@athena.auth.gr, pvp@intgp1.att.com, pschwar@world.std.com, prs@netcom.com, pp002051@interramp.com, pp001495@interramp.com, pope@qds.com, pmaugust@teleport.com, plato@xs4all.nl, pkowalsk@pipeline.com, philba@microsoft.com, phil@hiccup.demon.co.uk, PED8C@pandora.cc.uottawa.ca, pdn!pdn.paradyne.com!bcutter@uunet.uu.net, page@cod.nosc.mil, ono@n105.is.tokushima-u.ac.jp, omy@San-Jose.ate.slb.com, ohata@sdl.hitachi.co.jp, Oberst@world.std.com, norton@ypn.com, nlehrer@isx.com, nitz@erg.sri.com, nigel@eecs.umich.edu, neutron@swttools.fc.hp.com, narnett@verity.com, naoki@open.rd.nttdata.jp, nadeem@macmillan.com, mvvaut@ib.com, mowens@advtech.uswest.com, mouche@metronet.com, moriya@st.rim.or.jp, mkgray@MIT.EDU, miyata@musys.com, mikeo@world.std.com, mikeg@innovation.com, Michael.Mauldin@NL.CS.CMU.EDU, mhorne@grover.lab.eds.co.nz, mholling@castaway.cc.uwf.edu, meyerg@viper.tcs-inc.com, medlar@ua.com, mdmays@server.iadfw.net, mdg1@hidec01.engr.uark.edu, mcbryan@cs.colorado.edu, marym@Finesse.COM, marty@Ahip.Getty.EDU, martinb@ix.netcom.com, Mark_M_Lee@ccm.ch.intel.com, mark_ferguson@msmgate.mrg.uswest.com, mark@darwin.sfbr.org, mario.brassard@ift.ulaval.ca, mak@bsjcube.bsj.com, lwarne01@ccsf.cc.ca.us, luciw@starwave.com, loofbour@news.cis.ohio-state.edu, lomax@smi.med.pitt.edu, loic@afp.com, logan@cs.cornell.edu, lists@konishiki.stanford.edu, lgg@cs.brown.edu, lentz@annie.astro.nwu.edu, leefi@microsoft.com, LBURKE@gco5.pb.gov.bc.ca, kumarv@apple.com, KUL@FSC.CA, kstamper@cybernetics.net, konop@techunix.technion.ac.il, kojo@ccs.mt.nec.co.jp, kljackso@cerfnet.com, king@hobart.cs.umass.edu, kimoto@people.flab.fujitsu.co.jp, Kelly_Carney@stortek.com, kcoffee@panix.com, kball@Novell.COM, jvisagie@active.co.za, Judy_Feder@cq.com, jswift@timber.infohwy.com, jsteer@BitScout.com, jshakes@cs.washington.edu, jrb@cs.pdx.edu, joshuack@cae.uwm.edu, Jonathan@spokes.demon.co.uk, jonathan@nwnet.net, John.R.R.Leavitt@nl.cs.cmu.edu, joe@clyde.larc.nasa.gov, jmeritt@smtpinet.aspensys.com, jlh@linus.mitre.org, jjj@crasun.cra.com, jjiang@ttl.pactel.com, jjc+@pitt.edu, jherder@southwind.net, Jfeder@aol.com, jessea@trcinc.com, jeff_sylvia@quickmail.truevision.com, jbb@conan.itc.virginia.edu, jar@iapp201.mcom.com, jamesb@werple.mira.net.au, jamesb@optical.fiber.net, jallan@schoolnet.carleton.ca, Ivan_Lindenfeld@pcmailgw.ml.com, its.com!scott@mcs.com, itaru@ulis.ac.jp, ipgroup@clark.net, ihli@cginn.cgs.fr, ic58@jove.acs.unt.edu, hurleyj@netcom.com, hschilling@lerc.nasa.gov, HotShield@aol.com, horsager@wln.com, hornlo@okra.millsaps.edu, hewett@cs.utexas.edu, helper@law.uark.edu, hardy@powell.cs.colorado.edu, hajime@st.rim.or.jp, habermann@dow.com, gsk@edgtuws1.qed.qld.gov.au, gregors@edo032pc.pipe.nova.ca, gregg@fly2.berkeley.edu, grayson@char.vnet.net, grahamh@mail.mpx.com.au, gr@pogo.ccd.bnl.gov, gpl53044@uxa.cso.uiuc.edu, gjv@io.org, gfowler@wilkins.iaims.bcm.tmc.edu, garth@pisces.systems.sa.gov.au, frode@toaster.SFSU.EDU, FRITZ@Gems.VCU.EDU, frank@manua.gsfc.nasa.gov, francis@cactus.slab.ntt.jp, flash@cyber.net, finnerty@CapAccess.org, fielding@avron.ICS.UCI.EDU, fcc@agent.com, fbra@sunyit.edu, Ewan@paranoia.demon.co.uk, ew974@nextsun.ins.cwru.edu, essicm@uf4725p02.washingtondc.NCR.COM, eshyjka@dc.isx.com, eoh@hacom.nl, emery@squawfish.fsr.com, eichmann@rbse.jsc.nasa.gov, efinet@insist.com, edjlali@cs.UMD.EDU, dwjurkat@mailbox.syr.edu, duffy@csn.org, duclos@iad.ift.ulaval.ca, dsylvest@clark.net, DSINGH@apsc.com, dolphin@comeng.chungnam.ac.kr, dl@hplyot.obspm.fr, dhart@titanic.cs.umass.edu, detter@databank.com, dcornwal@mail.utexas.edu, dbakin@sybase.com, DaxMan@ix.netcom.com, david.pattarini@fi.gs.com, david@police.tas.gov.au, daveg@fultech.com, Curtisb@caladan.chattanooga.net, CSTEPHEN@us.oracle.com, cohmer@lamar.ColoState.EDU, clv2m@oak.cs.virginia.edu, close@cs.ukans.edu, Chung.Kang.Tsen@ozark.edrc.cmu.edu, chs@Verity.COM, chrisf@pipex.net, chris@cerl.gatech.edu, choy@cs.usask.ca, chess@watson.ibm.com, charless@sco.COM, charles@sw19.demon.co.uk, chang@cs.umd.edu, chang@cam.nist.gov, carter@cs.bu.edu, carro@cs.bu.edu, caadalin@mtu.edu, bynum@CS.WM.EDU, burkhart@tis.andersen.com, BUDDHI@umiami.ir.miami.edu, brycer@priscilla.ultima.org, brinskel@slowpoke.genie.uottawa.ca, bp@cs.washington.edu, bonnie@dev.prodigy.com, bonini@panix.com, bmthomas@ix.netcom.com, billy@utdallas.edu, billbe@chi.shl.com, bicknell@ussenterprise.async.vt.edu, bens@microsoft.com, benjy@benjy.cc.vt.edu, beebee@parc.xerox.com, beck@amb1.ccalmr.ogi.edu, bdtaylor@alias.com, bcutter@gate.net, barrie@scs.unr.edu, barrett@almaden.ibm.com, bal@martigny.ai.mit.edu, atc@ornl.gov, ARGRAY@rivendell.otago.ac.nz, andy@andy.net, andras@is.co.za, amonge@cs.ucsd.edu, amohesky@itg.ti.com, allsop@swttools.fc.hp.com, allmedia@world.std.com, allied@biddeford.com, allain@waiter.ira.rl.af.mil, Alex_Franz@A.NL.CS.CMU.EDU, ALEWONTIN@bentley.edu, ahovig@uclink3.berkeley.edu, AGOUNTIS@hop.qgraph.com, adriaan@eb.com, ac41@solo.pipex.com, aak2@Ra.MsState.Edu, a-mikebi@microsoft.com, 100627.2502@compuserve.com, 0004103477@mcimail.com, igirisujin%mberry.demon.co.uk@punt2.demon.co.uk, jims%globalvillag.com@pmail.globalvillag.com, NANDU%anest4.anest.ufl.edu@nervm.nerdc.ufl.edu, PBOSTROM%VIENNA.LEGENT.COM@LEGENT.COM, BenefieM%buchananpo.mpc.af.mil@dbm1.mpc.af.mil, tronche@lri.fr, Norbert.Glaser@loria.fr, taipale@rotol.fi, jta@mofile.fi, ptk@akumiitti.fi, GONZALO@cicei.ulpgc.es, jh@icl.dk, adreyer@uni-paderborn.de, doemel@informatik.uni-frankfurt.de, nwoh@software-ag.de, ralf@egd.igd.fhg.de, stegmann@rzo2.sari.fh-wuerzburg.de, Andreas.Ley@rz.uni-karlsruhe.de, olafabbe@w250zrz.zrz.tu-berlin.d400.de, pannier@cs.tu-berlin.d400.de, bene@cs.tu-berlin.d400.de, jpellizz@sp055.cern.ch, goatcher@dxcern.cern.ch, casey@ptsun00.cern.ch, pam@sunbim.be, Nicolas.GEUSKENS@DG4.cec.be, m.koster@nexor.co.uk, /CN=robots-archive/@nexor.co.uk X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:158720:950614010600] Content-Identifier: how many glob... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 14 Jun 1995 02:05:57 +0100; Alternate-Recipient: Allowed From: " (Paul Francis)" <francis@cactus.slab.ntt.jp> Message-ID: <9506140103.AA26385@cactus.slab.ntt.jp> To: /CN=robots/@nexor.co.uk Subject: how many global robots do we need? Status: RO Content-Length: 449 Hi, Like a number of you out there, I would also like to get a list of as many of the world's URLs as I can. But, since a number of robots are already out collecting this stuff, it seems silly to add yet another. So, what I'm wondering is: 1. Is there anybody out there with a large list of URLs willing to share it? 2. If so, is anybody set up to send out periodic updates of their lists (what's new, what's obsolete)? Thanks, PF From /CN=robots-errors/@nexor.co.uk Wed Jun 14 05:23:49 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Wed, 14 Jun 1995 05:26:16 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 14 Jun 1995 05:23:49 +0100 Date: Wed, 14 Jun 1995 05:23:49 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: yu@CS.UCLA.EDU, talman@sics.se, mats@medialab.se, fred@nvg.unit.no, christen.krogh@filosofi.uio.no, Svein.Parnas@bibhs.no, kruijff@cs.utwente.nl, www@win.tue.nl, floris.wiesman@MI.rulimburg.nl, fb@di.unito.it, andrea@dei.unipd.it, sadun@hp2.sm.dsi.unimi.it, musella@hp2.sm.dsi.unimi.it, esinto@nettuno.csr.unibo.it, altini@ssi06.ssi.it, coc0175@iperbole.bologna.it, Sean.McGrath@UCG.ie, rich%dcre.leeds.ac.uk%scs.leeds.ac.uk@mail-relay.ja.net, Z.Wang@cs.ucl.ac.uk, M.Levy@cs.ucl.ac.uk, sac@compsci.stirling.ac.uk, phrrngtn%cs.st-andrews.ac.uk@cs.st-andrews.ac.uk, R.J.Hollom@ecs.southampton.ac.uk, A.J.Jackson@qmw.ac.uk, marques@vax.oxford.ac.uk, s.j.masterton@open.ac.uk, username%cerfnet.com@nsfnet-relay.ac.uk, murrayb%hume.icis.qut.edu.au@nsfnet-relay.ac.uk, line%ginko.cecer.army.mil@nsfnet-relay.ac.uk, kurt_indermaur%intuit.com@nsfnet-relay.ac.uk, isamu%hopf.dnai.com@nsfnet-relay.ac.uk, gorme%ifi.uio.no@nsfnet-relay.ac.uk, bstarr%liege.ICS.UCI.EDU@nsfnet-relay.ac.uk, oklee@computer-science.nottingham.ac.uk, darren.sanders@northumbria.ac.uk, mlvanbie@undergrad.math.uwaterloo.ca, MAGERB@wl.aecl.ca, lemieuse@ERE.UMontreal.CA, lehtin@ee.ualberta.ca, js@bison.lif.icnet.uk, joel@ai.iit.nrc.ca, jfg <jfg%atl.hp.com@hpl.hewlett-packard.co.uk>, howardk@cyberstore.ca, hachemi@ims.uwindsor.ca, gaines@fsc.cpsc.ucalgary.ca, COWANP@sci.istc.ca, alex@alexu.demon.co.uk, twalker%labatt.com@mail.uunet.ca, M.Wooldridge@doc.manchester-metropolitan-university.ac.uk, brownc@computer-science.manchester.ac.uk, csc177@central1.lancaster.ac.uk, lmjm@doc.imperial.ac.uk, kk5@doc.imperial.ac.uk, CL0094P@pegasus.huddersfield.ac.uk, ibs@dcs.edinburgh.ac.uk, gjmooney@de-montfort.ac.uk, David.Halls@computer-lab.cambridge.ac.uk, omr1000@central-unix-service.cambridge.ac.uk, a.rodrigues%geography.bbk.ac.uk%ccs.bbk.ac.uk%mail1.ccs.bbk.ac.uk@mailh.ccs.bbk.ac.uk , zandy@cedar.buffalo.edu, wyliamh@sco.COM, wozz@chewy.wookie.net, wlee@chaph.usc.edu, wfocbt@ix.netcom.com, weber@eit.COM, warlock@bnr.ca, wagner@panix.com, vparada@inf.utfsm.cl, verweb@verity.com, username@cerfnet.com, ulla@stupi.se, twoods@blue.weeg.uiowa.edu, tushar@watson.ibm.com, treadwell@maine.com, tong@numbat.cs.rmit.edu.au, tomd@bga.com, tomasic@almaden.ibm.com, tom@jax-inter.net, tom@iiidns.iii.org.tw, tom@amnh.org, tobor@psycco.msae.wisc.edu, tmorling@rtm.com, tmaslen@verity.com, tm3k+@andrew.cmu.edu, tkoyt@theseas.ntua.gr, tjc4@cornell.edu, TIM.COOK@DOS.US-STATE.GOV, tim@funbox.demon.co.uk, tierno.bah@his.com, thrift@osage.csc.ti.com, thomas.schuhberger@plus.at, thom@nickel.ucs.indiana.edu, therion@fiat.gslis.utexas.edu, tgargiulo@VNET.IBM.COM, telbert@dcs118.dcsnod.uwf.edu, tbray@opentext.com, tamaki@nttlsl.ntt.jp, taluskie@utpapa.ph.utexas.edu, swneal@alpha.ncsc.mil, suther@suntid.bnl.gov, support@jfdi.demon.co.uk, sugai@okilab.oki.co.jp, ssutarwala@cc1.dttus.com, ssandke@Verity.COM, slobin@feast.fe.msk.ru, skipmoon@chattanooga.net, skg@shiva.pls.com, SHUGHES@jpl-pds.jpl.nasa.gov, shen@ISI.EDU, shah2170@css1s0.engr.ccny.cuny.edu, sggill@neumann.uwaterloo.ca, sfcallic@magneto.csc.ncsu.edu, seb1@gate.net, schwartz@latour.cs.colorado.edu, sbfrs@attila.scob.alaska.edu, sato@soum.co.jp, satishk@paul.rutgers.edu, sanyal@deathstar.cris.com, samy@ru.cs.gmr.com, sallystan@aol.com, s9010727@arcadia.cs.rmit.edu.au, rustw@utw.com, rudi@knoware.nl, rtong@verity.com, rsmith@proteus.arc.nasa.gov, Rshearer@cris.com, rs_butner@ccmail.pnl.gov, rross@supernet.net, root@sabalan.uplift.fr, ronz@rio.myra.com, ROMANBP@delphi.com, roland@technet.sg, roger@hazelton.demon.co.uk, rob@iconics.com, rmelo@ncsa.uiuc.edu, rlh@conan.ids.net, rick@vivo.com, rhb@hotsand.att.com, rgh@slc.unisys.com, reveman@arafat.ENET.dec.com, reter@mail.b-2.de.contrib.net, reader@server.blueline.com, rcalder@cfara1.harvard.edu, rc_stratton@ccmail.pnl.gov, R.Sosic@cit.gu.edu.au, pypodima@athena.auth.gr, pvp@intgp1.att.com, pschwar@world.std.com, prs@netcom.com, pp002051@interramp.com, pp001495@interramp.com, pope@qds.com, pmaugust@teleport.com, plato@xs4all.nl, pkowalsk@pipeline.com, philba@microsoft.com, phil@hiccup.demon.co.uk, PED8C@pandora.cc.uottawa.ca, pdn!pdn.paradyne.com!bcutter@uunet.uu.net, page@cod.nosc.mil, ono@n105.is.tokushima-u.ac.jp, omy@San-Jose.ate.slb.com, ohata@sdl.hitachi.co.jp, Oberst@world.std.com, norton@ypn.com, nlehrer@isx.com, nitz@erg.sri.com, nigel@eecs.umich.edu, neutron@swttools.fc.hp.com, narnett@verity.com, naoki@open.rd.nttdata.jp, nadeem@macmillan.com, mvvaut@ib.com, mowens@advtech.uswest.com, mouche@metronet.com, moriya@st.rim.or.jp, mkgray@MIT.EDU, miyata@musys.com, mikeo@world.std.com, mikeg@innovation.com, Michael.Mauldin@NL.CS.CMU.EDU, mhorne@grover.lab.eds.co.nz, mholling@castaway.cc.uwf.edu, meyerg@viper.tcs-inc.com, medlar@ua.com, mdmays@server.iadfw.net, mdg1@hidec01.engr.uark.edu, mcbryan@cs.colorado.edu, marym@Finesse.COM, marty@Ahip.Getty.EDU, martinb@ix.netcom.com, Mark_M_Lee@ccm.ch.intel.com, mark_ferguson@msmgate.mrg.uswest.com, mark@darwin.sfbr.org, mario.brassard@ift.ulaval.ca, mak@bsjcube.bsj.com, lwarne01@ccsf.cc.ca.us, luciw@starwave.com, loofbour@news.cis.ohio-state.edu, lomax@smi.med.pitt.edu, loic@afp.com, logan@cs.cornell.edu, lists@konishiki.stanford.edu, lgg@cs.brown.edu, lentz@annie.astro.nwu.edu, leefi@microsoft.com, LBURKE@gco5.pb.gov.bc.ca, kumarv@apple.com, KUL@FSC.CA, kstamper@cybernetics.net, konop@techunix.technion.ac.il, kojo@ccs.mt.nec.co.jp, kljackso@cerfnet.com, king@hobart.cs.umass.edu, kimoto@people.flab.fujitsu.co.jp, Kelly_Carney@stortek.com, kcoffee@panix.com, kball@Novell.COM, jvisagie@active.co.za, Judy_Feder@cq.com, jswift@timber.infohwy.com, jsteer@BitScout.com, jshakes@cs.washington.edu, jrb@cs.pdx.edu, joshuack@cae.uwm.edu, Jonathan@spokes.demon.co.uk, jonathan@nwnet.net, John.R.R.Leavitt@nl.cs.cmu.edu, joe@clyde.larc.nasa.gov, jmeritt@smtpinet.aspensys.com, jlh@linus.mitre.org, jjj@crasun.cra.com, jjiang@ttl.pactel.com, jjc+@pitt.edu, jherder@southwind.net, Jfeder@aol.com, jessea@trcinc.com, jeff_sylvia@quickmail.truevision.com, jbb@conan.itc.virginia.edu, jar@iapp201.mcom.com, jamesb@werple.mira.net.au, jamesb@optical.fiber.net, jallan@schoolnet.carleton.ca, Ivan_Lindenfeld@pcmailgw.ml.com, its.com!scott@mcs.com, itaru@ulis.ac.jp, ipgroup@clark.net, ihli@cginn.cgs.fr, ic58@jove.acs.unt.edu, hurleyj@netcom.com, hschilling@lerc.nasa.gov, HotShield@aol.com, horsager@wln.com, hornlo@okra.millsaps.edu, hewett@cs.utexas.edu, helper@law.uark.edu, hardy@powell.cs.colorado.edu, hajime@st.rim.or.jp, habermann@dow.com, gsk@edgtuws1.qed.qld.gov.au, gregors@edo032pc.pipe.nova.ca, gregg@fly2.berkeley.edu, grayson@char.vnet.net, grahamh@mail.mpx.com.au, gr@pogo.ccd.bnl.gov, gpl53044@uxa.cso.uiuc.edu, gjv@io.org, gfowler@wilkins.iaims.bcm.tmc.edu, garth@pisces.systems.sa.gov.au, frode@toaster.SFSU.EDU, FRITZ@Gems.VCU.EDU, frank@manua.gsfc.nasa.gov, francis@cactus.slab.ntt.jp, flash@cyber.net, finnerty@CapAccess.org, fielding@avron.ICS.UCI.EDU, fcc@agent.com, fbra@sunyit.edu, Ewan@paranoia.demon.co.uk, ew974@nextsun.ins.cwru.edu, essicm@uf4725p02.washingtondc.NCR.COM, eshyjka@dc.isx.com, eoh@hacom.nl, emery@squawfish.fsr.com, eichmann@rbse.jsc.nasa.gov, efinet@insist.com, edjlali@cs.UMD.EDU, dwjurkat@mailbox.syr.edu, duffy@csn.org, duclos@iad.ift.ulaval.ca, dsylvest@clark.net, DSINGH@apsc.com, dolphin@comeng.chungnam.ac.kr, dl@hplyot.obspm.fr, dhart@titanic.cs.umass.edu, detter@databank.com, dcornwal@mail.utexas.edu, dbakin@sybase.com, DaxMan@ix.netcom.com, david.pattarini@fi.gs.com, david@police.tas.gov.au, daveg@fultech.com, Curtisb@caladan.chattanooga.net, CSTEPHEN@us.oracle.com, cohmer@lamar.ColoState.EDU, clv2m@oak.cs.virginia.edu, close@cs.ukans.edu, Chung.Kang.Tsen@ozark.edrc.cmu.edu, chs@Verity.COM, chrisf@pipex.net, chris@cerl.gatech.edu, choy@cs.usask.ca, chess@watson.ibm.com, charless@sco.COM, charles@sw19.demon.co.uk, chang@cs.umd.edu, chang@cam.nist.gov, carter@cs.bu.edu, carro@cs.bu.edu, caadalin@mtu.edu, bynum@CS.WM.EDU, burkhart@tis.andersen.com, BUDDHI@umiami.ir.miami.edu, brycer@priscilla.ultima.org, brinskel@slowpoke.genie.uottawa.ca, bp@cs.washington.edu, bonnie@dev.prodigy.com, bonini@panix.com, bmthomas@ix.netcom.com, billy@utdallas.edu, billbe@chi.shl.com, bicknell@ussenterprise.async.vt.edu, bens@microsoft.com, benjy@benjy.cc.vt.edu, beebee@parc.xerox.com, beck@amb1.ccalmr.ogi.edu, bdtaylor@alias.com, bcutter@gate.net, barrie@scs.unr.edu, barrett@almaden.ibm.com, bal@martigny.ai.mit.edu, atc@ornl.gov, ARGRAY@rivendell.otago.ac.nz, andy@andy.net, andras@is.co.za, amonge@cs.ucsd.edu, amohesky@itg.ti.com, allsop@swttools.fc.hp.com, allmedia@world.std.com, allied@biddeford.com, allain@waiter.ira.rl.af.mil, Alex_Franz@A.NL.CS.CMU.EDU, ALEWONTIN@bentley.edu, ahovig@uclink3.berkeley.edu, AGOUNTIS@hop.qgraph.com, adriaan@eb.com, ac41@solo.pipex.com, aak2@Ra.MsState.Edu, a-mikebi@microsoft.com, 100627.2502@compuserve.com, 0004103477@mcimail.com, igirisujin%mberry.demon.co.uk@punt2.demon.co.uk, jims%globalvillag.com@pmail.globalvillag.com, NANDU%anest4.anest.ufl.edu@nervm.nerdc.ufl.edu, PBOSTROM%VIENNA.LEGENT.COM@LEGENT.COM, BenefieM%buchananpo.mpc.af.mil@dbm1.mpc.af.mil, tronche@lri.fr, Norbert.Glaser@loria.fr, taipale@rotol.fi, jta@mofile.fi, ptk@akumiitti.fi, GONZALO@cicei.ulpgc.es, jh@icl.dk, adreyer@uni-paderborn.de, doemel@informatik.uni-frankfurt.de, nwoh@software-ag.de, ralf@egd.igd.fhg.de, stegmann@rzo2.sari.fh-wuerzburg.de, Andreas.Ley@rz.uni-karlsruhe.de, olafabbe@w250zrz.zrz.tu-berlin.d400.de, pannier@cs.tu-berlin.d400.de, bene@cs.tu-berlin.d400.de, jpellizz@sp055.cern.ch, goatcher@dxcern.cern.ch, casey@ptsun00.cern.ch, pam@sunbim.be, Nicolas.GEUSKENS@DG4.cec.be, m.koster@nexor.co.uk, /CN=robots-archive/@nexor.co.uk X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:190970:950614042351] Content-Identifier: Re: how many ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 14 Jun 1995 05:23:49 +0100; Alternate-Recipient: Allowed From: " (David Eichmann)" <eichmann@rbse.jsc.nasa.gov> Message-ID: <9506140422.AA16671@rbse.jsc.nasa.gov> To: /CN=robots/@nexor.co.uk Subject: Re: how many global robots do we need? X-Sender: eichmann@192.88.42.10 MIME-version: 1.0 Content-type: text/plain; charset="us-ascii" Content-transfer-encoding: 7BIT Status: RO Content-Length: 1576 At 2:05 AM 6/14/95 +0100, (Paul Francis) wrote: >Like a number of you out there, I would also like to get a >list of as many of the world's URLs as I can. But, since >a number of robots are already out collecting this stuff, >it seems silly to add yet another. So, what I'm wondering >is: > >1. Is there anybody out there with a large list of URLs > willing to share it? Our current list is rather dated. We have version two of the RBSE Spider just about ready for prime-time - we're currently porting to Postgres95 to see if we can achieve better stability of the database engine. When we get the port done, we'll be building up an incrementally updateable index. >2. If so, is anybody set up to send out periodic updates > of their lists (what's new, what's obsolete)? One of the features of version 2 will be an interface allowing folks to execute raw queries - supporting pretty much anything you wish to interrogate the database about. More details when we go public with the interface. This will include the ability to query as of any given time/date in the life of the Spider, using Postgres' temporal features. - Dave ----------- David Eichmann Asst. Prof. / RBSE Director of R & D Web: http://ricis.cl.uh.edu/eichmann/ Software Engineering Program Phone: (713) 283-3875 University of Houston - Clear Lake fax: (713) 283-3869 Box 113, 2700 Bay Area Blvd. Email: eichmann@rbse.jsc.nasa.gov Houston, TX 77058 or: eichmann@cl.uh.edu RBSE on the Web: http://rbse.jsc.nasa.gov/eichmann/rbse.html From /CN=robots-errors/@nexor.co.uk Wed Jun 14 09:01:27 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Wed, 14 Jun 1995 09:05:15 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 14 Jun 1995 09:01:27 +0100 Date: Wed, 14 Jun 1995 09:01:27 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: yu@CS.UCLA.EDU, talman@sics.se, mats@medialab.se, fred@nvg.unit.no, christen.krogh@filosofi.uio.no, Svein.Parnas@bibhs.no, kruijff@cs.utwente.nl, www@win.tue.nl, floris.wiesman@MI.rulimburg.nl, fb@di.unito.it, andrea@dei.unipd.it, sadun@hp2.sm.dsi.unimi.it, musella@hp2.sm.dsi.unimi.it, esinto@nettuno.csr.unibo.it, altini@ssi06.ssi.it, coc0175@iperbole.bologna.it, Sean.McGrath@UCG.ie, rich%dcre.leeds.ac.uk%scs.leeds.ac.uk@mail-relay.ja.net, Z.Wang@cs.ucl.ac.uk, M.Levy@cs.ucl.ac.uk, sac@compsci.stirling.ac.uk, phrrngtn%cs.st-andrews.ac.uk@cs.st-andrews.ac.uk, R.J.Hollom@ecs.southampton.ac.uk, A.J.Jackson@qmw.ac.uk, marques@vax.oxford.ac.uk, s.j.masterton@open.ac.uk, username%cerfnet.com@nsfnet-relay.ac.uk, murrayb%hume.icis.qut.edu.au@nsfnet-relay.ac.uk, line%ginko.cecer.army.mil@nsfnet-relay.ac.uk, kurt_indermaur%intuit.com@nsfnet-relay.ac.uk, isamu%hopf.dnai.com@nsfnet-relay.ac.uk, gorme%ifi.uio.no@nsfnet-relay.ac.uk, bstarr%liege.ICS.UCI.EDU@nsfnet-relay.ac.uk, oklee@computer-science.nottingham.ac.uk, darren.sanders@northumbria.ac.uk, mlvanbie@undergrad.math.uwaterloo.ca, MAGERB@wl.aecl.ca, lemieuse@ERE.UMontreal.CA, lehtin@ee.ualberta.ca, js@bison.lif.icnet.uk, joel@ai.iit.nrc.ca, jfg <jfg%atl.hp.com@hpl.hewlett-packard.co.uk>, howardk@cyberstore.ca, hachemi@ims.uwindsor.ca, gaines@fsc.cpsc.ucalgary.ca, COWANP@sci.istc.ca, alex@alexu.demon.co.uk, twalker%labatt.com@mail.uunet.ca, M.Wooldridge@doc.manchester-metropolitan-university.ac.uk, brownc@computer-science.manchester.ac.uk, csc177@central1.lancaster.ac.uk, lmjm@doc.imperial.ac.uk, kk5@doc.imperial.ac.uk, CL0094P@pegasus.huddersfield.ac.uk, ibs@dcs.edinburgh.ac.uk, gjmooney@de-montfort.ac.uk, David.Halls@computer-lab.cambridge.ac.uk, omr1000@central-unix-service.cambridge.ac.uk, a.rodrigues%geography.bbk.ac.uk%ccs.bbk.ac.uk%mail1.ccs.bbk.ac.uk@mailh.ccs.bbk.ac.uk , zandy@cedar.buffalo.edu, wyliamh@sco.COM, wozz@chewy.wookie.net, wlee@chaph.usc.edu, wfocbt@ix.netcom.com, weber@eit.COM, warlock@bnr.ca, wagner@panix.com, vparada@inf.utfsm.cl, verweb@verity.com, username@cerfnet.com, ulla@stupi.se, twoods@blue.weeg.uiowa.edu, tushar@watson.ibm.com, treadwell@maine.com, tong@numbat.cs.rmit.edu.au, tomd@bga.com, tomasic@almaden.ibm.com, tom@jax-inter.net, tom@iiidns.iii.org.tw, tom@amnh.org, tobor@psycco.msae.wisc.edu, tmorling@rtm.com, tmaslen@verity.com, tm3k+@andrew.cmu.edu, tkoyt@theseas.ntua.gr, tjc4@cornell.edu, TIM.COOK@DOS.US-STATE.GOV, tim@funbox.demon.co.uk, tierno.bah@his.com, thrift@osage.csc.ti.com, thomas.schuhberger@plus.at, thom@nickel.ucs.indiana.edu, therion@fiat.gslis.utexas.edu, tgargiulo@VNET.IBM.COM, telbert@dcs118.dcsnod.uwf.edu, tbray@opentext.com, tamaki@nttlsl.ntt.jp, taluskie@utpapa.ph.utexas.edu, swneal@alpha.ncsc.mil, suther@suntid.bnl.gov, support@jfdi.demon.co.uk, sugai@okilab.oki.co.jp, ssutarwala@cc1.dttus.com, ssandke@Verity.COM, slobin@feast.fe.msk.ru, skipmoon@chattanooga.net, skg@shiva.pls.com, SHUGHES@jpl-pds.jpl.nasa.gov, shen@ISI.EDU, shah2170@css1s0.engr.ccny.cuny.edu, sggill@neumann.uwaterloo.ca, sfcallic@magneto.csc.ncsu.edu, seb1@gate.net, schwartz@latour.cs.colorado.edu, sbfrs@attila.scob.alaska.edu, sato@soum.co.jp, satishk@paul.rutgers.edu, sanyal@deathstar.cris.com, samy@ru.cs.gmr.com, sallystan@aol.com, s9010727@arcadia.cs.rmit.edu.au, rustw@utw.com, rudi@knoware.nl, rtong@verity.com, rsmith@proteus.arc.nasa.gov, Rshearer@cris.com, rs_butner@ccmail.pnl.gov, rross@supernet.net, root@sabalan.uplift.fr, ronz@rio.myra.com, ROMANBP@delphi.com, roland@technet.sg, roger@hazelton.demon.co.uk, rob@iconics.com, rmelo@ncsa.uiuc.edu, rlh@conan.ids.net, rick@vivo.com, rhb@hotsand.att.com, rgh@slc.unisys.com, reveman@arafat.ENET.dec.com, reter@mail.b-2.de.contrib.net, reader@server.blueline.com, rcalder@cfara1.harvard.edu, rc_stratton@ccmail.pnl.gov, R.Sosic@cit.gu.edu.au, pypodima@athena.auth.gr, pvp@intgp1.att.com, pschwar@world.std.com, prs@netcom.com, pp002051@interramp.com, pp001495@interramp.com, pope@qds.com, pmaugust@teleport.com, plato@xs4all.nl, pkowalsk@pipeline.com, philba@microsoft.com, phil@hiccup.demon.co.uk, PED8C@pandora.cc.uottawa.ca, pdn!pdn.paradyne.com!bcutter@uunet.uu.net, page@cod.nosc.mil, ono@n105.is.tokushima-u.ac.jp, omy@San-Jose.ate.slb.com, ohata@sdl.hitachi.co.jp, Oberst@world.std.com, norton@ypn.com, nlehrer@isx.com, nitz@erg.sri.com, nigel@eecs.umich.edu, neutron@swttools.fc.hp.com, narnett@verity.com, naoki@open.rd.nttdata.jp, nadeem@macmillan.com, mvvaut@ib.com, mowens@advtech.uswest.com, mouche@metronet.com, moriya@st.rim.or.jp, mkgray@MIT.EDU, miyata@musys.com, mikeo@world.std.com, mikeg@innovation.com, Michael.Mauldin@NL.CS.CMU.EDU, mhorne@grover.lab.eds.co.nz, mholling@castaway.cc.uwf.edu, meyerg@viper.tcs-inc.com, medlar@ua.com, mdmays@server.iadfw.net, mdg1@hidec01.engr.uark.edu, mcbryan@cs.colorado.edu, marym@Finesse.COM, marty@Ahip.Getty.EDU, martinb@ix.netcom.com, Mark_M_Lee@ccm.ch.intel.com, mark_ferguson@msmgate.mrg.uswest.com, mark@darwin.sfbr.org, mario.brassard@ift.ulaval.ca, mak@bsjcube.bsj.com, lwarne01@ccsf.cc.ca.us, luciw@starwave.com, loofbour@news.cis.ohio-state.edu, lomax@smi.med.pitt.edu, loic@afp.com, logan@cs.cornell.edu, lists@konishiki.stanford.edu, lgg@cs.brown.edu, lentz@annie.astro.nwu.edu, leefi@microsoft.com, LBURKE@gco5.pb.gov.bc.ca, kumarv@apple.com, KUL@FSC.CA, kstamper@cybernetics.net, konop@techunix.technion.ac.il, kojo@ccs.mt.nec.co.jp, kljackso@cerfnet.com, king@hobart.cs.umass.edu, kimoto@people.flab.fujitsu.co.jp, Kelly_Carney@stortek.com, kcoffee@panix.com, kball@Novell.COM, jvisagie@active.co.za, Judy_Feder@cq.com, jswift@timber.infohwy.com, jsteer@BitScout.com, jshakes@cs.washington.edu, jrb@cs.pdx.edu, joshuack@cae.uwm.edu, Jonathan@spokes.demon.co.uk, jonathan@nwnet.net, John.R.R.Leavitt@nl.cs.cmu.edu, joe@clyde.larc.nasa.gov, jmeritt@smtpinet.aspensys.com, jlh@linus.mitre.org, jjj@crasun.cra.com, jjiang@ttl.pactel.com, jjc+@pitt.edu, jherder@southwind.net, Jfeder@aol.com, jessea@trcinc.com, jeff_sylvia@quickmail.truevision.com, jbb@conan.itc.virginia.edu, jar@iapp201.mcom.com, jamesb@werple.mira.net.au, jamesb@optical.fiber.net, jallan@schoolnet.carleton.ca, Ivan_Lindenfeld@pcmailgw.ml.com, its.com!scott@mcs.com, itaru@ulis.ac.jp, ipgroup@clark.net, ihli@cginn.cgs.fr, ic58@jove.acs.unt.edu, hurleyj@netcom.com, hschilling@lerc.nasa.gov, HotShield@aol.com, horsager@wln.com, hornlo@okra.millsaps.edu, hewett@cs.utexas.edu, helper@law.uark.edu, hardy@powell.cs.colorado.edu, hajime@st.rim.or.jp, habermann@dow.com, gsk@edgtuws1.qed.qld.gov.au, gregors@edo032pc.pipe.nova.ca, gregg@fly2.berkeley.edu, grayson@char.vnet.net, grahamh@mail.mpx.com.au, gr@pogo.ccd.bnl.gov, gpl53044@uxa.cso.uiuc.edu, gjv@io.org, gfowler@wilkins.iaims.bcm.tmc.edu, garth@pisces.systems.sa.gov.au, frode@toaster.SFSU.EDU, FRITZ@Gems.VCU.EDU, frank@manua.gsfc.nasa.gov, francis@cactus.slab.ntt.jp, flash@cyber.net, finnerty@CapAccess.org, fielding@avron.ICS.UCI.EDU, fcc@agent.com, fbra@sunyit.edu, Ewan@paranoia.demon.co.uk, ew974@nextsun.ins.cwru.edu, essicm@uf4725p02.washingtondc.NCR.COM, eshyjka@dc.isx.com, eoh@hacom.nl, emery@squawfish.fsr.com, eichmann@rbse.jsc.nasa.gov, efinet@insist.com, edjlali@cs.UMD.EDU, dwjurkat@mailbox.syr.edu, duffy@csn.org, duclos@iad.ift.ulaval.ca, dsylvest@clark.net, DSINGH@apsc.com, dolphin@comeng.chungnam.ac.kr, dl@hplyot.obspm.fr, dhart@titanic.cs.umass.edu, detter@databank.com, dcornwal@mail.utexas.edu, dbakin@sybase.com, DaxMan@ix.netcom.com, david.pattarini@fi.gs.com, david@police.tas.gov.au, daveg@fultech.com, Curtisb@caladan.chattanooga.net, CSTEPHEN@us.oracle.com, cohmer@lamar.ColoState.EDU, clv2m@oak.cs.virginia.edu, close@cs.ukans.edu, Chung.Kang.Tsen@ozark.edrc.cmu.edu, chs@Verity.COM, chrisf@pipex.net, chris@cerl.gatech.edu, choy@cs.usask.ca, chess@watson.ibm.com, charless@sco.COM, charles@sw19.demon.co.uk, chang@cs.umd.edu, chang@cam.nist.gov, carter@cs.bu.edu, carro@cs.bu.edu, caadalin@mtu.edu, bynum@CS.WM.EDU, burkhart@tis.andersen.com, BUDDHI@umiami.ir.miami.edu, brycer@priscilla.ultima.org, brinskel@slowpoke.genie.uottawa.ca, bp@cs.washington.edu, bonnie@dev.prodigy.com, bonini@panix.com, bmthomas@ix.netcom.com, billy@utdallas.edu, billbe@chi.shl.com, bicknell@ussenterprise.async.vt.edu, bens@microsoft.com, benjy@benjy.cc.vt.edu, beebee@parc.xerox.com, beck@amb1.ccalmr.ogi.edu, bdtaylor@alias.com, bcutter@gate.net, barrie@scs.unr.edu, barrett@almaden.ibm.com, bal@martigny.ai.mit.edu, atc@ornl.gov, ARGRAY@rivendell.otago.ac.nz, andy@andy.net, andras@is.co.za, amonge@cs.ucsd.edu, amohesky@itg.ti.com, allsop@swttools.fc.hp.com, allmedia@world.std.com, allied@biddeford.com, allain@waiter.ira.rl.af.mil, Alex_Franz@A.NL.CS.CMU.EDU, ALEWONTIN@bentley.edu, ahovig@uclink3.berkeley.edu, AGOUNTIS@hop.qgraph.com, adriaan@eb.com, ac41@solo.pipex.com, aak2@Ra.MsState.Edu, a-mikebi@microsoft.com, 100627.2502@compuserve.com, 0004103477@mcimail.com, igirisujin%mberry.demon.co.uk@punt2.demon.co.uk, jims%globalvillag.com@pmail.globalvillag.com, NANDU%anest4.anest.ufl.edu@nervm.nerdc.ufl.edu, PBOSTROM%VIENNA.LEGENT.COM@LEGENT.COM, BenefieM%buchananpo.mpc.af.mil@dbm1.mpc.af.mil, tronche@lri.fr, Norbert.Glaser@loria.fr, taipale@rotol.fi, jta@mofile.fi, ptk@akumiitti.fi, GONZALO@cicei.ulpgc.es, jh@icl.dk, adreyer@uni-paderborn.de, doemel@informatik.uni-frankfurt.de, nwoh@software-ag.de, ralf@egd.igd.fhg.de, stegmann@rzo2.sari.fh-wuerzburg.de, Andreas.Ley@rz.uni-karlsruhe.de, olafabbe@w250zrz.zrz.tu-berlin.d400.de, pannier@cs.tu-berlin.d400.de, bene@cs.tu-berlin.d400.de, jpellizz@sp055.cern.ch, goatcher@dxcern.cern.ch, casey@ptsun00.cern.ch, pam@sunbim.be, Nicolas.GEUSKENS@DG4.cec.be, m.koster@nexor.co.uk, /CN=robots-archive/@nexor.co.uk X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:219190:950614080131] Content-Identifier: Re: how many ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 14 Jun 1995 09:01:27 +0100; Alternate-Recipient: Allowed From: " (Paul Francis)" <francis@cactus.slab.ntt.jp> Message-ID: <9506140759.AA03235@cactus.slab.ntt.jp> To: /CN=robots/@nexor.co.uk, eichmann@rbse.jsc.nasa.gov Subject: Re: how many global robots do we need? Status: RO Content-Length: 706 > > One of the features of version 2 will be an interface allowing folks to > execute raw queries - supporting pretty much anything you wish to > interrogate the database about. More details when we go public with the > interface. This will include the ability to query as of any given > time/date in the life of the Spider, using Postgres' temporal features. > So, someone will be able to make a query of the sort: "get me all URLs entered into the database since Wed Jun 14 13:12:53 1995"? (understanding, of course, that you may not support natural language query :-) When will such a thing be available, and how many of URLs will you expect it to contain steady state? Thanks, PF From /CN=robots-errors/@nexor.co.uk Fri Jun 16 04:31:53 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Fri, 16 Jun 1995 04:41:31 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 16 Jun 1995 04:31:53 +0100 Date: Fri, 16 Jun 1995 04:31:53 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: yu@CS.UCLA.EDU, talman@sics.se, mats@medialab.se, fred@nvg.unit.no, christen.krogh@filosofi.uio.no, Svein.Parnas@bibhs.no, kruijff@cs.utwente.nl, www@win.tue.nl, floris.wiesman@MI.rulimburg.nl, fb@di.unito.it, andrea@dei.unipd.it, sadun@hp2.sm.dsi.unimi.it, musella@hp2.sm.dsi.unimi.it, esinto@nettuno.csr.unibo.it, altini@ssi06.ssi.it, coc0175@iperbole.bologna.it, Sean.McGrath@UCG.ie, rich%dcre.leeds.ac.uk%scs.leeds.ac.uk@mail-relay.ja.net, Z.Wang@cs.ucl.ac.uk, M.Levy@cs.ucl.ac.uk, sac@compsci.stirling.ac.uk, phrrngtn%cs.st-andrews.ac.uk@cs.st-andrews.ac.uk, R.J.Hollom@ecs.southampton.ac.uk, A.J.Jackson@qmw.ac.uk, viral@dcs.qmw.ac.uk, marques@vax.oxford.ac.uk, s.j.masterton@open.ac.uk, username%cerfnet.com@nsfnet-relay.ac.uk, murrayb%hume.icis.qut.edu.au@nsfnet-relay.ac.uk, line%ginko.cecer.army.mil@nsfnet-relay.ac.uk, kurt_indermaur%intuit.com@nsfnet-relay.ac.uk, isamu%hopf.dnai.com@nsfnet-relay.ac.uk, gorme%ifi.uio.no@nsfnet-relay.ac.uk, bstarr%liege.ICS.UCI.EDU@nsfnet-relay.ac.uk, oklee@computer-science.nottingham.ac.uk, darren.sanders@northumbria.ac.uk, mlvanbie@undergrad.math.uwaterloo.ca, MAGERB@wl.aecl.ca, lemieuse@ERE.UMontreal.CA, lehtin@ee.ualberta.ca, js@bison.lif.icnet.uk, joel@ai.iit.nrc.ca, jfg <jfg%atl.hp.com@hpl.hewlett-packard.co.uk>, howardk@cyberstore.ca, hachemi@ims.uwindsor.ca, gaines@fsc.cpsc.ucalgary.ca, COWANP@sci.istc.ca, alex@alexu.demon.co.uk, twalker%labatt.com@mail.uunet.ca, M.Wooldridge@doc.manchester-metropolitan-university.ac.uk, brownc@computer-science.manchester.ac.uk, csc177@central1.lancaster.ac.uk, lmjm@doc.imperial.ac.uk, kk5@doc.imperial.ac.uk, CL0094P@pegasus.huddersfield.ac.uk, ibs@dcs.edinburgh.ac.uk, gjmooney@de-montfort.ac.uk, David.Halls@computer-lab.cambridge.ac.uk, omr1000@central-unix-service.cambridge.ac.uk, a.rodrigues%geography.bbk.ac.uk%ccs.bbk.ac.uk%mail1.ccs.bbk.ac.uk@mailh.ccs.bbk.ac.uk , zandy@cedar.buffalo.edu, wyliamh@sco.COM, wozz@chewy.wookie.net, wlee@chaph.usc.edu, wfocbt@ix.netcom.com, weber@eit.COM, warlock@bnr.ca, wagner@panix.com, vparada@inf.utfsm.cl, verweb@verity.com, username@cerfnet.com, ulla@stupi.se, twoods@blue.weeg.uiowa.edu, tushar@watson.ibm.com, tsmith@cmp.com, treadwell@maine.com, tong@numbat.cs.rmit.edu.au, tomd@bga.com, tomasic@almaden.ibm.com, tom@jax-inter.net, tom@iiidns.iii.org.tw, tom@amnh.org, tobor@psycco.msae.wisc.edu, tmorling@rtm.com, tmaslen@verity.com, tm3k+@andrew.cmu.edu, tkoyt@theseas.ntua.gr, tjc4@cornell.edu, TIM.COOK@DOS.US-STATE.GOV, tim@funbox.demon.co.uk, tierno.bah@his.com, thrift@osage.csc.ti.com, thomas.schuhberger@plus.at, thom@nickel.ucs.indiana.edu, therion@fiat.gslis.utexas.edu, tgargiulo@VNET.IBM.COM, telbert@dcs118.dcsnod.uwf.edu, tbray@opentext.com, tamaki@nttlsl.ntt.jp, taluskie@utpapa.ph.utexas.edu, swneal@alpha.ncsc.mil, suther@suntid.bnl.gov, support@jfdi.demon.co.uk, sugai@okilab.oki.co.jp, ssutarwala@cc1.dttus.com, ssandke@Verity.COM, slobin@feast.fe.msk.ru, skipmoon@chattanooga.net, skg@shiva.pls.com, SHUGHES@jpl-pds.jpl.nasa.gov, shen@ISI.EDU, shah2170@css1s0.engr.ccny.cuny.edu, sggill@neumann.uwaterloo.ca, sfcallic@magneto.csc.ncsu.edu, seb1@gate.net, schwartz@latour.cs.colorado.edu, sbfrs@attila.scob.alaska.edu, sato@soum.co.jp, satishk@paul.rutgers.edu, sanyal@deathstar.cris.com, samy@ru.cs.gmr.com, sallystan@aol.com, s9010727@arcadia.cs.rmit.edu.au, rustw@utw.com, rudi@knoware.nl, rtong@verity.com, rsmith@proteus.arc.nasa.gov, Rshearer@cris.com, rs_butner@ccmail.pnl.gov, rross@supernet.net, root@sabalan.uplift.fr, ronz@rio.myra.com, ROMANBP@delphi.com, roland@technet.sg, roger@hazelton.demon.co.uk, rob@iconics.com, rmelo@ncsa.uiuc.edu, rlh@conan.ids.net, rlandon@scruznet.com, rick@vivo.com, rhb@hotsand.att.com, rgh@slc.unisys.com, reveman@arafat.ENET.dec.com, reter@mail.b-2.de.contrib.net, reader@server.blueline.com, rcalder@cfara1.harvard.edu, rc_stratton@ccmail.pnl.gov, R.Sosic@cit.gu.edu.au, pypodima@athena.auth.gr, pvp@intgp1.att.com, pschwar@world.std.com, prs@netcom.com, pp002051@interramp.com, pp001495@interramp.com, pope@qds.com, pmaugust@teleport.com, plato@xs4all.nl, pkowalsk@pipeline.com, philba@microsoft.com, phil@hiccup.demon.co.uk, PED8C@pandora.cc.uottawa.ca, pdn!pdn.paradyne.com!bcutter@uunet.uu.net, page@cod.nosc.mil, ono@n105.is.tokushima-u.ac.jp, omy@San-Jose.ate.slb.com, ohata@sdl.hitachi.co.jp, Oberst@world.std.com, norton@ypn.com, nlehrer@isx.com, nitz@erg.sri.com, nigel@eecs.umich.edu, neutron@swttools.fc.hp.com, narnett@verity.com, naoki@open.rd.nttdata.jp, nadeem@macmillan.com, mvvaut@ib.com, mowens@advtech.uswest.com, mouche@metronet.com, moriya@st.rim.or.jp, mkgray@MIT.EDU, mjoly@insa.insa-lyon.fr, miyata@musys.com, mikeo@world.std.com, mikeg@innovation.com, Michael.Mauldin@NL.CS.CMU.EDU, mhorne@grover.lab.eds.co.nz, mholling@castaway.cc.uwf.edu, meyerg@viper.tcs-inc.com, medlar@ua.com, mdmays@server.iadfw.net, mdg1@hidec01.engr.uark.edu, mcbryan@cs.colorado.edu, marym@Finesse.COM, marty@Ahip.Getty.EDU, martinb@ix.netcom.com, Mark_M_Lee@ccm.ch.intel.com, mark_ferguson@msmgate.mrg.uswest.com, mark@darwin.sfbr.org, mario.brassard@ift.ulaval.ca, mak@bsjcube.bsj.com, lwarne01@ccsf.cc.ca.us, luciw@starwave.com, loofbour@news.cis.ohio-state.edu, lomax@smi.med.pitt.edu, loic@afp.com, logan@cs.cornell.edu, lists@konishiki.stanford.edu, lgg@cs.brown.edu, lentz@annie.astro.nwu.edu, leefi@microsoft.com, LBURKE@gco5.pb.gov.bc.ca, kvale@ivy.physics.mcmaster.ca, kumarv@apple.com, KUL@FSC.CA, kstamper@cybernetics.net, konop@techunix.technion.ac.il, kojo@ccs.mt.nec.co.jp, kljackso@cerfnet.com, king@hobart.cs.umass.edu, kimoto@people.flab.fujitsu.co.jp, Kelly_Carney@stortek.com, kcoffee@panix.com, kball@Novell.COM, jvisagie@active.co.za, Judy_Feder@cq.com, jswift@timber.infohwy.com, jsteer@BitScout.com, jshakes@cs.washington.edu, jrb@cs.pdx.edu, joshuack@cae.uwm.edu, Jonathan@spokes.demon.co.uk, jonathan@nwnet.net, John.R.R.Leavitt@nl.cs.cmu.edu, joe@clyde.larc.nasa.gov, jmeritt@smtpinet.aspensys.com, jlh@linus.mitre.org, jjj@crasun.cra.com, jjiang@ttl.pactel.com, jjc+@pitt.edu, jherder@southwind.net, Jfeder@aol.com, jessea@trcinc.com, jeff_sylvia@quickmail.truevision.com, jbb@conan.itc.virginia.edu, jar@iapp201.mcom.com, jamesb@werple.mira.net.au, jamesb@optical.fiber.net, jallan@schoolnet.carleton.ca, Ivan_Lindenfeld@pcmailgw.ml.com, its.com!scott@mcs.com, itaru@ulis.ac.jp, ipgroup@clark.net, ihli@cginn.cgs.fr, ic58@jove.acs.unt.edu, hurleyj@netcom.com, hschilling@lerc.nasa.gov, HotShield@aol.com, horsager@wln.com, hornlo@okra.millsaps.edu, hewett@cs.utexas.edu, helper@law.uark.edu, hardy@powell.cs.colorado.edu, hajime@st.rim.or.jp, habermann@dow.com, gsk@edgtuws1.qed.qld.gov.au, gregors@edo032pc.pipe.nova.ca, gregg@fly2.berkeley.edu, grayson@char.vnet.net, grahamh@mail.mpx.com.au, gr@pogo.ccd.bnl.gov, gpl53044@uxa.cso.uiuc.edu, gjv@io.org, gfowler@wilkins.iaims.bcm.tmc.edu, garth@pisces.systems.sa.gov.au, frode@toaster.SFSU.EDU, FRITZ@Gems.VCU.EDU, frank@manua.gsfc.nasa.gov, francis@cactus.slab.ntt.jp, flash@cyber.net, finnerty@CapAccess.org, fielding@avron.ICS.UCI.EDU, fcc@agent.com, fbra@sunyit.edu, Ewan@paranoia.demon.co.uk, ew974@nextsun.ins.cwru.edu, essicm@uf4725p02.washingtondc.NCR.COM, eshyjka@dc.isx.com, eoh@hacom.nl, emery@squawfish.fsr.com, eichmann@rbse.jsc.nasa.gov, efinet@insist.com, edjlali@cs.UMD.EDU, dwjurkat@mailbox.syr.edu, duffy@csn.org, duclos@iad.ift.ulaval.ca, dsylvest@clark.net, DSINGH@apsc.com, dolphin@comeng.chungnam.ac.kr, dl@hplyot.obspm.fr, dhart@titanic.cs.umass.edu, detter@databank.com, dcornwal@mail.utexas.edu, dbakin@sybase.com, DaxMan@ix.netcom.com, david.pattarini@fi.gs.com, david@police.tas.gov.au, daveg@fultech.com, Curtisb@caladan.chattanooga.net, CSTEPHEN@us.oracle.com, cohmer@lamar.ColoState.EDU, clv2m@oak.cs.virginia.edu, close@cs.ukans.edu, Chung.Kang.Tsen@ozark.edrc.cmu.edu, chs@Verity.COM, chrisf@pipex.net, chris@cerl.gatech.edu, choy@cs.usask.ca, chess@watson.ibm.com, charless@sco.COM, charles@sw19.demon.co.uk, chang@cs.umd.edu, chang@cam.nist.gov, carter@cs.bu.edu, carro@cs.bu.edu, caadalin@mtu.edu, bynum@CS.WM.EDU, burkhart@tis.andersen.com, BUDDHI@umiami.ir.miami.edu, brycer@priscilla.ultima.org, brinskel@slowpoke.genie.uottawa.ca, bp@cs.washington.edu, bonnie@dev.prodigy.com, bonini@panix.com, bmthomas@ix.netcom.com, billy@utdallas.edu, billbe@chi.shl.com, bicknell@ussenterprise.async.vt.edu, bens@microsoft.com, benjy@benjy.cc.vt.edu, beebee@parc.xerox.com, beck@amb1.ccalmr.ogi.edu, bdtaylor@alias.com, bcutter@gate.net, barrie@scs.unr.edu, barrett@almaden.ibm.com, bal@martigny.ai.mit.edu, atc@ornl.gov, ARGRAY@rivendell.otago.ac.nz, andy@andy.net, andras@is.co.za, amonge@cs.ucsd.edu, amohesky@itg.ti.com, amills@rmplc.co.uk, allsop@swttools.fc.hp.com, allmedia@world.std.com, allied@biddeford.com, allain@waiter.ira.rl.af.mil, Alex_Franz@A.NL.CS.CMU.EDU, ALEWONTIN@bentley.edu, ahovig@uclink3.berkeley.edu, AGOUNTIS@hop.qgraph.com, adriaan@eb.com, ac41@solo.pipex.com, aak2@Ra.MsState.Edu, a-mikebi@microsoft.com, 100627.2502@compuserve.com, 0004103477@mcimail.com, igirisujin%mberry.demon.co.uk@punt2.demon.co.uk, jims%globalvillag.com@pmail.globalvillag.com, NANDU%anest4.anest.ufl.edu@nervm.nerdc.ufl.edu, PBOSTROM%VIENNA.LEGENT.COM@LEGENT.COM, BenefieM%buchananpo.mpc.af.mil@dbm1.mpc.af.mil, tronche@lri.fr, Norbert.Glaser@loria.fr, taipale@rotol.fi, jta@mofile.fi, ptk@akumiitti.fi, GONZALO@cicei.ulpgc.es, jh@icl.dk, adreyer@uni-paderborn.de, doemel@informatik.uni-frankfurt.de, nwoh@software-ag.de, ralf@egd.igd.fhg.de, stegmann@rzo2.sari.fh-wuerzburg.de, Andreas.Ley@rz.uni-karlsruhe.de, olafabbe@w250zrz.zrz.tu-berlin.d400.de, pannier@cs.tu-berlin.d400.de, bene@cs.tu-berlin.d400.de, jpellizz@sp055.cern.ch, goatcher@dxcern.cern.ch, casey@ptsun00.cern.ch, pam@sunbim.be, Nicolas.GEUSKENS@DG4.cec.be, m.koster@nexor.co.uk, /CN=robots-archive/@nexor.co.uk X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:074660:950616033203] Content-Identifier: Re: how many ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 16 Jun 1995 04:31:53 +0100; Alternate-Recipient: Allowed From: " (David Eichmann)" <eichmann@rbse.jsc.nasa.gov> Message-ID: <9506160329.AB18212@rbse.jsc.nasa.gov> To: /CN=robots/@nexor.co.uk Subject: Re: how many global robots do we need? X-Sender: eichmann@192.88.42.10 MIME-version: 1.0 Content-type: text/plain; charset="us-ascii" Content-transfer-encoding: 7BIT Status: RO Content-Length: 1796 At 4:59 PM 6/14/95 +0900, Paul Francis wrote: >> >> One of the features of version 2 will be an interface allowing folks to >> execute raw queries - supporting pretty much anything you wish to >> interrogate the database about. More details when we go public with the >> interface. This will include the ability to query as of any given >> time/date in the life of the Spider, using Postgres' temporal features. >> > >So, someone will be able to make a query of the sort: > > "get me all URLs entered into the database since Wed Jun 14 13:12:53 1995"? > >(understanding, of course, that you may not support natural >language query :-) Yes, as well as "get me all URLs in the database that have changed since Wed Jun 14 13:12:53 1995" among other things. >When will such a thing be available, and how many of URLs will you >expect it to contain steady state? We'll be going public as soon as we get our database engine stable. Postgres 4.2 has been proving problematic for us. Postgres95 appears to have resolved our problems, but uses a SQL variant instead of TQUEL as its programming language. We haven't decided just how wide a net to throw. We're planning on attempting to generate coverage of one or more conceptual areas (e.g. computer science / software engineering) rather than making a run at Lycos' total URL count. - Dave ----------- David Eichmann Asst. Prof. / RBSE Director of R & D Web: http://ricis.cl.uh.edu/eichmann/ Software Engineering Program, Box 113 Phone: (713) 283-3875 University of Houston - Clear Lake fax: (713) 283-3869 2700 Bay Area Blvd. Email: eichmann@rbse.jsc.nasa.gov Houston, TX 77058 or: eichmann@cl.uh.edu RBSE on the Web: http://rbse.jsc.nasa.gov/eichmann/rbse.html From /CN=robots-errors/@nexor.co.uk Fri Jun 16 22:59:54 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Fri, 16 Jun 1995 23:04:20 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 16 Jun 1995 22:59:54 +0100 Date: Fri, 16 Jun 1995 22:59:54 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: yu@CS.UCLA.EDU, talman@sics.se, mats@medialab.se, fred@nvg.unit.no, christen.krogh@filosofi.uio.no, Svein.Parnas@bibhs.no, kruijff@cs.utwente.nl, www@win.tue.nl, floris.wiesman@MI.rulimburg.nl, fb@di.unito.it, andrea@dei.unipd.it, sadun@hp2.sm.dsi.unimi.it, musella@hp2.sm.dsi.unimi.it, esinto@nettuno.csr.unibo.it, altini@ssi06.ssi.it, coc0175@iperbole.bologna.it, Sean.McGrath@UCG.ie, rich%dcre.leeds.ac.uk%scs.leeds.ac.uk@mail-relay.ja.net, Z.Wang@cs.ucl.ac.uk, M.Levy@cs.ucl.ac.uk, sac@compsci.stirling.ac.uk, phrrngtn%cs.st-andrews.ac.uk@cs.st-andrews.ac.uk, R.J.Hollom@ecs.southampton.ac.uk, A.J.Jackson@qmw.ac.uk, viral@dcs.qmw.ac.uk, marques@vax.oxford.ac.uk, s.j.masterton@open.ac.uk, username%cerfnet.com@nsfnet-relay.ac.uk, murrayb%hume.icis.qut.edu.au@nsfnet-relay.ac.uk, line%ginko.cecer.army.mil@nsfnet-relay.ac.uk, kurt_indermaur%intuit.com@nsfnet-relay.ac.uk, isamu%hopf.dnai.com@nsfnet-relay.ac.uk, gorme%ifi.uio.no@nsfnet-relay.ac.uk, bstarr%liege.ICS.UCI.EDU@nsfnet-relay.ac.uk, oklee@computer-science.nottingham.ac.uk, darren.sanders@northumbria.ac.uk, mlvanbie@undergrad.math.uwaterloo.ca, MAGERB@wl.aecl.ca, lemieuse@ERE.UMontreal.CA, lehtin@ee.ualberta.ca, js@bison.lif.icnet.uk, joel@ai.iit.nrc.ca, jfg <jfg%atl.hp.com@hpl.hewlett-packard.co.uk>, howardk@cyberstore.ca, hachemi@ims.uwindsor.ca, gaines@fsc.cpsc.ucalgary.ca, COWANP@sci.istc.ca, alex@alexu.demon.co.uk, twalker%labatt.com@mail.uunet.ca, M.Wooldridge@doc.manchester-metropolitan-university.ac.uk, brownc@computer-science.manchester.ac.uk, csc177@central1.lancaster.ac.uk, lmjm@doc.imperial.ac.uk, kk5@doc.imperial.ac.uk, CL0094P@pegasus.huddersfield.ac.uk, ibs@dcs.edinburgh.ac.uk, gjmooney@de-montfort.ac.uk, David.Halls@computer-lab.cambridge.ac.uk, omr1000@central-unix-service.cambridge.ac.uk, a.rodrigues%geography.bbk.ac.uk%ccs.bbk.ac.uk%mail1.ccs.bbk.ac.uk@mailh.ccs.bbk.ac.uk , zandy@cedar.buffalo.edu, wyliamh@sco.COM, wozz@chewy.wookie.net, wlee@chaph.usc.edu, wfocbt@ix.netcom.com, weber@eit.COM, warlock@bnr.ca, wagner@panix.com, vparada@inf.utfsm.cl, verweb@verity.com, username@cerfnet.com, ulla@stupi.se, twoods@blue.weeg.uiowa.edu, tushar@watson.ibm.com, tsmith@cmp.com, treadwell@maine.com, tong@numbat.cs.rmit.edu.au, tomd@bga.com, tomasic@almaden.ibm.com, tom@jax-inter.net, tom@iiidns.iii.org.tw, tom@amnh.org, tobor@psycco.msae.wisc.edu, tmorling@rtm.com, tmaslen@verity.com, tm3k+@andrew.cmu.edu, tkoyt@theseas.ntua.gr, tjc4@cornell.edu, TIM.COOK@DOS.US-STATE.GOV, tim@funbox.demon.co.uk, tierno.bah@his.com, thrift@osage.csc.ti.com, thomas.schuhberger@plus.at, thom@nickel.ucs.indiana.edu, therion@fiat.gslis.utexas.edu, tgargiulo@VNET.IBM.COM, telbert@dcs118.dcsnod.uwf.edu, tbray@opentext.com, tamaki@nttlsl.ntt.jp, taluskie@utpapa.ph.utexas.edu, swneal@alpha.ncsc.mil, suther@suntid.bnl.gov, support@jfdi.demon.co.uk, sugai@okilab.oki.co.jp, ssutarwala@cc1.dttus.com, ssandke@Verity.COM, srinivas@cs.ualberta.ca, slobin@feast.fe.msk.ru, skipmoon@chattanooga.net, skg@shiva.pls.com, SHUGHES@jpl-pds.jpl.nasa.gov, shen@ISI.EDU, shah2170@css1s0.engr.ccny.cuny.edu, sggill@neumann.uwaterloo.ca, sfcallic@magneto.csc.ncsu.edu, seb1@gate.net, schwartz@latour.cs.colorado.edu, sbfrs@attila.scob.alaska.edu, sato@soum.co.jp, satishk@paul.rutgers.edu, sanyal@deathstar.cris.com, samy@ru.cs.gmr.com, sallystan@aol.com, s9010727@arcadia.cs.rmit.edu.au, rustw@utw.com, rudi@knoware.nl, rtong@verity.com, rsmith@proteus.arc.nasa.gov, Rshearer@cris.com, rs_butner@ccmail.pnl.gov, rross@supernet.net, root@sabalan.uplift.fr, ronz@rio.myra.com, ROMANBP@delphi.com, roland@technet.sg, roger@hazelton.demon.co.uk, rob@iconics.com, rmelo@ncsa.uiuc.edu, rlh@conan.ids.net, rlandon@scruznet.com, rick@vivo.com, rhb@hotsand.att.com, rgh@slc.unisys.com, reveman@arafat.ENET.dec.com, reter@mail.b-2.de.contrib.net, reader@server.blueline.com, rcalder@cfara1.harvard.edu, rc_stratton@ccmail.pnl.gov, R.Sosic@cit.gu.edu.au, pypodima@athena.auth.gr, pvp@intgp1.att.com, pschwar@world.std.com, prs@netcom.com, pp002051@interramp.com, pp001495@interramp.com, pope@qds.com, pmaugust@teleport.com, plato@xs4all.nl, pkowalsk@pipeline.com, philba@microsoft.com, phil@hiccup.demon.co.uk, PED8C@pandora.cc.uottawa.ca, pdn!pdn.paradyne.com!bcutter@uunet.uu.net, page@cod.nosc.mil, ono@n105.is.tokushima-u.ac.jp, omy@San-Jose.ate.slb.com, ohata@sdl.hitachi.co.jp, Oberst@world.std.com, norton@ypn.com, nlehrer@isx.com, nitz@erg.sri.com, nigel@eecs.umich.edu, neutron@swttools.fc.hp.com, narnett@verity.com, naoki@open.rd.nttdata.jp, nadeem@macmillan.com, mvvaut@ib.com, mowens@advtech.uswest.com, mouche@metronet.com, moriya@st.rim.or.jp, mkgray@MIT.EDU, mjoly@insa.insa-lyon.fr, miyata@musys.com, mikeo@world.std.com, mikeg@innovation.com, Michael.Mauldin@NL.CS.CMU.EDU, mhorne@grover.lab.eds.co.nz, mholling@castaway.cc.uwf.edu, meyerg@viper.tcs-inc.com, medlar@ua.com, mdmays@server.iadfw.net, mdg1@hidec01.engr.uark.edu, mcbryan@cs.colorado.edu, marym@Finesse.COM, marty@Ahip.Getty.EDU, martinb@ix.netcom.com, Mark_M_Lee@ccm.ch.intel.com, mark_ferguson@msmgate.mrg.uswest.com, mark@darwin.sfbr.org, mario.brassard@ift.ulaval.ca, mak@bsjcube.bsj.com, lwarne01@ccsf.cc.ca.us, luciw@starwave.com, loofbour@news.cis.ohio-state.edu, lomax@smi.med.pitt.edu, loic@afp.com, logan@cs.cornell.edu, lists@konishiki.stanford.edu, lgg@cs.brown.edu, lentz@annie.astro.nwu.edu, leefi@microsoft.com, LBURKE@gco5.pb.gov.bc.ca, kvale@ivy.physics.mcmaster.ca, kumarv@apple.com, KUL@FSC.CA, kstamper@cybernetics.net, konop@techunix.technion.ac.il, kojo@ccs.mt.nec.co.jp, kljackso@cerfnet.com, king@hobart.cs.umass.edu, kimoto@people.flab.fujitsu.co.jp, Kelly_Carney@stortek.com, kcoffee@panix.com, kball@Novell.COM, jvisagie@active.co.za, Judy_Feder@cq.com, jswift@timber.infohwy.com, jsteer@BitScout.com, jshakes@cs.washington.edu, jrb@cs.pdx.edu, joshuack@cae.uwm.edu, Jonathan@spokes.demon.co.uk, jonathan@nwnet.net, John.R.R.Leavitt@nl.cs.cmu.edu, joe@clyde.larc.nasa.gov, jmeritt@smtpinet.aspensys.com, jlh@linus.mitre.org, jjj@crasun.cra.com, jjiang@ttl.pactel.com, jjc+@pitt.edu, jherder@southwind.net, Jfeder@aol.com, jessea@trcinc.com, jeff_sylvia@quickmail.truevision.com, jbb@conan.itc.virginia.edu, jar@iapp201.mcom.com, jamesb@werple.mira.net.au, jamesb@optical.fiber.net, jallan@schoolnet.carleton.ca, Ivan_Lindenfeld@pcmailgw.ml.com, its.com!scott@mcs.com, itaru@ulis.ac.jp, ipgroup@clark.net, ihli@cginn.cgs.fr, ic58@jove.acs.unt.edu, hurleyj@netcom.com, hseuping@cs.utexas.edu, hschilling@lerc.nasa.gov, HotShield@aol.com, horsager@wln.com, hornlo@okra.millsaps.edu, hewett@cs.utexas.edu, helper@law.uark.edu, hardy@powell.cs.colorado.edu, hajime@st.rim.or.jp, habermann@dow.com, gsk@edgtuws1.qed.qld.gov.au, gregors@edo032pc.pipe.nova.ca, gregg@fly2.berkeley.edu, grayson@char.vnet.net, grahamh@mail.mpx.com.au, gr@pogo.ccd.bnl.gov, gpl53044@uxa.cso.uiuc.edu, gjv@io.org, gfowler@wilkins.iaims.bcm.tmc.edu, garth@pisces.systems.sa.gov.au, frode@toaster.SFSU.EDU, FRITZ@Gems.VCU.EDU, frank@manua.gsfc.nasa.gov, francis@cactus.slab.ntt.jp, flash@cyber.net, finnerty@CapAccess.org, fielding@avron.ICS.UCI.EDU, fcc@agent.com, fbra@sunyit.edu, Ewan@paranoia.demon.co.uk, ew974@nextsun.ins.cwru.edu, essicm@uf4725p02.washingtondc.NCR.COM, eshyjka@dc.isx.com, eoh@hacom.nl, emery@squawfish.fsr.com, eichmann@rbse.jsc.nasa.gov, efinet@insist.com, edjlali@cs.UMD.EDU, dwjurkat@mailbox.syr.edu, duffy@csn.org, duclos@iad.ift.ulaval.ca, dsylvest@clark.net, DSINGH@apsc.com, dolphin@comeng.chungnam.ac.kr, dl@hplyot.obspm.fr, dhart@titanic.cs.umass.edu, detter@databank.com, dcornwal@mail.utexas.edu, dbakin@sybase.com, DaxMan@ix.netcom.com, david.pattarini@fi.gs.com, david@police.tas.gov.au, daveg@fultech.com, Curtisb@caladan.chattanooga.net, CSTEPHEN@us.oracle.com, cohmer@lamar.ColoState.EDU, clv2m@oak.cs.virginia.edu, close@cs.ukans.edu, Chung.Kang.Tsen@ozark.edrc.cmu.edu, chs@Verity.COM, chrisf@pipex.net, chris@cerl.gatech.edu, choy@cs.usask.ca, chess@watson.ibm.com, charless@sco.COM, charles@sw19.demon.co.uk, chang@cs.umd.edu, chang@cam.nist.gov, carter@cs.bu.edu, carro@cs.bu.edu, caadalin@mtu.edu, bynum@CS.WM.EDU, burkhart@tis.andersen.com, BUDDHI@umiami.ir.miami.edu, brycer@priscilla.ultima.org, brinskel@slowpoke.genie.uottawa.ca, bp@cs.washington.edu, bonnie@dev.prodigy.com, bonini@panix.com, bmthomas@ix.netcom.com, billy@utdallas.edu, billbe@chi.shl.com, bicknell@ussenterprise.async.vt.edu, bens@microsoft.com, benjy@benjy.cc.vt.edu, beebee@parc.xerox.com, beck@amb1.ccalmr.ogi.edu, bdtaylor@alias.com, bcutter@gate.net, barrie@scs.unr.edu, barrett@almaden.ibm.com, bal@martigny.ai.mit.edu, atc@ornl.gov, ARGRAY@rivendell.otago.ac.nz, andy@andy.net, andras@is.co.za, amonge@cs.ucsd.edu, amohesky@itg.ti.com, amills@rmplc.co.uk, allsop@swttools.fc.hp.com, allmedia@world.std.com, allied@biddeford.com, allain@waiter.ira.rl.af.mil, Alex_Franz@A.NL.CS.CMU.EDU, ALEWONTIN@bentley.edu, ahovig@uclink3.berkeley.edu, AGOUNTIS@hop.qgraph.com, adriaan@eb.com, ac41@solo.pipex.com, aak2@Ra.MsState.Edu, a-mikebi@microsoft.com, 100627.2502@compuserve.com, 0004103477@mcimail.com, igirisujin%mberry.demon.co.uk@punt2.demon.co.uk, jims%globalvillag.com@pmail.globalvillag.com, NANDU%anest4.anest.ufl.edu@nervm.nerdc.ufl.edu, PBOSTROM%VIENNA.LEGENT.COM@LEGENT.COM, BenefieM%buchananpo.mpc.af.mil@dbm1.mpc.af.mil, tronche@lri.fr, Norbert.Glaser@loria.fr, taipale@rotol.fi, jta@mofile.fi, ptk@akumiitti.fi, GONZALO@cicei.ulpgc.es, jh@icl.dk, adreyer@uni-paderborn.de, doemel@informatik.uni-frankfurt.de, nwoh@software-ag.de, ralf@egd.igd.fhg.de, stegmann@rzo2.sari.fh-wuerzburg.de, Andreas.Ley@rz.uni-karlsruhe.de, olafabbe@w250zrz.zrz.tu-berlin.d400.de, pannier@cs.tu-berlin.d400.de, bene@cs.tu-berlin.d400.de, jpellizz@sp055.cern.ch, goatcher@dxcern.cern.ch, casey@ptsun00.cern.ch, pam@sunbim.be, Nicolas.GEUSKENS@DG4.cec.be, m.koster@nexor.co.uk, /CN=robots-archive/@nexor.co.uk X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:289590:950616215955] Content-Identifier: ANNOUNCE: Web... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 16 Jun 1995 22:59:54 +0100; Alternate-Recipient: Allowed From: "Victor Parada G." <vparada@inf.utfsm.cl> Message-ID: <199506162158.AA19839@camahueto.inf.utfsm.cl> To: /CN=robots/@nexor.co.uk Subject: ANNOUNCE: WebCopy 0.97b now available. Status: RO Content-Length: 735 Hola mundo. I've just released a new version of WebCopy (0.97b), a command-line http-protocol file retriever with recursivity. It includes some new features: - better and more flexible code - proxy support - POST method The new on-line documentation is at the same location as ever: <URL:http://www.inf.utfsm.cl/~vparada/webcopy.html> I'd like some feedback about it, to release version 1.0 as soon as posible. Bye... ++Vitoco -- Lic. Victor A. Parada __ __ Universidad Tecnica Ingenieria Civil en Informatica o-''))_____\\ Federico Santa Maria, mailto:vparada@inf.utfsm.cl "--__/ * * * ) Valparaiso, CHILE. http://www.inf.utfsm.cl/~vparada/ c_c__/-c____/ +56 32 626364 x431 :-) From /CN=robots-errors/@nexor.co.uk Fri Jun 23 15:42:02 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Fri, 23 Jun 1995 15:47:49 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 23 Jun 1995 15:42:02 +0100 Date: Fri, 23 Jun 1995 15:42:02 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:028450:950623144204] Content-Identifier: weblayers Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 23 Jun 1995 15:42:02 +0100; Alternate-Recipient: Allowed From: Loic Dachary <loic@afp.com> Message-ID: <199506231435.QAA29867@pinot.par.afp.com> To: /CN=robots/@nexor.co.uk Cc: hugues@afp.com, thy@univ-paris8.fr, zull@coplanet.fr, ps@shiraz.uplift.fr, jcc@france.sun.com Subject: weblayers Status: RO Content-Length: 1335 I sent a mail dated Tue May 23 10:12:26 +0200 1995 to announce that a robot to maintain a emacs-w3 cache directory in synch with the net was under construction. I finally released it under the name weblayers. Here is a http://web.nexor.co.uk/mak/doc/robots/active.html like entry that describes it: <hr> <h2>weblayers</h2> <a href="http://www.univ-paris8.fr/~loic/weblayers/">weblayers</a> is maintained by <a href="http://www.univ-paris8.fr/~loic/">Loic Dachary</a> <a href="mailto:loic@afp.com"><loic@afp.com></a>. <p> Its purpose is to validate, cache and maintain links. <p> The HTTP <code>User-agent</code> field is set to 'weblayers/0.0'. <p> The <a href="http://web.nexor.co.uk/users/mak/doc/robots/norobots.html"> Proposed Standard for Robot Exclusion</a> is supported.<p>It is a standalone program written in <a href="http://web.nexor.co.uk/public/perl/perl.html">Perl 5</a>.<p> It is designed to maintain the cache generated by the emacs <a href="http://www.cs.indiana.edu/elisp/w3/docs.html">w3 mode</a> (N*tscape replacement) and to support annotated documents (keep them in sync with the original document via diff/patch). <p>This information was last updated on Fri Jun 23 16:30:42 FRE 1995. <hr> Cheers, Loic From /CN=robots-errors/@nexor.co.uk Fri Jun 23 18:16:49 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Fri, 23 Jun 1995 18:20:04 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 23 Jun 1995 18:16:49 +0100 Date: Fri, 23 Jun 1995 18:16:49 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:067900:950623171653] Content-Identifier: Re: weblayers Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 23 Jun 1995 18:16:49 +0100; Alternate-Recipient: Allowed From: Brian Joseph Starr <bstarr@monet.ICS.UCI.EDU> Message-ID: <9506231011.aa01907@paris.ics.uci.edu> Cc: /CN=robots/@nexor.co.uk In-Reply-To: <199506231435.QAA29867@pinot.par.afp.com> Subject: Re: weblayers Status: RO Content-Length: 107 I wonder how I can get off this mailing list? I've tried unsubscribe, but it doesn't seem to work. Brian From /CN=robots-errors/@nexor.co.uk Sun Jun 25 13:54:23 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Sun, 25 Jun 1995 13:57:54 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Sun, 25 Jun 1995 13:54:23 +0100 Date: Sun, 25 Jun 1995 13:54:23 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:247000:950625125424] Content-Identifier: How to unsubs... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Sun, 25 Jun 1995 13:54:23 +0100; Alternate-Recipient: Allowed From: Martijn Koster <m.koster@nexor.co.uk> Message-ID: <"24698 Sun Jun 25 13:54:03 1995"@nexor.co.uk> To: Brian Joseph Starr <bstarr@monet.ICS.UCI.EDU> Cc: /CN=robots/@nexor.co.uk In-Reply-To: <9506231011.aa01907@paris.ics.uci.edu> Subject: How to unsubscribe (was Re: weblayers ) Status: RO Content-Length: 923 In message <9506231011.aa01907@paris.ics.uci.edu>, Brian Joseph Starr writes: > I wonder how I can get off this mailing list? Before anyone else asks: http://web.nexor.co.uk/users/mak/doc/robots/mailing-list.html: To unsubscribe, DO NOT send an unsubscribe message to robots@nexor.co.uk, but send a message to robots-request@nexor.co.uk with the words "unsubscribe", "stop" on separate lines in the body. If you have problems, mail the list owner <m.koster@nexor.co.uk> > I've tried unsubscribe, but it doesn't seem to work. It doesn't work for you because you subscribed as bstarr@liege.ICS.UCI.EDU, not bstarr@monet.ICS.UCI.EDU, and apart from that you got stung by a mail routing problem... > Brian I have unsubscribed you. RSN these issues will get fixed) -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Tue Jul 4 10:07:50 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Tue, 4 Jul 1995 10:14:27 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 4 Jul 1995 10:07:50 +0100 Date: Tue, 4 Jul 1995 10:07:50 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:065790:950704090803] Content-Identifier: web searches ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 4 Jul 1995 10:07:50 +0100; Alternate-Recipient: Allowed From: James Lick <jlick@shoreside.com> Message-ID: <Pine.SOL.3.91.950704011318.3264A-100000@mgm-grand.shoreside.com> To: /CN=robots/@nexor.co.uk Cc: wwwstaff@qrd.org, hawkeye@tcp.com Subject: web searches index under wrong hostname MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Status: RO Content-Length: 4025 Hello Robot Gurus (plus cc folks), I've perused the standard sources for information on the problem I have and haven't come up with any solutions as yet. I am hoping that through this forum I may discover the information I need to fix this or if not possible to help bring out changes in the search engines to make it possible. Basically, I have a web server which is on a machine known under several names. Due to this, various web search robots have indexed pages under pretty much all of the various names, so that some pages show under one host name, some under another, and some even under several hostnames. In general this is ugly, but not of too much concern. Most of the hostnames it is known by always point to the same host at all times. E.G., tcp.com and www.tcp.com and venice.tcp.com are always 128.95.44.29. However, I also do a mirror of the Queer Resources Directory which has an hostname www.qrd.org which cycles randomly through the various servers the QRD is mirrored on. One time you may get tcp.com's address, other times the server in Israel, or the one in San Francisco. Now the big problem is that the web search engines are going and finding a server "www.qrd.org" which at the moment points to tcp.com, and goes to index all the pages on there, including all the pages which are not part of the QRD, but are other archives or personal pages, etc. Later on, someone executes a search for one of our non-QRD pages and finds a reference with url pointing to www.qrd.org, which they follow, and due to the luck of the draw get the server in Iowa which never heard of that page. Needless to say, my users who are not in the QRD section are getting quite irked that they are being indexed under a name that only works maybe 10% of the time. I've tried out some things to see if I could force the server to give out information to force the client to pick up what I consider to be the "correct" hostname in the url which is "http://www.tcp.com/...". Unfortunately this does not seem to be possible in the current scheme of things. My first attempt was adding in "URI:" and "Location:" fields in the meta-headers. No go, the clients don't care about these unless they get a 3xx type response, i.e. a redirect or moved, etc. OK, so why not just send a redirect? Oops, can't do that. The part we care about changing is the host part, and that has been stripped off by the time the http server gets it, so it doesn't know whether you used a "correct" hostname or not. You can't just do a global redirect either since this will just loop. (Fortunately the client is smart enough to abort this loop.) The only progress I've made at all is finding the HTML BASE tag. Unfortunately it is only half a solution, and I'm not even sure if the web searchers interpret it correctly or at all. What the BASE tag does is specify the base URL to use when interpreting any relative URLs in the page. For example, you can put in: <BASE HREF="http://www.tcp.com/"> Then if there is a link further down the page such as: <A HREF="~jlick/">Foo!</a> Then clicking on it would load "http://www.tcp.com/~jlick/" no matter what hostname was used to originally get there. As mentioned this is only a halfway solution since it only affects links made from that page instead of the page itself. Another possibility is to split to two servers, using the cloned network interface method or seperate machines. Unfortunately in this case it is not possible since this server is a guest on another's network, and getting another IP address is not possible at this time. Another possibility is to get robot admins to purge their server lists of "floating hosts" such as www.qrd.org and mirror only the actual hosts. This might be the best short-term solution but I'm not clear of the feasibility of this. Thanks for the consideration of this problem and I look forward to any responses. --- James Lick -- jlick@tcp.com -- http://www.tcp.com/~jlick/ for more info --- From /CN=robots-errors/@nexor.co.uk Tue Jul 4 11:45:31 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Tue, 4 Jul 1995 11:49:39 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 4 Jul 1995 11:45:31 +0100 Date: Tue, 4 Jul 1995 11:45:31 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:087550:950704104533] Content-Identifier: Re: web searc... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 4 Jul 1995 11:45:31 +0100; Alternate-Recipient: Allowed From: Martijn Koster <m.koster@nexor.co.uk> Message-ID: <"8700 Tue Jul 4 11:44:48 1995"@nexor.co.uk> To: James Lick <jlick@shoreside.com> Cc: /CN=robots/@nexor.co.uk, wwwstaff@qrd.org, hawkeye@tcp.com In-Reply-To: <Pine.SOL.3.91.950704011318.3264A-100000@mgm-grand.shoreside.com> Subject: Re: web searches index under wrong hostname Status: RO Content-Length: 3028 In message <Pine.SOL.3.91.950704011318.3264A-100000@mgm-grand.shoreside.com>, J ames Lick writes: > Basically, I have a web server which is on a machine known under > several names. Due to this, various web search robots have indexed > pages under pretty much all of the various names, so that some pages > show under one host name, some under another, and some even under > several hostnames. You can't just blame that on the robots; I've seen people link to the wrong names too :-( If a robot then follows that you're out of luck. I do think robots should at least keep track of IP addresses to prevent pointless duplicate transfers. > Now the big problem is that the web search engines are going > and finding a server "www.qrd.org" which at the moment points to > tcp.com, and goes to index all the pages on there, including all the > pages which are not part of the QRD, but are other archives or > personal pages, etc. That's interesting, because there are no relative links on th QRD stuff to other stuff on your server. How do robots guess these other URL's? By trying an empty path component to get to the root and working down from there? > Unfortunately this does not seem to be possible in the current > scheme of things. No, I can't think of anything either. > My first attempt was adding in "URI:" and "Location:" fields in > the meta-headers. No go, the clients don't care about these unless they > get a 3xx type response, i.e. a redirect or moved, etc. This is something I've been wondering about, and something that should probably be suggested for HTTP/1.1: It'd be nice if a preferred URL could be sent along, ie have a Redirect within a server, without an extra round trip (not unlike the If-modified-since). > Another possibility is to split to two servers, using the cloned > network interface method or seperate machines. I think that's the only technical option at the moment. > Another possibility is to get robot admins to purge their server > lists of "floating hosts" such as www.qrd.org and mirror only the actual > hosts. This might be the best short-term solution but I'm not clear of > the feasibility of this. As another short-term solution one could extend the /robots.txt to include a full URL. Then you could say: URL-Disallow: http://www.qrd.org/private/ URL-Disallow: http://www.tcp.com/qrd/ Of course you'd need some more logic to ensure that these rules are only applied to the IP address the /robots.txt came from, to prevent Microsoft disallowing Apple etc :-) I guess that needs to go on the wish-list. > Thanks for the consideration of this problem and I look forward to > any responses. An optimistic note for the future: I believe passing a full URL including the access and netloc parts is on the wishlist for HTTP/1.1 This would allow a server to be more precise about what URL's it serves and denies. -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Tue Jul 4 14:27:09 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Tue, 4 Jul 1995 14:32:54 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 4 Jul 1995 14:27:09 +0100 Date: Tue, 4 Jul 1995 14:27:09 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:119720:950704132711] Content-Identifier: Re: web searc... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 4 Jul 1995 14:27:09 +0100; Alternate-Recipient: Allowed From: " (Reinier Post)" <reinpost@win.tue.nl> Message-ID: <199507041325.PAA02474@wsinis10.win.tue.nl> To: " (Martijn Koster)" <m.koster@nexor.co.uk> Cc: jlick@shoreside.com, /CN=robots/@nexor.co.uk, wwwstaff@qrd.org, hawkeye@tcp.com In-Reply-To: <"8700 Tue Jul 4 11:44:48 1995"@nexor.co.uk> Subject: Re: web searches index under wrong hostname Reply-To: reinpost@win.tue.nl X-Mailer: ELM [version 2.4 PL23] MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Status: RO Content-Length: 2056 You (Martijn Koster) write: > >In message <Pine.SOL.3.91.950704011318.3264A-100000@mgm-grand.shoreside.com>, J >ames Lick writes: > >> Basically, I have a web server which is on a machine known under >> several names. Due to this, various web search robots have indexed >> pages under pretty much all of the various names, so that some pages >> show under one host name, some under another, and some even under >> several hostnames. This scheme is nice, but it further corrodes the notion of a URL as a persistent (non-unique) identifier for a document. If a document was fetched successfully under a given URL, in my opinion it must be accessible there forever, unless the document itself expires. (The same problem arises with caching script results, and is technically solved with the Expires: header.) You fail to comply with this criterion, so I would prefer to regard it as an implementation failure on your side. If you don't want people or robots to retrieve certain documents under certain URLs, then don't serve them. So the preferred solution, in my opinion, is to teach the server to disallow requests for documents based on the server name used by the client. However, if I understand correctly, there is no way to extract this information from the connection; the client would need to send it explicitly. It already sends this information for the previous document, in the REFERER-header; you need the information for the current document. The solution is a new header; if REFERER headers are set on redirects, you might find a working solution using redirects and the REFERER header, but it would not work with most clients. A better solution, in my opinion: abandon the scheme, and set up www.qrd.org to serve redirections only. It's slower, but you'll get the control you need. You can even set up multiple hosts to serve the redirections. -- Reinier Post reinpost@win.tue.nl a.k.a. <A HREF="http://www.win.tue.nl/win/cs/is/reinpost/">me</A> [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] From /CN=robots-errors/@nexor.co.uk Tue Jul 4 17:02:47 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Tue, 4 Jul 1995 17:06:20 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 4 Jul 1995 17:02:47 +0100 Date: Tue, 4 Jul 1995 17:02:47 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:150110:950704160249] Content-Identifier: Re: web searc... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 4 Jul 1995 17:02:47 +0100; Alternate-Recipient: Allowed From: " (Tim Bray)" <tbray@opentext.com> Message-ID: <m0sTANi-0001lzC@giant.mindlink.net> To: /CN=robots/@nexor.co.uk Subject: Re: web searches index under wrong hostname X-Sender: a07893@giant.mindlink.net X-Mailer: Windows Eudora Version 2.0.3 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Status: RO Content-Length: 559 James Lick points out that it's hard for robots/indexers to work around duplicates. IP address aliases are one problem, but symlinks and foxy/modified httpd servers and so on all make it impossible in principle to do this. However, doesn't the BASE element provide a place to hang a solution to the problem? What we, the robot-floggers and indexers of the world, need to do is get on our high horse and shriek in the conferences and newsgroups and at the editor vendors and get them to use it. Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com) From /CN=robots-errors/@nexor.co.uk Tue Jul 4 17:23:10 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Tue, 4 Jul 1995 17:32:02 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 4 Jul 1995 17:23:10 +0100 Date: Tue, 4 Jul 1995 17:23:10 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:153430:950704162312] Content-Identifier: Re: web searc... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 4 Jul 1995 17:23:10 +0100; Alternate-Recipient: Allowed From: Brian Pinkerton <bp@webcrawler.com> Message-ID: <9507041621.AA02352@webcrawler.com> To: James Lick <jlick@shoreside.com> Cc: /CN=robots/@nexor.co.uk, wwwstaff@qrd.org, hawkeye@tcp.com References: <Pine.SOL.3.91.950704011318.3264A-100000@mgm-grand.shoreside.com> Subject: Re: web searches index under wrong hostname Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) Original-Received: by NeXT.Mailer (1.118.2) PP-warning: Illegal Received field on preceding line Status: RO Content-Length: 938 Martijn and Reinier are right -- there's currently no perfect solution to this problem. I'll second Martijn's wish for the ability (though not the requirement) to include full URLs in an HTTP request. There's just no good way to get URLs right all the time because they are issued from the beholder's perspective: if it works, it works! The most common request for a change to the WebCrawler index is a change of the hostname part of the URL: in the case where multiple names map to a single IP address, the WebCrawler is certain to get half the URLs wrong because it identifies servers by unique IP address. The best current solution I know is to make sure that if a URL works, it will always work, and to use virtual hosts (aka APB patches) where more than one hostname per physical server is desired. Virtual hosts are supported by Apache (see http://www.apache.org/), and can be hacked in to NCSA httpd. bri From /CN=robots-errors/@nexor.co.uk Wed Jul 5 04:53:41 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Wed, 5 Jul 1995 04:57:07 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 5 Jul 1995 04:53:41 +0100 Date: Wed, 5 Jul 1995 04:53:41 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:235090:950705035342] Content-Identifier: Re: web searc... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 5 Jul 1995 04:53:41 +0100; Alternate-Recipient: Allowed From: " (James Burton)" <james@Snark.apana.org.au> Message-ID: <20ee5915.be6e0-james@Snark.apana.org.au> To: /CN=robots/@nexor.co.uk References: <Pine.SOL.3.91.950704011318.3264A-100000@mgm-grand.shoreside.com>, <jlick@shoreside.com> Subject: Re: web searches index under wrong hostname X-Mailer: //\\miga Electronic Mail (AmiElm 5.42) MIME-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: quoted-printable Organization: Melbourne ArtWorks Status: RO Content-Length: 1926 > Hello Robot Gurus (plus cc folks), >=20 > Basically, I have a web server which is on a machine known unde= r > several names. Due to this, various web search robots have indexed p= ages > under pretty much all of the various names, so that some pages show u= nder > one host name, some under another, and some even under several hostna= mes.=20 >=20 > In general this is ugly, but not of too much concern. Most of = the > hostnames it is known by always point to the same host at all times.=20 > E.G., tcp.com and www.tcp.com and venice.tcp.com are always 128.95.44= =2E29.=20 > However, I also do a mirror of the Queer Resources Directory which ha= s an > hostname www.qrd.org which cycles randomly through the various server= s the > QRD is mirrored on. One time you may get tcp.com's address, other ti= mes > the server in Israel, or the one in San Francisco.=20 [...] > Thanks for the consideration of this problem and I look forward= =20 > to any responses. >=20 > --- James Lick -- jlick@tcp.com -- http://www.tcp.com/~jlick/ for mor= e info --- Call me thick but why doesn't the following work. On all the possible servers of www.qrd.org configure exactly which URLs are to be accepted. e.g. on the CERN server I can do (in /etc/http= d.conf) Pass /httpd-internal-icons/* /icons/* Pass /* /home/WWW/* Pass http:* Pass ftp:* Pass gopher:* Pass wais:* Pass news:* Now if I was to (on www.tcp.com) change this to Pass http://www.tcp.com/* /home/WWW/* Pass http://www.qrd.org/* /home/WWW/QRD/* and nothing else. Then the indexing robot would never find any wrong UR= Ls unless somebody has put a nasty absolute URL in a link James --=20 James Burton |=20 EMail: James@Snark.apana.org.au | Latrobe University WWW : http://www.cs.latrobe.edu.au/~burton/ | Melbourne, Australia From /CN=robots-errors/@nexor.co.uk Wed Jul 5 08:57:26 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Wed, 5 Jul 1995 09:06:47 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 5 Jul 1995 08:57:26 +0100 Date: Wed, 5 Jul 1995 08:57:26 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:262710:950705075728] Content-Identifier: Re: web searc... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 5 Jul 1995 08:57:26 +0100; Alternate-Recipient: Allowed From: " (Reinier Post)" <reinpost@win.tue.nl> Message-ID: <199507050756.JAA04177@wsinis10.win.tue.nl> To: james@Snark.apana.org.au Cc: /CN=robots/@nexor.co.uk In-Reply-To: <20ee5915.be6e0-james@Snark.apana.org.au> Subject: Re: web searches index under wrong hostname Reply-To: reinpost@win.tue.nl X-Mailer: ELM [version 2.4 PL23] MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Status: RO Content-Length: 357 You (/CN=robots-errors/@nexor.co.uk) write: >Now if I was to (on www.tcp.com) change this to > >Pass http://www.tcp.com/* /home/WWW/* >Pass http://www.qrd.org/* /home/WWW/QRD/* Is this allowed at all? The problem is, the server has no way of knowing by what name it was called, www.tcp.com or www.qrd.org. -- Reinier Post reinpost@win.tue.nl From /CN=robots-errors/@nexor.co.uk Wed Jul 5 09:19:19 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Wed, 5 Jul 1995 09:32:30 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 5 Jul 1995 09:19:19 +0100 Date: Wed, 5 Jul 1995 09:19:19 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:267500:950705081931] Content-Identifier: Re: web searc... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 5 Jul 1995 09:19:19 +0100; Alternate-Recipient: Allowed From: Martijn Koster <m.koster@nexor.co.uk> Message-ID: <"26729 Wed Jul 5 09:18:16 1995"@nexor.co.uk> To: reinpost@win.tue.nl Cc: james@Snark.apana.org.au, /CN=robots/@nexor.co.uk In-Reply-To: <199507050756.JAA04177@wsinis10.win.tue.nl> Subject: Re: web searches index under wrong hostname Status: RO Content-Length: 706 In message <199507050756.JAA04177@wsinis10.win.tue.nl>, " (Reinier Post)" write s: > >Pass http://www.tcp.com/* /home/WWW/* > >Pass http://www.qrd.org/* /home/WWW/QRD/* > > Is this allowed at all? I think the above configuration is probably for the proxy side of the CERN server; a proxy gets a full target URL, complete with access scheme and hostname, which you can configure access control on. > The problem is, the server has no way of knowing > by what name it was called, www.tcp.com or www.qrd.org. Indeed, unless they're separate IP addresses. -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Thu Jul 6 16:14:41 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Thu, 6 Jul 1995 16:18:54 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 6 Jul 1995 16:14:41 +0100 Date: Thu, 6 Jul 1995 16:14:41 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:297060:950706151443] Content-Identifier: How big is th... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 6 Jul 1995 16:14:41 +0100; Alternate-Recipient: Allowed From: Josef Pellizzari <jpellizz@afsmail.cern.ch> Message-ID: <Pine.A32.3.91.950706165934.23703D-100000@sp066.cern.ch> To: /CN=robots/@nexor.co.uk Subject: How big is the Web? X-Sender: jpellizz@sp066.cern.ch Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Status: RO Content-Length: 678 I have three simple questions, to which only the Masters of the Robots may have the answers: * I need rough estimates of the current number of servers and URLs. * In case someone has the data that document the growth of the Web ready at hand, I would be happy to receive them. * Does anyone have an idea when a robot the last time traversed the whole Web? Thanks for your help! Josef ---------------------------------------------------------------- Josef PELLIZZARI tel : +41 22 767 9627 CN Division 31 2-013 fax : +41 22 767 7155 CERN mail: Josef.Pellizzari@cern.ch CH-1211 Geneve 23 ---------------------------------------------------------------- From /CN=robots-errors/@nexor.co.uk Thu Jul 6 16:37:06 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Thu, 6 Jul 1995 16:43:07 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 6 Jul 1995 16:37:06 +0100 Date: Thu, 6 Jul 1995 16:37:06 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:005060:950706153708] Content-Identifier: Re: How big i... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 6 Jul 1995 16:37:06 +0100; Alternate-Recipient: Allowed From: Chris Eborn <chris@dcs.kingston.ac.uk> Message-ID: <199507061535.QAA25785@kite.dcs.king.ac.uk> To: /CN=robots/@nexor.co.uk Subject: Re: How big is the Web? Status: RO Content-Length: 736 > > I have three simple questions, to which only the Masters of the Robots > may have the answers: > > * I need rough estimates of the current number of > servers and URLs. According to Lycos - around 4 million URLs on (at least) 23,550 http servers (see the pages mentioned below) > > * In case someone has the data that document the growth of the Web > ready at hand, I would be happy to receive them. > > * Does anyone have an idea when a robot the last time traversed the > whole Web? > > Thanks for your help! > > Josef The lycos project has produced some estimates of the size of the web. Try: http://lycos.cs.cmu.edu/lycos-websize.html or more generally: http://lycos.cs.cmu.edu/lycos-websize.html -chris From /CN=robots-errors/@nexor.co.uk Thu Jul 6 20:06:44 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Thu, 6 Jul 1995 20:09:40 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 6 Jul 1995 20:06:44 +0100 Date: Thu, 6 Jul 1995 20:06:44 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:024820:950706190645] Content-Identifier: Re: How big i... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 6 Jul 1995 20:06:44 +0100; Alternate-Recipient: Allowed From: Matthew Gray <mkgray@netgen.com> Message-ID: <Pine.OSF.3.91.950706145934.4743B-100000@thoth.netgen.com> To: Josef Pellizzari <jpellizz@afsmail.cern.ch> Cc: /CN=robots/@nexor.co.uk In-Reply-To: <Pine.A32.3.91.950706165934.23703D-100000@sp066.cern.ch> Subject: Re: How big is the Web? Organization: net.Genesis Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Status: RO Content-Length: 1052 > * I need rough estimates of the current number of > servers and URLs. More than 23,000 servers. > * In case someone has the data that document the growth of the Web > ready at hand, I would be happy to receive them. I have monthly data on the number of servers since June of 1993(*) and will announce it to the list when I make all the figures available. Below are figures for every 6 months. Feel free to redistribute these figures, but please keep the attribution of "Matthew Gray <mkgray@netgen.com> of net.Genesis Corp" with any graphs or representations of the data. Matthew Gray --------------------------------- voice: (617) 577-9800 net.Genesis fax: (617) 577-9850 56 Rogers St. mkgray@netgen.com Cambridge, MA 02142-1119 ------------- http://www.netgen.com/~mkgray Growth of the web, number of sites over time 6/93 130 12/93 623 6/94 2738 12/94 10022 6/95 23517 (*) Based on the results of my Wanderer, the first wandering web robot. From /CN=robots-errors/@nexor.co.uk Fri Jul 7 08:55:38 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Fri, 7 Jul 1995 09:00:46 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 7 Jul 1995 08:55:38 +0100 Date: Fri, 7 Jul 1995 08:55:38 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:194810:950707075539] Content-Identifier: Re: How big i... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 7 Jul 1995 08:55:38 +0100; Alternate-Recipient: Allowed From: Brian Pinkerton <bp@webcrawler.com> Message-ID: <9507070755.AA08058@webcrawler.com> To: /CN=robots/@nexor.co.uk References: <Pine.A32.3.91.950706165934.23703D-100000@sp066.cern.ch> Subject: Re: How big is the Web? Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Original-Received: by NeXT.Mailer (1.118.2) PP-warning: Illegal Received field on preceding line Status: RO Content-Length: 477 Hmmm. Those estimates [Lycos, net.Genesis] of the number of Web servers = seem pretty low. Check out http://webcrawler.com/WebCrawler/WebSize.html for the WebCrawler's data. Our latest number is just shy of 39,000 unique = HTTP servers (by IP address). We can't really compete with Lycos on the = number of total URLs, but if you take their figure for the average number of = URLs per server and multiply by our 39K number, then you get something = around 7M URLs. bri From /CN=robots-errors/@nexor.co.uk Fri Jul 7 13:27:07 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Fri, 7 Jul 1995 13:35:08 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 7 Jul 1995 13:27:07 +0100 Date: Fri, 7 Jul 1995 13:27:07 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:260960:950707122709] Content-Identifier: Re: How big i... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 7 Jul 1995 13:27:07 +0100; Alternate-Recipient: Allowed From: Billy Barron <billy@utdallas.edu> Message-ID: <199507071224.HAA29290@utdallas.edu> To: " (Brian Pinkerton)" <bp@webcrawler.com> Cc: /CN=robots/@nexor.co.uk In-Reply-To: <9507070755.AA08058@webcrawler.com> Subject: Re: How big is the Web? X-WWW-Page: http://www.utdallas.edu/acc/billy.html X-Mailer: ELM [version 2.4 PL24] Content-Type: text Status: RO Content-Length: 1003 In reply to Brian Pinkerton's message: > >Hmmm. Those estimates [Lycos, net.Genesis] of the number of Web servers = >seem pretty low. Check out > > http://webcrawler.com/WebCrawler/WebSize.html > >for the WebCrawler's data. Our latest number is just shy of 39,000 unique = >HTTP servers (by IP address). We can't really compete with Lycos on the = >number of total URLs, but if you take their figure for the average number of = >URLs per server and multiply by our 39K number, then you get something = >around 7M URLs. > IP address is not really an accurate measurement. First, some large sites (e.g. NCSA) use DNS shuffle records so the IP address changes possibly on repeated queries. Second, machines with multiple Ethernets may shown up more than once. Apache is making this situation worse too. Even using the FDQN (eliminating aliases) suffer from this problem and I don't see a good solution to it. -- Billy Barron, Network Services Manager, Univ of Texas at Dallas billy@utdallas.edu From /CN=robots-errors/@nexor.co.uk Fri Jul 7 13:54:36 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Fri, 7 Jul 1995 14:33:15 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 7 Jul 1995 13:54:36 +0100 Date: Fri, 7 Jul 1995 13:54:36 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:268020:950707125445] Content-Identifier: Lycos Answer:... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 7 Jul 1995 13:54:36 +0100; Alternate-Recipient: Allowed From: " (Michael Mauldin)" <mlm@fuzine.mt.cs.cmu.edu> Message-ID: <9507071248.AA19441@fuzine.mt.cs.cmu.edu> To: Brian Pinkerton <bp@webcrawler.com> Cc: /CN=robots/@nexor.co.uk Subject: Lycos Answer: 6.9 million URLs, 57k servers Original-Received: by NeXT Mailer (1.63) PP-warning: Illegal Received field on preceding line Status: RO Content-Length: 232 But that's a lower bound. Also, since we don't track servers by IP address, there may well be only 39k machines hosting those 57k servers. Our current count is 4.49 million URLs located and 1.015 million URLs downloaded. --Fuzzy From /CN=robots-errors/@nexor.co.uk Thu Jul 13 13:14:03 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Thu, 13 Jul 1995 13:17:29 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 13 Jul 1995 13:14:03 +0100 Date: Thu, 13 Jul 1995 13:14:03 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:145380:950713121406] Content-Identifier: unsubscribe Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 13 Jul 1995 13:14:03 +0100; Alternate-Recipient: Allowed From: Chung.Kang.Tsen@OZARK.EDRC.CMU.EDU Message-ID: <"14528 Thu Jul 13 13:13:23 1995"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: unsubscribe Status: RO Content-Length: 12 unsubscribe From /CN=robots-errors/@nexor.co.uk Thu Jul 13 19:13:02 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Thu, 13 Jul 1995 19:16:39 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 13 Jul 1995 19:13:02 +0100 Date: Thu, 13 Jul 1995 19:13:02 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:219710:950713181305] Content-Identifier: unsubscribe Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 13 Jul 1995 19:13:02 +0100; Alternate-Recipient: Allowed From: Peter A Schwartz <pschwar@world.std.com> Message-ID: <Pine.3.89.9507131025.A20327-0100000@world.std.com> To: /CN=robots/@nexor.co.uk Subject: unsubscribe Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Status: RO Content-Length: 12 unsubscribe From /CN=robots-errors/@nexor.co.uk Fri Jul 14 18:10:10 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Fri, 14 Jul 1995 18:13:53 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 14 Jul 1995 18:10:10 +0100 Date: Fri, 14 Jul 1995 18:10:10 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:167900:950714171014] Content-Identifier: Opportunities... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 14 Jul 1995 18:10:10 +0100; Alternate-Recipient: Allowed From: John.R.R.Leavitt@NL.CS.CMU.EDU Message-ID: <"16777 Fri Jul 14 18:09:37 1995"@nexor.co.uk> To: /CN=robots/@nexor.co.uk Subject: Opportunities at Lycos, Inc. Status: RO Content-Length: 1574 [I hope this doesn't offend anyone. If it does, I apologize. Hi, Martijn!] Employment Opportunities Lycos, Inc builds, licenses and serves the catalog of the internet. We are currently seeking individuals for a variety of positions in an exciting but demanding startup environment: Technical Positions (Pittsburgh, PA) Highly skilled computer scientists and MIS professionals. Skills sought include experience programming networking and operating systems level (Solaris, Windows NT, OSF/1, SunOS, Windows 3.1, MacOS, ...), database management, Web Services (HTTP, HTML, CGI, Perl), software engineering and performance analysis. Technical positions are in the Pittsburgh area. Electronic resumes may be mailed to jobs@www.lycos.com. Physical resumes may be sent to: Lycos, Inc. c/o Center for Machine Translation Carnegie Mellon University 4910 Forbes Avenue Cyert Hall 2nd floor Pittsburgh, PA 15213-3890 Sales and Marketing (Boston, MA) We have a variety of assignments available for net savvy marketing professionals. Electronic resumes may be mailed to bdavis@www.lycos.com. Physical resumes may be sent to: Lycos, Inc. 187 Ballardvale St. Suite B110 Wilmington, MA 01887-7000 Technical Positions Sales and Marketing John R. R. Leavitt | Director, Product Development | Lycos, Inc. 412 268 7282 | jrrl@lycos.com | http://thule.mt.cs.cmu.edu:8001/jrrl/ Editor: Omphalos | Member: Pittsburgh Worldwrights Reading: All My Sins Remembered by Joe Haldeman From /CN=robots-errors/@nexor.co.uk Mon Jul 24 01:42:42 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Mon, 24 Jul 1995 01:46:56 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 24 Jul 1995 01:42:42 +0100 Date: Mon, 24 Jul 1995 01:42:42 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:136730:950724004244] Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Mon, 24 Jul 1995 01:42:42 +0100; Alternate-Recipient: Allowed From: " (Nick Arnett)" <narnett@Verity.COM> Message-ID: <ac3898f101021004f2a4@[192.187.143.12]> To: /CN=robots/@nexor.co.uk X-Sender: narnett@hawaii.verity.com Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Status: RO Content-Length: 5452 FYI -- Here's the press release that we'll send out tomorrow morning regarding our spider. As you'll see, it not only follows "robots.txt," by default it requires the presence of a "robots.txt" file. We thought this was a safer way to go in a commercial products. As it also says below, but I think particularly relevant to the people on this list, it's designed to index only one site at a time (and by default, only documents "below" the starting point in the directory tree), so that it won't go wandering off to discover new resources. That's not the intent of this product. It's designed for companies that have a number of servers for which they want to build a single index. I'll be happy to answer whatever questions people on this list might have. Nick VERITY ANNOUNCES REMOTE INDEXING CAPABILITIES FOR TOPIC(r) WEBSEARCHER Introducing the First Commercial Web "spider" MOUNTAIN VIEW, Calif. -- July 24, 1995 -- Verity, Inc., the leading developer of search and retrieval software for the enterprise and the Internet, today announced that it is enhancing its Topic WebSearcher product with remote indexing capabilities. This new Remote Web Indexer will be the first commercially available indexing robot, or "spider" for the Web, allowing customers to build a searchable full-text index of any Web site via the Internet. "This new indexer makes it easy for our customers to work with distributed information," said Philippe Courtot, Verity's chairman and CEO. "The value of the Web is multiplied by giving users the power to search information across the organization as well as external sources." Topic WebSearcher is a sophisticated search and retrieval tool that incorporates Verity's powerful Topic technology for concept-based search and relevancy-ranked results. It is designed to work with Web servers via a gateway application, giving access to documents from the Web, file systems, databases and other repositories, supporting multiple formats including the Web's HyperText Markup Language (HTML), native support for Adobe Acrobat indexes and more than 50 standard word processing formats. The new remote Web indexer follows the widely used "robots.txt" exclusion file convention, which allows Web server administrators to restrict or deny access to their servers. "Topic Remote Web Indexer makes it easy for us to build and maintain the tools that our researchers and others will use to find research collaborators, sponsors and technology licensees," said Jay Creutz, program manager at SAIC. "Verity's search tools are a powerful complement to the Web's browsing capabilities." SAIC, a large systems integrator, is using Topic WebSearcher and Topic Remote Web Indexer to create powerful search for UC-ACCESS, an on-line system of databases that includes information about technologies, researchers and data from the nine campuses of the UC system and the UC Office of the President. "Our organization has troubles with the labor intensive task of collecting and organizing all the information we are responsible for so that anyone can find it whenever they require it," said Brent Allsop of the Support Technology Center at Hewlett Packard. "This is the tool that will make this task easily and automatically possible." Key features of the Remote Web Indexer include: * Automatic indexing of HyperText Markup Language and text files. * Built-in capture of fielded and zone information such as titles, headlines and more, Web page modification dates, and Uniform Resource Locators for more precise searching. * Observes "Safeguard" default behaviors to ensure only authorized sites are indexed: will only index if a "robots.txt" file is present; will not jump hypertext links between servers. Topic Remote Web Indexer is in beta test now at more than 10 customer sites. Topic WebSearcher version 1.1, which will include the Remote Web Indexer, is scheduled for release in August, 1995 and is priced at $9,995. For a demonstration of the Topic search engine and databases built with Topic Remote Web Indexer, see Verity's Web Publishers Virtual Library at http://www.verity.com/library.html. Verity, Inc., headquartered in Mountain View, CA, develops and markets the Topic family of information retrieval tools and applications for publishing and disseminating information across the enterprise, the Internet and CD-ROM. The company's products and services are used by more than 650 corporations and organizations worldwide as well as by hundreds of development partners. Verity's Topic search engine is the engine of choice for Adobe Systems, Lotus Development Corporation, Netscape Communications, Quarterdeck Corporation, Frame Technology Corporation, Saros Corporation, PC Docs, Odesta Systems Corporation, Documentum, Inc. Restrac, and many other software developers. ### For more information contact Verity at info@verity.com or at the World Wide Web site http://www/verity.com/ or by calling 415/960-7600. Verity and TOPIC are registered trademarks of Verity, Inc. in the United States and other countries. Verity, Inc. is not related to the International Stock Exchange of the United Kingdom and the Republic of Ireland Limited, which provide computerized information under the name Topic. All other trademarks are the property of their respective holders. -- Verity Inc. narnett@verity.com <URL:http://www.verity.com/> (415) 960-7660 From /CN=robots-errors/@nexor.co.uk Fri Jul 28 14:51:53 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Fri, 28 Jul 1995 14:56:17 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 28 Jul 1995 14:51:53 +0100 Date: Fri, 28 Jul 1995 14:51:53 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:124010:950728135155] Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 28 Jul 1995 14:51:53 +0100; Alternate-Recipient: Allowed From: Ivan_Lindenfeld@pcmailgw.ml.com Message-ID: <9506288069.AA806950406@pcmailgw.ml.com> To: /CN=robots/@nexor.co.uk Encoding: 3 Text Return-Receipt-To: Ivan_Lindenfeld@pcmailgw.ml.com Status: RO Content-Length: 30 unsubscribe From /CN=robots-errors/@nexor.co.uk Wed Aug 2 14:41:27 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Wed, 2 Aug 1995 14:47:20 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 2 Aug 1995 14:41:27 +0100 Date: Wed, 2 Aug 1995 14:41:27 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:266080:950802134130] Content-Identifier: reindexing pa... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 2 Aug 1995 14:41:27 +0100; Alternate-Recipient: Allowed From: ".. G. Edward Johnson" <lorax@speckle.ncsl.nist.gov> Message-ID: <Pine.3.89.9508020937.B6175-0100000@speckle> To: /CN=robots/@nexor.co.uk Subject: reindexing pages. Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Status: RO Content-Length: 392 I am interested to know what stratigies web databases use for updating their indexes. If a page has changed, and is reindexed, will it match both what used to be on the page and what is now on the page or just what is currently on the page? Also, on a related note is there a way to remove a page from the index (if for instance it no longer exists, or has moved?) Thanks. Edward. From /CN=robots-errors/@nexor.co.uk Wed Aug 2 15:53:38 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Wed, 2 Aug 1995 16:03:24 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 2 Aug 1995 15:53:38 +0100 Date: Wed, 2 Aug 1995 15:53:38 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:287850:950802145353] Content-Identifier: Re: reindexin... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 2 Aug 1995 15:53:38 +0100; Alternate-Recipient: Allowed From: " (Tim Bray)" <tbray@opentext.com> Message-ID: <m0sdf7n-0003CUC@giant.mindlink.net> To: /CN=robots/@nexor.co.uk Subject: Re: reindexing pages. X-Sender: a07893@giant.mindlink.net X-Mailer: Windows Eudora Version 2.0.3 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Status: RO Content-Length: 650 >I am interested to know what stratigies web databases use for updating >their indexes. If a page has changed, and is reindexed, will it match both >what used to be on the page and what is now on the page or just what is >currently on the page? Also, on a related note is there a way to remove >a page from the index (if for instance it no longer exists, or has moved?) Ours (Open Text http://www.opentext.com:8080), when it detects that a page has changed, indexes only the new version. Various different indexes all have their own methods/tools for requesting deletion/update. Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com) From /CN=robots-errors/@nexor.co.uk Wed Aug 2 18:23:45 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Wed, 2 Aug 1995 18:27:52 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Wed, 2 Aug 1995 18:23:45 +0100 Date: Wed, 2 Aug 1995 18:23:45 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:033570:950802172351] Content-Identifier: I need a robot Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Wed, 2 Aug 1995 18:23:45 +0100; Alternate-Recipient: Allowed From: " (Ryan Waldron)" <rew@CrystalData.COM> Message-ID: <m0sdhQp-000hAiC@cdsgw.CrystalData.COM> To: /CN=robots/@nexor.co.uk Subject: I need a robot X-Mailer: ELM [version 2.4 PL23] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Status: RO Content-Length: 1948 Hi, all. I have a client (well, actually a client's client, but it's effectively mine here) that wants to do a searchable index on the contents of about 10 sites. These sites are not big, they are all on a similar topic, and the 'bot doesn't need to wander off the sites following links to other hosts and documents. In short, this robot needs to grab a relatively small number of documents (compared to normal resource-discovery robots) and then make a searchable index of certain key words and phrases of interest to my client. They have been very specific that they want their own search facility, and are even willing to pay for us to set up a dedicated machine to do it, if need be. So I can't very well tell them, "Just go look at Harvest." They don't want the whole Web searched, just very specific searches from just these few sites. Now, I've never written a robot, though I've written lots of little utilities to grab this and that. I've read the robot exclusion standards, I've looked at the big robots' sites, and so on. I've done what I could to educate myself on how this works. I've grabbed MOMspider and libwww and started hacking away. But I'm having difficulty getting it to do exactly what I need, so I'm asking here, just out of hope that someone can help me: Is there anywhere a robot that already does this, for which I could get the source? I'd a whole lot rather use the code written by someone who knew *exactly* what they're doing than risk my silly robot getting loose and making people mad at me. If not, I will do my utmost to make a very well-behaved robot out of the MOMspider and libwww stuff I have. -- Ryan Waldron ||| http://www.traveller.com/~rew ||| rew@traveller.com The Software Tailors (205) 232-2706 "Software that fits" Consulting & Contract programming Unix / Windows / DOS C / C++ / XVT / OWL / MFC / E-Mail / News / WWW / HTML From /CN=robots-errors/@nexor.co.uk Thu Aug 3 09:06:06 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Thu, 3 Aug 1995 09:09:14 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 3 Aug 1995 09:06:06 +0100 Date: Thu, 3 Aug 1995 09:06:06 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:224390:950803080608] Content-Identifier: Re: I need a ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 3 Aug 1995 09:06:06 +0100; Alternate-Recipient: Allowed From: Martijn Koster <m.koster@nexor.co.uk> Message-ID: <"22432 Thu Aug 3 09:05:41 1995"@nexor.co.uk> To: " (Ryan Waldron)" <rew@CrystalData.COM> Cc: /CN=robots/@nexor.co.uk In-Reply-To: <m0sdhQp-000hAiC@cdsgw.CrystalData.COM> Subject: Re: I need a robot Status: RO Content-Length: 1013 In message <m0sdhQp-000hAiC@cdsgw.CrystalData.COM>, " (Ryan Waldron)" writes: > In short, this robot > needs to grab a relatively small number of documents (compared to > normal resource-discovery robots) and then make a searchable index of > certain key words and phrases of interest to my client. They have > been very specific that they want their own search facility, and are > even willing to pay for us to set up a dedicated machine to do it, if > need be. So I can't very well tell them, "Just go look at Harvest." > They don't want the whole Web searched, just very specific searches > from just these few sites. Maybe I don't quite understand your requirement, but why can you not use Harvest for this purpose? You can configure that to stay within bounds, and once you have the data in SOIF format you can pretty much do anything with it what you want. -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M WWW: http://web.nexor.co.uk/mak/mak.html From /CN=robots-errors/@nexor.co.uk Thu Aug 3 15:38:15 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Thu, 3 Aug 1995 15:45:15 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Thu, 3 Aug 1995 15:38:15 +0100 Date: Thu, 3 Aug 1995 15:38:15 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:295630:950803143816] Content-Identifier: Re: I need a ... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Thu, 3 Aug 1995 15:38:15 +0100; Alternate-Recipient: Allowed From: Tim Jung <tjung@i1.net> Message-ID: <Pine.BSD/.3.91.950803093113.638C-100000@mail1.i1.net> To: /CN=robots/@nexor.co.uk In-Reply-To: <"22432 Thu Aug 3 09:05:41 1995"@nexor.co.uk> Subject: Re: I need a robot MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Status: RO Content-Length: 2006 On Thu, 3 Aug 1995, Martijn Koster wrote: > In message <m0sdhQp-000hAiC@cdsgw.CrystalData.COM>, " (Ryan Waldron)" writes: > > > In short, this robot > > needs to grab a relatively small number of documents (compared to > > normal resource-discovery robots) and then make a searchable index of > > certain key words and phrases of interest to my client. They have > > been very specific that they want their own search facility, and are > > even willing to pay for us to set up a dedicated machine to do it, if > > need be. So I can't very well tell them, "Just go look at Harvest." > > They don't want the whole Web searched, just very specific searches > > from just these few sites. > > Maybe I don't quite understand your requirement, but why can you not > use Harvest for this purpose? You can configure that to stay within > bounds, and once you have the data in SOIF format you can pretty > much do anything with it what you want. > Yes this is true. I already sent him private email telling him the same thing. "Harvest" is actually quite a nice package. It is one of the few indexing and search engines on the net for use with web where you can define a narrow search/retrieve parameters so you are not trying to index the whole internet but rather a very small detailed section of it on what you are interested in. It is also unusual in the fact that you don't need to maintain awhole set of robots, you can instead share them with neighbors around you so as to reduce the traffic load on the net. It is actually a very nice engine, and should be one of the better things to come out for this purpose in the last few months that I have heard of. It also allows you to share your index databases, not just your robot information. So if a site didn't want to maintain their own set of indexing routines, and qualifiers they could just get them from another site. All in all I think this program will be one of the greatest things to reduce traffic on the net in a long time. From /CN=robots-errors/@nexor.co.uk Tue Aug 8 16:26:42 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Tue, 8 Aug 1995 16:30:33 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 8 Aug 1995 16:26:42 +0100 Date: Tue, 8 Aug 1995 16:26:42 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:205940:950808152643] Content-Identifier: Harvester for... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 8 Aug 1995 16:26:42 +0100; Alternate-Recipient: Allowed From: " (Bonnie Scott)" <bonnie@dev.prodigy.com> Message-ID: <199508081517.LAA31319@tinman.dev.prodigy.com> To: /CN=robots/@nexor.co.uk Subject: Harvester for AIX? X-Sender: bonnie@tinman.dev.prodigy.com Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Mailer: <Windows Eudora Version 2.0.2> Status: RO Content-Length: 345 I read at http://harvest.cs.colorado.edu/harvest/FAQ.html#platforms that someone had ported Harvester to AIX 3.2. Does anyone know where to get this or whom to write to? Do you think it would run under AIX 4.1? Also, what do you all think of NASA's MORE (extension of their RBSE project)? Thank you, Bonnie Scott Prodigy Services Company From /CN=robots-errors/@nexor.co.uk Tue Aug 8 17:39:21 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Tue, 8 Aug 1995 17:45:57 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 8 Aug 1995 17:39:21 +0100 Date: Tue, 8 Aug 1995 17:39:21 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:223300:950808163923] Content-Identifier: Re: Harvester... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 8 Aug 1995 17:39:21 +0100; Alternate-Recipient: Allowed From: " (Mike Schwartz)" <mfs@cse.ogi.edu> Message-ID: <m0sfrVq-00000eC@latour.cse.ogi.edu> To: bonnie@dev.prodigy.com Cc: /CN=robots/@nexor.co.uk Subject: Re: Harvester for AIX? Status: RO Content-Length: 667 > Date: Tue, 8 Aug 1995 16:26:42 +0100 > From: " (Bonnie Scott)" <bonnie@dev.prodigy.com> > To: /CN=robots/@nexor.co.uk > Subject: Harvester for AIX? > > I read at > > http://harvest.cs.colorado.edu/harvest/FAQ.html#platforms > > that someone had ported Harvester to AIX 3.2. Does anyone know where to get > this or whom to write to? Do you think it would run under AIX 4.1? Bonnie, See ftp://ftp.cs.colorado.edu/pub/distribs/harvest/contrib/AIX-binaries/ More generally, if you would like technical support for Harvest, please see http://harvest.cs.colorado.edu/harvest/support-policy.html Finally, please note that it's "Harvest", not "Harvester" - Mike From /CN=robots-errors/@nexor.co.uk Tue Aug 8 21:03:33 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Tue, 8 Aug 1995 21:06:55 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Tue, 8 Aug 1995 21:03:33 +0100 Date: Tue, 8 Aug 1995 21:03:33 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:252930:950808200334] Content-Identifier: Harvest on So... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Tue, 8 Aug 1995 21:03:33 +0100; Alternate-Recipient: Allowed From: " (Kelly Carney)" <kcarney@magellan.teq.stortek.com> Message-ID: <9508082001.AA04406@gomer.teq.stortek.com> To: /CN=robots/@nexor.co.uk Cc: Kelly_Carney@stortek.com Subject: Harvest on Solaris? X-Sun-Charset: US-ASCII Status: RO Content-Length: 629 While waiting for the new newsgroup for Harvest to come online (comp.infosystems.harvest), I thought I ask this forum a question... Anyone RUNNING harvest under Solaris? I've gotten binary as well as source distributions to execute under Solaris 2.4, but I've never been able to make it work correctly. It seems to get confused while looking through links to directories and eventually locks up my host. I know there is a new beta ready (1.3), but before going to the trouble of building it, I'd appreciate hearing if ANYONE has older versions working under Solaris 2.4 Much obliged, Kelly From /CN=robots-errors/@nexor.co.uk Fri Aug 18 18:18:12 1995 Return-Path: </CN=robots-errors/@nexor.co.uk> Delivery-Date: Fri, 18 Aug 1995 18:21:30 +0100 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Fri, 18 Aug 1995 18:18:12 +0100 Date: Fri, 18 Aug 1995 18:18:12 +0100 X400-Originator: /CN=robots-errors/@nexor.co.uk X400-Recipients: non-disclosure:; X400-MTS-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:055900:950818171813] Content-Identifier: Prototype web... Priority: Non-Urgent DL-Expansion-History: /CN=robots/@nexor.co.uk ; Fri, 18 Aug 1995 18:18:12 +0100; Alternate-Recipient: Allowed From: Razzakul Haider Chowdhury <a94385@cs.ait.ac.th> Message-ID: <Pine.SUN.3.91.950819000052.14768A-100000@cs4.cs.ait.ac.th> To: /CN=robots/@nexor.co.uk Subject: Prototype web robot MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 1237 As pert of my thesis for M.Engg. in Computer Science, I have implemented a prototype web robot. Using Perl and its rich library robot is created to generate an index of HTML documents in the web servers. Though the robot exclusion protocol is not implemented yet, it is expected to implement in future before the test run in the net. A form-based search interface is available which will provide boolean (OR only) keyward search facility on the Html Index (HI testbed, try with energy, environment, power sector etc. keywords). The URL is: http://www.cs.ait.ac.th/~a94385/pa.html Razzakul Haider Chowdhury =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+= E-mail: a94385@cs.ait.ac.th Home Page: http://www.cs.ait.ac.th/~a94385/index.html =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+= Mailing Address: |FAX Address: Mail Box-171, | Dormitory- R106B AIT, | Fax# (66-2) 524-2126 & 516-1418 G.P.O. Box-2754, | Tel: (66-2) 524-5980 Bangkok-10501, | " 524-6170 (8:00 to 12:00 pm BK) THAILAND | " 524-6171 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=