From owner-robots Thu Oct 12 14:39:19 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20349; Thu, 12 Oct 95 14:39:19 -0700 Message-Id: <9510122139.AA20341@webcrawler.com> To: robots Subject: The robots mailing list at WebCrawler From: Martijn Koster Date: Thu, 12 Oct 1995 14:39:19 -0700 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Welcome to our new home... This mailing list is now open for traffic. For details see: http://info.webcrawler.com/mailing-lists/robots/info.html -- Martijn __________ Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Thu Oct 12 16:09:58 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25602; Thu, 12 Oct 95 16:09:58 -0700 Message-Id: Date: Thu, 12 Oct 95 16:09 PDT X-Sender: a07893@giant.mindlink.net X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Tim Bray Subject: Something that would be handy Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com It might be nice to enhance robots.txt to include a hint as to how long the file ought to be cached by a Robot driver. People who don't understand why probably ought to ignore this message. People who do might want to suggest (a) reasons why this is a silly idea, (b) a syntax/method for doing it, or (c) any implementation difficulties that could ensue. My suggestion, expressed in the form of perl code that could be used to implement it: if (/^\s*CacheHint:\s+(\d+)\s*([dhm])\s*$/) { $SecondsToCache = $1; if ($2 eq 'd') { $SecondsToCache *= 60*60*24; } elsif ($2 eq 'h') { $SecondsToCache *= 60*60; } else { $SecondsToCache *= 60; } } Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com) From owner-robots Fri Oct 13 18:03:54 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA29927; Fri, 13 Oct 95 18:03:54 -0700 Message-Id: Date: Sat, 14 Oct 95 11:07:39 0000 From: James Organization: Tourist Radio Pty Ltd X-Mailer: Mozilla 1.1N (Macintosh; I; 68K) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Site Announcement X-Url: http://info.webcrawler.com/mailing-lists/robots/info.html Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com We wish to advise those with a robot seeking facility that we have two sites at http://www.com.au/aaa and http://www.world.net/touristradio We would be grateful if you would ask your robots to visit and announce our sites where possible. If this is bad net ettique, we apologise, there are huge back logs with manual services. James From owner-robots Mon Oct 16 08:25:16 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00957; Mon, 16 Oct 95 08:25:16 -0700 Message-Id: <9510161525.AA00951@webcrawler.com> To: robots Subject: Re: Site Announcement In-Reply-To: Your message of "Sat, 14 Oct 1995 11:07:39." Date: Mon, 16 Oct 1995 08:25:16 -0700 From: Martijn Koster Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, You've asked me to add a link. The best way to get a link added to the WebCrawler, submit them to http://www.webcrawler.com/WebCrawler/SubmitURLS.html Regards, -- Martijn __________ Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Mon Oct 16 18:36:43 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA29862; Mon, 16 Oct 95 18:36:43 -0700 Message-Id: Date: 16 Oct 1995 18:40:48 -0800 From: "Roger Dearnaley" Subject: How do I let spiders in? To: " " X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Is there any way currently supported of providing spiders access to our (soon to be launched) username & password authenticated site? (Of course if a customer followed a link generated by this spider search, they will be asked for authentication, but when the can't provide it we will redirect them to a Registration page.) The security on our site is not meant to be high: it is there primarily so that the forms CGI scripts have a unique user name to figure out who is doing what. Thus for our site we would probably be happy to just place a user name and password in robots.txt, or some similar low-security solution. However, I can see that for other sites this might not be an acceptable, so spider maintainers might want to consider adding fields for the username and password to use to their 'Please index this URL' submission forms. Then, ideally, it should be possible to submit these forms securely. --Roger Dearnaley (roger_dearnaley@intouchgroup.com) From owner-robots Wed Oct 18 08:32:24 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12938; Wed, 18 Oct 95 08:32:24 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 18 Oct 1995 08:31:05 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Unfriendly robot at 205.177.10.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com One of my Web servers (http://asearch.mccmedia.com/ last night was attacked by a very unfriendly robot that requested many documents per second. This robot was originating from 205.177.10.2. I've tried to resolve that IP address, but I'm unable thus far. However, a traceroute shows that a cais.net router was the last hop before the domain in which the offending robot lives, so I sent an e-mail to the postmaster there, hoping that he or she will know whose host that is and will forward it (assuming that whoever owns this thing is a CAIS customer). Has anyone else encountered this one? It doesn't identify itself at all. Nick From owner-robots Wed Oct 18 08:58:47 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14082; Wed, 18 Oct 95 08:58:47 -0700 Message-Id: Date: Wed, 18 Oct 95 08:58 PDT X-Sender: a07893@giant.mindlink.net X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Tim Bray Subject: Re: Unfriendly robot at 205.177.10.2 Cc: robots@webcrawler.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 08:31 18/10/95 -0700, Nick Arnett wrote: >One of my Web servers (http://asearch.mccmedia.com/ last night was attacked >by a very unfriendly robot that requested many documents per second. This >robot was originating from 205.177.10.2. That resolves to 'murph.cais.net' - no idea who they are, never heard of 'em. - Tim From owner-robots Wed Oct 18 09:06:44 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14459; Wed, 18 Oct 95 09:06:44 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 18 Oct 1995 09:05:20 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: CORRECTION -- Re: Unfriendly robot Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Whoops -- I pasted the wrong IP address into this message. The unfriendly robot was at 205.252.60.50. Nick From owner-robots Wed Oct 18 09:32:08 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15587; Wed, 18 Oct 95 09:32:08 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 18 Oct 1995 09:30:32 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Unfriendly robot at 205.177.10.2 Cc: tbray@opentext.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 8:58 AM 10/18/95, Tim Bray wrote: >That resolves to 'murph.cais.net' - no idea who they are, never heard >of 'em. As you may have seen in my correction, that was a mistake on my part. I copied that from the traceroute -- it's the last router before the address space in which the misbehaving robot lives. It is Capitol Area Internet Service and under the assumption that the owner of the robot is one of their customers, I sent a message to the CAIS postmaster. The correct address of the owner of the robot is 205.252.60.50, which won't resolve. Tight security, apparently. Ironically. Nick From owner-robots Wed Oct 18 09:43:26 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16066; Wed, 18 Oct 95 09:43:26 -0700 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199510181643.RAA22167@wsinis11.win.tue.nl> Subject: Re: Unfriendly robot at 205.177.10.2 To: robots@webcrawler.com Date: Wed, 18 Oct 1995 17:42:55 +0100 (MET) In-Reply-To: from "Nick Arnett" at Oct 18, 95 08:31:05 am X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 921 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (Nick Arnett) write: > >One of my Web servers (http://asearch.mccmedia.com/ last night was attacked >by a very unfriendly robot that requested many documents per second. This >robot was originating from 205.177.10.2. I've tried to resolve that IP >address, but I'm unable thus far. However, a traceroute shows that a >cais.net router was the last hop before the domain in which the offending >robot lives, so I sent an e-mail to the postmaster there, hoping that he or >she will know whose host that is and will forward it (assuming that whoever >owns this thing is a CAIS customer). Here you are: % host 205.177.10.2 Name: murph.cais.net Address: 205.177.10.2 Aliases: >Has anyone else encountered this one? It doesn't identify itself at all. No accesses here from 205.177.10.2 or cais.net. >Nick -- Reinier Post reinpost@win.tue.nl a.k.a. me From owner-robots Wed Oct 18 11:32:15 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21768; Wed, 18 Oct 95 11:32:15 -0700 Message-Id: <9510181831.AA06646@ai.iit.nrc.ca> Date: Wed, 18 Oct 95 14:31:39 EDT From: Alain Desilets To: robots@webcrawler.com Subject: Looking for a spider Cc: alain@ai.iit.nrc.ca Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Dear spider developpers. My name is Alain Desilets. I am a researcher in the Interactive Information Group of the National Research Council of Canada. We are a small group (6 people) developing tools for interactive access to information. Our technological angle on this problem is AI based approaches, in particular Machine Learning and Agents. You can find more about our work at http://ai.iit.nrc.ca/II_public/. In order to test our methods we need to acquire a large corpus of full HTML files from the Web. We plan to use a spider for that task. We are aware of the controversy surrounding the creation of new spiders and therefore do not plan to develop one. That would not only be a duplication of effort but would also introduce a new, possibly buggy spider in Koster's already vast list of Web critters. Instead, we would like to use a publically available, well behaved and proven spider. Is there such spider available for serious research purpose? Or maybe the corpus we need already exists? Is there a CD-ROM or .zip file that would give us the whole of the web in full HTML? Thanks for your help. Alain Desilets Institute for Information Technology National Research Concil of Canada Building M-50 Montreal Road Ottawa (Ont) K1A 0R6 e-mail: alain@ai.iit.nrc.ca Tel: (613) 990-2813 Fax: (613) 952-7151 From owner-robots Wed Oct 18 12:28:54 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23934; Wed, 18 Oct 95 12:28:54 -0700 Date: Wed, 18 Oct 1995 15:34:04 -0400 Message-Id: <199510181934.PAA12177@maple.sover.net> X-Sender: Leigh.D.Dupee@neinfo.net X-Mailer: Windows Eudora Version 1.4.4 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Leigh.D.Dupee@neinfo.net (Leigh DeForest Dupee) Subject: Re: Unfriendly robot at 205.177.10.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Query:All records (ALL):2.10.177.205.in-addr.arpa Authoritative Answer 2.10.177.205.in-addr.arpa PTR murph.cais.net 10.177.205.in-addr.arpa NS cais.com cais.com A 199.0.216.4 Complete: 2.10.177.205.in-addr.arpa Query:All records (ALL):murph.cais.net Authoritative Answer Name does not exist Complete:NO_DATA murph.cais.net Best I can come up with! >One of my Web servers (http://asearch.mccmedia.com/ last night was attacked >by a very unfriendly robot that requested many documents per second. This >robot was originating from 205.177.10.2. I've tried to resolve that IP >address, but I'm unable thus far. However, a traceroute shows that a >cais.net router was the last hop before the domain in which the offending >robot lives, so I sent an e-mail to the postmaster there, hoping that he or >she will know whose host that is and will forward it (assuming that whoever >owns this thing is a CAIS customer). > >Has anyone else encountered this one? It doesn't identify itself at all. > >Nick > > > --------------------------------------------------------------- Leigh DeForest Dupee Help Me Learn, Inc., Administrator for NEInfo.Net South Stream Road RR3 Box 4203, Bennington, VT 05201 (802) 447-2905 --------------------------------------------------------------- From owner-robots Wed Oct 18 12:49:50 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24697; Wed, 18 Oct 95 12:49:50 -0700 Message-Id: <9510181951.AA08164@pluto.sybgate.sybase.com> X-Sender: dbakin@pluto X-Mailer: Windows Eudora Version 2.1.1 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 18 Oct 1995 12:49:14 -0700 To: robots@webcrawler.com From: David Bakin Subject: Is it a robot or a link-updater? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com As the subject implies, I'm curious if there is a difference, in the impact on the serving site, between a true robot and someone running an automatic link updater? Can they even be told apart by the serving site? -- Dave -- Dave Bakin How much work would a work flow flow if a #include 415-872-1543 x5018 work flow could flow work? From owner-robots Wed Oct 18 13:16:38 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25902; Wed, 18 Oct 95 13:16:38 -0700 From: amonge@cs.ucsd.edu (Alvaro Monge) Message-Id: <9510182013.AA10642@dino> Subject: Re: Looking for a spider To: robots@webcrawler.com Date: Wed, 18 Oct 1995 13:13:55 -0700 (PDT) In-Reply-To: <9510181831.AA06646@ai.iit.nrc.ca> from "Alain Desilets" at Oct 18, 95 02:31:39 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 1865 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com A colleague of mine and I are also doing research which is AI based and are in need of a large corpus for our use. We would like to use anything that is already available which keeps the structure of the real WWW and does not take anything away. This is in order to create realistic experiments of our approaches. Thanks in advance for any pointers, --Alvaro Computer science and engineering department University of California, San Diego > > Dear spider developpers. > > > My name is Alain Desilets. I am a researcher in the Interactive > Information Group of the National Research Council of Canada. > > We are a small group (6 people) developing tools for interactive > access to information. Our technological angle on this problem is AI > based approaches, in particular Machine Learning and Agents. You can > find more about our work at http://ai.iit.nrc.ca/II_public/. > > In order to test our methods we need to acquire a large corpus of > full HTML files from the Web. We plan to use a spider for that task. > > We are aware of the controversy surrounding the creation of new > spiders and therefore do not plan to develop one. That > would not only be a duplication of effort but would also introduce a > new, possibly buggy spider in Koster's already vast list of Web > critters. Instead, we would like to use a publically available, well > behaved and proven spider. > > Is there such spider available for serious research purpose? > > Or maybe the corpus we need already exists? Is there a CD-ROM or .zip > file that would give us the whole of the web in full HTML? > > > Thanks for your help. > > Alain Desilets > > Institute for Information Technology > National Research Concil of Canada > Building M-50 > Montreal Road > Ottawa (Ont) > K1A 0R6 > > e-mail: alain@ai.iit.nrc.ca > Tel: (613) 990-2813 > Fax: (613) 952-7151 > > From owner-robots Wed Oct 18 14:13:35 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28102; Wed, 18 Oct 95 14:13:35 -0700 Message-Id: Date: 18 Oct 1995 15:13:44 -0700 From: "Xiaodong Zhang" Subject: Re: Looking for a spider To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Reply to: RE>>Looking for a spider 7/24/95 - Frontier Technologies licenses Lycos Internet Catalog software MEQUON, WIS. (July 24) BUSINESS WIRE -July 24, 1995--Frontier Technologies Corp. today announced it has signed an agreement to license the Lycos(TM) Internet Catalog. The Lycos Catalog has been incorporated into Frontier Technologies" new SuperHighway Access product, called SuperHighway Access CyberSearch(TM), which allows users to perform a Lycos search offline via CD-ROM, connecting to the Internet only once relevant Internet resources have been identified. The Lycos technology was developed at Carnegie Mellon University, and was recently transferred to Lycos Inc., a newly-created subsidiary of CMG Information Services Inc. Lycos is a software system which contains a robot that searches the World Wide Web and catalogs the documents it finds. It also includes an information search engine that helps users access information quickly and easily when they type in key words or topics. The Lycos exploration robot locates new and changed documents and builds abstracts, which consist of title, headings, subheadings, 100 most significant words and the first 20 lines of the document. The catalog is continually updated by the Lycos exploration agent. Frontier will receive regular updates from Lycos Inc., allowing it to produce monthly issues of SuperHighway Access CyberSearch. "It's now widely understood that one of the primary barriers to users" productivity on the Internet is finding information," said Dennis Freeman, Frontier Technologies" marketing director. "That's why Internet search services like Lycos are among the Internet's most popular sites." "Lycos Inc. is pleased to partner with Frontier as they contribute to our continued position as the most widely used and most comprehensive catalog product on the Web," said Bob Davis, CEO of Lycos Inc. The product, now shipping, consists of a 608-megabyte subset of the Lycos catalog, indexing about half a million web pages, integrated with Frontier's multi-session, multi-protocol Internet browser software. The product is shipped on CD-ROM and is available through Frontier's reseller channel. The CD will be updated monthly (bi-monthly initially) Frontier is offering the first issue of CyberSearch at $14.95. A charter subscription for 6 issues is priced at $6.75 per month. Subscribers should call 1-800/879-0075 (+1-414/571-0190 outside the U.S.) or access Frontier's web server, http://www.frontiertech.com for further information. Lycos Inc., with offices in Wilmington, Mass. and Pittsburgh, Penn., is the newly formed corporation based upon technology developed at Carnegie Mellon University. Frontier Technologies Corp., based in Mequon, is a leading supplier of TCP/IP and Internet-based products that make businesses more competitive in a global market. CONTACT: Frontier Technologies Corp., Mequon Nicole Rogers, 414/241-4555 x293 or Lycos Inc. Mike Olfe, 508/657-5050 x3124 ------------------------------ Date: 10/18/95 3:01 PM To: Zhang, Xiaodong From: robots@webcrawler.com A colleague of mine and I are also doing research which is AI based and are in need of a large corpus for our use. We would like to use anything that is already available which keeps the structure of the real WWW and does not take anything away. This is in order to create realistic experiments of our approaches. Thanks in advance for any pointers, --Alvaro Computer science and engineering department University of California, San Diego > > Dear spider developpers. > > > My name is Alain Desilets. I am a researcher in the Interactive > Information Group of the National Research Council of Canada. > > We are a small group (6 people) developing tools for interactive > access to information. Our technological angle on this problem is AI > based approaches, in particular Machine Learning and Agents. You can > find more about our work at http://ai.iit.nrc.ca/II_public/. > > In order to test our methods we need to acquire a large corpus of > full HTML files from the Web. We plan to use a spider for that task. > > We are aware of the controversy surrounding the creation of new > spiders and therefore do not plan to develop one. That > would not only be a duplication of effort but would also introduce a > new, possibly buggy spider in Koster's already vast list of Web > critters. Instead, we would like to use a publically available, well > behaved and proven spider. > > Is there such spider available for serious research purpose? > > Or maybe the corpus we need already exists? Is there a CD-ROM or .zip > file that would give us the whole of the web in full HTML? > > > Thanks for your help. > > Alain Desilets > > Institute for Information Technology > National Research Concil of Canada > Building M-50 > Montreal Road > Ottawa (Ont) > K1A 0R6 > > e-mail: alain@ai.iit.nrc.ca > Tel: (613) 990-2813 > Fax: (613) 952-7151 > > ------------------ RFC822 Header Follows ------------------ Received: by zazu.softshell.com with SMTP;18 Oct 1995 14:59:25 -0700 Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25902; Wed, 18 Oct 95 13:16:38 -0700 From: amonge@cs.ucsd.edu (Alvaro Monge) Message-Id: <9510182013.AA10642@dino> Subject: Re: Looking for a spider To: robots@webcrawler.com Date: Wed, 18 Oct 1995 13:13:55 -0700 (PDT) In-Reply-To: <9510181831.AA06646@ai.iit.nrc.ca> from "Alain Desilets" at Oct 18, 95 02:31:39 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 1865 Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com From owner-robots Wed Oct 18 14:55:47 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA29718; Wed, 18 Oct 95 14:55:47 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 18 Oct 1995 14:54:22 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Unfriendly robot owner identified! Cc: aleonard@well.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Got 'em. Using whois, I found that the IP address belongs to Library Corp. in Virginia. They're the providers of the "NlightN" search service at: http://www.nlightn.com/ Anybody know anything about their robot? I know that they've licensed the Lycos data. Their background information says, "NlightN, a division of The Library Corporation, was formed to develop and market a Universal Index to the world's electronically stored information." I guess their robot has to work fast to build a universal index... ;-) Nick From owner-robots Wed Oct 18 15:19:02 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01014; Wed, 18 Oct 95 15:19:02 -0700 Date: Wed, 18 Oct 1995 15:18:53 -0700 (PDT) From: Andrew Leonard Subject: Re: Unfriendly robot owner identified! To: robots@webcrawler.com In-Reply-To: Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, all. I'm a reporter for Wired working on a story about bots, and I'm personally following up on this NlightN robot episode. I've put a call into their Reston VA headquarters asking to talk to someone about their search robot, and I'll keep the list posted on whatever I find out. Andrew Leonard Wired Magazine > Got 'em. > > Using whois, I found that the IP address belongs to Library Corp. in > Virginia. They're the providers of the "NlightN" search service at: > > http://www.nlightn.com/ > > Anybody know anything about their robot? I know that they've licensed the > Lycos data. > > Their background information says, "NlightN, a division of The Library > Corporation, was formed to develop and market a Universal Index to the > world's electronically stored information." > > I guess their robot has to work fast to build a universal index... ;-) > > Nick > > > From owner-robots Wed Oct 18 15:38:57 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02009; Wed, 18 Oct 95 15:38:57 -0700 From: amonge@cs.ucsd.edu (Alvaro Monge) Message-Id: <9510182200.AA11857@dino> Subject: Re: Looking for a spider To: robots@webcrawler.com Date: Wed, 18 Oct 1995 15:00:01 -0700 (PDT) In-Reply-To: from "Xiaodong Zhang" at Oct 18, 95 03:13:44 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 555 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Unfortunately, I cannot use most robots that I know of because they DO NOT SAVE the entire document, or its hierarchical structure. Lycos for example: > The Lycos exploration robot locates new and changed documents and > builds abstracts, which consist of title, headings, subheadings, > 100 most significant words and the first 20 lines of the document. For my research, this is not that useful. I need the entire document, as it appears at the source -- not as saved by some robot, because I want to follow the links within the document. --Alvaro From owner-robots Wed Oct 18 16:19:20 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04137; Wed, 18 Oct 95 16:19:20 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 18 Oct 1995 16:18:02 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Really fast searching Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com It's a bit off-topic, but I can't resist sharing something that one of our sharp-eyed engineers found in a certain company's information page about their search service: > By transparently linking hundreds of data sources, ******* has > created the world's largest integrated index, already comprised of > more than 100 gigabytes and growing daily. A proprietary database > engine provides immediate response time and actually increases speed > as the size of the index grows. We need this algorithm, our engineer says. It start off with immediate responses, then gets faster. Wowza! ("A meeting on time travel will be held last week.") Nick From owner-robots Thu Oct 19 06:29:53 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16844; Thu, 19 Oct 95 06:29:53 -0700 Message-Id: <9510191329.AA12490@ai.iit.nrc.ca> Date: Thu, 19 Oct 95 09:29:15 EDT From: Alain Desilets To: robots@webcrawler.com Subject: Re: Looking for a spider Cc: alain@ai.iit.nrc.ca Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Dear Alvaro, Thanks for responding. I'll let you know if I find something. I'm interested to know more about your work. Do you have a Web page on it? Thanks Alain From owner-robots Thu Oct 19 06:32:09 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17037; Thu, 19 Oct 95 06:32:09 -0700 Message-Id: <9510191331.AA12583@ai.iit.nrc.ca> Date: Thu, 19 Oct 95 09:31:31 EDT From: Alain Desilets To: robots@webcrawler.com Subject: Re: Looking for a spider Cc: alain@ai.iit.nrc.ca Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Dear Zhang, Thank you for the info. Unfortunately, I am in the same position as Alvaro Monge. I need the original HTML files, as opposed to some condensed version of it produced by a robot. Alain From owner-robots Thu Oct 19 06:39:50 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17600; Thu, 19 Oct 95 06:39:50 -0700 Message-Id: <9510191339.AA12691@ai.iit.nrc.ca> Date: Thu, 19 Oct 95 09:39:13 EDT From: Alain Desilets To: robots@webcrawler.com Subject: Sorry! Cc: alain@ai.iit.nrc.ca Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Sorry about the previous messages. I intended to send them directly to the people concerned but it somehow got sent to this list. - Alain From owner-robots Thu Oct 19 07:53:29 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22425; Thu, 19 Oct 95 07:53:29 -0700 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199510191453.PAA06141@wswiop11.win.tue.nl> Subject: Re: Unfriendly robot at 205.177.10.2 To: robots@webcrawler.com Date: Thu, 19 Oct 1995 15:53:11 +0100 (MET) Cc: tbray@opentext.com In-Reply-To: from "Nick Arnett" at Oct 18, 95 09:30:32 am X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 989 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >The correct address of the owner of the robot is 205.252.60.50, which won't >resolve. Tight security, apparently. Ironically. Well, on our site (www.win.tue.nl), it's causing no problems at all: % grep '205\.252' /usr/www/logs/cern_access.log 205.252.60.50 - - [13/Oct/1995:12:30:13 +0100] "GET / HTTP/1.0" 302 381 205.252.60.50 - - [13/Oct/1995:20:58:55 +0100] "GET / HTTP/1.0" 302 381 % wc /usr/www/logs/cern_access.log 206422 2062250 22193056 /usr/www/logs/cern_access.log That is, out of the last 206,422 requests, 2 were from this site. Lycos wants to index as many documents on a site it can find. This robot has only made two requests, and it didn't even retrieve our home page (/ is redirected to /win/, which is the actual home page). Perhaps it doesn't follow redirections. >Nick -- Reinier Post reinpost@win.tue.nl a.k.a. me [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] From owner-robots Thu Oct 19 07:57:03 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22755; Thu, 19 Oct 95 07:57:03 -0700 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199510191456.PAA06159@wswiop11.win.tue.nl> Subject: Re: Looking for a spider To: robots@webcrawler.com Date: Thu, 19 Oct 1995 15:56:40 +0100 (MET) In-Reply-To: <9510182200.AA11857@dino> from "Alvaro Monge" at Oct 18, 95 03:00:01 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 1038 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (Alvaro Monge) write: >Unfortunately, I cannot use most robots that I know of because they >DO NOT SAVE the entire document, or its hierarchical structure. > >Lycos for example: > >> The Lycos exploration robot locates new and changed documents and >> builds abstracts, which consist of title, headings, subheadings, >> 100 most significant words and the first 20 lines of the document. > >For my research, this is not that useful. I need the entire document, >as it appears at the source -- not as saved by some robot, because I >want to follow the links within the document. Lycos follows the links of documents; that's how robots work. The summaries are built for indexing purposes. You can't save the full text of all documents because of the disk space requirements (perhaps OpenText can?) and because of legal considerations. >--Alvaro -- Reinier Post reinpost@win.tue.nl a.k.a. me [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] From owner-robots Thu Oct 19 08:44:31 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26046; Thu, 19 Oct 95 08:44:31 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 19 Oct 1995 08:41:05 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Unfriendly robot at 205.252.60.50 Cc: tbray@opentext.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 7:53 AM 10/19/95, Reinier Post wrote: >>The correct address of the owner of the robot is 205.252.60.50, which won't >>resolve. Tight security, apparently. Ironically. > >Well, on our site (www.win.tue.nl), it's causing no problems at all In my e-mail to NlightN, I said that I assume it was unintentional. I can't imagine that anyone would purposely request documents at the rate they were hitting us. Of course, there's no way to know if that was the robot or a human-controlled browser hitting your site from the same host... Thanks! Nick From owner-robots Thu Oct 19 09:10:34 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28065; Thu, 19 Oct 95 09:10:34 -0700 Message-Id: <9510191609.AA14728@ai.iit.nrc.ca> Date: Thu, 19 Oct 95 12:09:49 EDT From: Alain Desilets To: robots@webcrawler.com Subject: Re: Looking for a spider Cc: alain@ai.iit.nrc.ca Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com In response to Alvaro's message, > > > >> The Lycos exploration robot locates new and changed documents and > >> builds abstracts, which consist of title, headings, subheadings, > >> 100 most significant words and the first 20 lines of the document. > > > >For my research, this is not that useful. I need the entire document, > >as it appears at the source -- not as saved by some robot, because I > >want to follow the links within the document. Reinier Post writes: > > Lycos follows the links of documents; that's how robots work. > The summaries are built for indexing purposes. You can't save > the full text of all documents because of the disk space requirements > (perhaps OpenText can?) and because of legal considerations. > Like Alvaro, no robot generated indexe of the whole web is sufficient for my purpose. My group working on developping new tools that can process the web and "summarise" it in some novel way. For example: - New and hopefully better keyword extraction algorithms - Automatic generation of hierarchichal indexes a la Yahoo - Merging of small indexes into bigger ones - etc... In order to test these new approaches, we need the full HTML, not an index of it. - Alain From owner-robots Thu Oct 19 09:18:30 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28618; Thu, 19 Oct 95 09:18:30 -0700 Date: Fri, 20 Oct 1995 02:18:16 +1000 From: Murray Bent Message-Id: <199510191618.CAA08466@wittgenstein.icis.qut.edu.au> To: robots@webcrawler.com Subject: re: Lycos unfriendly robot Content-Length: 439 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com According to Reinier Post: >Lycos wants to index as many documents on a site it can find. This >robot has only made two requests, and it didn't even retrieve our home page >(/ is redirected to /win/, which is the actual home page). Perhaps it doesn't >follow redirections. >>Nick >-- >Reinier Post reinpost@win.tue.nl That may be fine if you have shares in Lycos or something. Do you? mj From owner-robots Thu Oct 19 11:01:14 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05721; Thu, 19 Oct 95 11:01:14 -0700 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199510191801.TAA19705@wsinis02.win.tue.nl> Subject: Re: Lycos unfriendly robot To: robots@webcrawler.com Date: Thu, 19 Oct 1995 19:01:00 +0100 (MET) In-Reply-To: <199510191618.CAA08466@wittgenstein.icis.qut.edu.au> from "Murray Bent" at Oct 20, 95 02:18:16 am X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 918 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (Murray Bent) write: > > >According to Reinier Post: >>Lycos wants to index as many documents on a site it can find. This >>robot has only made two requests, and it didn't even retrieve our home page >>(/ is redirected to /win/, which is the actual home page). Perhaps it doesn't >>follow redirections. > >>>Nick > >>-- >>Reinier Post reinpost@win.tue.nl > >That may be fine if you have shares in Lycos or something. Do you? I don't follow your logic. *What* is fine if I have shares in Lycos? The fact that this visit was made by something that doesn't follow redirections, and therefore is unlikely to be a Lycos robot? >mj For some reason you seem to bear a grudge against Lycos. If my posting did anything to tear open any old wounds, I apologise. -- Reinier Post reinpost@win.tue.nl a.k.a. me From owner-robots Sat Oct 21 07:17:11 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06960; Sat, 21 Oct 95 07:17:11 -0700 Date: Sat, 21 Oct 1995 07:17:03 -0700 (PDT) From: Andrew Leonard Subject: Re: Unfriendly robot at 205.252.60.50 To: robots@webcrawler.com Cc: robots@webcrawler.com, tbray@opentext.com In-Reply-To: Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I contacted NlightN, and their CEO said that their most junior hire was testing a new robot. They were apparently unaware of the robot exclusion protocol but plan to mend their ways. Andrew Leonard Wired Magazine On Thu, 19 Oct 1995, Nick Arnett wrote: > At 7:53 AM 10/19/95, Reinier Post wrote: > >>The correct address of the owner of the robot is 205.252.60.50, which won't > >>resolve. Tight security, apparently. Ironically. > > > >Well, on our site (www.win.tue.nl), it's causing no problems at all > > In my e-mail to NlightN, I said that I assume it was unintentional. I > can't imagine that anyone would purposely request documents at the rate > they were hitting us. Of course, there's no way to know if that was the > robot or a human-controlled browser hitting your site from the same host... > > Thanks! > > Nick > > > From owner-robots Sat Oct 21 11:21:18 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23944; Sat, 21 Oct 95 11:21:18 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sat, 21 Oct 1995 10:35:40 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Unfriendly robot at 205.252.60.50 Cc: robots@webcrawler.com, tbray@opentext.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 7:17 AM 10/21/95, Andrew Leonard wrote: >I contacted NlightN, and their CEO said that their most junior hire was >testing a new robot. They were apparently unaware of the robot exclusion >protocol but plan to mend their ways. I haven't heard from them, but our server/spider product manager received a telephone apology. I can't resist pointing out the irony of a search services company that apparently failed to find some critical information about robots on the Internet. On the other hand, we've probably done equally silly things. I hope they'll add a user-agent field, at least. Nick From owner-robots Sat Oct 21 17:47:17 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20668; Sat, 21 Oct 95 17:47:17 -0700 Message-Id: From: kimba@snog.it.com.au (Kim Davies) Subject: Re: Unfriendly robot at 205.252.60.50 To: robots@webcrawler.com Date: Sun, 22 Oct 1995 08:46:39 +0800 (WST) In-Reply-To: from "Nick Arnett" at Oct 21, 95 10:35:40 am X-Mailer: ELM [version 2.4 PL24 PGP2] Content-Type: text Content-Length: 554 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, > >I contacted NlightN, and their CEO said that their most junior hire was > >testing a new robot. They were apparently unaware of the robot exclusion > >protocol but plan to mend their ways. > > I haven't heard from them, but our server/spider product manager received a > telephone apology. Has someone invited them to join this list? If they discussed what they were doing it might be better for all concerned.. catchya, -- Kim Davies | "Belief is the death of intelligence" -Snog kimba@it.com.au | http://www.it.com.au/~kimba/ From owner-robots Sun Oct 22 13:14:28 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01215; Sun, 22 Oct 95 13:14:28 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sun, 22 Oct 1995 13:13:12 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Unfriendly robot at 205.252.60.50 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 5:46 PM 10/21/95, Kim Davies wrote: >Hi, > >> >I contacted NlightN, and their CEO said that their most junior hire was >> >testing a new robot. They were apparently unaware of the robot exclusion >> >protocol but plan to mend their ways. >> >> I haven't heard from them, but our server/spider product manager received a >> telephone apology. > >Has someone invited them to join this list? If they discussed what they >were doing it might be better for all concerned.. I directed them to the robots pages on www.webcrawler.com, which should lead them to this list. What am I thinking -- the server that they were hammering with their robot includes recent messages from this list (at http://asearch.mccmedia.com/robots/). I suppose that means they might have looked... Nick From owner-robots Mon Oct 23 07:50:14 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03859; Mon, 23 Oct 95 07:50:14 -0700 Date: Mon, 23 Oct 95 10:50:03 EDT From: wulfekuh@cps.msu.edu (Marilyn R Wulfekuhler) Message-Id: <9510231450.AA10394@pixel.cps.msu.edu> To: robots@webcrawler.com Subject: Re: Looking for a spider Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Alain Desilets writes: > In order to test our methods we need to acquire a large corpus of > full HTML files from the Web. We plan to use a spider for that task. > and Alvaro Monge writes: > A colleague of mine and I are also doing research which is AI based > and are in need of a large corpus for our use. We would like to use > anything that is already available which keeps the structure of the > real WWW and does not take anything away. This is in order to create > realistic experiments of our approaches. > We are also doing research on AI based approaches to processing the web, and toward the goal of having a test bed of the web, we have a text-only copy of a subset of the web (currently about 650 meg) which we have been calling "the proving grounds". It is not possible to get a complete snapshot of the web at any given time, but without images and audio, we can at least have a large, known, subset. It's also to our collective advantage to all be working from the same subset. It is our intention to make the proving grounds available to the public, hopefully within the next two weeks. We used a spider which was a modified htmlgobble, which takes a URL and follows all the links, copying all the documents it finds except image, audio, and video files. The urls inside the documents have been modified so that everything points to the local copy, enabling a spider (or human browser) to traverse the database locally. Before we go public, I have a few questions: (1) We currently don't copy audio, video, image files and instead create a file by the same name with a single character identifying it as video, image, or audio. Would an empty file suffice? Is there another identification scheme that would be more useful? (2) We currently copy postscript, but are considering treating them as we do image files. They take a LOT of space, and are of no utility for the kind of analysis that we want to do. Would it be more useful to keep the postscript, or treat it as we do images (which would then allow us to use the space for a larger web subset)? I appreciate any feedback and I'll announce to the list when it's ready for public use. Marilyn Wulfekuhler Intelligent Systems Lab, Michigan State University From owner-robots Mon Oct 23 15:27:34 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05597; Mon, 23 Oct 95 15:27:34 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Mon, 23 Oct 1995 15:26:16 -0700 To: Andrew Daviel , robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Proposed URLs that robots should search Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 1:51 PM 10/23/95, Andrew Daviel wrote: >With my other hat on (admin@vancouver-webpages.com), I'm >trying to build a database of URLs and other information for businesses >on the Net. I can't quite contain the urge to say, "Isn't everyone?" >Some database registration robots (I believe) search submitted URLs for >keywords, doing some natural language processing to discard modifiers and >prepositions. However, the trend to graphics-dominated homepages makes >such efforts of dubious utility. I wouldn't be so quick to jump to that conclusion. I have seen few, if any, business sites that don't offer text-only versions of their key pages. Also, I'm utterly certain that a good relevancy-ranking engine will do a better job at assigning categories than will an uncontrolled set of people, especially when those people are out to maximize hits, rather than to maximize relevancy. Having said all of that, I'd like to agree that we need some additional information for robots. Could we start simply by having a standard way to set forth the name of the site? An icon for the site would be really nice. It's very frustrating to build a search results list and have no definitive way of describing the site on which the documents reside! Next, I'd like to have the means to name groups of documents (Press releases, product descriptions, as examples of typical business groupings). We guess at these from directory names, but that's very haphazard. The secondary naming problem is more difficult because there are many-to-many relationships involved. >In the spirit of /robots.txt, I would like to propose a set of files that >robots would be encouraged to visit: > >/robots.htm - an HTML list of links that robots are encouraged to traverse What does "encouraged" mean? How is it differnet from (not (robots.txt))? Why HTML? >/descript.txt - a text file describing what the site (or directory) is > all about Agreed. >/keywords.txt - a text file with comma-delimited keywords relevant to the > site (or directory) Disagree greatly. This opens a giant can of worms. Keywords are never enough, often confusing and difficult to maintain. >/linecard.txt - for commercial sites, a text file with comma-delimited > line items (brands) manufactured or stocked This will drown in details. >/sitedata.txt - a text file similar to the InterNIC submissions forms, > with publicly-available site data such as > >Organization: organisation name >Type: commercial/non/profit/educational etc. >Admin: email of admininstration >Webmaster: email of Web admininstration >Postal: postal address >ZIP: ZIP/postcode >Country: >Position: Lat/Long >etc. Yes to some of this at least. But there's an assumption that there's a one-to-one relationship between the server and these field data. Often, there isn't and no scheme that fails to deal with that is going to succeed. I'm ready to adapt one of my prototype robots to parse this data for our engine, so here's one hand up for "Yes, I'll implement it." I'm just doing research, but my research does fall in front of our engineers at some point. By the way, today, Verity announced that NetManage and Purveyor have signed up to use our search engine. They join Netscape, Quarterdeck and a few others. Nick P.S. I've replied to the new list server address at webcrawler.com, rather than the Nexor address. From owner-robots Mon Oct 23 16:31:22 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10352; Mon, 23 Oct 95 16:31:22 -0700 Message-Id: <9510232331.AA10338@webcrawler.com> To: robots Cc: Andrew Daviel Subject: Re: Proposed URLs that robots should search In-Reply-To: Your message of "Mon, 23 Oct 1995 15:26:16 PDT." Date: Mon, 23 Oct 1995 16:31:17 -0700 From: Martijn Koster Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com In message , Nick Arnett writes: > Also, I'm utterly certain that a good relevancy-ranking engine will do a > better job at assigning categories than will an uncontrolled set of people, > especially when those people are out to maximize hits, rather than to > maximize relevancy. Yeah, isn't that fun... :-/ Maybe we should have a shared spammer blacklist :-) > [want the name of the site] > [groups of documents] > >In the spirit of /robots.txt, I would like to propose a set of files that > >robots would be encouraged to visit: > > > >/robots.htm - an HTML list of links that robots are encouraged to traverse > > What does "encouraged" mean? How is it differnet from (not (robots.txt))? Because a robot may not want to traverse the whole site, and would prefer to get "sensible" pages. > Why HTML? Yeah, bad news. > [/keywords] > Disagree greatly. This opens a giant can of worms. Keywords are never > enough, often confusing and difficult to maintain. Hmmm... yes, but it's not necesarrily worse than straight HTML text, which is the alternative. > >/linecard.txt - for commercial sites, a text file with comma-delimited > > line items (brands) manufactured or stocked > > This will drown in details. Yup. > >/sitedata.txt - a text file similar to the InterNIC submissions forms, > > with publicly-available site data such as > > > Yes to some of this at least. But there's an assumption that there's a > one-to-one relationship between the server and these field data. Often, > there isn't and no scheme that fails to deal with that is going to succeed. Well, I hate to repeat myself, but ALIWEB's /site.idx will give you all of the above (OK, not the icon, but you could add that). It doesn't seem to scale to well to large sites who want to describe every single page or resource on their server, but that's not the goal here... Note also that nobody is stopping you to pull just the URLs from a site.idx, and doing your standard robot summarising on that... -- Martijn __________ Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Mon Oct 23 17:06:25 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12787; Mon, 23 Oct 95 17:06:25 -0700 Message-Id: From: kimba@snog.it.com.au (Kim Davies) Subject: Re: Proposed URLs that robots should search To: andrew@andrew.triumf.ca (Andrew Daviel) Date: Tue, 24 Oct 1995 08:03:58 +0800 (WST) Cc: robots@webcrawler.com In-Reply-To: from "Andrew Daviel" at Oct 23, 95 09:51:17 pm X-Mailer: ELM [version 2.4 PL24 PGP2] Content-Type: text Content-Length: 1378 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, > /robots.htm - an HTML list of links that robots are encouraged to traverse A plain text file would be much more well suited, similar to the existing robots.txt - reading in plain text and adding it to the stack of URL's to be processed is sure to be more effective than sending the html to the robot reasoning engine to parse about. > [snip] > > Organization: organisation name > Type: commercial/non/profit/educational etc. > Admin: email of admininstration > Webmaster: email of Web admininstration > Postal: postal address > ZIP: ZIP/postcode > Country: > Position: Lat/Long > etc. How are you going to get a system administrator to implement all these files? How many system administrators do you know even know about robots.txt? Assuming you want a large chunk of sites to adopt these details, I'd propose it be implemented into the HTTP protocol somehow. an "ADMIN" request, for example, could request the above details from the site just as an "/admin", for example, on IRC, grabs the admin details of a server from the lines in the configuration. If a space was made in a server's configuration or makefile for these details, web administrators are far more likely to implement. catchya, -- Kim Davies | "Belief is the death of intelligence" -Snog kimba@it.com.au | http://www.it.com.au/~kimba/ From owner-robots Tue Oct 24 02:48:24 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14601; Tue, 24 Oct 95 02:48:24 -0700 Date: Tue, 24 Oct 1995 02:48:19 -0700 (PDT) From: Andrew Daviel To: robots@webcrawler.com Subject: Re: Proposed URLs that robots should search In-Reply-To: Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Let's see if I can reply to everyone without getting in a tangle ... :)= >>I'm trying to build a database of URLs for business... >I can't quite contain the urge to say, "Isn't everyone?" Know any good ones? Nothing jumped out at me from CUSI, or Submit-It, etc. >I have seen few .. business sites that don't offer text-only versions I seem to keep seeing sites that say "Works best with Netscape 1.2 - get it!" >Could we start .. standard way to set forth the name of the site? Having it in the of the document root is quite common, but you get "BloggCo Home Page", "Welcome to BloggCo", and sometimes "Welcome to B L O G G C O". I've tried looking for non-dictionary words with some success. >>/linecard.txt - for commercial sites, a text file with comma-delimited >> line items (brands) manufactured or stocked >This will drown in details. >Yup. This was a suggestion from a professional buyer. Sure, collecting these for the whole world would get out of control, but with a small enough scope it might be manageable. The buyers look up brand names in a huge 12-volume book to find distributors or manufacturers. Finding who stocks Motorcraft in Tipperary can't produce that many records. >Well, I hate to repeat myself, but ALIWEB's /site.idx will give you .. Didn't know about it. Looks like what I was thinking of. I see it has keywords ( >..Disagree greatly. This opens a giant can ... ) > >/robots.htm - an HTML list of links > Why HTML? A simplistic idea. I figured that if existing robots are written to traverse HTML, then giving them an HTML file to start from would be fairly easy. Re. site.idx, is this a fairly open-ended list of fields? I had in mind some fields relevant to larger businesses, like Sales-Email, Info-Email, Tech-Email, Sales-FaxBack, etc. etc. for voice, fax, email where some places may have separate hotlines for hardware, software, licenses, etc. How to handle this for big concerns that have one website and hundreds of regional offices is another problem. I find the Lat/Long format in IAFA a bit strange; I use the "standard" navigational format from navigation books, GPS and Loran, etc. eg. 49D14.7N 123D13.6W, except that as there isn't a degree symbol in ASCII I've used "D", which makes it similar to the NMEA0182 format. The current NMEA0183 standard for navigation equipment would use something like: $LCGLL,4001.74,N,07409.43,W for 40 degrees 1.74 minutes North, 74 degrees 9.43 minutes West. Anyway, it's just bits and easy enough to convert. >How are you going to get a system administrator to implement all these >files? Well, one might assume that a good many HTML authors and Webmasters read comp.infosystems.author.html, or whatever it's called. Or one could just send them all mail ... 50,000 returned mail messages wouldn't make too much of a dent in my disk ... :)= >I'd propose it be implemented into the HTTP protocol .. I'd think it might take a while for everyone to update their servers - say, at least 2 years... Andrew Daviel email: advax@triumf.ca From owner-robots Wed Oct 25 15:49:09 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05021; Wed, 25 Oct 95 15:49:09 -0700 Date: Thu, 26 Oct 1995 08:48:57 +1000 From: Murray Bent <murrayb@icis.qut.edu.au> Message-Id: <199510252248.IAA09980@wittgenstein.icis.qut.edu.au> To: robots@webcrawler.com Subject: lycos patents Content-Length: 134 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com To add insult to injury, Lycos are patenting spiders and robots. Anyone care to comment on what Lycos Inc. is up to these days? mj From owner-robots Wed Oct 25 15:56:03 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05454; Wed, 25 Oct 95 15:56:03 -0700 Message-Id: <9510252256.AA05447@webcrawler.com> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Scott Stephenson <scott> Date: Wed, 25 Oct 95 15:55:18 -0700 To: robots Subject: Re: lycos patents References: <199510252248.IAA09980@wittgenstein.icis.qut.edu.au> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, What, Lycos is trying to patent spiders and robots. Got any more information on this?!? How can this be possible, as it is certainly not technology that they developed. ss From owner-robots Wed Oct 25 15:58:36 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05629; Wed, 25 Oct 95 15:58:36 -0700 Message-Id: <9510252258.AA05583@webcrawler.com> To: robots Cc: Murray Bent <murrayb@icis.qut.edu.au> Subject: Re: lycos patents In-Reply-To: Your message of "Thu, 26 Oct 1995 08:48:57 +1000." <199510252248.IAA09980@wittgenstein.icis.qut.edu.au> Date: Wed, 25 Oct 1995 15:58:13 -0700 From: Martijn Koster <mak@beach.webcrawler.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com In message <199510252248.IAA09980@wittgenstein.icis.qut.edu.au>, Murray Bent wr ites: > To add insult to injury, Lycos are patenting spiders and robots. Can you elaborate? Where did you hear this, where can we find out more? -- Martijn __________ Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Wed Oct 25 16:09:34 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06328; Wed, 25 Oct 95 16:09:34 -0700 Date: Wed, 25 Oct 1995 19:08:47 -0400 (EDT) From: Matthew Gray <mkgray@Netgen.COM> X-Sender: mkgray@bokonon To: robots@webcrawler.com Subject: Re: lycos patents In-Reply-To: <199510252248.IAA09980@wittgenstein.icis.qut.edu.au> Message-Id: <Pine.SOL.3.91.951025190537.13893C-100000@bokonon> Organization: net.Genesis Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > To add insult to injury, Lycos are patenting spiders and robots. I assume he is referring to the comment: > We have a patent pending on our spider technology, which makes it > possible for us to both keep up with the exponential growth of the > Internet, and still find the most popular sites. which appears in the FAQ at http://lycos-tmp1.psc.edu/reference/faq.html I hope when they refer to "our spider technology", they are referring to something genuinely unique. If not there are a great many cases for prior art, notably my Wanderer which (while no longer the best) was the first one around in spring of '93. I agree that some comment or clarification from Lycos would be good. Matthew Gray --------------------------------- voice: (617) 577-9800 net.Genesis fax: (617) 577-9850 56 Rogers St. mkgray@netgen.com Cambridge, MA 02142-1119 ------------- http://www.netgen.com/~mkgray From owner-robots Wed Oct 25 16:19:27 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06783; Wed, 25 Oct 95 16:19:27 -0700 Date: Thu, 26 Oct 1995 09:16:39 +1000 From: Murray Bent <murrayb@icis.qut.edu.au> Message-Id: <199510252316.JAA10010@wittgenstein.icis.qut.edu.au> To: robots@webcrawler.com Subject: re: Lycos patents Content-Length: 570 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com reference: > From: "Alison O'Balle" <a.oballe@mail.utexas.edu> (Alison O'Balle) > Subject: Catalog of the Internet > To: Multiple recipients of list <web4lib@library.berkeley.edu> [...] > A representative from Lycos made a presentation on campus Thursday morning > in which he said a number of interesting things about the future of the > internet, cataloging,and other topics. [Interesting facts and figures deleted] > They are patenting web spiders and robots. This was glossed over, but the > lycos guy said the patent process was going well for them so far. From owner-robots Wed Oct 25 16:22:14 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06913; Wed, 25 Oct 95 16:22:14 -0700 Message-Id: <9510252322.AA06904@webcrawler.com> To: fuzzy@cmu.edu Cc: robots Subject: Patents? From: Martijn Koster <m.koster@webcrawler.com> Date: Wed, 25 Oct 1995 16:22:18 -0700 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi Fuzzy, I can't see you in the list of subscribers to the robots list, (to which this is cc'ed) so maybe you missed a message regarding patents there. In http://www.lycos.com/reference/faq.html one reads: > We have a patent pending on our spider technology, which makes it > possible for us to both keep up with the exponential growth of the > Internet, and still find the most popular sites. Can you give any further details, either on the technical nature or the patent application? -- Martijn __________ Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Wed Oct 25 16:45:53 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08081; Wed, 25 Oct 95 16:45:53 -0700 Message-Id: <n1397482621.64443@mail.intouchgroup.com> Date: 25 Oct 1995 16:47:13 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: Re: lycos patents To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > What, Lycos is trying to patent spiders and robots. Got any more > information on this?!? How can this be possible, as it is certainly > not technology that they developed. If this is so, then some interested parties should let the Patent Office (or whatever the corresponding US body is called) know this. Particularly given what a terrible job they have been doing judging software and algorithm patents recently, it's a bad idea to just assume that the Patent Office will get it right. --Roger Dearnaley From owner-robots Wed Oct 25 19:19:25 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14858; Wed, 25 Oct 95 19:19:25 -0700 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199510260219.DAA02026@wsinis02.win.tue.nl> Subject: Re: lycos patents To: robots@webcrawler.com Date: Thu, 26 Oct 1995 03:19:08 +0100 (MET) In-Reply-To: <Pine.SOL.3.91.951025190537.13893C-100000@bokonon> from "Matthew Gray" at Oct 25, 95 07:08:47 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 1094 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Lycos's patents: >I hope when they refer to "our spider technology", they are referring to >something genuinely unique. If not there are a great many cases for >prior art, notably my Wanderer which (while no longer the best) was the >first one around in spring of '93. Mmm ... I think I first saw JumpStation in January '93. http://js.stir.ac.uk/jsbin/js Simple spiders existed before; I used one in November '92 to fill a proxy cache and fake a live Internet connection for a demo, but it wasn't used for indexing purposes. >I agree that some comment or clarification from Lycos would be good. The author has been seen to post to this list, before it moved. I should think the summaries may be patentable; in fact this thought first occurred to me when I saw his short talk on Lycos at WWW'95 in Darmstadt, in the workshop on Web indexing. But I haven't heard from Lycos since. There may be some unusual tricks in running the spiders as well. If XOR-ing bitmaps can be patented, why can't a bunch of details in spider technology? -- Reinier Post reinpost@win.tue.nl From owner-robots Tue Oct 31 06:58:02 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03475; Tue, 31 Oct 95 06:58:02 -0800 From: davidmsl@anti.tesi.dsi.unimi.it (Davide Musella) Message-Id: <9510311459.AA13828@anti.tesi.dsi.unimi.it> Subject: meta tag implementation To: robots@webcrawler.com (Mailing list su robot) Date: Tue, 31 Oct 1995 15:59:26 +0100 (MET) Organization: Dept. of Computer Science, Milan, Italy. X-Mailer: ELM [version 2.4 PL23alpha2] Content-Type: text Content-Length: 772 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi to everybody! I would like to know what do you think about a possible implementation of the meta http-equiv tag on an http-server. I' working in this direction to build a complete system to catalogue www docs but I think that the bigger problems is that there isn't any http-server that handle this meta tag (maybe only the WN server) Thanx Davide +--------------------------------------------------+ |Davide Musella | |e-Mail musella@dsi.unimi.it Dept. of | |Phone number +39.(0)2.4390821 Computer Science | |Address: Via Montevideo, 25 University of | | 20144 Milano ITALY Milan, Italy | |http://www.dsi.unimi.it/Users/Tesi/musella | +--------------------------------------------------+ From owner-robots Thu Nov 2 09:30:07 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15340; Thu, 2 Nov 95 09:30:07 -0800 Message-Id: <YkaDzD200YUxASM0sm@andrew.cmu.edu> Date: Thu, 2 Nov 1995 12:28:47 -0500 (EST) From: "Jeffrey C. Chen" <jc7k+@andrew.cmu.edu> To: robots@webcrawler.com (Mailing list su robot) Subject: Re: meta tag implementation Cc: In-Reply-To: <9510311459.AA13828@anti.tesi.dsi.unimi.it> References: <9510311459.AA13828@anti.tesi.dsi.unimi.it> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi everybody! I am a MS student at CMU. I am working on a software tool for collecting full system traces on the Alpha. The tool will also gather statistics by using the on-chip hardware event counters. I am interested in using a web server and a client as my test workload. It would be interesting to identify performance bottlenecks in a web server as it runs over a period of time servicing requests. Does anyone have a simple robot that I can use to exercise a web server? Thanks, Jeff From owner-robots Thu Nov 2 10:40:02 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20410; Thu, 2 Nov 95 10:40:02 -0800 From: Jaakko Hyvatti <Jaakko.Hyvatti@www.fi> Message-Id: <199511021835.UAA17200@krisse.www.fi> Subject: Simple load robot To: robots@webcrawler.com Date: Thu, 2 Nov 1995 20:35:19 +0200 (EET) In-Reply-To: <YkaDzD200YUxASM0sm@andrew.cmu.edu> from "Jeffrey C. Chen" at Nov 2, 95 12:28:47 pm X-Mailer: ELM [version 2.4 PL22] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 412 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Does anyone have a simple robot that I can use to exercise a web > server? Would this do the job, maybe run multiple times in parallel? (Please replace the url's..) #!/bin/sh while true do for i in \ http://www.fi/ \ http://www.fi/search.html \ http://www.fi/index/ \ http://www.fi/~jaakko/ \ http://www.fi/sss/ \ http://www.fi/www/ \ http://www.fi/links.html do lynx -source $i > /dev/null done done From owner-robots Mon Nov 6 22:44:28 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15194; Mon, 6 Nov 95 22:44:28 -0800 Date: Tue, 7 Nov 1995 00:43:47 -0600 Message-Id: <9511070643.AA120822@nic.smsu.edu> X-Sender: kdf274s@nic.smsu.edu X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Keith Fischer <kfischer@mail.win.org> Subject: Preliminary robot.faq (Please Send Questions or Comments) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Archive-name: robot.faq Posting-Frequency: variable Last-modified: Nov. 6, 1995 This article is a description and primer for World Wide Web robots and spiders. The following topics are addressed: 1) DEFINING ROBOTS AND SPIDERS 1.1) What is a ROBOT? 1.2) What is a SPIDER? 1.3) What is a search engine? 1.4) How many ROBOTS are there? 1.5) What can be achieved by using ROBOTS? 1.6) What harm can a ROBOT do? 2) THE THEORY BEHIND A ROBOT 2.1) Who can write one? 2.2) How is one written? 2.3) What is the Proposed Standard for Robot Exclusion? 2.4) What are the potential problems? 2.5) How do I use proper Etiquette? 3) THE REALITY OF THE WEB 3.1) Can I visit the entire web? 1) DEFINING ROBOTS AND SPIDERS 1.1) What is a ROBOT? A Robot is a program that traverses the World Wide Web, gathering some sort of information from each site it visits. This journey is accomplished by visiting a web page and then recursively visiting all or some of it's linked pages. 1.2) What is a SPIDER? Spiders are synonymous with Robots, as are Wanderers. These names however, have some misleading implications. For instance many people think that a spider or wanderer leaves the home site to work its magic, when in reality it never leaves. The Spider rather just acts as a sophisticated web browser, automatically retrieving documents and/or images until it is told to stop. I prefer the term Robot and will continue using it throughout this document. 1.3) What is a search engine? A search engine is not a robot. However some search engines rely heavily on robots. A search engine is nothing more than a glorified index. It searches the index, which resides on the host's computer, and returns the result. A common misconception is that a search engine like Lycos or Yahoo actively searches the web upon request. This is not true, all activity by the robot is done ahead of time. 1.4) How many ROBOTS are there? There are about 30 in existence. Martijn Koster maintains a list at: http://info.webcrawler.com/mak/projects/robots/active.html 1.5) What can be achieved by using ROBOTS? The possibilities are endless. Once you visit a page, you have free run of the html. You can retrieve files or the html itself. Most robots retrieve pieces of the html document. This is then used to build an index, which is later used by a search engine. 1.6) What harm can a ROBOT do? The robot can do no harm per say, but it can anger a lot of people. If your robot acts irresponsibly it can fall into a black hole, a link that dynamically makes new links, or worse it can get stuck in a loop. Both of these actions are certain to reek havoc on a server. The goal in web traversal is to never be on one server for to long. The solution to the problem of bad htmls or rather your robot's handling of bad htmls is to stay online. Simply put, never leave your robot unattended. 2) THE THEORY BEHIND A ROBOT 2.1) Who can write one? Anyone can write a robot provided that they have web access. But, a word to the wise, tell your system administrators because they WILL feel the system drain and they WILL hear many complaints concerning your activities. But, just because the possibility exists doesn't mean you should take on this task half cocked. Before even thinking about coding a robot: do your research, have an intended goal, and read the following: The Proposed Standard for Robot Exclusion located at: http://info.webcrawler.com/mak/projects/robots/norobots.html The Guidelines for Robot Writers located at: http://info.webcrawler.com/mak/projects/robots/guidelines.html Ethical Web Agent located at: http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Agents/eichmann.ethical/eichma nn.html 2.2) How is one written? A Robot is nothing more than an executable program. It can be in the form of a script or a binary file. It makes a connection to a web server and requests a document be sent, much the same way a web browser works. The difference is in the automation provided by the robot. 2.3) What is the Proposed Standard for Robot Exclusion? Martijn Koster explains the reason for a robot exclusion standard with the following: "In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting)." The form the robot exclusion standard takes is given in more detail at: The Proposed Standard for Robot Exclusion located at: http://info.webcrawler.com/mak/projects/robots/norobots.html 2.4) What are the potential problems? The potential problems can't be listed. The list would be far to big and unpredictable. The very nature of the World Wide Web is diversity and this very diversity makes robot writing both important and increasingly difficult. There is no one right html. They can be written in many ways and in many formats. My suggestion is get the spec sheet for html and practice, practice, practice, making your robot robust. 2.5) How do I use proper Etiquette? Etiquette is a very touchy subject. Many people stand in opposition to your newly written robot. They don't like the idea that their server will be over run with seemingly pointless requests. The solution is simple, first give them the results. Or rather put up for public consumption the results of your searches. This is the concept of giving back to the community that provided for you. Not to mention, if a person can use your results, the robot's requests may seem to have more merit. Another form of etiquette is slow requests. You've heard the term rapid fire. This means quick requests (a request every second or so); basically put, this brings a server to its figurative knees. The solution is limit your requests to any given server to one every minute (some say one every five minutes). More information about etiquette is located at: The Guidelines for Robot Writers located at: http://info.webcrawler.com/mak/projects/robots/guidelines.html Ethical Web Agents located at: http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Agents/eichmann.ethical/eichma nn.html 3) THE REALITY OF THE WEB 3.1) Can I visit the entire web? No. So don't try. Gauge your goals in reasonable amounts. ______________________________________________________________ I disclaim everything. The contents of this article might be totally inaccurate, inappropriate, misguided, or otherwise perverse - except for my name (you can probably trust me on that). Copyright (c) 1995 by Keith D. Fischer, all rights reserved. This FAQ may be posted to any USENET newsgroup, on-line service, or BBS as long as it is posted in its entirety and includes this copyright statement. This FAQ may not be distributed for financial gain. This FAQ may not be included in commercial collections or compilations without express permission from the author. ____________________________________________________________ Keith D. Fischer - kfischer@mail.win.org or kfischer@science.smsu.edu Keith D. Fischer kfischer@mail.win.org kdf274s@nic.smsu.edu "Misery loves company" By Anonymous "Today is a good day to die." By Crazy Horse "To be or not to be ..." Hamlet -- William Shakespeare From owner-robots Tue Nov 7 02:37:01 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA27042; Tue, 7 Nov 95 02:37:01 -0800 Date: Tue, 7 Nov 95 10:32:55 GMT Message-Id: <9511071032.AA09660@raphael.doc.aca.mmu.ac.uk> X-Sender: steven@raphael.doc.aca.mmu.ac.uk X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Steve Nisbet <S.Nisbet@DOC.MMU.AC.UK> Subject: Re: meta tag implementation Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 12:28 PM 11/2/95 -0500, you wrote: >Hi everybody! > >I am a MS student at CMU. I am working on a software tool for >collecting full system traces on the Alpha. The tool will also gather >statistics by using the on-chip hardware event counters. I am >interested in using a web server and a client as my test workload. It >would be interesting to identify performance bottlenecks in a web server >as it runs over a period of time servicing requests. Does anyone have a >simple robot that I can use to exercise a web server? > >Thanks, >Jeff > > Hi there Jef, know this sounds cheeky, but if you get any useful repliesfor robots that have nothing to do with PERL, could you let me know. I tride asking the same question you asked, but got no replies. From owner-robots Tue Nov 7 04:05:00 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02122; Tue, 7 Nov 95 04:05:00 -0800 From: davidmsl@anti.tesi.dsi.unimi.it (Davide Musella) Message-Id: <9511071205.AA13152@anti.tesi.dsi.unimi.it> Subject: Re: meta tag implementation To: robots@webcrawler.com Date: Tue, 7 Nov 1995 13:05:21 +0100 (MET) In-Reply-To: <9511071032.AA09660@raphael.doc.aca.mmu.ac.uk> from "Steve Nisbet" at Nov 7, 95 10:32:55 am Organization: Dept. of Computer Science, Milan, Italy. X-Mailer: ELM [version 2.4 PL23alpha2] Content-Type: text Content-Length: 251 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Hi there Jef, know this sounds cheeky, but if you get any useful repliesfor > robots that have nothing to do with PERL, could you let me know. I tride > asking the same question you asked, but got no replies. No replies until now....sigh!!! Davide From owner-robots Tue Nov 7 06:17:49 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08070; Tue, 7 Nov 95 06:17:49 -0800 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199511071417.OAA06656@wsinis11.win.tue.nl> Subject: Re: meta tag implementation To: robots@webcrawler.com Date: Tue, 7 Nov 1995 15:17:26 +0100 (MET) In-Reply-To: <9511071205.AA13152@anti.tesi.dsi.unimi.it> from "Davide Musella" at Nov 7, 95 01:05:21 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (Davide Musella) write: > >> Hi there Jef, know this sounds cheeky, but if you get any useful repliesfor >> robots that have nothing to do with PERL, could you let me know. I tride >> asking the same question you asked, but got no replies. > >No replies until now....sigh!!! You might use Lynx (2.4.FM); it has a -traverse switch now. Experimental, and I don't think it supports the RES (Robot Exclusion Standard) yet. We have a simple robot written in C, but it doesn't follow the RES either. What's your resaon to stay away from Perl? >Davide -- Reinier Post reinpost@win.tue.nl From owner-robots Tue Nov 7 06:54:36 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09686; Tue, 7 Nov 95 06:54:36 -0800 Date: Tue, 7 Nov 95 14:41:39 GMT Message-Id: <9511071441.AA11827@raphael.doc.aca.mmu.ac.uk> X-Sender: steven@raphael.doc.aca.mmu.ac.uk X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Steve Nisbet <S.Nisbet@DOC.MMU.AC.UK> Subject: Re: meta tag implementation Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi Davide, thanks very much for the info. I stay away from Perl here because it was badly set up and I have to reinstall it. SO its more of a grudge :) Other than that I think its a good thing. I will do as you suggest. All the best in you endeavours. From owner-robots Tue Nov 7 07:11:12 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10944; Tue, 7 Nov 95 07:11:12 -0800 Message-Id: <m0tCpg2-0003LMC@giant.mindlink.net> Date: Tue, 7 Nov 95 07:11 PST X-Sender: a07893@giant.mindlink.net X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Tim Bray <tbray@opentext.com> Subject: Re: Preliminary robot.faq (Please Send Questions or Comments) Cc: robots@webcrawler.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >1.1) What is a ROBOT? > > A Robot is a program that traverses the World Wide Web, gathering some >sort of information from each site it visits. This journey is accomplished >by visiting a web page and then recursively visiting all or some of it's >linked pages. True but misleading; there are much better strategies for covering the web than this kind of direct recursion. Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com) From owner-robots Wed Nov 8 01:30:52 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21486; Wed, 8 Nov 95 01:30:52 -0800 Date: Wed, 8 Nov 1995 03:30:45 -0600 Message-Id: <9511080930.AA35454@nic.smsu.edu> X-Sender: kdf274s@nic.smsu.edu X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Keith Fischer <kfischer@mail.win.org> Subject: Re: Preliminary robot.faq (Please Send Questions or Comments) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >>1.1) What is a ROBOT? >> >> A Robot is a program that traverses the World Wide Web, gathering some >>sort of information from each site it visits. This journey is accomplished >>by visiting a web page and then recursively visiting all or some of it's >>linked pages. > >True but misleading; there are much better strategies for covering >the web than this kind of direct recursion. > > >Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com) 1.1) What is a ROBOT? A Robot is a program that traverses the World Wide Web, gathering some sort of information from each site it visits. This journey is accomplished by visiting a web page and then visiting some or all of its linked pages. The method one follows whether it's recursive or some sort of fuzzy logic determines the effectivness of the search. How is the above. If you like, this will be the new 1.1. Also, could you please elaborate on better stratagies. (I'm assuming you are talking about the fuzzy logic that Yahoo and Lycos use.) Keith kfischer@mail.win.org kdf274s@nic.smsu.edu Keith D. Fischer kfischer@mail.win.org kdf274s@nic.smsu.edu "Misery loves company" By Anonymous "Today is a good day to die." By Crazy Horse "To be or not to be ..." Hamlet -- William Shakespeare From owner-robots Wed Nov 8 05:45:00 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03365; Wed, 8 Nov 95 05:45:00 -0800 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199511081344.NAA17571@wsinis02.win.tue.nl> Subject: Re: Preliminary robot.faq (Please Send Questions or Comments) To: robots@webcrawler.com Date: Wed, 8 Nov 1995 14:44:43 +0100 (MET) In-Reply-To: <9511080930.AA35454@nic.smsu.edu> from "Keith Fischer" at Nov 8, 95 03:30:45 am X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (Keith Fischer) write: >1.1) What is a ROBOT? > > A Robot is a program that traverses the World Wide Web, gathering some >sort of information from each site it visits. This journey is accomplished >by visiting a web page and then visiting some or all of its linked pages. >The method one follows whether it's recursive or some sort of fuzzy logic >determines the effectivness of the search. We have a robot which does 'fuzzy' searching, for which your description is appropriate. But in general, the document collection process (= robot) and the search process executed in response to a user query (on the resulting collection) are completely separate. Besides, searching the contents of document collections is not the only purpose of robots; robots can be used to check the validity of hyperlinks, for example. Your description is accurate, as applied to the robot process itself, but it may be confusing. A minor quibble: robots must use some heuristics in determining which links to follow. All robots are 'recursive', and most of them cut off the process in a more or less arbitrary way, which could be called 'fuzzy'. There is no or/or decision here. -- Reinier Post reinpost@win.tue.nl a.k.a. <A HREF="http://www.win.tue.nl/win/cs/is/reinpost/">me</A> [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] From owner-robots Wed Nov 8 08:38:48 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA13242; Wed, 8 Nov 95 08:38:48 -0800 Subject: Re: Preliminary robot.faq (Please Send Questions or Comments) From: YUWONO BUDI <yuwono@uxmail.ust.hk> To: robots@webcrawler.com Date: Thu, 9 Nov 1995 00:37:33 +0800 (HKT) In-Reply-To: <9511080930.AA35454@nic.smsu.edu> from "Keith Fischer" at Nov 8, 95 03:30:45 am X-Mailer: ELM [version 2.4 PL24alpha3] Content-Type: text Content-Length: 1603 Message-Id: <95Nov9.003740hkt.19032-3+260@uxmail.ust.hk> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > >>1.1) What is a ROBOT? > >> > >> A Robot is a program that traverses the World Wide Web, gathering some > >>sort of information from each site it visits. This journey is accomplished > >>by visiting a web page and then recursively visiting all or some of it's > >>linked pages. > > > >True but misleading; there are much better strategies for covering > >the web than this kind of direct recursion. > > > > > >Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com) > > > 1.1) What is a ROBOT? > > A Robot is a program that traverses the World Wide Web, gathering some > sort of information from each site it visits. This journey is accomplished > by visiting a web page and then visiting some or all of its linked pages. > The method one follows whether it's recursive or some sort of fuzzy logic > determines the effectivness of the search. I am not sure understand what the original comment is getting at. But it seems to me that the word "recursive" is somewhat overloaded. To those with CS background, a "recursive" visit implies a "depth first" tree traversal. Most robot implementations that I'm aware of use "breadth first" traversals. Among the reasons is that you would want to be able to limit the depth your robot digs into. Whether depth limitation is more useful than breadth limitation is another issue, IMHO. One thing for sure, stopping the robot after it reaches a certain depth is much simpler than deciding which links to follow/ignore. I don't know what would be the more general term in place of "recursively," "sequentially" perhaps? -Budi. From owner-robots Thu Nov 9 08:53:37 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12795; Thu, 9 Nov 95 08:53:37 -0800 Resent-Message-Id: <9511091653.AA12783@webcrawler.com> Resent-From: mak@beach.webcrawler.com Resent-To: robots Resent-Date: Thu, 9 Nov 1995 16:53:32 Date: Wed, 8 Nov 95 10:08:51 -0800 From: <owner-robots> Message-Id: <9511081808.AA19321@webcrawler.com> To: owner-robots Subject: BOUNCE robots: Admin request X-Filter: mailagent [version 3.0 PL41] for mak@surfski.webcrawler.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >From tbray@opentext.com Wed Nov 8 10:08:46 1995 Return-Path: <tbray@opentext.com> Received: from giant.mindlink.net by webcrawler.com (NX5.67f2/NX3.0M) id AA19311; Wed, 8 Nov 95 10:08:46 -0800 Received: from Default by giant.mindlink.net with smtp (Smail3.1.28.1 #5) id m0tDEv9-000343C; Wed, 8 Nov 95 10:08 PST Message-Id: <m0tDEv9-000343C@giant.mindlink.net> Date: Wed, 8 Nov 95 10:08 PST X-Sender: a07893@giant.mindlink.net X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Tim Bray <tbray@opentext.com> Subject: Re: Preliminary robot.faq (Please Send Questions or Comments) Cc: robots@webcrawler.com We're wasting too much time on this. All I meant to say was that the original language strongly suggested that robots use the following algorithm: sub RetrievePage(url) text = HttpGet(url) foreach sub_url in text RetrievePage(sub_url) Whereas lots of robots don't. Obviously it is recursive in that you do pull urls out of pages and eventually follow them, but it doesn't feel recursive. The 'fuzzy' stuff is a complete red herring - except for the special case of 'fuzzy logic' (not what's being done here) the word 'fuzzy' in the information retrieval context is a marketing term without semantic content. Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com) From owner-robots Fri Nov 17 09:12:34 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21835; Fri, 17 Nov 95 09:12:34 -0800 Date: Fri, 17 Nov 1995 09:24:00 -0800 (PST) From: Benjamin Franz <snowhare@netimages.com> X-Sender: snowhare@ns.viet.net To: robots@webcrawler.com Subject: Bad robot: WebHopper bounch! Owner: peter@cartes.hut.fi In-Reply-To: <95Nov9.003740hkt.19032-3+260@uxmail.ust.hk> Message-Id: <Pine.LNX.3.91.951117085518.25864A-100000@ns.viet.net> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I was checking my stats and this showed up with 1838 hits on the 9th of November. It tried to completely explore an infinite virtual space in one run, with an average time between hits of 4.3 seconds. Its' parser has to be broken because it was exploring a space defined by a ?cookie=number (used for shopping basket session tracking), but failing to preserve the '=' (generating 'cookienumber' instead of 'cookie=number') between calls and causing a new cookie to be assigned to every request. It went into an infinite loop over the same five base pages as it tried to do a depth first search of the site - for a little over two hours. Argh. Anyone else hit by this rather broken robot? -- Benjamin Franz, Webmaster, Net Images From owner-robots Thu Nov 23 12:44:36 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03420; Thu, 23 Nov 95 12:44:36 -0800 Date: Thu, 23 Nov 1995 12:42:51 -0800 (PST) From: Andrew Daviel <andrew@andrew.triumf.ca> To: libwww-perl@ics.UCI.EDU, /CN=robots/@nexor.co.uk Cc: Daniel Terrer <Daniel.Terrer@sophia.inria.fr> Subject: wwwbot.pl problem Message-Id: <Pine.LNX.3.91.951123111508.16547A-100000@andrew.triumf.ca> Mime-Version: 1.0 Content-Type: text/PLAIN; charset="US-ASCII" Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com (I send a request to libwww-perl-request just before my last message to the list, so I might not be on yet. Please Cc any replies to me.) I was having trouble with wwwbot from the libwww-perl-0.40 library. I continued to work on the problem after posting to the perl list. It seems that botcache is not well enough defined, so that a site with User-Agent: * Disallow / would kill subsequent GETs to a site that was previously in the cache. I have made a patch which adds the address to the cache, and fixes a couple of other odd cases, such as where the address is not fully defined working within a domain, and there are host names such as ypsun, ypsun2 etc. which would become confused with the path count. See ftp://andrew.triumf.ca/pub/wwwbot.patch Andrew Daviel email: advax@triumf.ca TRIUMF voice: 604-222-7376 4004 Wesbrook Mall fax: 604-222-7307 Vancouver BC http://andrew.triumf.ca/~andrew Canada V6T 2A3 49D14.7N 123D13.6W From owner-robots Thu Nov 23 23:45:39 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA07952; Thu, 23 Nov 95 23:45:39 -0800 Date: Fri, 24 Nov 95 16:45:28 JST From: francis@cactus.slab.ntt.jp (Paul Francis) Message-Id: <9511240745.AA03918@cactus.slab.ntt.jp> To: robots@webcrawler.com Subject: yet another robot Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com For all its worth, we have implemented a robot in order to (surprise surprise) gather web resources to build a (distributed) search database. The robot is called Yobot, and http://rodem.slab.ntt.jp:8080/home/robot-e.html tells you who to complain to if Yobot misbehaves. Thanks, PF From owner-robots Fri Nov 24 13:51:35 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17245; Fri, 24 Nov 95 13:51:35 -0800 Date: Sat, 25 Nov 1995 07:53:43 +1000 (EST) From: David Eagles <eaglesd@planets.com.au> To: robots@webcrawler.com Subject: yet another robot, volume 2 In-Reply-To: <9511240745.AA03918@cactus.slab.ntt.jp> Message-Id: <Pine.LNX.3.91.951125075027.1078A-100000@earth.planets.com.au> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com We, too, have developed a robot to provide Web resource search facilities to Australia and the South Pacific. The crawler engine will only follow links to designated domains, and the search engine allows individual selection of the search domain for queries. Named after a famous Australian spider, the FunnelWeb, the service is available at http://funnelweb.net.au Enjoy. Regards, David Eagles From owner-robots Fri Nov 24 15:20:08 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22501; Fri, 24 Nov 95 15:20:08 -0800 Date: Sat, 25 Nov 95 09:29:44 +1100 (EST) Message-Id: <v01530506acdc92c54ddb@[192.190.215.44]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: radio@mpx.com.au (James) Subject: Re: yet another robot, volume 2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >We, too, have developed a robot to provide Web resource search facilities >to Australia and the South Pacific. The crawler engine will only follow >links to designated domains, and the search engine allows individual >selection of the search domain for queries. > >Named after a famous Australian spider, the FunnelWeb, the service is >available at http://funnelweb.net.au > >Enjoy. >David we tried it out the other day.We lodged AAA and Tourist Radio(2 Sites) Great VISION Keith Ashton >Regards, >David Eagles AAA Australia Announce Archive / Tourist Radio Home of the Australian Cool Site of the Day ! http://www.com.au/aaa From owner-robots Fri Nov 24 16:13:17 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25777; Fri, 24 Nov 95 16:13:17 -0800 Date: Sat, 25 Nov 95 11:13:05 +1100 (EST) Message-Id: <v01530507acdcab771ad7@[192.190.215.44]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: radio@mpx.com.au (James) Subject: Re: yet another robot, volume 2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >>We, too, have developed a robot to provide Web resource search facilities >>to Australia and the South Pacific. The crawler engine will only follow >>links to designated domains, and the search engine allows individual >>selection of the search domain for queries. >> >>Named after a famous Australian spider, the FunnelWeb, the service is >>available at http://funnelweb.net.au >> > >>Enjoy. >>David we tried it out the other day.We lodged AAA and Tourist Radio(2 Sites) >Great VISION > >Keith Ashton > > > ____________________________________________________________________________ ___________ David, We just got an Email back from you but there was no content Keith ____________________________________________________________________________ ____________ > > > >>Regards, >>David Eagles > >AAA Australia Announce Archive / Tourist Radio >Home of the Australian Cool Site of the Day ! >http://www.com.au/aaa AAA Australia Announce Archive / Tourist Radio Home of the Australian Cool Site of the Day ! http://www.com.au/aaa From owner-robots Sat Nov 25 06:21:14 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05034; Sat, 25 Nov 95 06:21:14 -0800 From: Byung-Gyu Chang <chitos@ktmp.kaist.ac.kr> Message-Id: <199511251419.XAA02550@ktmp.kaist.ac.kr> Subject: Q: Cooperation of robots To: robots@webcrawler.com (Robot Mailing list) Date: Sat, 25 Nov 1995 23:19:12 +0900 (KST) X-Mailer: ELM [version 2.4 PL21-h4] Mime-Version: 1.0 Content-Type: text/plain; charset=iso-2022-kr Content-Transfer-Encoding: 7bit Content-Length: 378 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, I am newbie to this mailing-list. If I do some mistake, plz reply to me. I have one question : Is there some effort for robots to do gathering informations in cooperative work style? That is, Sharing informations gathered by the other kind of robots with some communication between robots like the that of intelligent agents in Intelligent Agent area. - Byung-Gyu Chang From owner-robots Sat Nov 25 10:19:10 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15907; Sat, 25 Nov 95 10:19:10 -0800 Date: Sat, 25 Nov 1995 13:19:03 -0500 Message-Id: <199511251819.NAA27702@moe.infi.net> X-Sender: magi@infi.net (Unverified) X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Michael Goldberg <magi@infi.net> Subject: Smart Agent help Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com A am developing sites for numerous large associations. I want to provide a service to the members by which they can choose from selected topics..say mortgage interest rates..and a robot goes out and searches selected sites and provides either by e-mail a formated "newsletter" or return a "newsletter" in html. Any suggestions? <<< Media Access Group>>> Local Access to electronic marketing Triad member- Network Hampton Roads 2101 Parks Ave. Suite 606 Virginia Beach, VA 23451 804-422-4481 From owner-robots Sat Nov 25 15:22:58 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01362; Sat, 25 Nov 95 15:22:58 -0800 Date: Sun, 26 Nov 1995 09:24:39 +1000 (EST) From: David Eagles <eaglesd@planets.com.au> To: robots@webcrawler.com Subject: Re: Q: Cooperation of robots In-Reply-To: <199511251419.XAA02550@ktmp.kaist.ac.kr> Message-Id: <Pine.LNX.3.91.951126091817.2816A-100000@earth.planets.com.au> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Sat, 25 Nov 1995, Byung-Gyu Chang wrote: > Hi, I am newbie to this mailing-list. If I do > some mistake, plz reply to me. > > > I have one question : > > Is there some effort for robots to do gathering > informations in cooperative work style? > That is, Sharing informations gathered by the other kind of > robots with some communication between robots like > the that of intelligent agents in Intelligent Agent area. > > - Byung-Gyu Chang > I'm not sure if there is any official cooperation going on, but I'm currently enhancing my web crawler (http://funnelweb.net.au) to include support for this type of operation. Basically, here's what I'm planning: The current web crawler, based in Australia, limits it's searching and collection to countries in the South Pacific. I'm planning to enhance this such that any URL's found (during the crawling process) for non-South Pacific countries will be forwarded to the web crawler responsible for that domain (as determined by a simple config file - maybe an automated registration process in the future). Similarly, the search engine will allow ANY individual country(s) to be searched (as is the case now for only South Pacific countries), and will fork the request off to the appropriate engine. Is this the type of info you were after? Regards, David Eagles From owner-robots Sun Nov 26 09:10:54 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17874; Sun, 26 Nov 95 09:10:54 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: <v02130502acde4cefcae6@[192.187.143.12]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sun, 26 Nov 1995 09:10:32 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Q: Cooperation of robots Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 11:19 PM 11/25/95, Byung-Gyu Chang wrote: >Is there some effort for robots to do gathering >informations in cooperative work style? >That is, Sharing informations gathered by the other kind of >robots with some communication between robots like >the that of intelligent agents in Intelligent Agent area. There are various efforts, but the most significant one is probably the Harvest project at the University of Colorado. I can't remember their URL at the moment, but I know we have a link to it from: http://www.verity.com/customers.html Nick From owner-robots Sun Nov 26 16:57:32 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11355; Sun, 26 Nov 95 16:57:32 -0800 Date: Mon, 27 Nov 95 09:57:15 JST From: francis@cactus.slab.ntt.jp (Paul Francis) Message-Id: <9511270057.AA12772@cactus.slab.ntt.jp> To: robots@webcrawler.com Subject: Re: Smart Agent help Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > A am developing sites for numerous large associations. I want to provide a > service to the members by which they can choose from selected topics..say > mortgage interest rates..and a robot goes out and searches selected sites > and provides either by e-mail a formated "newsletter" or return a > "newsletter" in html. > > Any suggestions? A number of people are working towards the ability to search selected sites, though I haven't heard of anyone trying to put the result in a newletter format. Harvest allows the user to custom build his own database, which is then locally accessed at search time. (http://harvest.cs.colorado.edu/) MetaCrawler, Silk, IBMinfoMarket, and no doubt many others query multiple pre-configured search databases at search time. (http://metacrawler.cs.washington.edu:8080/home.html http://services.bunyip.com:8000/products/silk/silk.html http://www.infomkt.ibm.com/about.htm) I'm looking forward to the day when two of these "meta" search services point to each other and create an infinite search loop.... PF ps. If you're going to the WWW conference in Boston, I'll be chairing a BOF on distributed searching. Please see http://rodem.slab.ntt.jp:8080/paulStuff/ From owner-robots Sun Nov 26 18:28:42 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16185; Sun, 26 Nov 95 18:28:42 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: <v02130509acded1bfff27@[192.187.143.12]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sun, 26 Nov 1995 18:28:33 -0800 To: robots@webcrawler.com, owner-robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: BOUNCE robots: Admin request Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 10:08 AM 11/8/95, <owner-robots@webcrawler.com> wrote: >Whereas lots of robots don't. Obviously it is recursive in that you >do pull urls out of pages and eventually follow them, but it doesn't >feel recursive. The 'fuzzy' stuff is a complete red herring - except >for the special case of 'fuzzy logic' (not what's being done here) the >word 'fuzzy' in the information retrieval context is a marketing term >without semantic content. Minor point -- let's not assume that no one on the list is using fuzzy logic to decide which links to follow. After all, some of us have search engines that use fuzzy logic operators. I'm fascinated by using evidential reasoning to build agents that explore. Nick From owner-robots Sun Nov 26 19:43:06 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20091; Sun, 26 Nov 95 19:43:06 -0800 Date: Mon, 27 Nov 95 12:42:56 JST From: francis@cactus.slab.ntt.jp (Paul Francis) Message-Id: <9511270342.AA14195@cactus.slab.ntt.jp> To: robots@webcrawler.com Subject: Re: Q: Cooperation of robots Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > Is there some effort for robots to do gathering > informations in cooperative work style? > That is, Sharing informations gathered by the other kind of > robots with some communication between robots like > the that of intelligent agents in Intelligent Agent area. > I haven't seen anything, but I only pay so much attention to this list. I know that one problem is that many robots run to support profit- (or planned profit-) based services, so don't want to share their info. What do you see as the advantage to sharing information? It is offhand not clear to me that much is to be gained by it. For instance, given that each robot-running organization usually has their own way of processing the resources they find, then they have to go out and retrieve the resources in any event. Thus, not much may be saved by sharing information.... PF From owner-robots Mon Nov 27 01:14:04 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06867; Mon, 27 Nov 95 01:14:04 -0800 From: Jaakko Hyvatti <Jaakko.Hyvatti@www.fi> Message-Id: <199511270913.LAA29177@krisse.www.fi> Subject: Re: Q: Cooperation of robots To: robots@webcrawler.com Date: Mon, 27 Nov 1995 11:13:46 +0200 (EET) In-Reply-To: <9511270342.AA14195@cactus.slab.ntt.jp> from "Paul Francis" at Nov 27, 95 12:42:56 pm X-Mailer: ELM [version 2.4 PL22] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 1744 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com francis@cactus.slab.ntt.jp (Paul Francis): > I haven't seen anything, but I only pay so much > attention to this list. I know that one problem is > that many robots run to support profit- (or planned > profit-) based services, so don't want to share their > info. We at http://www.fi/ have a good coverage of the www-resources of Finland. You are right, we are clearly not willing to share our information base with other search engines in Finland (there is another one). On the other hand, it might be possible to share the database with some or all of the international search engines as a promotion. We would not lose any markets here in finland, 'cause always our site would be the fastest way for Finnish customers to perform searching. > What do you see as the advantage to sharing information? > It is offhand not clear to me that much is to be gained > by it. For instance, given that each robot-running > organization usually has their own way of processing > the resources they find, then they have to go out and > retrieve the resources in any event. Thus, not much > may be saved by sharing information.... If the two co-operating parties agree of common set of information to stre about each individual page, both could modify their robots to comply with this. Possibly even just a compressed .tar.gz archive of the pages could do. Anyway it saves bandwidth in international connections and annoys the servers less. I do not believe that our current database would suit anybody elses needs, but maybe the next time we collect all the pages we could fetch all the information necessary to someone else too. Feel free to contact me at Jaakko.Hyvatti@www.fi if you are interested. We cover almost all of Finland. From owner-robots Mon Nov 27 08:27:10 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04759; Mon, 27 Nov 95 08:27:10 -0800 Message-Id: <9511271626.AA04714@webcrawler.com> Original-Received: from research by ns Pp-Warning: Illegal Received field on preceding line X-Mailer: exmh version 1.6.4 10/10/95 From: Fred Douglis <douglis@research.att.com> To: Andrew Daviel <andrew@andrew.triumf.ca> Cc: libwww-perl@ics.UCI.EDU, /CN=robots/@nexor.co.uk, Daniel Terrer <Daniel.Terrer@sophia.inria.fr> Subject: Re: wwwbot.pl problem In-Reply-To: Your message of "Thu, 23 Nov 1995 12:42:51 PST." <Pine.LNX.3.91.951123111508.16547A-100000@andrew.triumf.ca> X-Face: *lvs`^NFil<?gI%c@~W[5*dWZ5;4-8#&S`1t,Ey&5R5z7nLBE)TKc?44|-sPxDy<i[jb[s Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com XQu4i;It_f~o>3, KN{Fk?$+k063Tiv(F~;?02MoaTUP/:+;eeHIOHWf_Ob-s*iTugCX^)YVicQB<1: {??RaMPnky^1nA7'2!$REBJNc=skHq:poE<ObzL*~*M-w$9Vxx`Lv>ZcirD$]R#_f8~qT,O[Vc)x, G bKn>8, <X)r, rKv|oipe=j/;e0%f/j:#/bRy('D]"f|zB3 X-Uri: http://www.research.att.com/orgs/ssr/people/douglis Date: Mon, 27 Nov 1995 11:15:47 -0500 Sender: douglis@pelican.research.att.com Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" I reported this bug a few months ago and I thought a patch had been installed in the distribution. Roy? -- Fred Douglis MIME accepted douglis@research.att.com AT&T Bell Laboratories 908 582-3633 (office) 600 Mountain Ave., Rm. 2B-105 908 582-3063 (fax) Murray Hill, NJ 07974 http://www.research.att.com/orgs/ssr/people/douglis/ From owner-robots Mon Nov 27 12:29:54 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17259; Mon, 27 Nov 95 12:29:54 -0800 Message-Id: <199511272029.PAA14228@lexington.cs.columbia.edu> To: robots@webcrawler.com Subject: harvest Date: Mon, 27 Nov 1995 15:29:38 -0500 From: "John D. Pritchard" <jdp@cs.columbia.edu> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com there's been some mention of harvest.. the URL is http://harvest.cs.colorado.edu/ this provides a ton of infrastructure for implementing robots on top of, in the form of gatherers and or brokers. harvest sites cooperate so that once (with caching) a set of data (ftp, http, gopher, wais, etc.) has been "harvested" (or gathered), the global harvest database can reuse the gathered info without re-harvesting (re-gathering) from the target data site. this is "responsible"* robots that dont load up data sites with redundant automated downloading and cooperative robots, via brokering. * or ethical: http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Agents/eichmann.ethical/eichmann.html see http://harvest.cs.colorado.edu/harvest/technical.html for more. for a linear robot cooperation, harvest provides Summary Object Interchange Format (SOIF), http://harvest.cs.colorado.edu/Harvest/brokers/soifhelp.html arbitrary extensions to SOIF are on the object, object-attribute model. for nonlinear robot cooperation or interaction, brokers can be defined arbitrarily. i'm presently working on an associative AI which i had developed as a standalone program, but am stripping my lame gathering and brokering code for the sophistication of harvest. -john From owner-robots Mon Nov 27 14:39:00 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18292; Mon, 27 Nov 95 14:39:00 -0800 Date: Mon, 27 Nov 95 15:55:32 EST From: Jason_Murray_at_FCRD@cclink.tfn.com Message-Id: <9510278175.AA817518051@cclink.tfn.com> To: robots@webcrawler.com Subject: Re: Smart Agent help Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Give me a call (617) 345-2465 or send email (netsoft@aol.com). We are in process of creating just such an agent. Jason Murray DataMarket 306 Union St Rockland MA 02370 Fax 617-871-5816 From owner-robots Mon Nov 27 14:58:48 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18458; Mon, 27 Nov 95 14:58:48 -0800 Message-Id: <30BA6C06.444C@infi.net> Date: Mon, 27 Nov 1995 17:55:18 -0800 From: Michael Goldberg <magi@infi.net> Organization: Media Access Group X-Mailer: Mozilla 2.0b2a (Windows; I; 16bit) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: harvest References: <199511272029.PAA14228@lexington.cs.columbia.edu> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Received your email through the robots listserv... I need an application built for a site I am developing... THe application allows users of the site to tailor a specified areas of interest,...say mortgages.. and search specific WWW sites and retrieve the information eith by email or a formatted newsletter. Can Harvest do this? From owner-robots Mon Nov 27 16:38:38 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19231; Mon, 27 Nov 95 16:38:38 -0800 Message-Id: <199511280038.TAA14968@lexington.cs.columbia.edu> To: robots@webcrawler.com Subject: mortgages with: Re: harvest In-Reply-To: Your message of "Mon, 27 Nov 1995 17:55:18 PST." <30BA6C06.444C@infi.net> Date: Mon, 27 Nov 1995 19:38:34 -0500 From: "John D. Pritchard" <jdp@cs.columbia.edu> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Received your email through the robots listserv... > I need an application built for a site I am developing... > THe application allows users of the site to tailor a specified > areas of interest,...say mortgages.. and search specific WWW sites this is the kind of thing that harvest provides for. basically, in "tailoring" information dynamically (as opposed to going to a static menu system) your user is faced with (recursively) traversing an association graph. the user wants to see data with mortgage numbers. associativity is the service we are providing. better associativity, however, classes data, eg, via SOIF, so that the user has more coherent domains to search through than "every document with numeric strings and the string 'mortgage'". presently, SOIF provides for arbitrary degrees of data classification which is a strong solution for most applications, and generally an optimal solution for applications involving fairly regular data formats, eg, reports or forms. harvest provides for sites to cooperate or interoperate efficiently for applications such as these since no one site could ever have space to replicate the entire internet, or even a significant associative slice of it, in providing a monolithic internet database. basically the talent of harvest in linear interoperability, via SOIF, is providing the architecture for this recursively infinite association graph traversal in most forms of data, especially business data. > and retrieve the information eith by email or a formatted newsletter. > Can Harvest do this? certainly you could put an email or such interface on the system, but your users would probably be happier with something more responsive and flexible like a web interface. an interactive interface provides the opportunity for refining data collection, for discovering new sources of data, etc. -john From owner-robots Mon Nov 27 19:52:36 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26675; Mon, 27 Nov 95 19:52:36 -0800 Date: Mon, 27 Nov 1995 22:52:30 -0500 From: Skip Montanaro <skip@automatrix.com> Message-Id: <199511280352.WAA24695@dolphin.automatrix.com> To: robots@webcrawler.com Subject: How frequently should I check /robots.txt? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I'm working on a specialized robot to identify Web sites with concert itineraries (by scoring the contents of the file against expected patterns). I will announce it here when I begin exercising it outside my local network. I'm a bit confused about how often I should update my local copy of a site's /robots.txt file. Clearly I shouldn't check it with each access, since that would double the number of accesses my robot would make to a site. I saw nothing in my server's access logs that would suggest that any of the robots that visit our site ever perform a HEAD request for /robots.txt (indicating they were checking for a Last-modified header). So how about it? How often should /robots.txt be checked? Thx, Skip Montanaro skip@calendar.com (518)372-5583 Musi-Cal: http://www.calendar.com/concerts/ or mailto:concerts@calendar.com Internet Conference Calendar: http://www.calendar.com/conferences/ >>> ZLDF: http://www.netresponse.com/zldf <<< From owner-robots Mon Nov 27 20:31:52 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00165; Mon, 27 Nov 95 20:31:52 -0800 Date: Mon, 27 Nov 1995 23:27:08 -0600 (CST) From: gil cosson <gil@rusty.waterworks.com> To: robots@webcrawler.com Cc: robots@webcrawler.com Subject: Re: How frequently should I check /robots.txt? In-Reply-To: <199511280352.WAA24695@dolphin.automatrix.com> Message-Id: <Pine.LNX.3.91.951127231751.1718A-100000@rusty.waterworks.com> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com How about adding an entry to the robots.txt file that specifies how frequently the robots.txt file should be checked? gil. ========================================================================== "Everybody can be great because anybody can serve. You don't have to have a college degree to serve. You don't have to make your subject and verb agree to serve. You don't have to know the second theory of Thermo Dynamics and physics to serve. You only need a heart full of grace. A soul generated by love." Martin Luther King Jr. On Mon, 27 Nov 1995, Skip Montanaro wrote: > > I'm working on a specialized robot to identify Web sites with concert > itineraries (by scoring the contents of the file against expected patterns). > I will announce it here when I begin exercising it outside my local network. > > I'm a bit confused about how often I should update my local copy of a site's > /robots.txt file. Clearly I shouldn't check it with each access, since that > would double the number of accesses my robot would make to a site. > > I saw nothing in my server's access logs that would suggest that any of the > robots that visit our site ever perform a HEAD request for /robots.txt > (indicating they were checking for a Last-modified header). > > So how about it? How often should /robots.txt be checked? > > Thx, > > Skip Montanaro skip@calendar.com (518)372-5583 > Musi-Cal: http://www.calendar.com/concerts/ or mailto:concerts@calendar.com > Internet Conference Calendar: http://www.calendar.com/conferences/ > >>> ZLDF: http://www.netresponse.com/zldf <<< > From owner-robots Mon Nov 27 23:22:57 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03527; Mon, 27 Nov 95 23:22:57 -0800 Message-Id: <9511280722.AA03518@webcrawler.com> To: robots@webcrawler.com Subject: Re: How frequently should I check /robots.txt? In-Reply-To: Your message of "Mon, 27 Nov 1995 23:27:08 CST." <Pine.LNX.3.91.951127231751.1718A-100000@rusty.waterworks.com> Date: Mon, 27 Nov 1995 23:22:54 -0800 From: Martijn Koster <mak@surfski.webcrawler.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com In message <Pine.LNX.3.91.951127231751.1718A-100000@rusty.waterworks.com>, gil cosson writes: > How about adding an entry to the robots.txt file that specifies how > frequently the robots.txt file should be checked? Hmm.. and then how often do you check if the checking frequency has changed? :-) Seriously though I don't think there'd be a lot of benefit; as an admin you tend not to know when you'll make the next change. From an http point of view robots could be smart, and look at the Expires header. Deciding how often to check for the /robots.txt depends highly on how you run your robot: how many runs per week, how many documents when, etc. I'd say a week is a reasoneable time. If your robot supports end-user submissions you could of course be clever about people submitting their /robots.txt URL; that would give them more influence. -- Martijn __________ Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Wed Nov 29 18:16:32 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11717; Wed, 29 Nov 95 18:16:32 -0800 Message-Id: <9511300215.AA04718@grasshopper.ucsd.edu> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Christopher Penrose <penrose@grasshopper.ucsd.edu> Date: Wed, 29 Nov 95 18:15:27 -0800 To: robots@webcrawler.com Subject: McKinley Spider hit us hard Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com A spider from magellan.mckinley.com hit us hard today and did a deep recursive search of our web tree. Not very friendly, but their spider did check /robots.txt which indicates that they may have successfully implemented the robot exclusion protocol. Christopher Penrose penrose@ucsd.edu http://www-crca.ucsd.edu/TajMahal/after.html here is their internic info if anyone else wants to complain to them: The McKinley Group (MCKINLEY-DOM) 85 Liberty Ship Way Suite 201 Sausalito, CA 94965 Domain Name: MCKINLEY.COM Administrative Contact, Technical Contact, Zone Contact: Cohen, Alexander J. (ASC2) xcohen@MCKINLEY.COM 415-331-1884 FAX Record last updated on 21-Sep-95. Record created on 14-Jul-94. Domain servers in listed order: NS1.NOC.NETCOM.NET 204.31.1.1 NS2.NOC.NETCOM.NET 204.31.1.2 From owner-robots Wed Nov 29 18:58:31 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15455; Wed, 29 Nov 95 18:58:31 -0800 From: Adminstrator <POSTMASTER@ATTAUST1.austria.attgis.com> To: robots@webcrawler.com Subject: Mail failure Date: Thu, 30 Nov 95 03:57:00 PST Message-Id: <30BD9C2C@mailgate.austria.attgis.com> Encoding: 50 TEXT X-Mailer: Microsoft Mail V3.0 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com User mail received addressed to the following unknown addresses: AUSTRIA/ATTAUST1/mostendo ------------------------------------------------------------------------------ Return-Path: <@issaust.austria.ncr.com:robots@webcrawler.com> Message-Id: <9511300215.AA04718@grasshopper.ucsd.edu> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Christopher Penrose <penrose@grasshopper.ucsd.edu> Date: Wed, 29 Nov 95 18:15:27 -0800 To: robots@webcrawler.com Subject: McKinley Spider hit us hard Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com A spider from magellan.mckinley.com hit us hard today and did a deep recursive search of our web tree. Not very friendly, but their spider did check /robots.txt which indicates that they may have successfully implemented the robot exclusion protocol. Christopher Penrose penrose@ucsd.edu http://www-crca.ucsd.edu/TajMahal/after.html here is their internic info if anyone else wants to complain to them: The McKinley Group (MCKINLEY-DOM) 85 Liberty Ship Way Suite 201 Sausalito, CA 94965 Domain Name: MCKINLEY.COM Administrative Contact, Technical Contact, Zone Contact: Cohen, Alexander J. (ASC2) xcohen@MCKINLEY.COM 415-331-1884 FAX Record last updated on 21-Sep-95. Record created on 14-Jul-94. Domain servers in listed order: NS1.NOC.NETCOM.NET 204.31.1.1 NS2.NOC.NETCOM.NET 204.31.1.2 From owner-robots Wed Nov 29 19:16:42 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17176; Wed, 29 Nov 95 19:16:42 -0800 From: Adminstrator <POSTMASTER@ATTAUST1.austria.attgis.com> To: robots@webcrawler.com Subject: Mail failure Date: Thu, 30 Nov 95 04:15:00 PST Message-Id: <30BDA075@mailgate.austria.attgis.com> Encoding: 70 TEXT X-Mailer: Microsoft Mail V3.0 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com User mail received addressed to the following unknown addresses: AUSTRIA/ATTAUST1/mostendo ------------------------------------------------------------------------------ Return-Path: <@issaust.austria.ncr.com:robots@webcrawler.com> From: Adminstrator <POSTMASTER@attaust1.austria.attgis.com> To: robots@webcrawler.com Subject: Mail failure Date: Thu, 30 Nov 95 03:57:00 PST Message-Id: <30BD9C2C@mailgate.austria.attgis.com> Encoding: 50 TEXT X-Mailer: Microsoft Mail V3.0 Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com User mail received addressed to the following unknown addresses: AUSTRIA/ATTAUST1/mostendo ------------------------------------------------------------------------------ Return-Path: <@issaust.austria.ncr.com:robots@webcrawler.com> Message-Id: <9511300215.AA04718@grasshopper.ucsd.edu> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Christopher Penrose <penrose@grasshopper.ucsd.edu> Date: Wed, 29 Nov 95 18:15:27 -0800 To: robots@webcrawler.com Subject: McKinley Spider hit us hard Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com A spider from magellan.mckinley.com hit us hard today and did a deep recursive search of our web tree. Not very friendly, but their spider did check /robots.txt which indicates that they may have successfully implemented the robot exclusion protocol. Christopher Penrose penrose@ucsd.edu http://www-crca.ucsd.edu/TajMahal/after.html here is their internic info if anyone else wants to complain to them: The McKinley Group (MCKINLEY-DOM) 85 Liberty Ship Way Suite 201 Sausalito, CA 94965 Domain Name: MCKINLEY.COM Administrative Contact, Technical Contact, Zone Contact: Cohen, Alexander J. (ASC2) xcohen@MCKINLEY.COM 415-331-1884 FAX Record last updated on 21-Sep-95. Record created on 14-Jul-94. Domain servers in listed order: NS1.NOC.NETCOM.NET 204.31.1.1 NS2.NOC.NETCOM.NET 204.31.1.2 From owner-robots Wed Nov 29 19:29:34 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18110; Wed, 29 Nov 95 19:29:34 -0800 From: Adminstrator <POSTMASTER@ATTAUST1.austria.attgis.com> To: robots@webcrawler.com Subject: Mail failure Date: Thu, 30 Nov 95 04:28:00 PST Message-Id: <30BDA376@mailgate.austria.attgis.com> Encoding: 91 TEXT X-Mailer: Microsoft Mail V3.0 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com User mail received addressed to the following unknown addresses: AUSTRIA/ATTAUST1/mostendo ------------------------------------------------------------------------------ Return-Path: <@issaust.austria.ncr.com:robots@webcrawler.com> From: Adminstrator <POSTMASTER@attaust1.austria.attgis.com> To: robots@webcrawler.com Subject: Mail failure Date: Thu, 30 Nov 95 04:15:00 PST Message-Id: <30BDA075@mailgate.austria.attgis.com> Encoding: 70 TEXT X-Mailer: Microsoft Mail V3.0 Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com User mail received addressed to the following unknown addresses: AUSTRIA/ATTAUST1/mostendo ------------------------------------------------------------------------------ Return-Path: <@issaust.austria.ncr.com:robots@webcrawler.com> From: Adminstrator <POSTMASTER@attaust1.austria.attgis.com> To: robots@webcrawler.com Subject: Mail failure Date: Thu, 30 Nov 95 03:57:00 PST Message-Id: <30BD9C2C@mailgate.austria.attgis.com> Encoding: 50 TEXT X-Mailer: Microsoft Mail V3.0 Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com User mail received addressed to the following unknown addresses: AUSTRIA/ATTAUST1/mostendo ------------------------------------------------------------------------------ Return-Path: <@issaust.austria.ncr.com:robots@webcrawler.com> Message-Id: <9511300215.AA04718@grasshopper.ucsd.edu> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Christopher Penrose <penrose@grasshopper.ucsd.edu> Date: Wed, 29 Nov 95 18:15:27 -0800 To: robots@webcrawler.com Subject: McKinley Spider hit us hard Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com A spider from magellan.mckinley.com hit us hard today and did a deep recursive search of our web tree. Not very friendly, but their spider did check /robots.txt which indicates that they may have successfully implemented the robot exclusion protocol. Christopher Penrose penrose@ucsd.edu http://www-crca.ucsd.edu/TajMahal/after.html here is their internic info if anyone else wants to complain to them: The McKinley Group (MCKINLEY-DOM) 85 Liberty Ship Way Suite 201 Sausalito, CA 94965 Domain Name: MCKINLEY.COM Administrative Contact, Technical Contact, Zone Contact: Cohen, Alexander J. (ASC2) xcohen@MCKINLEY.COM 415-331-1884 FAX Record last updated on 21-Sep-95. Record created on 14-Jul-94. Domain servers in listed order: NS1.NOC.NETCOM.NET 204.31.1.1 NS2.NOC.NETCOM.NET 204.31.1.2 From owner-robots Wed Nov 29 20:03:47 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20992; Wed, 29 Nov 95 20:03:47 -0800 From: Adminstrator <POSTMASTER@ATTAUST1.austria.attgis.com> To: robots@webcrawler.com Subject: Mail failure Date: Thu, 30 Nov 95 04:44:00 PST Message-Id: <30BDA71C@mailgate.austria.attgis.com> Encoding: 113 TEXT X-Mailer: Microsoft Mail V3.0 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com User mail received addressed to the following unknown addresses: AUSTRIA/ATTAUST1/mostendo ------------------------------------------------------------------------------ Return-Path: <@issaust.austria.ncr.com:robots@webcrawler.com> From: Adminstrator <POSTMASTER@attaust1.austria.attgis.com> To: robots@webcrawler.com Subject: Mail failure Date: Thu, 30 Nov 95 04:28:00 PST Message-Id: <30BDA376@mailgate.austria.attgis.com> Encoding: 91 TEXT X-Mailer: Microsoft Mail V3.0 Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com User mail received addressed to the following unknown addresses: AUSTRIA/ATTAUST1/mostendo ------------------------------------------------------------------------------ Return-Path: <@issaust.austria.ncr.com:robots@webcrawler.com> From: Adminstrator <POSTMASTER@attaust1.austria.attgis.com> To: robots@webcrawler.com Subject: Mail failure Date: Thu, 30 Nov 95 04:15:00 PST Message-Id: <30BDA075@mailgate.austria.attgis.com> Encoding: 70 TEXT X-Mailer: Microsoft Mail V3.0 Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com User mail received addressed to the following unknown addresses: AUSTRIA/ATTAUST1/mostendo ------------------------------------------------------------------------------ Return-Path: <@issaust.austria.ncr.com:robots@webcrawler.com> From: Adminstrator <POSTMASTER@attaust1.austria.attgis.com> To: robots@webcrawler.com Subject: Mail failure Date: Thu, 30 Nov 95 03:57:00 PST Message-Id: <30BD9C2C@mailgate.austria.attgis.com> Encoding: 50 TEXT X-Mailer: Microsoft Mail V3.0 Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com User mail received addressed to the following unknown addresses: AUSTRIA/ATTAUST1/mostendo ------------------------------------------------------------------------------ Return-Path: <@issaust.austria.ncr.com:robots@webcrawler.com> Message-Id: <9511300215.AA04718@grasshopper.ucsd.edu> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Christopher Penrose <penrose@grasshopper.ucsd.edu> Date: Wed, 29 Nov 95 18:15:27 -0800 To: robots@webcrawler.com Subject: McKinley Spider hit us hard Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com A spider from magellan.mckinley.com hit us hard today and did a deep recursive search of our web tree. Not very friendly, but their spider did check /robots.txt which indicates that they may have successfully implemented the robot exclusion protocol. Christopher Penrose penrose@ucsd.edu http://www-crca.ucsd.edu/TajMahal/after.html here is their internic info if anyone else wants to complain to them: The McKinley Group (MCKINLEY-DOM) 85 Liberty Ship Way Suite 201 Sausalito, CA 94965 Domain Name: MCKINLEY.COM Administrative Contact, Technical Contact, Zone Contact: Cohen, Alexander J. (ASC2) xcohen@MCKINLEY.COM 415-331-1884 FAX Record last updated on 21-Sep-95. Record created on 14-Jul-94. Domain servers in listed order: NS1.NOC.NETCOM.NET 204.31.1.1 NS2.NOC.NETCOM.NET 204.31.1.2 From owner-robots Thu Nov 30 10:45:43 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12365; Thu, 30 Nov 95 10:45:43 -0800 Date: Thu, 30 Nov 1995 13:43:58 -0500 From: alain@ai.iit.nrc.ca (Alain Desilets) Message-Id: <9511301843.AA28288@ksl1000.iit.nrc.ca> To: robots@webcrawler.com Subject: Re: Looking for a spider Cc: alain@ai.iit.nrc.ca X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Dear Marilyn, Just thought I'd check out the status of your robot testbed. My ListSeeker software (http://ai.iit.nrc.ca/II_public/WebView/ListSeeker.html) is now ready for testing. So if your robot testbed is ready for public use, I am prepared to try it out. Sincerely, Alain Desilets Institute for Information Technology National Research Concil of Canada Building M-50 Montreal Road Ottawa (Ont) K1A 0R6 e-mail: alain@ai.iit.nrc.ca Tel: (613) 990-2813 Fax: (613) 952-7151 From owner-robots Thu Nov 30 12:30:51 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18231; Thu, 30 Nov 95 12:30:51 -0800 Date: Thu, 30 Nov 1995 21:29:30 +0100 (MET) From: Karoly Negyesi <chx@cs.elte.hu> X-Sender: chx@turan To: robots@webcrawler.com Subject: Small robot needed Message-Id: <Pine.SV4.3.91.951130212824.4490A@turan> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi! I'd need a very small robot which download a given URL (most probably a HTML page) and everything directly referenced (HREFs LINKs SRCs) Thanks, ___ ___ Charlie Negyesi chx@cs.elte.hu ___ ___ {~._.~} {~._.~} (+361) 203-5962 (7pm-9pm) {~._.~} {~._.~} _( Y )_ ( * ) Hungary, Budapest ( * ) _( Y )_ (:_~*~_:) ()~*~() H-1462, P.o.box 503 ()~*~() (:_~*~_:) (_)-(_) (_)-(_) May the Bear be with you! (_)-(_) (_)-(_) From owner-robots Thu Nov 30 13:15:32 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20570; Thu, 30 Nov 95 13:15:32 -0800 Date: Thu, 30 Nov 1995 16:15:21 -0500 From: Skip Montanaro <skip@automatrix.com> Message-Id: <199511302115.QAA04958@dolphin.automatrix.com> To: robots@webcrawler.com Subject: New robot turned loose on an unsuspecting public... and a DNS question Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com No, it's not really another Godzilla movie. I started running the Musi-Cal Robot today. It has the following properties: 1. Understands (and obeys!) the robots.txt protocol. 2. Doesn't revisit the same server more than once every 10 minutes. 3. Doesn't revisit the same URL more than once per month. 4. Only groks HTTP URLs at the moment. 5. Announces itself in requests as "Musi-Cal-Robot/0.1". 6. Gives my email ("skip@calendar.com") in the From: field of the request. 7. It's looking for music-related sites, so you may never see it. 8. The HTML parser I'm using is rather slow, which helps avoid network congestion. 9. You should only ever see it running from dolphin.automatrix.com, a machine connected via 28.8k modem - again, a fine network/server congestion avoidance tool. 10. It randomizes its list of outstanding URLs after every pass through the list to minimize beating up a single server. If there's anything I've forgotten to do (like announce it somewhere on Usenet) or any parameter needs obvious tweaking, let me know. I have been struggling with DNS resolution and was wondering if people could give me some feedback. Ideally, I want to make sure I treat all aliases for a server as the same server, so I was attempting to execute gethostbyaddr(gethostbyname('www.wherever.com')) but that seemed terribly slow and tcpdump traces suggested that it would get stuck banging on the same server. Then I tried just the gethostbyname(), but that wasn't much better. For now, I just accept what I have for a host name and map a couple places I know that do round-robin DNS back into the canonical name. What do other robot writers do about name resolution? Feedback appreciated. Thanks, Skip Montanaro skip@calendar.com (518)372-5583 Musi-Cal: http://www.calendar.com/concerts/ or mailto:concerts@calendar.com Internet Conference Calendar: http://www.calendar.com/conferences/ >>> ZLDF: http://www.netresponse.com/zldf <<< From owner-robots Thu Nov 30 17:40:52 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09699; Thu, 30 Nov 95 17:40:52 -0800 Message-Id: <199512010140.RAA28005@fiji.verity.com> X-Authentication-Warning: fiji.verity.com: Host localhost.verity.com didn't use HELO protocol To: skip@calendar.com Cc: robots@webcrawler.com Subject: Re: New robot turned loose on an unsuspecting public... and a DNS question In-Reply-To: Your message of "Thu, 30 Nov 1995 16:15:21 EST." <199511302115.QAA04958@dolphin.automatrix.com> Date: Thu, 30 Nov 1995 17:40:32 -0800 From: Thomas Maslen <tmaslen@Verity.COM> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > What do other robot writers do about name resolution? In our case... cache the results of lookups so that we only do the gethostbyname("foo") once for any particular "foo". This still gives pretty evil behaviour on, say, a page of links to cool places where almost every link points to a different host, but the average behaviour is much better than not caching. Also, if you're looking for a canonical representation for hosts so that you can test "is this host the same as that one?", I'd suggest that you _not_ try matching the hostnames: rather, do the gethostbyaddr() and then look for an intersection in the sets of IP addresses (but be prepared to rewrite the code next year to deal with IPv6 addresses!). In other words, the canonical representation for a host should be the set of IP addresses, not the hostname strings. Thomas tmaslen@verity.com My opinions, not Verity's From owner-robots Fri Dec 1 08:24:24 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00727; Fri, 1 Dec 95 08:24:24 -0800 Date: Fri, 1 Dec 95 10:33:28 EST From: wulfekuh@cps.msu.edu (Marilyn R Wulfekuhler) Message-Id: <9512011533.AA14431@pixel.cps.msu.edu> To: robots@webcrawler.com Subject: Re: Looking for a spider Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, Sorry to say, we had a disk problem and lost the original data. In the meantime, we have ordered a new (9 gig) disk, and also uncovered some more bugs in htmlgobble, and are trying to get things back. The known bugs are fixed, but the word on the new disk is still "any day now". You've been patient so far: sorry I didn't let you know the status earlier. I'll try to keep you informed, and when we have stuff (even before I announce it to the list), I'll let you know. Thanks for your patience, Marilyn From owner-robots Fri Dec 1 08:59:15 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03564; Fri, 1 Dec 95 08:59:15 -0800 Date: Fri, 1 Dec 1995 17:20:47 +0200 (EET) From: Cristian Ionitoiu <cristi@cs.utt.ro> X-Sender: cristi@tempus5 To: robots@webcrawler.com Subject: inquiry about robots Message-Id: <Pine.SUN.3.91.951201171701.5311A-100000@tempus5> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi to everybody, I'm quite new on the list, and I'm interested in Internet navigating robots. I would like to know if there is any robot which offer a certain API for the programmer? Or if there any public available robot together with its sources? And I would prefer an non-perl implementation. Thank you in advance for all your information! --Cristian ============================================================================== CRISTIAN IONITOIU - Computer Science Department, "Politehnica" University of teaching Timisoara. assistant Email: cristi@utt.ro, cristi@ns.utt.ro, cristi@cs.utt.ro WWW: http://www.utt.ro/~cristi Office: Bdul. Vasile Parvan No. 2, 1900 Timisoara, Romania Private: O.P. 5, C.P. 641, 1900 Timisoara, Romania Fax&Phone: (office): +40 56 192 049 ______________________________________________________________________________ Science is what happens when preconception meets verification. ============================================================================== From owner-robots Fri Dec 1 09:26:06 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06317; Fri, 1 Dec 95 09:26:06 -0800 Date: Fri, 1 Dec 1995 12:24:16 -0500 From: alain@ai.iit.nrc.ca (Alain Desilets) Message-Id: <9512011724.AA00940@ksl1000.iit.nrc.ca> To: robots@webcrawler.com Subject: Re: Looking for a spider Cc: alain@ai.iit.nrc.ca X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Hi, > > Sorry to say, we had a disk problem and lost the original data. That's a bummer... > In the > meantime, we have ordered a new (9 gig) disk, and also uncovered some more > bugs in htmlgobble, and are trying to get things back. The known bugs are > fixed, but the word on the new disk is still "any day now". > > You've been patient so far: sorry I didn't let you know the status earlier. > > I'll try to keep you informed, and when we have stuff (even before I announce > it to the list), I'll let you know. > Don't worry about me. We have some data here that I can use to test my approach on a small scale, and I am talking to some other people about getting about 1G of additional data. Your data would be a good addition to that (the more data the better). Good luck with your work and let me know how it goes. Alain From owner-robots Fri Dec 1 09:41:33 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA07351; Fri, 1 Dec 95 09:41:33 -0800 Date: Fri, 1 Dec 1995 09:40:26 -0800 Message-Id: <199512011740.JAA05988@ix13.ix.netcom.com> From: wessman@ix.netcom.com (Gene Essman ) Subject: Re: Looking for a spider To: robots@webcrawler.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You wrote: > > >Hi, > >Sorry to say, we had a disk problem and lost the original data. In the (snip) Sorry to seem so ignorant, but I have just been hanging around the Internet a short time. In that time, I have wondered about the whole "robot/spider" thing and have a couple of questions. Perhaps someone could take the time to help me out. Are robots for sale or can one "hire" someone who has one to do some work, or how does that whole thing work. Thanks, Gene Essman From owner-robots Fri Dec 1 10:28:36 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10173; Fri, 1 Dec 95 10:28:36 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: <v02130501ace4f418b338@[192.187.143.12]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 1 Dec 1995 10:28:27 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Looking for a spider Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 9:40 AM 12/1/95, Gene Essman wrote: >Are robots for sale or can one "hire" someone who has one to do some >work, or how does that whole thing work. Verity offers a couple of variations of its Web robot, but they are designed specifically to build Verity search indexes, not as general-purpose robots. The only generally available robot-ish code that I know about is the Harvest Gatherer code. Its primary purpose is to index the server on which is it running, but it's a fairly small step to make it do the same over the wire. I think there's a widespread reluctance to push robots hard in the commercial space, since marketing success would fairly quickly breed failure -- having lots of robots doing redundant work would be a huge inefficiency. Nick From owner-robots Fri Dec 1 17:22:39 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03387; Fri, 1 Dec 95 17:22:39 -0800 Message-Id: <m0tLgeS-0004gSC@rsoft.rsoft.bc.ca> Date: Sat, 2 Dec 1995 00:12:58 +0000 From: Ted Sullivan <tsullivan@blizzard.snowymtn.com> Subject: Re: Looking for a spider To: robots <robots@webcrawler.com> X-Mailer: Worldtalk (NetConnex V3.50a)/MIME Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Not to make this a sales pitch but if you need a real specialized spider for commercial work then we can build one for you that interfaces with ObjectStore a Object Database and any other applications you might have around. Ted Sullivan ---------- From: robots To: robots Subject: Re: Looking for a spider Date: Friday, December 01, 1995 9:40AM You wrote: > > >Hi, > >Sorry to say, we had a disk problem and lost the original data. In the (snip) Sorry to seem so ignorant, but I have just been hanging around the Internet a short time. In that time, I have wondered about the whole "robot/spider" thing and have a couple of questions. Perhaps someone could take the time to help me out. Are robots for sale or can one "hire" someone who has one to do some work, or how does that whole thing work. Thanks, Gene Essman From owner-robots Fri Dec 1 19:54:36 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03896; Fri, 1 Dec 95 19:54:36 -0800 Date: Fri, 1 Dec 1995 20:52:46 -0700 Message-Id: <199512020352.UAA24347@web.azstarnet.com> X-Sender: drose@azstarnet.com X-Mailer: Windows Eudora Version 1.4.4 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: drose@AZStarNet.com Subject: Re: Looking for a spider Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Ted: I very much need a specialized spider. Could you let me know something about your capabilities? Assume that I want to research *everything* on the web about, say, stamp collecting (not) on an historical and contemporary basis, how would your spider work? I look forward to hearing from you. -David M. Rose > >Not to make this a sales pitch but if you need a real specialized spider for >commercial work then we can build one for you that interfaces with >ObjectStore a Object Database and any other applications you might have >around. > >Ted Sullivan > ---------- From owner-robots Fri Dec 1 20:47:03 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04060; Fri, 1 Dec 95 20:47:03 -0800 Message-Id: <30BF86FE.183@mcc.tamu.edu> Date: Fri, 01 Dec 1995 22:51:42 +0000 From: Lance Ogletree <Lance.Ogletree@mcc.tamu.edu> X-Mailer: Mozilla 2.0b3 (Macintosh; I; PPC) Mime-Version: 1.0 To: robots@webcrawler.com Subject: MacPower Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Interested in Power Macintosh Computers? Stop by a site on the web. MacPower!!!!!!!! http://mccnet.tamu.edu/MacPower/MacPower.html From owner-robots Sat Dec 2 08:11:51 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05472; Sat, 2 Dec 95 08:11:51 -0800 Message-Id: <m0tLuX6-0004oqC@rsoft.rsoft.bc.ca> Date: Sat, 2 Dec 1995 05:09:58 +0000 From: Ted Sullivan <tsullivan@blizzard.snowymtn.com> Subject: Re: Looking for a spider To: robots <robots@webcrawler.com> X-Mailer: Worldtalk (NetConnex V3.50a)/MIME Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Could you send me your e-mail address to tsullivan@snowymtn.com so we can have this discussion offline the robots mailing list. I am sure the other would appreciate it. Ted ---------- From: robots To: robots Subject: Re: Looking for a spider Date: Friday, December 01, 1995 7:52PM Ted: I very much need a specialized spider. Could you let me know something about your capabilities? Assume that I want to research *everything* on the web about, say, stamp collecting (not) on an historical and contemporary basis, how would your spider work? I look forward to hearing from you. -David M. Rose > >Not to make this a sales pitch but if you need a real specialized spider for >commercial work then we can build one for you that interfaces with >ObjectStore a Object Database and any other applications you might have >around. > >Ted Sullivan > ---------- From i.bromwich@nexor.co.uk Mon Dec 4 02:24:00 1995 Return-Path: <i.bromwich@nexor.co.uk> Received: from lancaster.nexor.co.uk by webcrawler.com (NX5.67f2/NX3.0M) id AA00398; Mon, 4 Dec 95 02:24:00 -0800 X400-Received: by /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 4 Dec 1995 10:23:23 +0000 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 4 Dec 1995 10:23:23 +0000 Date: Mon, 4 Dec 1995 10:23:23 +0000 X400-Originator: i.bromwich@nexor.co.uk X400-Recipients: non-disclosure:; X400-Mts-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:166150:951204102333] Content-Identifier: XT-MS Message Priority: Non-Urgent From: "i.bromwich" <i.bromwich@nexor.co.uk> Message-Id: <"-2131556092-16615-00001 951204102333Z*/I=i/S=bromwich/O=NEXOR/PRMD=NEXOR/ADMD= /C=GB/"@MHS> To: robots-archive <robots-archive@webcrawler.com> Reply-To: mak <mak@webcrawler.com> X-Mua-Version: XT-MUA 1.4 (dornier) of Tue Aug 22 03:03:53 BST 1995 // martijn, can't think of any other way to get these to you easily. Get in // touch if you need more help get stop From owner-robots Mon Dec 4 04:36:17 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00646; Mon, 4 Dec 95 04:36:17 -0800 From: Jaakko Hyvatti <Jaakko.Hyvatti@www.fi> Message-Id: <199512041236.OAA16470@krisse.www.fi> Subject: Re: MacPower To: robots@webcrawler.com Date: Mon, 4 Dec 1995 14:36:07 +0200 (EET) In-Reply-To: <30BF86FE.183@mcc.tamu.edu> from "Lance Ogletree" at Dec 1, 95 10:51:42 pm X-Mailer: ELM [version 2.4 PL22] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 174 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Interested in Power Macintosh Computers? > Stop by a site on the web. > MacPower!!!!!!!! > http://mccnet.tamu.edu/MacPower/MacPower.html No, I am not very interested. From owner-robots Mon Dec 4 04:47:22 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00689; Mon, 4 Dec 95 04:47:22 -0800 From: Jaakko Hyvatti <Jaakko.Hyvatti@www.fi> Message-Id: <199512041247.OAA16694@krisse.www.fi> Subject: Re: MacPower (an apology, I am very sorry) To: robots@webcrawler.com Date: Mon, 4 Dec 1995 14:47:14 +0200 (EET) In-Reply-To: <199512041236.OAA16470@krisse.www.fi> from "Jaakko Hyvatti" at Dec 4, 95 02:36:07 pm X-Mailer: ELM [version 2.4 PL22] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 148 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > http://mccnet.tamu.edu/MacPower/MacPower.html > > No, I am not very interested. I am very sorry this reply to the spam got into the list. From owner-robots Tue Dec 5 12:57:01 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19484; Tue, 5 Dec 95 12:57:01 -0800 From: Michael Van Biesbrouck <mlvanbie@undergrad.math.uwaterloo.ca> Message-Id: <199512052056.PAA24672@mobius07.math.uwaterloo.ca> Subject: Re: McKinley Spider hit us hard To: robots@webcrawler.com Date: Tue, 5 Dec 1995 15:56:33 -0500 (EST) In-Reply-To: <9511300215.AA04718@grasshopper.ucsd.edu> from "Christopher Penrose" at Nov 29, 95 06:15:27 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 1110 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > A spider from magellan.mckinley.com hit us hard today and did a > deep recursive search of our web tree. Not very friendly, but their > spider did check /robots.txt which indicates that they may have > successfully implemented the robot exclusion protocol. > > > Christopher Penrose > penrose@ucsd.edu > http://www-crca.ucsd.edu/TajMahal/after.html > > here is their internic info if anyone else wants to complain to them: The spider in question is Wobot/1.00; the correct person to bother with complaints is cedeno@mckinley.com. They visited a site that I watch over on 21 Nov and did nothing after reading /robots.txt. The robots.txt is somewhat long, but not very restrictive. However, it seems to have gone ballastic today on another machine. As a result I will be complaining. In this case it came from radar.mckinley.com. I sugest that other people check their logs and complain if necessary. -- "You're obviously on drugs, Michael Van Biesbrouck but not the right ones." ACM East Central Winning Team -- bwross about mlvanbie http://csclub.uwaterloo.ca/u/mlvanbie/ From owner-robots Tue Dec 5 22:02:14 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25735; Tue, 5 Dec 95 22:02:14 -0800 Date: Tue, 5 Dec 1995 22:02:01 -0800 X-Sender: julian @best.com Message-Id: <v01530501acea810e4f32@[206.86.2.106]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: julian@ugorilla.com (Julian Gorodsky) Subject: Re: Returned mail: Service unavailableHELP HELP! Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >The original message was received at Mon, 4 Dec 1995 20:35:54 -0800 >from julian.vip.best.com [206.86.2.106] > > ----- The following addresses had delivery problems ----- ><Majordomo-Owner@webcrawler.com> (unrecoverable error) > > ----- Transcript of session follows ----- >... while talking to surfski.webcrawler.com.: >>>> RCPT To:<Majordomo-Owner@webcrawler.com> ><<< 554 <Majordomo-Owner@webcrawler.com>... 550 User unknown >554 <Majordomo-Owner@webcrawler.com>... Service unavailable > > ----- Original message follows ----- > >Content-Type: message/rfc822 > >Return-Path: julian@ugorilla.com >Received: from [206.86.2.106] (julian.vip.best.com [206.86.2.106]) by >blob.best.net (8.6.12/8.6.5) with SMTP id UAA10780 for ><Majordomo-Owner@webcrawler.com>; Mon, 4 Dec 1995 20:35:54 -0800 >Date: Mon, 4 Dec 1995 20:35:54 -0800 >X-Sender: julian @best.com >Message-Id: <v01530500ace91a991809@[206.86.2.106]> >Mime-Version: 1.0 >Content-Type: text/plain; charset="us-ascii" >To: Majordomo-Owner@webcrawler.com >From: julian@ugorilla.com (Julian Gorodsky) >Subject: Re: Majordomo results > >>-- >> >>>>>> unsubscribe julian@best.com >>**** unsubscribe: 'julian@best.com' is not a member of list 'robots'. >>>>>> >>>>>> julian@ugorilla.com >>**** Command 'julian@ugorilla.com' not recognized. >>>>>> A Renaissance Project >>**** Command 'a' not recognized. >>>>>> >>>>>> >>**** Help for Majordomo: >> >>This is Brent Chapman's "Majordomo" mailing list manager, version 1.93. >> >>In the description below items contained in []'s are optional. When >>providing the item, do not include the []'s around it. >> >>It understands the following commands: >> >> subscribe [<list>] [<address>] >> Subscribe yourself (or <address> if specified) to the named <list>. >> >> unsubscribe [<list>] [<address>] >> Unsubscribe yourself (or <address> if specified) from the named >><list>. >> >> get [<list>] <filename> >> Get a file related to <list>. >> >> index [<list>] >> Return an index of files you can "get" for <list>. >> >> which [<address>] >> Find out which lists you (or <address> if specified) are on. >> >> who [<list>] >> Find out who is on the named <list>. >> >> info [<list>] >> Retrieve the general introductory information for the named <list>. >> >> lists >> Show the lists served by this Majordomo server. >> >> help >> Retrieve this message. >> >> end >> Stop processing commands (useful if your mailer adds a signature). >> >>Commands should be sent in the body of an email message to >>"Majordomo"or to "<list>-request". >> >>The <list> parameter is only optional if the message is sent to an address >>of the form "<list>-request". >> >> >>Commands in the "Subject:" line NOT processed. >> >>If you have any questions or problems, please contact >>"Majordomo-Owner". > >You have a subscriber named julianrz@best.com >Perhaps there's some confusion. > >julian@ugorilla.com >A Renaissance Project julian@ugorilla.com A Renaissance Project From owner-robots Tue Dec 5 22:02:50 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25781; Tue, 5 Dec 95 22:02:50 -0800 Date: Tue, 5 Dec 1995 22:02:39 -0800 X-Sender: julian @best.com Message-Id: <v01530500acea810c4ec1@[206.86.2.106]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: julian@ugorilla.com (Julian Gorodsky) Subject: Re: Returned mail: Service unavailableHELP AGAIN HELP AGAIN! Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >The original message was received at Mon, 4 Dec 1995 20:38:42 -0800 >from julian.vip.best.com [206.86.2.106] > > ----- The following addresses had delivery problems ----- ><Majordomo-Owner@webcrawler.com> (unrecoverable error) > > ----- Transcript of session follows ----- >... while talking to surfski.webcrawler.com.: >>>> RCPT To:<Majordomo-Owner@webcrawler.com> ><<< 554 <Majordomo-Owner@webcrawler.com>... 550 User unknown >554 <Majordomo-Owner@webcrawler.com>... Service unavailable > > ----- Original message follows ----- > >Content-Type: message/rfc822 > >Return-Path: julian@ugorilla.com >Received: from [206.86.2.106] (julian.vip.best.com [206.86.2.106]) by >blob.best.net (8.6.12/8.6.5) with SMTP id UAA12321 for ><Majordomo-Owner@webcrawler.com>; Mon, 4 Dec 1995 20:38:42 -0800 >Date: Mon, 4 Dec 1995 20:38:42 -0800 >X-Sender: julian @best.com >Message-Id: <v01530501ace91bcc6033@[206.86.2.106]> >Mime-Version: 1.0 >Content-Type: text/plain; charset="us-ascii" >To: Majordomo-Owner@webcrawler.com >From: julian@ugorilla.com (Julian Gorodsky) >Subject: Re: Welcome to robots > >>-- >> >>Welcome to the robots mailing list! >> >>If you ever want to remove yourself from this mailing list, >>send the following command in email to >>"robots-request": >> >> unsubscribe >> >>Or you can send mail to "Majordomo" with the following command >>in the body of your email message: >> >> unsubscribe robots Julian Rozentur <julianrz@best.com> >> >>Here's the general information for the list you've >>subscribed to, in case you don't already have it: >> >> >>This information is also available on the World-Wide Web in >>http://info.webcrawler.com/mailing-lists/robots/info.html >> >>CHARTER >> >>The robots@webcrawler.com mailing-list is intended as a technical >>forum for authors, maintainers and administrators of WWW robots. Its >>aim is to maximise the benefits WWW robots can offer while minimising >>drawbacks and duplication of effort. It is intended to address both >>development and operational aspects of WWW robots. >> >>This list is not intended for general discussion of WWW development >>efforts, or as a first line of support for users of robot facilities. >> >>Postings to this list are informal, and decisions and recommendations >>formulated here do not constitute any official standards. Postings to >>this list will be made available publicly through a mailing list >>archive. The administrator of this list nor his company accept any >>responsibility for the content of the postings. >> >>SUBSCRIPTION DETAILS >> >>To subscribe to this list, send a mail message to >>robots-request@webcrawler.com, with the word subscribe on the first >>line of the body. >> >>To unsubscribe to this list, send a mail message to >>robots-request@webcrawler.com, with the word unsubscribe on the first >>line of the body. >> >>Should this fail or should you otherwise need human assistance, send a >>message to owner-robots@webcrawler.com. >> >>To send message to all subscribers on the list itself, mail >>robots@webcrawler.com. >> >>THE ARCHIVE >> >>Messages to this list are archived. The preferred way of accessing the >>archived messages is using the Robots Mailing List Archive provided by >>Hypermail, on http://info.webcrawler.com/mailing-lists/robots/archive/ >> >>Behind the scenes this list is currently managed by Majordomo, an >>automated mailing list manager written in Perl. Majordomo also allows >>acces to archived messages; send mail to robots-request@webcrawler.com >>with the word help in the body to find out how. >> >> >>-- The Robots Mailing List Administrator <owner-robots@webcrawler.com> > >This is the original "warning" that a deluge of someone else's email would >arrive in my box > Not OK. > >julian@ugorilla.com >A Renaissance Project julian@ugorilla.com A Renaissance Project From owner-robots Tue Dec 5 22:43:34 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28361; Tue, 5 Dec 95 22:43:34 -0800 Message-Id: <199512060643.PAA01955@yamato.mtl.t.u-tokyo.ac.jp> To: robots@webcrawler.com Subject: Indexing two-byte text Date: Wed, 06 Dec 1995 15:43:23 +0900 From: Harry Munir Behrens <behrens@mtl.t.u-tokyo.ac.jp> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hello there, here at the Univ. of Tokyo we are currently installing Harvest and were wondering if anybody has experience with the problems encountered when indexing Japanese text. (no word boundaries, two-byte code etc.) I would be very grateful for any help pointing me to an international version of agrep/glimpse or something similar. Cheers, Harry Behrens PhD. candidate Dept. of Electrical Engineering Univ. of Tokyo behrens@mtl.t.u-tokyo.ac.jp From owner-robots Tue Dec 5 23:30:12 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01387; Tue, 5 Dec 95 23:30:12 -0800 Message-Id: <199512060730.CAA17199@lexington.cs.columbia.edu> To: robots@webcrawler.com Subject: Re: Indexing two-byte text In-Reply-To: Your message of "Wed, 06 Dec 1995 15:43:23 +0900." <199512060643.PAA01955@yamato.mtl.t.u-tokyo.ac.jp> Date: Wed, 06 Dec 1995 02:30:07 -0500 From: "John D. Pritchard" <jdp@cs.columbia.edu> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com maybe people in unicode would have one (agrep), like maybe the folks at http://plan9.att.com/ > here at the Univ. of Tokyo we are currently installing Harvest and were > wondering if anybody has experience with the problems encountered > when indexing Japanese text. (no word boundaries, two-byte code etc.) > I would be very grateful for any help pointing me to an international > version of agrep/glimpse or something similar. From owner-robots Tue Dec 5 23:47:15 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02152; Tue, 5 Dec 95 23:47:15 -0800 Date: Wed, 6 Dec 95 16:46:55 JST From: francis@cactus.slab.ntt.jp (Paul Francis) Message-Id: <9512060746.AA15834@cactus.slab.ntt.jp> To: robots@webcrawler.com Subject: Re: Indexing two-byte text Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > here at the Univ. of Tokyo we are currently installing Harvest and were > wondering if anybody has experience with the problems encountered > when indexing Japanese text. (no word boundaries, two-byte code etc.) > I would be very grateful for any help pointing me to an international > version of agrep/glimpse or something similar. > We are doing a multi-lingual navigation project (called Ingrid) that involves indexing Japanese text. We use JUMAN to extract japanese text (because it is public domain---it actually doesn't do such a good job), and some home grown perl stuff to filter out garbage, weight terms, and do stemming. But, for searching, we are for now doing exact string matching only. I suggest you ask this question on the comp.infosystems.harvest and also on the winter (web internationalization) mailing list at winter@dorado.crpht.lu. (please see http://dorado.crpht.lu:80/~carrasco/winter/ for the winter web page). I think there may be some mule tools for international grep like things, but I'm not absolutely sure about it... PF From owner-robots Wed Dec 6 18:19:17 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04473; Wed, 6 Dec 95 18:19:17 -0800 Message-Id: <v02130503acebfafcb272@[202.243.51.210]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 7 Dec 1995 11:18:57 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: Re: Indexing two-byte text Cc: robots@webcrawler.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 4:46 PM 12/6/95, Paul Francis wrote: >We are doing a multi-lingual navigation project >(called Ingrid) that involves indexing Japanese >text. We use JUMAN to extract japanese text >(because it is public domain---it actually doesn't >do such a good job), and some home grown perl >stuff to filter out garbage, weight terms, and >do stemming. Is there publicly available code to handle stemming for Japanese, or is there a description of the algorithm involved anywhere (in English or in Japanese)? And what sort of garbage remains after using JUMAN? --Mark From owner-robots Wed Dec 6 18:48:42 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06782; Wed, 6 Dec 95 18:48:42 -0800 Date: Thu, 7 Dec 95 11:48:24 JST From: francis@cactus.slab.ntt.jp (Paul Francis) Message-Id: <9512070248.AA19999@cactus.slab.ntt.jp> To: robots@webcrawler.com Subject: Re: Indexing two-byte text Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > Is there publicly available code to handle stemming for Japanese, or is > there a description of the algorithm involved anywhere (in English or in > Japanese)? Our Japanese "publisher" code will be made publicly available after 1) it is in decent shape, and 2) we get approval from management to release it (don't worry, we *WILL* get approval, one way or another :-). As for stemming. After making a weak attempt at finding out what other people are doing, we couldn't find anything about Japanese stemming. I think this may be because, since a dictionary is necessary simply to parse out the individual words, algorithmic stemming isn't really necessary. The stems are already in the dictionary. I wanted to minimize dependence on a dictionary, though, so we put our heads together and decided that effective stemming for Japanese simply requires removing any kana that appears after a kanji in a single "term". In other words, the kanji is the stem, in all cases. If the term has no kanji, then we don't stem at all. Though surely this simple algorithm must break for some cases, in our limited experience so far, we haven't found any problems. > > And what sort of garbage remains after using JUMAN? > JUMAN doesn't remove any text per se, just tries to separate out the individual terms. So, in general, text has all kinds of junk in it that isn't a valid term, including numbers, various symbols such as stars, circles, X's, etc. So, we try to filter as much of that out as we can without removing any valid stuff. As for JUMAN's term isolation ability, it suffers from a small dictionary. For example "intaanetto" (in romaji, "internet" in English) is broken into "intaa" and "netto", because JUMAN doesn't have "intaanetto" in its dictionary. I believe we'll be able to fix most of these by doing simple phrase detection. That is, if we see that "intaa" is always or very often followed by "netto", we can assume that they constitute a single phrase (or, in the no-white-space case, a single term). We will implement phrase detection next, and expect to have it by late January. PF ps. By the way, our Japanese publisher will be a single component of a multi-lingual publisher that will have language detection built in. We are doing Japanese and English, but expect to add others as they are done. pps. I really don't think this thread is so interesting to the robot list people. Maybe we should take it off-line. From owner-robots Wed Dec 6 23:29:02 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22374; Wed, 6 Dec 95 23:29:02 -0800 Message-Id: <199512070730.JAA08451@dns2.netvision.net.il> X-Sender: smadja@dns2.netvision.net.il X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 07 Dec 1995 09:27:25 -0500 To: robots@webcrawler.com From: Frank Smadja <smadja@netvision.net.il> Subject: Re: Indexing two-byte text Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I am interested in this thread. Please keep it online or keep me posted. Thanks At 11:48 AM 12/7/95 JST, you wrote: >> >> Is there publicly available code to handle stemming for Japanese, or is >> there a description of the algorithm involved anywhere (in English or in >> Japanese)? > >Our Japanese "publisher" code will be made publicly >available after 1) it is in decent shape, and 2) we >get approval from management to release it (don't >worry, we *WILL* get approval, one way or another :-). > >As for stemming. After making a weak attempt at finding >out what other people are doing, we couldn't find >anything about Japanese stemming. I think this may be >because, since a dictionary is necessary simply to >parse out the individual words, algorithmic stemming >isn't really necessary. The stems are already in the >dictionary. > >I wanted to minimize dependence on a dictionary, though, >so we put our heads together and decided that effective >stemming for Japanese simply requires removing any kana >that appears after a kanji in a single "term". In other >words, the kanji is the stem, in all cases. If the term >has no kanji, then we don't stem at all. > >Though surely this simple algorithm must break for some >cases, in our limited experience so far, we haven't found >any problems. > >> >> And what sort of garbage remains after using JUMAN? >> > >JUMAN doesn't remove any text per se, just tries to separate >out the individual terms. So, in general, text has all >kinds of junk in it that isn't a valid term, including >numbers, various symbols such as stars, circles, X's, etc. >So, we try to filter as much of that out as we can without >removing any valid stuff. > >As for JUMAN's term isolation ability, it suffers from a >small dictionary. For example "intaanetto" (in romaji, >"internet" in English) is broken into "intaa" and "netto", >because JUMAN doesn't have "intaanetto" in its dictionary. >I believe we'll be able to fix most of these by doing >simple phrase detection. That is, if we see that "intaa" >is always or very often followed by "netto", we can assume >that they constitute a single phrase (or, in the no-white-space >case, a single term). We will implement phrase detection >next, and expect to have it by late January. > >PF > >ps. By the way, our Japanese publisher will be a single >component of a multi-lingual publisher that will have >language detection built in. We are doing Japanese and >English, but expect to add others as they are done. > >pps. I really don't think this thread is so interesting >to the robot list people. Maybe we should take it off-line. > > From owner-robots Wed Dec 6 23:42:25 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23372; Wed, 6 Dec 95 23:42:25 -0800 Date: Thu, 7 Dec 95 16:42:14 JST From: francis@cactus.slab.ntt.jp (Paul Francis) Message-Id: <9512070742.AA21981@cactus.slab.ntt.jp> To: robots@webcrawler.com Subject: Re: Indexing two-byte text Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > I am interested in this thread. Please keep it online or keep me posted. > I have heard that there are only 4 numbers in computer science...0, 1, 2, and many. Thus, it seems that many people are interested in this thread.... :-) I'm more than happy to keep in online. PF From owner-robots Thu Dec 7 05:15:16 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08376; Thu, 7 Dec 95 05:15:16 -0800 Message-Id: <v02130503acec96754345@[202.243.51.214]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 7 Dec 1995 22:15:48 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: Re: Indexing two-byte text Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >> I am interested in this thread. Please keep it online or keep me posted. > >I have heard that there are only 4 numbers in >computer science...0, 1, 2, and many. > >Thus, it seems that many people are interested >in this thread.... :-) > >I'm more than happy to keep in online. > >PF Susumu Shimizu has started a Japanese language robots mailing list if anyone is interested: w3-search@rodem.slab.ntt.jp You can contact him at shimizu@rodem.slab.ntt.jp to join. The charter members are those of us who attended his BOF at the recent Japan WWW Conference '95 in Kobe. --Mark From owner-robots Thu Dec 7 07:30:26 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16064; Thu, 7 Dec 95 07:30:26 -0800 Message-Id: <199512071526.AAA08232@luxion.mtl.t.u-tokyo.ac.jp> To: robots@webcrawler.com Subject: Indexing two-byte text Date: Fri, 08 Dec 1995 00:26:27 +0900 From: Harry Munir Behrens <behrens@mtl.t.u-tokyo.ac.jp> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi guys, terrific echo, thanks to all that were interested and helpful. I have asked around some more in the university circus and we have arrived at the following project plan: We are putting in place a three-phase system based on JUMAN (for now) and an existing dictionary based rule-based system. In the first phase the system scans the text looking for two- and four- kanji components that the dictionary knows. This are singled out as "sure hits" and are stemmed were appropriate. In the second phase we run JUMAN over the resulting text. The third phase is going to be very similar to the first, but will be only for verificaction purposes; meaning that if JUMAN generates terms the dictionary doesn't know about error messages are ouput. The fourth stage is manual editing of these error messages :-( If there's anybody out there who is interested in more detailed info please get in touch on : behrens@mtl.t.u-tokyo.ac.jp I'm happy for any comments, suggestions etc. Harry Behrens PhD. candidate Dept. of Electrical Engineering Univ. of Tokyo behrens@mtl.t.u-tokyo.ac.jp From owner-robots Thu Dec 7 12:44:47 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03647; Thu, 7 Dec 95 12:44:47 -0800 Message-Id: <v02130504aced024e3b85@[202.243.51.208]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 8 Dec 1995 05:45:10 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: Re: Indexing two-byte text Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 12:26 AM 12/8/95, Harry Munir Behrens wrote: >We are putting in place a three-phase system based on JUMAN >(for now) and an existing dictionary based rule-based system. Is the "existing dictionary-based rule system" different from juman? Is juman not a dictionary-based rule system? --Mark From owner-robots Thu Dec 7 12:52:19 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03988; Thu, 7 Dec 95 12:52:19 -0800 Date: Thu, 7 Dec 1995 14:52:09 -0500 (EST) From: Randall Hill <rlh@conan.ids.net> To: robots@webcrawler.com Subject: Either a spider or a hacker? ww2.allcon.com Message-Id: <Pine.SUN.3.90.951207143241.24599B-100000@conan.ids.net> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi all, I'm setting up a new site and am getting persistent requests from ww2.allcon.com for a single file, home.shtml, that is under development and is not linked to anything. the default for my server is index.html which has NOT been requested. Any one seen them before TIA, -randy hill From owner-robots Thu Dec 7 23:51:21 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09790; Thu, 7 Dec 95 23:51:21 -0800 Message-Id: <v02130509aced92cda9fe@[202.243.51.212]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 8 Dec 1995 16:51:06 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: Indexing two-byte text Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com francis@cactus.slab.ntt.jp (Paul Francis) wrote: >Our Japanese "publisher" code will be made publicly >available after 1) it is in decent shape, and 2) we >get approval from management to release it (don't >worry, we *WILL* get approval, one way or another :-). You may get approval, but I assume that it couldn't be freely used for commercial purposes? >As for stemming. After making a weak attempt at finding >out what other people are doing, we couldn't find >anything about Japanese stemming. I think this may be >because, since a dictionary is necessary simply to >parse out the individual words, algorithmic stemming >isn't really necessary. The stems are already in the >dictionary. >I wanted to minimize dependence on a dictionary, though, >so we put our heads together and decided that effective >stemming for Japanese simply requires removing any kana >that appears after a kanji in a single "term". In other >words, the kanji is the stem, in all cases. If the term >has no kanji, then we don't stem at all. > >Though surely this simple algorithm must break for some >cases, in our limited experience so far, we haven't found >any problems. I don't think perfection is necessary here anyway to produce a useful system. But couldn't you just swap out the dictionary for a better dictionary? I just got a copy of juman, though, and although I just glanced at the files, it seemed like the dictionary was broken up by parts of speech. But most new coinages in a language tend to be nouns I would think. This could be a business opportunity for someone--just like software companies in the U.S. buy their spell checkers from specialized companies, someone could develop and market a morphological root dictionary for Japanese. >As for JUMAN's term isolation ability, it suffers from a >small dictionary. For example "intaanetto" (in romaji, >"internet" in English) is broken into "intaa" and "netto", >because JUMAN doesn't have "intaanetto" in its dictionary. >I believe we'll be able to fix most of these by doing >simple phrase detection. That is, if we see that "intaa" >is always or very often followed by "netto", we can assume >that they constitute a single phrase (or, in the no-white-space >case, a single term). We will implement phrase detection >next, and expect to have it by late January. Ha! A programmer's solution. It seems like just upping the dictionary is more straightforward. ;-) >ps. By the way, our Japanese publisher will be a single >component of a multi-lingual publisher that will have >language detection built in. We are doing Japanese and >English, but expect to add others as they are done. I'm not sure what you mean by a "publisher"--I'm not sure what this does. Is this different from Ingrid? --Mark From owner-robots Fri Dec 8 00:25:22 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11463; Fri, 8 Dec 95 00:25:22 -0800 Date: Fri, 8 Dec 95 17:25:12 JST From: francis@cactus.slab.ntt.jp (Paul Francis) Message-Id: <9512080825.AA29356@cactus.slab.ntt.jp> To: robots@webcrawler.com Subject: Re: Indexing two-byte text Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > You may get approval, but I assume that it couldn't be freely used for > commercial purposes? You are right. I suppose you could liscense it, but I hardly think it would be worth it. A good programmer could throw it together in a week.... > > I don't think perfection is necessary here anyway to produce a useful > system. But couldn't you just swap out the dictionary for a better > dictionary? I just got a copy of juman, though, and although I just glanced One major problem is that all the better dictionaries we know are commercial, so it broke our requirement for freely usable code. Second, I think using a dictionary is a never-ending battle. Each specialization has its own terms and require their own dictionary. Further, language evolves fast, especially in fast-moving fields. I don't want the headache of always trying to maintain the dictionary. > >case, a single term). We will implement phrase detection > >next, and expect to have it by late January. > > Ha! A programmer's solution. It seems like just upping the dictionary is > more straightforward. ;-) Your a manager, eh? :-) But, we need phrase detection in any event. So, I hope it handles the term isolation part as well. > > I'm not sure what you mean by a "publisher"--I'm not sure what this does. > Is this different from Ingrid? > "Publisher" is the (rather poor) term we use for the component of Ingrid that takes a resource, automatically pulls out key terms, generates some other info about the resource (size, type, title, etc.) and gives it to the component of Ingrid that inserts it into the navigation topology. PF From owner-robots Fri Dec 8 02:19:34 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16967; Fri, 8 Dec 95 02:19:34 -0800 Message-Id: <v02130517acedc10886ff@[202.243.51.212]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 8 Dec 1995 19:19:15 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: Re: Indexing two-byte text Cc: robots@webcrawler.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 5:25 PM 12/8/95, Paul Francis wrote: >One major problem is that all the better dictionaries we know >are commercial, so it broke our requirement for freely usable >code. Second, I think using a dictionary is a never-ending >battle. Each specialization has its own terms and require their >own dictionary. Further, language evolves fast, especially in >fast-moving fields. I don't want the headache of always trying >to maintain the dictionary. But by that time you'll be off on another project, and it'll be someone else's headache. ;-) >> I'm not sure what you mean by a "publisher"--I'm not sure what this does. >> Is this different from Ingrid? > >"Publisher" is the (rather poor) term we use for the component >of Ingrid that takes a resource, automatically pulls out key >terms, generates some other info about the resource (size, type, >title, etc.) and gives it to the component of Ingrid that inserts >it into the navigation topology. Is this navigation topology the part that you intend to patent? --Mark From owner-robots Fri Dec 8 07:08:42 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02625; Fri, 8 Dec 95 07:08:42 -0800 Date: Fri, 8 Dec 95 18:16:44 EST From: smadja@netvision.net.il Subject: RE: Indexing two-byte text To: robots@webcrawler.com X-Mailer: Chameleon ARM_55, TCP/IP for Windows, NetManage Inc. Message-Id: <Chameleon.951208181716.smadja@Haifa.netvision.net.il> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com How can we get JUMAN ? Is it freeware, shareware, commercial? Thanks ------------------------------------- Name: Frank Smadja E-mail: smadja@netvision.net.il Date: 12/08/95 Time: 18:16:44 ------------------------------------- From owner-robots Fri Dec 8 20:16:18 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20477; Fri, 8 Dec 95 20:16:18 -0800 Message-Id: <v02130505aceebd0aeb88@[202.243.51.214]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sat, 9 Dec 1995 13:16:43 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: RE: Indexing two-byte text Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >How can we get JUMAN ? >Is it freeware, shareware, commercial? > >Thanks > >------------------------------------- >Name: Frank Smadja >E-mail: smadja@netvision.net.il >Date: 12/08/95 >Time: 18:16:44 >------------------------------------- You can find software like this by using Archie, and picking a Japanese Archie server. Look for the most recently uploaded version. Juman, for instance, is on the Sony Computer Science Labs FTP server, among others. I think it's freeware or public domain, and it's written by the Nara Institute of Something-or-Other. --Mark From owner-robots Sun Dec 10 17:33:15 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17521; Sun, 10 Dec 95 17:33:15 -0800 Message-Id: <199512110132.KAA25906@azalea.kawasaki.flab.fujitsu.co.jp> To: robots@webcrawler.com Subject: RE: Indexing two-byte text In-Reply-To: Your message of "Sat, 9 Dec 1995 13:16:43 +0900" References: <v02130505aceebd0aeb88@[202.243.51.214]> X-Mailer: Mew beta version 0.91 on Emacs 19.28.1, Mule 2.2 Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Date: Mon, 11 Dec 1995 10:30:03 +0900 From: Noboru Iwayama <iwayama@flab.fujitsu.co.jp> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Mark> I think it's freeware or public domain, and it's written by the Nara Mark> Institute of Something-or-Other. You can get JUMAN from ftp://pr.aist-nara.ac.jp/pub/nlp/tools/juman/ Noboru I From owner-robots Tue Dec 12 20:28:25 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20743; Tue, 12 Dec 95 20:28:25 -0800 From: ecarp@tssun5.dsccc.com Date: Tue, 12 Dec 1995 22:25:31 -0600 Message-Id: <9512130425.AA27447@tssun5.> To: robots@webcrawler.com Subject: Freely available robot code in C available? X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com DSC Communications is a multinational company located in Plano, TX with offices all over the world. We have lots of technically-savvy people in the company, but not a lot of information on what different divisions are doing within the company, especially in regards to web activities. Since the division that I work for is Information Services, we feel that we would like to get a handle on who is running web servers in the company, what they have on them, and what they are being used for. The idea is to eliminate duplication of effort (two or more departments put up servers, each with similar information), and provide consistent information to our internal departments. I myself have been running a server (both internally and externally) for over a year, and have many years of CS experience, so I feel that the task of collecting information on who is doing what wouldn't be a overwhelming task. It is felt that the best way of collecting the information needed would be to either write some sort of web collection program from scratch or obtain a freely-available one from the net and modify it for our needs. I have read the proposed FAQ and all of the etiquette documents, and the plan of attack is to write or obtain a robot that would scan HTML text only, signaling the server that we can handle only text (avoiding the overhead of having to download images only to discard them), then build an Oracle database composed of URLs and text which could be searchable via an SQL query. Comments or sample source code on doing such a task, or pointers to freely- available code, would be greatly appreciated. If no such code is available, pointers on writing such a beast would be also appreciated. One more note: if I hadn't made it clear already, the robot would, under no circumstances, be allowed to search outside the DSC domain, and we have no direct access to the outside world except through our firewall (which will only filter selected packets from selected sites, and the internal web server isn't on the list). This is intended to be an 'internal use only' project, and so would not be used to generate revenue, nor would it be allowed to roam the net at large. The other restriction on the server is that it must be written in C. ANSI C is not a requirement. Any help or comments would be greatly appreciated. Thanks in advance... -- Ed Carp, Senior Operations Analyst, DSC Communications Please note that I do not speak for DSC Communications, nor are any statements made herein meant to be taken as a position, official or otherwise, of DSC Communications. From owner-robots Tue Dec 12 21:52:55 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25270; Tue, 12 Dec 95 21:52:55 -0800 Message-Id: <9512130553.AA03554@marys.smumn.edu> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Kevin Hoogheem <khooghee@marys.smumn.edu> Date: Tue, 12 Dec 95 23:54:40 -0600 To: robots@webcrawler.com Subject: Freely available robot code in C available? References: <9512130425.AA27447@tssun5.> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com If you get any good C code would you please send it along to me I am a CS student and am trying to write an indexing robot I have alreay wrote a cheesy performance robot which you can find on yahoo. it is called bomb. it is very simple and plain jane From owner-robots Tue Dec 12 22:21:57 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26942; Tue, 12 Dec 95 22:21:57 -0800 Message-Id: <v02130500acf420454c14@[202.243.51.222]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 13 Dec 1995 15:21:33 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: Re: Freely available robot code in C available? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >Since the division that I work for is Information Services, we feel that we >would like to get a handle on who is running web servers in the company, >what they have on them, and what they are being used for. The idea is to >eliminate duplication of effort (two or more departments put up servers, each >with similar information), and provide consistent information to our internal >departments. Ed: Why don't you just buy a turnkey package from Open Text, Architext, or one of the other companies selling this sort of thing rather than make it from scratch? --Mark From owner-robots Wed Dec 13 05:45:49 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11573; Wed, 13 Dec 95 05:45:49 -0800 Date: Wed, 13 Dec 95 08:42:23 EST From: "Jim Meritt" <jmeritt@smtpinet.aspensys.com> Message-Id: <9511138188.AA818873047@smtpinet.aspensys.com> To: robots@webcrawler.com Subject: Harvest question Content-Length: 563 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com First, is someone aware of a Harvest list? Next, my problem. I've gotten Harvest-1.4 patch level 1 onto a Sun Sparcstation 20 running Solaris 2.3. Watching the logs during gathering shows that it appears to be Gatherering, but when I try the broker, I don't get errors on the broker screen, just "no hits" and in the broker.out log I get "GL_do_query_inline: connect: Connection refused". What is it trying to connect to, and does anyone have a suggestion on how to get this working? Jim Meritt From owner-robots Wed Dec 13 07:47:09 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14957; Wed, 13 Dec 95 07:47:09 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: <v02130501acf4a5481384@[192.187.143.12]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 13 Dec 1995 07:48:15 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Freely available robot code in C available? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 10:25 PM 12/12/95, ecarp@tssun5.dsccc.com wrote: >... then build an Oracle database >composed of URLs and text which could be searchable via an SQL query. Aside from the question of my you want to build your own, rather than buying an off-the-shelf solution (we have one, too) -- why Oracle? A text search engine will give much better performance and have many more text-oriented features than Oracle or another RDBMS. Search engines are a kind of database, of course, but one that is oriented toward text, rather than fielded data (which some of them also support). Nick From owner-robots Wed Dec 13 13:18:37 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01989; Wed, 13 Dec 95 13:18:37 -0800 Message-Id: <9512132114.AA21778@tssun5.> Comments: Authenticated sender is <ecarp@tssun5.dsccc.com> From: "Edwin Carp" <ecarp@tssun5.dsccc.com> Organization: DSC Communications To: narnett@Verity.COM (Nick Arnett), robots@webcrawler.com Date: Wed, 13 Dec 1995 15:15:17 +0000 Subject: Re: Freely available robot code in C available? Priority: normal X-Mailer: Pegasus Mail for Windows (v2.23) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Date: Wed, 13 Dec 1995 07:48:15 -0800 > To: robots@webcrawler.com > From: narnett@Verity.COM (Nick Arnett) > Subject: Re: Freely available robot code in C available? > Reply-to: robots@webcrawler.com > At 10:25 PM 12/12/95, ecarp@tssun5.dsccc.com wrote: > >... then build an Oracle database > >composed of URLs and text which could be searchable via an SQL > >query. > > Aside from the question of my you want to build your own, rather > than buying an off-the-shelf solution (we have one, too) -- why > Oracle? A text search engine will give much better performance and > have many more text-oriented features than Oracle or another RDBMS. > Search engines are a kind of database, of course, but one that is > oriented toward text, rather than fielded data (which some of them > also support). The problem with an off-the-shelf solution is that most of them are not flexiabel enough for our needs. Also, we are tied to a product that does not allow us to make any changes unless we go back to the vendor. Customizations are likely to be expensive, and this project is being done on a literal shoestring, using existing hardware and home-grown software. Oracle, because taht's what we have in-house, and we have lots and lots of reporting and search tools for it. From owner-robots Wed Dec 13 17:06:12 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16103; Wed, 13 Dec 95 17:06:12 -0800 Message-Id: <v02130500acf5287dd3c9@[202.243.51.222]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 14 Dec 1995 10:05:47 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: Re: Harvest question Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > First, is someone aware of a Harvest list? > > Next, my problem. I've gotten Harvest-1.4 patch level 1 onto a Sun > Sparcstation 20 running Solaris 2.3. Watching the logs during > gathering shows that it appears to be Gatherering, but when I try the > broker, I don't get errors on the broker screen, just "no hits" and in > the broker.out log I get "GL_do_query_inline: connect: Connection > refused". What is it trying to connect to, and does anyone have a > suggestion on how to get this working? > > Jim Meritt There's a full-blown newsgroup, comp.infosystems.harvest --Mark From owner-robots Thu Dec 14 02:40:19 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14819; Thu, 14 Dec 95 02:40:19 -0800 Date: Thu, 14 Dec 1995 10:34:13 GMT From: cs0sst@isis.sunderland.ac.uk (Simon.Stobart) Message-Id: <9512141034.AA19413@osiris.sund.ac.uk> To: robots@webcrawler.com Subject: Announcement and Help Requested X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com New Robot Announcement ~~~~~~~~~~~~~~~~~~~~~~ Name: IncyWincy Home: University of Sunderland, UK Implementation Language: C++ Supports Robot Exclusion standard: Yes Purpose: Various research projects Status: This robot has not yet been released outside of Sunderland Authors: Simon Stobart, Reg Arthington Help Requested ~~~~~~~~~~~~~~ The user-agent, from and referer http fields are not set to anything currently. Obviously, I wish these to conatin informative information. So, how do you send this information to the web server? The values which I wish to set these fields to are: User-Agent: IncyWincy V?.? From: simon.stobart@sunderland.ac.uk Many Thanks |------------------------------------+-------------------------------------| | Simon Stobart, | Net: simon.stobart@sunderland.ac.uk | | Lecturer in Computing, | Voice: (+44) 091 515 2783 | | School of Computing | Fax: (+44) 091 515 2781 | | & Information Systems, + ------------------------------------| | University of Sunderland, SR1 3SD, | 007: Balls Q? | | England. | Q: Bolas 007! | |------------------------------------|-------------------------------------| From owner-robots Thu Dec 14 03:58:35 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18495; Thu, 14 Dec 95 03:58:35 -0800 Date: Thu, 14 Dec 1995 10:34:13 GMT From: cs0sst@isis.sunderland.ac.uk (Simon.Stobart) Message-Id: <9512141034.AA19413@osiris.sund.ac.uk> To: robots@webcrawler.com Subject: Announcement and Help Requested X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com New Robot Announcement ~~~~~~~~~~~~~~~~~~~~~~ Name: IncyWincy Home: University of Sunderland, UK Implementation Language: C++ Supports Robot Exclusion standard: Yes Purpose: Various research projects Status: This robot has not yet been released outside of Sunderland Authors: Simon Stobart, Reg Arthington Help Requested ~~~~~~~~~~~~~~ The user-agent, from and referer http fields are not set to anything currently. Obviously, I wish these to conatin informative information. So, how do you send this information to the web server? The values which I wish to set these fields to are: User-Agent: IncyWincy V?.? From: simon.stobart@sunderland.ac.uk Many Thanks |------------------------------------+-------------------------------------| | Simon Stobart, | Net: simon.stobart@sunderland.ac.uk | | Lecturer in Computing, | Voice: (+44) 091 515 2783 | | School of Computing | Fax: (+44) 091 515 2781 | | & Information Systems, + ------------------------------------| | University of Sunderland, SR1 3SD, | 007: Balls Q? | | England. | Q: Bolas 007! | |------------------------------------|-------------------------------------| From owner-robots Thu Dec 14 05:45:40 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23662; Thu, 14 Dec 95 05:45:40 -0800 Date: Thu, 14 Dec 95 08:47:11 EST From: "Jim Meritt" <jmeritt@smtpinet.aspensys.com> Message-Id: <9511148189.AA818959719@smtpinet.aspensys.com> To: robots@webcrawler.com Subject: Re[2]: Harvest question Content-Length: 399 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I know about the newsgroup - which is why I asked about a mailing list... ______________________________ Reply Separator _________________________________ Subject: Re: Harvest question Author: robots@webcrawler.com at SMTPINET Date: 12/13/95 8:26 PM > First, is someone aware of a Harvest list? > Jim Meritt There's a full-blown newsgroup, comp.infosystems.harvest --Mark From owner-robots Thu Dec 14 07:53:50 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00520; Thu, 14 Dec 95 07:53:50 -0800 From: mschrimsher@twics.com Message-Id: <v02130503acf5f7cc13e2@[202.243.51.222]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 15 Dec 1995 00:54:17 +0900 To: robots@webcrawler.com Subject: Robot on the Rampage Cc: w3-search@rodem.slab.ntt.jp, infotalk@square.brl.ntt.jp Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Can anyone identify the following robot: 206.214.202.44 It went through my site (a 600-page web directory service) grabbing several pages a second, despite the fact that I prohibit robots in my robots.txt file. --Mark From owner-robots Thu Dec 14 09:37:21 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06123; Thu, 14 Dec 95 09:37:21 -0800 Date: Fri, 15 Dec 1995 02:37:23 +0900 From: shimizu@rodem.slab.ntt.jp (Susumu Shimizu) Message-Id: <199512141737.CAA24695@rodem.slab.ntt.jp> To: mschrimsher@twics.com Cc: robots@webcrawler.com, w3-search@rodem.slab.ntt.jp, infotalk@square.brl.ntt.jp In-Reply-To: <v02130503acf5f7cc13e2@[202.243.51.222]> (mschrimsher@twics.com) Subject: Re: Robot on the Rampage Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Mark, here you are. Name: magellan.mckinley.com Address: 206.214.202.44 -- shim From owner-robots Thu Dec 14 10:14:54 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA07660; Thu, 14 Dec 95 10:14:54 -0800 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199512141814.TAA01403@wsinis10.win.tue.nl> Subject: Re: Robot on the Rampage To: robots@webcrawler.com Date: Thu, 14 Dec 1995 19:14:53 +0100 (MET) In-Reply-To: <v02130503acf5f7cc13e2@[202.243.51.222]> from "mschrimsher@twics.com" at Dec 15, 95 00:54:17 am X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Content-Length: 244 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (mschrimsher@twics.com) write: > >Can anyone identify the following robot: > > 206.214.202.44 % host 206.214.202.44 Name: magellan.mckinley.com Try http://www.mckinley.com/ to find out more about their search service. -- Reinier From owner-robots Thu Dec 14 11:34:46 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11581; Thu, 14 Dec 95 11:34:46 -0800 Date: Thu, 14 Dec 1995 14:32:43 -0600 (CST) From: Cees Hek <hekc@phoenix.cis.mcmaster.ca> To: robots@webcrawler.com Subject: Checking Log files In-Reply-To: <v02130503acf5f7cc13e2@[202.243.51.222]> Message-Id: <Pine.LNX.3.91.951214142420.2308A-100000@phoenix.cis.mcmaster.ca> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Does anyone have a small script that will parse a log file (NCSA 1.3 common log format) and check for "nasty" robots. I don't have a /robots.txt file on the server, since we welcome anyone to index our site, but I would like to keep track of any robots that are hammering the system. Currently our log file grows at about a half a Meg a day, and I don't have time to go through it myself. Any help would be appreciated Cees Hek Computing & Information Services Email: hekc@mcmaster.ca McMaster University Hamilton, Ontario, Canada From owner-robots Thu Dec 14 13:08:37 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15607; Thu, 14 Dec 95 13:08:37 -0800 Message-Id: <9512142109.AA05041@marys.smumn.edu> Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable From: Kevin Hoogheem <khooghee@marys.smumn.edu> Date: Thu, 14 Dec 95 15:11:15 -0600 To: robots@webcrawler.com Subject: Re: Checking Log files References: <Pine.LNX.3.91.951214142420.2308A-100000@phoenix.cis.mcmaster.ca> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com why not set up a cron job to grep out for all access to robot.txt = out of the log file= From owner-robots Thu Dec 14 18:12:20 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00533; Thu, 14 Dec 95 18:12:20 -0800 Message-Id: <v02130503acf687e1f79d@[202.237.148.40]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 15 Dec 1995 11:11:57 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: Re: Checking Log files Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >Does anyone have a small script that will parse a log file (NCSA 1.3 >common log format) and check for "nasty" robots. I don't have a >/robots.txt file on the server, since we welcome anyone to index our >site, but I would like to keep track of any robots that are hammering the >system. > >Currently our log file grows at about a half a Meg a day, and I don't have >time to go through it myself. Any help would be appreciated > > >Cees Hek >Computing & Information Services Email: hekc@mcmaster.ca >McMaster University >Hamilton, Ontario, Canada You can make a robots.txt file that permits all accesses, and then check the log for requests for that file. But it won't catch robots that don't check for robots.txt. --Mark From owner-robots Thu Dec 14 18:46:45 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02442; Thu, 14 Dec 95 18:46:45 -0800 Message-Id: <n1393155260.72132@mail.intouchgroup.com> Date: 14 Dec 1995 18:50:11 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [1]RE>Checking Log files To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [1]RE>Checking Log files 12/14/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Thu Dec 14 19:17:42 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04222; Thu, 14 Dec 95 19:17:42 -0800 Message-Id: <n1393153402.82326@mail.intouchgroup.com> Date: 14 Dec 1995 19:20:30 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [2]RE>Checking Log files To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [2]RE>Checking Log files 12/14/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Thu Dec 14 19:43:43 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05763; Thu, 14 Dec 95 19:43:43 -0800 Message-Id: <n1393151841.75020@mail.intouchgroup.com> Date: 14 Dec 1995 19:46:33 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [3]RE>Checking Log files To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [3]RE>Checking Log files 12/14/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Thu Dec 14 20:13:46 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA07395; Thu, 14 Dec 95 20:13:46 -0800 Message-Id: <n1393150039.85103@mail.intouchgroup.com> Date: 14 Dec 1995 20:16:52 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [4]RE>Checking Log files To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [4]RE>Checking Log files 12/14/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Thu Dec 14 20:43:51 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08988; Thu, 14 Dec 95 20:43:51 -0800 Message-Id: <n1393148232.94882@mail.intouchgroup.com> Date: 14 Dec 1995 20:46:54 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [5]RE>Checking Log files To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [5]RE>Checking Log files 12/14/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Thu Dec 14 22:23:54 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14421; Thu, 14 Dec 95 22:23:54 -0800 Message-Id: <v0213050aacf6c4753632@[202.237.148.34]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 15 Dec 1995 15:23:30 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: [5]RE>Checking Log files Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >Date: 14 Dec 1995 20:46:54 -0800 >From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> >Subject: [5]RE>Checking Log files >To: robots@webcrawler.com >Sender: owner-robots@webcrawler.com >Precedence: bulk >Reply-To: robots@webcrawler.com > > [5]RE>Checking Log files 12/14/95 > >Thanks for you message. > >I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail >before I get back. Is there any way to stop Roger's infinite loop? January 5 is a long way off. --Mark From owner-robots Fri Dec 15 02:08:57 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25729; Fri, 15 Dec 95 02:08:57 -0800 Message-Id: <9512151006.AA25636@webcrawler.com> X-Mailer: exmh version 1.5 11/22/94 To: robots@webcrawler.com Subject: Re: [5]RE>Checking Log files In-Reply-To: Your message of "Fri, 15 Dec 95 15:23:30 +0900." <v0213050aacf6c4753632@[202.237.148.34]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 15 Dec 95 09:55:02 +0000 From: M.Levy@cs.ucl.ac.uk Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Is there any way to stop Roger's infinite loop? January 5 is a long way off. > > --Mark > > Maybe it's worth mailing the system administrator at twics.com From owner-robots Fri Dec 15 02:55:10 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28164; Fri, 15 Dec 95 02:55:10 -0800 Message-Id: <n1393125956.34566@mail.intouchgroup.com> Date: 15 Dec 1995 02:58:18 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [1]RE>[5]RE>Checking Log fi To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [1]RE>[5]RE>Checking Log files 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 03:55:58 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01483; Fri, 15 Dec 95 03:55:58 -0800 Message-Id: <n1393122315.53763@mail.intouchgroup.com> Date: 15 Dec 1995 03:58:38 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [2]RE>[5]RE>Checking Log fi To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [2]RE>[5]RE>Checking Log fi 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 08:12:39 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03110; Fri, 15 Dec 95 08:12:39 -0800 From: Byung-Gyu Chang <chitos@ktmp.kaist.ac.kr> Message-Id: <199512151254.VAA10969@ktmp.kaist.ac.kr> Subject: Wobot? To: robots@webcrawler.com (Robot Mailing list) Date: Fri, 15 Dec 1995 21:54:17 +0900 (KST) X-Mailer: ELM [version 2.4 PL21-h4] Mime-Version: 1.0 Content-Type: text/plain; charset=iso-2022-kr Content-Transfer-Encoding: 7bit Content-Length: 504 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Did anyone know about Wobot from magellan.mckinley.com ? They represents them "Wobot" in User-Agent field. Martijn Koster write in "List of Robots" html page : -- McKinley Robot It's unclear who administers this, but a number of people have complained about rapid-fire hits from magellan.mckinley.com. There have been no replies to direct complaints. Not very nice... -- Yeah, okay. I know what is Wobot now. My question is : It there any method to exclude *only* Wobot in my access list ? -chitos From owner-robots Fri Dec 15 08:46:56 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06966; Fri, 15 Dec 95 08:46:56 -0800 Date: Fri, 15 Dec 1995 11:46:48 -0500 (EST) From: "Mordechai T. Abzug" <mabzug1@gl.umbc.edu> To: Robots mailing list <robots@webcrawler.com> Subject: Announcing NaecSpyr, a new. . . robot? Message-Id: <Pine.SGI.3.91.951215000641.16482A-100000@umbc10.umbc.edu> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com NaecSpyr is an agent that checks if URLs have changed. In purpose, it is similar to URL Minder, w3new, and Web Watch; in implementation, it takes a slightly different approach, running centrally on a server (like URL Minder) but providing a web interface for each user. NaecSpyr may not be a robot according to the definition in the robots homepage (it doesn't scan HTML for new URLs), but it's compliant to the robot protocol, anyway. ;> See <http://www.gl.umbc.edu/~mabzug1/NaecSpyr> for a little (*very* little) more info. Mordechai T. Abzug http://umbc.edu/~mabzug1 mabzug1@umbc.edu finger -l mabzug1@gl.umbc.edu So many bytes, so few CPS. From owner-robots Fri Dec 15 09:03:29 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08911; Fri, 15 Dec 95 09:03:29 -0800 Message-Id: <n1393103858.61698@mail.intouchgroup.com> Date: 15 Dec 1995 09:06:06 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [3]RE>[5]RE>Checking Log fi To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [3]RE>[5]RE>Checking Log fi 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 09:08:25 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09443; Fri, 15 Dec 95 09:08:25 -0800 Message-Id: <n1393103559.81023@mail.intouchgroup.com> Date: 15 Dec 1995 09:11:21 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [1]Wobot? To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [1]Wobot? 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 09:23:32 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10895; Fri, 15 Dec 95 09:23:32 -0800 Message-Id: <n1393102654.36226@mail.intouchgroup.com> Date: 15 Dec 1995 09:26:58 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [1]Announcing NaecSpyr, a n To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [1]Announcing NaecSpyr, a new. . . robot? 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 09:55:54 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14150; Fri, 15 Dec 95 09:55:54 -0800 Message-Id: <n1393100724.52255@mail.intouchgroup.com> Date: 15 Dec 1995 09:58:31 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [2]Announcing NaecSpyr, a n To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [2]Announcing NaecSpyr, a n 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 09:55:56 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14154; Fri, 15 Dec 95 09:55:56 -0800 Message-Id: <n1393100712.52210@mail.intouchgroup.com> Date: 15 Dec 1995 09:59:09 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [2]Wobot? To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [2]Wobot? 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 10:06:31 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15105; Fri, 15 Dec 95 10:06:31 -0800 Message-Id: <n1393100072.90931@mail.intouchgroup.com> Date: 15 Dec 1995 10:09:01 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [4]RE>[5]RE>Checking Log fi To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [4]RE>[5]RE>Checking Log fi 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 10:14:24 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15864; Fri, 15 Dec 95 10:14:24 -0800 Date: Fri, 15 Dec 1995 10:14:19 -0800 From: gordon@BASISinc.com (Gordon Bainbridge) Message-Id: <9512151814.AA01071@outland.BASISinc.com> To: robots@webcrawler.com Subject: Re: [3]RE>[5]RE>Checking Log fi X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com ----- Begin Included Message ----- From owner-robots@webcrawler.com Fri Dec 15 09:58 PST 1995 Date: 15 Dec 1995 09:06:06 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [3]RE>[5]RE>Checking Log fi To: robots@webcrawler.com Reply-To: robots@webcrawler.com [3]RE>[5]RE>Checking Log fi 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. ----- End Included Message ----- Please, please, please, if you check your mail, take care of this. I'm receiving this message from you constantly. I'll be away for a week, and fear that your messages will completely overflow my mailbox. DO SOMETHING!!! From owner-robots Fri Dec 15 10:20:18 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16063; Fri, 15 Dec 95 10:20:18 -0800 From: micah@fsu.fsufay.edu (Micah A. Williams) Message-Id: <199512151819.NAA05590@fsu.fsufay.edu> Subject: Re: [2]RE>[5]RE>Checking Log fi To: robots@webcrawler.com Date: Fri, 15 Dec 95 13:19:50 EST In-Reply-To: <n1393122315.53763@mail.intouchgroup.com>; from "Roger Dearnaley" at Dec 15, 95 3:58 am X-Mailer: ELM [version 2.3 PL0] Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com In the words of Roger Dearnaley, > > [2]RE>[5]RE>Checking Log fi 12/15/95 > > Thanks for you message. > > I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail > before I get back. > I think it's obvious that some sort of recursion or infinite loop is happening at the site this mail is originating from: mail.intouchgroup.com. All the duplicate messages have different mail-spooler ID's from this site, so the mail is being queued over and over again for some reason. (Perhaps it is being sumbitted repeatedly by a mail client)... Oh well...I guess we could get a lot of messages between now and Jan 5th :-) Could the list maintainer maybe mail root@mail.intouchgroup.com and inform them of this problem? Thanks. -Micah -- ==================================================================== Micah A. Williams micah@fsu.uncfsu.edu Computer Science tndf20c@prodigy.com Fayetteville State University http://fsu.uncfsu.edu/~micah Bjork WebPage: http://fsu.uncfsu.edu/~micah/bjork.html ==================================================================== From owner-robots Fri Dec 15 10:32:18 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17084; Fri, 15 Dec 95 10:32:18 -0800 Message-Id: <n1393098526.84465@mail.intouchgroup.com> Date: 15 Dec 1995 10:35:10 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [3]Wobot? To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [3]Wobot? 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 10:43:44 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18068; Fri, 15 Dec 95 10:43:44 -0800 Message-Id: <n1393097842.22896@mail.intouchgroup.com> Date: 15 Dec 1995 10:46:37 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [1]RE>[3]RE>[5]RE>Checking To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [1]RE>[3]RE>[5]RE>Checking Log fi 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 10:43:43 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18065; Fri, 15 Dec 95 10:43:43 -0800 Message-Id: <n1393097842.22938@mail.intouchgroup.com> Date: 15 Dec 1995 10:45:57 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [3]Announcing NaecSpyr, a n To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [3]Announcing NaecSpyr, a n 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 10:48:40 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18534; Fri, 15 Dec 95 10:48:40 -0800 Message-Id: <n1393097544.41865@mail.intouchgroup.com> Date: 15 Dec 1995 10:51:21 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [5]RE>[5]RE>Checking Log fi To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [5]RE>[5]RE>Checking Log fi 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 10:49:51 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18645; Fri, 15 Dec 95 10:49:51 -0800 From: Vince Taluskie <vince@psa.pencom.com> Message-Id: <199512151849.MAA10429@psa.pencom.com> Subject: Contact for Intouchgroup.com To: robots@webcrawler.com Date: Fri, 15 Dec 1995 12:49:41 -0600 (CST) In-Reply-To: <n1393100712.52210@mail.intouchgroup.com> from "Roger Dearnaley" at Dec 15, 95 09:59:09 am X-Mailer: ELM [version 2.4 PL24] Content-Type: text Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com WHOIS shows the following as an administrative contact at the site: Hunter, Kurt (KH258) kurt_hunter@INTOUCHGROUP.COM 415-974-5000 I phoned Kurt and left voicemail for him about this user asking him to disable the auto-responder on the account.... Cheers, Vince -- ___ ____ __ | _ \/ __/| \ Vince Taluskie, at Fidelity Investments Boston, MA | _/\__ \| \ \ Pencom Systems Administration Phone: 617-563-8349 |_| /___/|_|__\ vince@pencom.com Pager: 800-253-5353, #182-6317 -------------------------------------------------------------------------- "We are smart, we make things go" From owner-robots Fri Dec 15 10:59:39 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19483; Fri, 15 Dec 95 10:59:39 -0800 Message-Id: <n1393096884.81309@mail.intouchgroup.com> Date: 15 Dec 1995 11:02:20 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [1]RE>[2]RE>[5]RE>Checking To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [1]RE>[2]RE>[5]RE>Checking Log fi 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 11:12:00 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20619; Fri, 15 Dec 95 11:12:00 -0800 From: micah@fsu.fsufay.edu (Micah A. Williams) Message-Id: <199512151911.OAA06554@fsu.fsufay.edu> Subject: Re: [3]RE>[5]RE>Checking Log fi To: robots@webcrawler.com Date: Fri, 15 Dec 95 14:11:30 EST In-Reply-To: <n1393103858.61698@mail.intouchgroup.com>; from "Roger Dearnaley" at Dec 15, 95 9:06 am X-Mailer: ELM [version 2.3 PL0] Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com -- ==================================================================== Micah A. Williams micah@fsu.uncfsu.edu Computer Science tndf20c@prodigy.com Fayetteville State University http://fsu.uncfsu.edu/~micah Bjork WebPage: http://fsu.uncfsu.edu/~micah/bjork.html ==================================================================== From owner-robots Fri Dec 15 11:14:51 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20891; Fri, 15 Dec 95 11:14:51 -0800 Message-Id: <n1393095974.37243@mail.intouchgroup.com> Date: 15 Dec 1995 11:18:46 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [4]Wobot? To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [4]Wobot? 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 11:22:20 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21568; Fri, 15 Dec 95 11:22:20 -0800 Date: Fri, 15 Dec 1995 11:22:29 -0800 From: gordon@BASISinc.com (Gordon Bainbridge) Message-Id: <9512151922.AA01083@outland.BASISinc.com> To: robots@webcrawler.com Subject: Re: [2]RE>[5]RE>Checking Log fi X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com ----- Begin Included Message ----- From owner-robots@webcrawler.com Fri Dec 15 10:59 PST 1995 From: micah@fsu.fsufay.edu (Micah A. Williams) Subject: Re: [2]RE>[5]RE>Checking Log fi To: robots@webcrawler.com Date: Fri, 15 Dec 95 13:19:50 EST Reply-To: robots@webcrawler.com In the words of Roger Dearnaley, > > [2]RE>[5]RE>Checking Log fi 12/15/95 > > Thanks for you message. > > I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail > before I get back. > Oh well...I guess we could get a lot of messages between now and Jan 5th :-) Could the list maintainer maybe mail root@mail.intouchgroup.com and inform them of this problem? Thanks. -Micah ----- End Included Message ----- I've already tried it. My mail was returned with the message "Unknown Quicktime Receipient(s)". Not only that, but my message has been returned to me TWICE. Does anyone else have any ideas? I'll be gone for a week, and don't want my mail box cluttered with hundreds of e-mails from good ol' Roger. I guess I could unsubscribe until January 5, but I'd rather not do it if there's an alternative. -Gordon Bainbridge BASIS Inc Emeryville, CA From owner-robots Fri Dec 15 11:25:52 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21899; Fri, 15 Dec 95 11:25:52 -0800 Message-Id: <n1393095317.74876@mail.intouchgroup.com> Date: 15 Dec 1995 11:28:52 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [4]Announcing NaecSpyr, a n To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [4]Announcing NaecSpyr, a n 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 11:25:52 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21900; Fri, 15 Dec 95 11:25:52 -0800 Message-Id: <n1393095317.74946@mail.intouchgroup.com> Date: 15 Dec 1995 11:28:25 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [2]RE>[3]RE>[5]RE>Checking To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [2]RE>[3]RE>[5]RE>Checking 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 11:32:00 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22520; Fri, 15 Dec 95 11:32:00 -0800 Date: Fri, 15 Dec 1995 14:31:50 -0500 From: Skip Montanaro <skip@automatrix.com> Message-Id: <199512151931.OAA10572@dolphin.automatrix.com> To: robots@webcrawler.com Cc: postmaster@mail.intouchgroup.com, owner-robots@webcrawler.com Subject: Re: [2]RE>[5]RE>Checking Log fi In-Reply-To: <199512151819.NAA05590@fsu.fsufay.edu> References: <n1393122315.53763@mail.intouchgroup.com> <199512151819.NAA05590@fsu.fsufay.edu> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Could the list maintainer maybe mail root@mail.intouchgroup.com and inform them of this problem? I already sent postmaster@mail.intouchgroup.com a note about the problem. No response yet. (Dear postmaster: For what it's worth, this fellow's mailbot has spewed, oh I don't know, maybe 30 messages back at the robots mailing list. I suspect any other lists he's on are similarly affected.) Perhaps the robots list owner could remove this fellow from the list for now... Skip Montanaro skip@calendar.com (518)372-5583 Musi-Cal: http://www.calendar.com/concerts/ or mailto:concerts@calendar.com Internet Conference Calendar: http://www.calendar.com/conferences/ >>> ZLDF: http://www.netresponse.com/zldf <<< From owner-robots Fri Dec 15 11:32:34 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22578; Fri, 15 Dec 95 11:32:34 -0800 From: micah@fsu.fsufay.edu (Micah A. Williams) Message-Id: <199512151932.OAA06841@fsu.fsufay.edu> Subject: Dearnaley Auto Reply Cannon? To: robots@webcrawler.com Date: Fri, 15 Dec 95 14:32:05 EST X-Mailer: ELM [version 2.3 PL0] Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I'm sure many of you have figured this out already but I think maybe Mr. Dearlarney is running some kind of automated reply program from his account. Any mail that is sent to his inbox gets an auto reply with the body of the mail being the subject followed by ... "I'm gone 'til Jan 5, etc..". The recursion is occuring because he is a member of the very list he is sending auto-replys to. So not only is he receiving and auto-replying to his own replys over and over again, he is also starting new recursion threads with any new message sent to the list. (Actually, this is kinda cool..I like recursion problems..but I'm sure the list maintainer and everybody else dislikes a mailbox full of Re:Re:Re: messages) .. The solution: (As Bonnie Scott pointed out to me) Temporarily remove Roger Dearlaney from the list. Sorry If I wasted bandwidth with this, but I just got a sudden realization of how this was all unfolding. -Micah -- ==================================================================== Micah A. Williams micah@fsu.uncfsu.edu Computer Science tndf20c@prodigy.com Fayetteville State University http://fsu.uncfsu.edu/~micah Bjork WebPage: http://fsu.uncfsu.edu/~micah/bjork.html ==================================================================== From owner-robots Fri Dec 15 11:35:33 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22843; Fri, 15 Dec 95 11:35:33 -0800 Message-Id: <9512151935.AA05873@marys.smumn.edu> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Kevin Hoogheem <khooghee@marys.smumn.edu> Date: Fri, 15 Dec 95 13:36:13 -0600 To: robots@webcrawler.com References: <199512151819.NAA05590@fsu.fsufay.edu> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com How hard would this be.. Would the admin please take him off the darn list please From owner-robots Fri Dec 15 11:36:09 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22904; Fri, 15 Dec 95 11:36:09 -0800 Message-Id: <n1393094695.13294@mail.intouchgroup.com> Date: 15 Dec 1995 11:38:24 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [2]RE>[2]RE>[5]RE>Checking To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [2]RE>[2]RE>[5]RE>Checking 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 11:42:31 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23546; Fri, 15 Dec 95 11:42:31 -0800 Message-Id: <n1393094316.33627@mail.intouchgroup.com> Date: 15 Dec 1995 11:43:55 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [1]Contact for Intouchgroup To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [1]Contact for Intouchgroup.com 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 11:42:40 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23568; Fri, 15 Dec 95 11:42:40 -0800 Message-Id: <n1393094310.33444@mail.intouchgroup.com> Date: 15 Dec 1995 11:45:14 -0800 Priority: Urgent From: "Saul Jacobs" <saul_jacobs@mail.intouchgroup.com> Subject: Re: [2]RE>[5]RE>Checking Lo To: robots@webcrawler.com, "Skip Montanaro" <skip@calendar.com> Cc: owner-robots@webcrawler.com, "postmaster@mail.intouchgroup.co" <postmaster@mail.intouchgroup.com> X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com intouch Reply 12/15/95 Subject:RE>>[2]RE>[5]RE>Checking Log fi 11:43 I am the postmaster. I am working on killing our user's forward. The mails should stop within 2 hours. But check out our webite: http://WorldWideMusic.com/ Saul Jacobs Coputer Systems Manger intouch group, inc. -------------------------------------- Date: 12/15/95 11:42 To: Saul Jacobs From: Skip Montanaro Could the list maintainer maybe mail root@mail.intouchgroup.com and inform them of this problem? I already sent postmaster@mail.intouchgroup.com a note about the problem. No response yet. (Dear postmaster: For what it's worth, this fellow's mailbot has spewed, oh I don't know, maybe 30 messages back at the robots mailing list. I suspect any other lists he's on are similarly affected.) Perhaps the robots list owner could remove this fellow from the list for now... Skip Montanaro skip@calendar.com (518)372-5583 Musi-Cal: http://www.calendar.com/concerts/ or mailto:concerts@calendar.com Internet Conference Calendar: http://www.calendar.com/conferences/ >>> ZLDF: http://www.netresponse.com/zldf <<< ------------------ RFC822 Header Follows ------------------ Received: by mail.intouchgroup.com with SMTP;15 Dec 1995 11:39:17 -0800 Received: (from skip@localhost) by dolphin.automatrix.com (8.6.12/8.6.12) id OAA10572; Fri, 15 Dec 1995 14:31:50 -0500 Date: Fri, 15 Dec 1995 14:31:50 -0500 From: Skip Montanaro <skip@automatrix.com> Message-Id: <199512151931.OAA10572@dolphin.automatrix.com> To: robots@webcrawler.com CC: postmaster@mail.intouchgroup.com, owner-robots@webcrawler.com Subject: Re: [2]RE>[5]RE>Checking Log fi In-Reply-To: <199512151819.NAA05590@fsu.fsufay.edu> References: <n1393122315.53763@mail.intouchgroup.com> <199512151819.NAA05590@fsu.fsufay.edu> Reply-To: skip@calendar.com (Skip Montanaro) From owner-robots Fri Dec 15 11:47:06 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24012; Fri, 15 Dec 95 11:47:06 -0800 Message-Id: <n1393094037.53612@mail.intouchgroup.com> Date: 15 Dec 1995 11:49:43 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [5]Wobot? To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [5]Wobot? 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 12:00:44 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25042; Fri, 15 Dec 95 12:00:44 -0800 Message-Id: <199512152000.PAA12523@tinman.dev.prodigy.com> X-Sender: bonnie@192.203.241.117 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 15 Dec 1995 15:00:35 -0400 To: robots@webcrawler.com From: bonnie@dev.prodigy.com (Bonnie Scott) Subject: Re: [2]RE>[5]RE>Checking Lo X-Mailer: <Windows Eudora Version 2.0.2> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >I am the postmaster. I am working on killing our user's forward. The mails >should stop within 2 hours. > >But check out our webite: http://WorldWideMusic.com/ > >Saul Jacobs >Coputer Systems Manger >intouch group, inc. Thanks Saul, if you're on this list. I had taken matters into my own hands a half hour ago, and told majordomo I was Roger and I told it to "unsubscribe robots." I then did a "who robots," and Roger doesn't appear to be on it anymore. I'll apologize to Roger and his autoresponder myself. :) I thought that the mail client community figured out that autoresponders should reply to "Sender:" and not even bother to answer "Precedence: bulk" messages (both of which are present in this list's headers) back in '93 or so with the big MCI mail snafu. Bonnie Scott Prodigy Services Company (whose mail client ALWAYS replies to sender, even if there's a "Reply-to:". Not my app. <g>) From owner-robots Fri Dec 15 12:06:30 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25532; Fri, 15 Dec 95 12:06:30 -0800 From: ecarp@tssun5.dsccc.com Date: Fri, 15 Dec 1995 14:03:20 -0600 Message-Id: <9512152003.AA10658@tssun5.> To: robots@webcrawler.com Subject: Re: [2]RE>[5]RE>Checking Log fi X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Perhaps the list maintainer can filter out this address until Jan 5. - that might be an easier and faster solution. From owner-robots Fri Dec 15 14:38:55 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06614; Fri, 15 Dec 95 14:38:55 -0800 X-Sender: mak@surfski.webcrawler.com Message-Id: <v02140800acf5eebdebc4@[199.221.45.66]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 15 Dec 1995 16:38:28 -0500 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: Re: Announcement and Help Requested Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >New Robot Announcement Can you fill out http://info.webcrawler.com/mak/projects/robots/active.html so I have all the bits I need toadd you to the list? >The user-agent, from and referer http fields are not set to anything >currently. Obviously, I wish these to conatin informative information. So, >how do you send this information to the web server? Ehr, by adding them as headers to the request? >User-Agent: IncyWincy V?.? Check the HTTP spec, it suggests forms like IncyWincy/1.1 -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Fri Dec 15 15:44:12 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10472; Fri, 15 Dec 95 15:44:12 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: <v02130504acf7b860b790@[192.187.143.12]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 15 Dec 1995 15:45:18 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Wobot? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 9:54 PM 12/15/95, Byung-Gyu Chang wrote: >Did anyone know about Wobot from magellan.mckinley.com ? >They represents them "Wobot" in User-Agent field. > >Martijn Koster write in "List of Robots" html page : >-- >McKinley Robot > >It's unclear who administers this, but a number of people have complained >about rapid-fire hits from magellan.mckinley.com. There have been no replies >to direct complaints. Not very nice... I've forwarded some of the complaints here to the head of development at McKinley. When I hear back, I'll post to the list. If all else fails, I do have his home phone number... ;-) Often, this sort of thing is an isolated test gone haywire. Nick From owner-robots Fri Dec 15 19:43:52 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25112; Fri, 15 Dec 95 19:43:52 -0800 X-Sender: mak@surfski.webcrawler.com (Unverified) Message-Id: <v02140808acf7b2ecfb47@[199.221.45.66]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 15 Dec 1995 15:25:39 -0800 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: Vacation wars Cc: bonnie@dev.prodigy.com (Bonnie Scott) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com In message <199512152000.PAA12523@tinman.dev.prodigy.com>, Bonnie Scott writes: > Thanks Saul, if you're on this list. I had taken matters into my own hands a > half hour ago, and told majordomo I was Roger and I told it to "unsubscribe > robots." I then did a "who robots," and Roger doesn't appear to be on it > anymore. I'll apologize to Roger and his autoresponder myself. :) Ah, that explains why I couldn't find him. :-) Thanks; sometimes I can catch these in time, but this time I had to be bleeped away from my day off :-/ > I thought that the mail client community figured out that autoresponders > should reply to "Sender:" and not even bother to answer "Precedence: bulk" > messages (both of which are present in this list's headers) back in '93 or > so with the big MCI mail snafu. Quite. Not sure quite what "Mail*Link SMTP-QM 3.0.2" is, but with lots of gatewaying and simplistic PC packages nowadays this does happen every once in a while. For anyone thinking about using autoresponders on UNIX, check out mailagent, which goes to all sorts of lengths to prevent such loops (but let me declare auto-reponder mail off-topic for this group) I guess it is time to modify majordomo to filter out vaction messages... Sorry for the inconvenience caused, -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Sat Dec 16 05:48:06 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26270; Sat, 16 Dec 95 05:48:06 -0800 Date: Sat, 16 Dec 1995 14:47:55 +0100 (MET) From: Bjorn-Olav Strand <bjorn-ol@ifi.uio.no> To: robots@webcrawler.com Subject: Re: [2]RE>[5]RE>Checking Log fi In-Reply-To: <199512151819.NAA05590@fsu.fsufay.edu> Message-Id: <Pine.SUN.3.91.951216144542.3188A-100000@beli.ifi.uio.no> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Fri, 15 Dec 1995, Micah A. Williams wrote: > > I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail > > before I get back. > Oh well...I guess we could get a lot of messages between > now and Jan 5th :-) He sends a reply on all the mail he gets that he is on vacation. But when the mail is from robots@webcrawler.com he will send it to that address, and then get the message back, and then reply to it again... There are 2 solutions. 1. Take him off the list. 2. Talk to his postmaster. ----- XxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxX Bjorn-Olav Strand . Nedre Berglia 56 . 1353 BAERUMS VERK . NORWAY (+47) 967 68 054 . bolav@pobox.com . http://www.pobox.com/~bolav/ XxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxX From owner-robots Sat Dec 16 12:28:19 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19283; Sat, 16 Dec 95 12:28:19 -0800 X-Sender: dhender@oly.olympic.net Message-Id: <v01510103acf8dc452db8@[205.240.23.66]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sat, 16 Dec 1995 12:28:38 -0800 To: robots@webcrawler.com From: david@olympic.net (David Henderson) Subject: Re: [2]RE>[5]RE>Checking Lo Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >>I am the postmaster. I am working on killing our user's forward. The mails >>should stop within 2 hours. >> >>But check out our webite: http://WorldWideMusic.com/ >> >>Saul Jacobs >>Coputer Systems Manger >>intouch group, inc. > >Thanks Saul, if you're on this list. I had taken matters into my own hands a >half hour ago, and told majordomo I was Roger and I told it to "unsubscribe >robots." I then did a "who robots," and Roger doesn't appear to be on it >anymore. I'll apologize to Roger and his autoresponder myself. :) > >I thought that the mail client community figured out that autoresponders >should reply to "Sender:" and not even bother to answer "Precedence: bulk" >messages (both of which are present in this list's headers) back in '93 or >so with the big MCI mail snafu. > >Bonnie Scott >Prodigy Services Company >(whose mail client ALWAYS replies to sender, even if there's a "Reply-to:". > Not my app. <g>) congratulation bonnie, ______________________________________________________________________ David Henderson QUICKimage Homepage Development and Marketing HOME PH/FAX: 360-377-2182 WORK PH: 206-443-1430 WORK FAX: 206-443-5670 From owner-robots Sat Dec 16 12:48:25 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20374; Sat, 16 Dec 95 12:48:25 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: <v02130501acf8e09606e0@[192.187.143.12]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sat, 16 Dec 1995 12:49:27 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Vacation wars Cc: m.koster@webcrawler.com (Martijn Koster) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 3:25 PM 12/15/95, Martijn Koster wrote: >... Not sure quite what "Mail*Link SMTP-QM 3.0.2" is... FYI, it's the StarNine QuickMail-SMTP gateway package. It gateways Internet mail to a Macintosh QuickMail server. (StarNine, which publishes the Mac Web server, WebStar, recently was acquired by Quarterdeck.) Nick From owner-robots Sun Dec 17 18:53:29 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20783; Sun, 17 Dec 95 18:53:29 -0800 X-Sender: dhender@oly.olympic.net Message-Id: <v01510100acfa871416e8@[205.240.23.66]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sun, 17 Dec 1995 18:53:55 -0800 To: robots@webcrawler.com From: david@olympic.net (David Henderson) Subject: New Robot??? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I discovered this robot hitting my server. From: hyrax.bio.indiana.edu. User-Agent: WebSCANNER libwww-perl/0.20 I have a very limited knowledge about robots so far. Is this a known robot? ______________________________________________________________________ David Henderson Webmaster QUICKimage HOME PH/FAX: 360-377-2182 WORK PH: 206-443-1430 WORK FAX: 206-443-5670 From owner-robots Mon Dec 18 04:23:53 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16169; Mon, 18 Dec 95 04:23:53 -0800 From: jeremy@mari.co.uk (Jeremy.Ellman) Message-Id: <9512181220.AA03749@kronos> Subject: Re: Announcement and Help Requested To: robots@webcrawler.com Date: Mon, 18 Dec 1995 12:20:21 +0000 (GMT) In-Reply-To: <9512141034.AA19413@osiris.sund.ac.uk> from "Simon.Stobart" at Dec 14, 95 10:34:13 am X-Mailer: ELM [version 2.4 PL24] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Jeremy.Ellman@mari.co.uk > > New Robot Announcement > ~~~~~~~~~~~~~~~~~~~~~~ > Name: IncyWincy > Home: University of Sunderland, UK > Implementation Language: C++ > Supports Robot Exclusion standard: Yes > Purpose: Various research projects > Status: This robot has not yet been released outside of Sunderland > Authors: Simon Stobart, Reg Arthington > > Help Requested > ~~~~~~~~~~~~~~ > The user-agent, from and referer http fields are not set to anything currently. Obviously, I wish these to conatin informative information. So, how do you send this information to the web server? > > The values which I wish to set these fields to are: > > User-Agent: IncyWincy V?.? > From: simon.stobart@sunderland.ac.uk > > Many Thanks > > |------------------------------------+-------------------------------------| > | Simon Stobart, | Net: simon.stobart@sunderland.ac.uk | > | Lecturer in Computing, | Voice: (+44) 091 515 2783 | > | School of Computing | Fax: (+44) 091 515 2781 | > | & Information Systems, + ------------------------------------| > | University of Sunderland, SR1 3SD, | 007: Balls Q? | > | England. | Q: Bolas 007! | > |------------------------------------|-------------------------------------| > From owner-robots Mon Dec 18 06:46:15 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22346; Mon, 18 Dec 95 06:46:15 -0800 Date: Mon, 18 Dec 1995 14:45:51 GMT From: cs0sst@isis.sunderland.ac.uk (Simon.Stobart) Message-Id: <9512181445.AA10893@osiris.sund.ac.uk> To: robots@webcrawler.com Subject: Re: Announcement and Help Requested X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, I got this from you - but there is no message. Simon ----- Begin Included Message ----- From owner-robots@webcrawler.com Mon Dec 18 14:34 GMT 1995 From: jeremy@mari.co.uk (Jeremy.Ellman) Subject: Re: Announcement and Help Requested Date: Mon, 18 Dec 1995 12:20:21 +0000 (GMT) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Jeremy.Ellman@mari.co.uk > > New Robot Announcement > ~~~~~~~~~~~~~~~~~~~~~~ > Name: IncyWincy > Home: University of Sunderland, UK > Implementation Language: C++ > Supports Robot Exclusion standard: Yes > Purpose: Various research projects > Status: This robot has not yet been released outside of Sunderland > Authors: Simon Stobart, Reg Arthington > > Help Requested > ~~~~~~~~~~~~~~ > The user-agent, from and referer http fields are not set to anything currently. Obviously, I wish these to conatin informative information. So, how do you send this information to the web server? > > The values which I wish to set these fields to are: > > User-Agent: IncyWincy V?.? > From: simon.stobart@sunderland.ac.uk > > Many Thanks > > |------------------------------------+-------------------------------------| > | Simon Stobart, | Net: simon.stobart@sunderland.ac.uk | > | Lecturer in Computing, | Voice: (+44) 091 515 2783 | > | School of Computing | Fax: (+44) 091 515 2781 | > | & Information Systems, + ------------------------------------| > | University of Sunderland, SR1 3SD, | 007: Balls Q? | > | England. | Q: Bolas 007! | > |------------------------------------|-------------------------------------| > ----- End Included Message ----- From owner-robots Mon Dec 18 07:44:40 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25834; Mon, 18 Dec 95 07:44:40 -0800 Date: Mon, 18 Dec 1995 10:42:39 -0600 (CST) From: Cees Hek <hekc@phoenix.cis.mcmaster.ca> To: robots@webcrawler.com Subject: Re: Checking Log files In-Reply-To: <v02130503acf687e1f79d@[202.237.148.40]> Message-Id: <Pine.LNX.3.91.951218102344.6222A-100000@phoenix.cis.mcmaster.ca> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Now that things have calmed down a bit on this list..... :-) What I was looking for was something that actually did some analysis on the log file, like a log statistics package but one that was geared toward robots. It would check to see if the robots are actually following the standard for Robot exclusion. It could check if there are multiple accesses to the server and how far apart they are, how many times in a month the robot returns, and how often the robots.txt file is accessed to name a few. If nothing like this has been written, I may just write it myself (if I can find some free time). I would welcome any suggestions as to what a program like this should contain. For now though I guess I will have to live with grepping the log file.... Cees Hek Computing & Information Services Email: hekc@mcmaster.ca McMaster University Hamilton, Ontario, Canada On Fri, 15 Dec 1995, Mark Schrimsher wrote: > You can make a robots.txt file that permits all accesses, and then check > the log for requests for that file. But it won't catch robots that don't > check for robots.txt. > > --Mark From owner-robots Mon Dec 18 14:59:15 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19060; Mon, 18 Dec 95 14:59:15 -0800 Message-Id: <9512182259.AA19051@webcrawler.com> To: robots Subject: test; please ignore From: Martijn Koster <m.koster@webcrawler.com> Date: Mon, 18 Dec 1995 14:59:15 -0800 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [2]RE>[2]RE>[5]RE>Checking 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. Do ingore this message; it's not true, and merely a test-case for a new bounce rule in majordomo, designed to prevent at least some "Roger Dearnaley" problems. The fact you saw this message indicates the initial easy fix didn't work, so it's back to the drawing board :-( Don't worry, further testing will take place on a specific test list... -- Martijn __________ Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Mon Dec 18 19:07:09 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24551; Mon, 18 Dec 95 19:07:09 -0800 Message-Id: <v02130503acfbdc9fea77@[202.237.148.35]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Tue, 19 Dec 1995 12:06:43 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: Re: test; please ignore Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > [2]RE>[2]RE>[5]RE>Checking 12/15/95 > >Thanks for you message. > >I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail >before I get back. > >Do ingore this message; it's not true, and merely a test-case for >a new bounce rule in majordomo, designed to prevent at least some >"Roger Dearnaley" problems. The fact you saw this message indicates >the initial easy fix didn't work, so it's back to the drawing board :-( >Don't worry, further testing will take place on a specific test list... > >-- Martijn >__________ >Email: m.koster@webcrawler.com >WWW: http://info.webcrawler.com/mak/mak.html Martijn: How can I subscribe to the test list. ;-) --<arl From owner-robots Thu Dec 21 07:50:52 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11270; Thu, 21 Dec 95 07:50:52 -0800 Date: Thu, 21 Dec 1995 07:57:20 -0800 X-Sender: dhender@oly.olympic.net Message-Id: <v01510101acfec33d79dd@[204.182.64.25]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: david@quickimage.com (David Henderson) Subject: Re: test; please ignore Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com test _____________________________________________________________ David Henderson - Webmaster - QUICKimage _____ HOME PH/FAX: 360-377-2182 / \ WORK PH: 206-443-1430 @ 0 0 @ WORK FAX: 206-443-5670 | \_/ | Check out my newest creation "MeatPower" \_____/ at 'http://www.qinet.com/meat' From owner-robots Fri Dec 22 07:31:42 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA07619; Fri, 22 Dec 95 07:31:42 -0800 Message-Id: <199512221530.HAA09354@sparty.surf.com> Date: Thu, 21 Dec 95 19:29:36 -0800 From: Murray Bent <murrayb@surf.com> X-Mailer: Mozilla 1.12 (X11; I; IRIX 5.3 IP22) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Unfriendly Lycos , again ... X-Url: http://www.whitehouse.gov/White_House/Publications/html/Publications.html Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Is anyone getting requests from an anonymous robot, presumably, from the lycos domain presumably (cmu.edu), as follows .. bragi.cc.cmu.edu - - [15/Dec/1995:11:42:50 -0800] "GET /" 200 3039 bragi.cc.cmu.edu - - [15/Dec/1995:22:30:21 -0800] "GET /" 200 3039 bragi.cc.cmu.edu - - [17/Dec/1995:21:08:52 -0800] "GET /" 200 3039 bragi.cc.cmu.edu - - [22/Dec/1995:06:51:27 -0800] "GET /" 200 3413 Nothing appears in the agents log. mj From owner-robots Sat Dec 23 05:57:14 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08742; Sat, 23 Dec 95 05:57:14 -0800 Message-Id: <01BAD181.37E12BC0@pluto.planets.com.au> From: David Eagles <eaglesd@planets.com.au> To: "'robots@webcrawler.com'" <robots@webcrawler.com> Subject: Inter-robot Comms Port Date: Sat, 23 Dec 1995 21:54:20 +-1100 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Has anyone thought about applying for a TCP port number dedicated for = intercommunications between various robots, as well as the additional = protocol for exchange of info. As many of you will have seen, and some = I have spoken to, I have developed a web crawler = (http://funnelweb.net.au) which performs searchs/indexing for the South = Pacific countries (based in Australia), selectable by individual = country. I have received a lot of queries in regard to others using the = code for various projects in other countries (and even internal = corporate networks). As a result, I'm currently implementing a = distributed searching/indexing facility. The best approach I can think = of is to have a dedicated port which can be used by remote agents to = either conduct searchs of another country's data or to register a URL = for processing and indexing by that agent for the remote database (hope = that makes sense). Any comments on this would be GREATLY appreciated. Regards, David Eagles PlaNET Consulting Pty Limited Brisbane Australia From owner-robots Tue Dec 26 12:42:19 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04648; Tue, 26 Dec 95 12:42:19 -0800 Message-Id: <199512262042.PAA25736@lexington.cs.columbia.edu> To: robots@webcrawler.com Subject: Re: Inter-robot Comms Port In-Reply-To: Your message of "Sat, 23 Dec 1995 21:54:20." <01BAD181.37E12BC0@pluto.planets.com.au> Date: Tue, 26 Dec 1995 15:42:08 -0500 From: "John D. Pritchard" <jdp@cs.columbia.edu> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com this is precisely what harvest http://www.cs.colorado.edu/harvest provides for, with a information model which is widely applicable. this software provides a very nice place to build such systems on top of, imo. for some of the related TRs see.. http://harvest.cs.colorado.edu/harvest/user-manual-1.1/node73.html -john > Has anyone thought about applying for a TCP port number dedicated for = > intercommunications between various robots, as well as the additional = > protocol for exchange of info. As many of you will have seen, and some = > I have spoken to, I have developed a web crawler = > (http://funnelweb.net.au) which performs searchs/indexing for the South = > Pacific countries (based in Australia), selectable by individual = > country. I have received a lot of queries in regard to others using the = > code for various projects in other countries (and even internal = > corporate networks). As a result, I'm currently implementing a = > distributed searching/indexing facility. The best approach I can think = > of is to have a dedicated port which can be used by remote agents to = > either conduct searchs of another country's data or to register a URL = > for processing and indexing by that agent for the remote database (hope = > that makes sense). > > Any comments on this would be GREATLY appreciated. From owner-robots Tue Dec 26 14:42:43 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09548; Tue, 26 Dec 95 14:42:43 -0800 Message-Id: <199512261040.CAA07443@www2> Date: Tue, 26 Dec 95 02:40:33 -0800 From: Super-User <root@www2> X-Mailer: Mozilla 1.12 (X11; I; IRIX 5.3 IP22) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: Inter-robot Comms Port X-Url: http://www.niyp.com/cgi/nyp_narrow.cgi Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >this is precisely what harvest http://www.cs.colorado.edu/harvest provides Don't let this stop you trying to build a better one, though! From owner-robots Tue Dec 26 16:47:38 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14539; Tue, 26 Dec 95 16:47:38 -0800 Date: Wed, 27 Dec 1995 01:45:49 +0100 (GMT+0100) From: Carlos Baquero <cbm@di.uminho.pt> To: Super-User <root@www2.webcrawler.com> Cc: robots@webcrawler.com Subject: Re: Inter-robot Comms Port In-Reply-To: <199512261040.CAA07443@www2> Message-Id: <Pine.LNX.3.91.951227013834.154C-100000@poe.di.uminho.pt> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 565 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Tue, 26 Dec 1995, Super-User wrote: > >this is precisely what harvest http://www.cs.colorado.edu/harvest provides > > Don't let this stop you trying to build a better one, though! > Yes. Specially if its a common interface for the interchange of information among robot databases. But profit might interfere with such a project. Carlos Baquero PhD Student, Distributed Systems Fax +351 (53) 612954 University of Minho, Portugal Voice +351 (53) 604475 cbm@di.uminho.pt http://shiva.di.uminho.pt/~cbm From owner-robots Thu Dec 28 13:38:21 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09788; Thu, 28 Dec 95 13:38:21 -0800 To: robots@webcrawler.com Subject: Re: Unfriendly Lycos , again ... X-Url: http://www.miranova.com/%7Esteve/ References: <199512221530.HAA09354@sparty.surf.com> From: steve@miranova.com (Steven L. Baur) Date: 28 Dec 1995 13:36:07 -0800 In-Reply-To: Murray Bent's message of 21 Dec 1995 19:29:36 -0800 Message-Id: <m2u42keurc.fsf@diana.miranova.com> Organization: Miranova Systems, Inc. Lines: 8 X-Mailer: September Gnus v0.26/XEmacs 19.13 Mime-Version: 1.0 (generated by tm-edit 7.38) Content-Type: text/plain; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Is anyone getting requests from an anonymous robot, presumably, > from the lycos domain presumably (cmu.edu), as follows .. BRAGI.CC.CMU.EDU - - [16/Dec/1995:19:18:55 +0800] "GET /" 200 2427 One request since August doesn't seem unfriendly to me. -- steve@miranova.com baur From owner-robots Thu Dec 28 17:47:01 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21949; Thu, 28 Dec 95 17:47:01 -0800 Message-Id: <01BAD5E2.CD1A7BA0@pluto.planets.com.au> From: David Eagles <eaglesd@planets.com.au> To: "'robots@webcrawler.com'" <robots@webcrawler.com> Subject: Inter-robot Communications - Part II Date: Fri, 29 Dec 1995 11:42:56 +-1100 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Well, I never expected to receive such a favourable response about a = standard port/protocol for communication between robots. Although the = work being done by Harvest was mentioned several times, the great = majority of people who replied thought the Harvest system was too = complicated now, and I believe it also lacks some useful features (and = it's not on a standardised port yet). I'm going away for a couple of weeks, but I'll put some thought into it = during that time. Any comments, requests, ideas for any aspect would be = greatly appreciated (after all ,that's how the Internet was built). = When I return I'll setup a part of my WWW server dedicated to this = project (think I'll call it Project Asimov - seems appropriate for a = global robot communications system). The key features I have thought of so far as listed below, so you can = comment on these also (ie. tell m,e if I'm being too = stupid/ambitious/etc) 1. Dedicated port approved as an Internet standard port number. (What = does this require?) 2. Protocol (similar to FTP I think) which allows remote agents to = exchange URL's, perform searchs and get the results in a standard = format, database mirroring(?), etc. The idea behind this is that if = Robot A finds a URL handled by another remote Robot (such as by domain = name, keywords(?), etc), then it can inform the remote robot of it's = existance. Similarly, if a user wants to search for something which = happens to be handled by the remote server, a standard data format will = be returned which can them be presented in any format. 3. A method of correlating Robots with specialties (what the robot is = for). An approach similar to DNS may come in handy here - limited = functionality could be obtained by using a "hosts" type file (called = "robots" ?), while large scale, transparent functionality would probably = require a centralised site which would maintain a list of all know = robots and their specialties. Remote robots would download the list( or = search parts of it) as required. This could probably be another = protocol command on the port above. 4. A standard set of data, plus some way to extend it for implementation = specific users. I use the following fields in FunelWeb URL Title (from <TITLE>) Headings (from <Hx>) Link Descriptions (from <A HREF=3D"">...</A>) Keywords (from user entry) Body Text (from all other non-HTML text) Document Size (from Content-Length: server field) Last-Modified Date (from Last-Modified: server field) Time-To-Live (server dependant) This also highlights one MAJOR consideration - These fields are = generally only useful to HTML robots. Something needs to be considered = to handle any input format, including FTP, WAIS and GOPHER. Well, this is now MUCH longer than I first intended it to be. Sorry to = have wasted your time and bandwidth. Hope you all had a great Christmas and have a Happy New Year. Regards, David From owner-robots Fri Dec 29 06:06:20 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22493; Fri, 29 Dec 95 06:06:20 -0800 Message-Id: <9512291705.AA3706@wscnotes.hammer.net> To: robots <robots@webcrawler.com> From: "Christopher J. Tomasello/WSC" <Christopher_J.._Tomasello@hammer.net> Date: 29 Dec 95 9:04:46 EDT Subject: unknown robot Mime-Version: 1.0 Content-Type: Text/Plain Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Anyone with information on this robot please respond to the group or perferably to me ctomasello@hammer.net There is a robot hitting our web server on a regular basis. It hits every file on the server in a very rapid rate (many requests per second). The curious thing about this is that the robot is using our IP/domain name to gain access. So in the log files it looks like one of our internal servers is hitting the site. Also, all the the requests this robot makes are returning 404 errors. I have heard rumors that the Alta Vista spider is doing this kind of spoofing - but I have also heard that it is not. Any information would be greatly appreciated. From owner-robots Fri Dec 29 08:26:19 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA29356; Fri, 29 Dec 95 08:26:19 -0800 X-Sender: mak@surfski.webcrawler.com Message-Id: <v02140800ad09a89121ec@[199.221.45.139]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 29 Dec 1995 08:26:27 -0700 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: Re: Inter-robot Communications - Part II Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 4:42 AM 12/29/95, David Eagles wrote: > Well, I never expected to receive such a favourable response > about a standard port/protocol for communication between robots. Cool. Many robot authors among the respondents? >The key features I have thought of so far as listed below, > so you can comment on these also (ie. tell m,e if I'm being too > stupid/ambitious/etc) > >1. Dedicated port approved as an Internet standard port number. > (What does this require?) Not sure, but there's no point until there is an RFC which specifies what the protocol on that port is doing, so concentrate on that first. >2. Protocol (similar to FTP I think) which allows remote agents > to exchange URL's, perform searchs and get the results in a standard > format, database mirroring(?), etc. Why on earth like FTP? FTP is reseonably complex and inefficient. If we're talking about the web, use HTTP! We know how/that that works, we have many implementations, and it's reaseonably OK. It's at least as efficient as FTP for this kind of thing, and when HTTP/NG comes along you can just plug that in. This allows you to concentrate on just the data format; so just invent a new Media type: text/foo. > The idea behind this is that if Robot A finds a URL handled by another > remote Robot (such as by domain name, keywords(?), etc), then it can > inform the remote robot of it's existance. This would be easy deployable if you use HTTP: POST a form or PUT a file using a client library such as libwww-perl, and handle it in a CGI script. Hey, we'll just link it to our submit form :-) This has been discussed before actually, we never got time to make it go anywhere... > Similarly, if a user wants to search for something which happens to be > handled by the remote server, a standard data format will be returned > which can them be presented in any format. Distributed interactive searching is more complicated than that though... what do you do when there are three thousand of these servers around? It is also complicated because these days you don't want all results; there are too many of them. But masaging the results using whatever selection and relevance feedback is very robot-specific, because everyone uses different kinds of search engines. This sounds to me like a problem for which there is no good and easy answer. However, it'd be nice to come up with a way to efficiently hoover other robots for URL's; this could be done with a mechanism such as you describe. What we can learn from Harvest is that these issues can be separated into different processes, making it all a bit more flexible and clear. >3. A method of correlating Robots with specialties (what the robot is for). > An approach similar to DNS may come in handy here - > limited functionality could be obtained by using a "hosts" type file > (called "robots" ?), while large scale, transparent functionality would > probably require a centralised site which would maintain a list of all > know robots and their specialties. Remote robots would download the > list( or search parts of it) as required. This could probably be > another protocol command on the port above. The words "scalable" and "centralised site" don't mix :-) Hmmm... expressing "what the robot is for" is probably very difficult to express. Meta information categorization is always a nightmare. What classification to use? >4. A standard set of data, plus some way to extend it for implementation > specific users. I use the following fields in FunelWeb > URL > Title (from <TITLE>) > Headings (from <Hx>) > Link Descriptions (from <A HREF="">...</A>) > Keywords (from user entry) > Body Text (from all other non-HTML text) > Document Size (from Content-Length: server field) > Last-Modified Date (from Last-Modified: server field) > Time-To-Live (server dependant) Hmm, this is where it gets tricky. URL, Title, and Keywords are obvious. Content-length and Last-Modified Date sound good, but do under-represent the HTTP server response; what about Content-language and other variants? Headings, link descriptions, and body text: hmmm. Which headers, how are they ordered? same for links? What is "body text" in the company of HTML tables etc? What about losing info from in-line images and HEAD elements? What about frames? This is a slippery slope; why not simply send the entire document content compressed? As efficient, and gives complete freedom) Also check out the URC work, sounds like some potential overlap here. (Damn, I'm starting to sound like Dan Connoly :-) > This also highlights one MAJOR consideration - These fields are generally > only useful to HTML robots. Something needs to be considered to handle > any input format, including FTP, WAIS and GOPHER. Even if you ignore that for now you'd be scoring... I'll share a different idea I had about this stuff (oe, or do I need to patent it first these days?). If in the distributed gathering part we start sending URL's, HTTP response headers, and complete content, doesn't the word "caching proxy" spring to mind? I need to think more about this, but it sounds to me that if we had an efficient way of updating and pre-loading distributed caches using a between-cache protocol, we'd be killing two birds with one stone: better caching performance than the current 30%, and complete freedom to do whatever you want with the content for robot purposes... Happy New Year, -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Fri Dec 29 08:50:27 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00753; Fri, 29 Dec 95 08:50:27 -0800 Message-Id: <199512291646.OAA01900@desterro.edugraf.ufsc.br> X-Sender: fernando@edugraf.ufsc.br Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 29 Dec 1995 14:43:48 -0400 To: robots@webcrawler.com From: fernando@edugraf.ufsc.br (Luiz Fernando) Subject: Re: unknown robot X-Mailer: <PC Eudora Version 1.4> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 09:04 AM 12/29/95 EDT, robots@webcrawler.com wrote: >I have heard rumors that the Alta Vista spider is doing this kind of spoofing >but I have also heard that it is not. Any information would be greatly >appreciated. btw, I would like to receive also any info on the AltaVista's inner workings, pse fernando ---------------------------------------- fernando@hipernet.ufsc.br http://www.hiperNet.ufsc.br From owner-robots Fri Dec 29 09:47:16 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02998; Fri, 29 Dec 95 09:47:16 -0800 From: <monier@pa.dec.com> Message-Id: <9512291743.AA05536@evil-twins.pa.dec.com> To: robots@webcrawler.com Subject: Re: unknown robot In-Reply-To: Your message of "29 Dec 95 09:04:46 EDT." <9512291705.AA3706@wscnotes.hammer.net> Date: Fri, 29 Dec 95 09:43:17 -0800 X-Mts: smtp Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Gang, I'm the father of Scooter, the robot behind Alta Vista. I can garantee you that the robot does not do anything funny: no IP spoofing or other arguable behavior. It is usually run from scooter.pa-x.dec.com, sometimes for short tests from inside the Digital firewall. It sets the following fields: User-Agent: Scooter/1.0 scooter@pa.dec.com From: scooter@pa.dec.com and it is registered at Martijn's site. It's anything but a stealth robot. So please help squash this rumor. I'll run this message by our network gurus, they might think of ways of catching the bad guys. Cheers, --Louis From owner-robots Fri Dec 29 12:10:00 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08527; Fri, 29 Dec 95 12:10:00 -0800 From: "Mordechai T. Abzug" <mabzug1@gl.umbc.edu> Message-Id: <199512292009.PAA24281@umbc8.umbc.edu> Subject: Re: Inter-robot Communications - Part II To: robots@webcrawler.com Date: Fri, 29 Dec 1995 15:09:51 -0500 (EST) In-Reply-To: <01BAD5E2.CD1A7BA0@pluto.planets.com.au> from "David Eagles" at Dec 29, 95 11:42:56 am X-Mailer: ELM [version 2.4 PL24] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 2707 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com "DE" == David Eagles spake thusly: DE> DE> Well, I never expected to receive such a favourable response about a = DE> standard port/protocol for communication between robots. Although the = DE> work being done by Harvest was mentioned several times, the great = DE> majority of people who replied thought the Harvest system was too = DE> complicated now, and I believe it also lacks some useful features (and = DE> it's not on a standardised port yet). DE> DE> I'm going away for a couple of weeks, but I'll put some thought into it = DE> during that time. Any comments, requests, ideas for any aspect would be = DE> greatly appreciated (after all ,that's how the Internet was built). = DE> When I return I'll setup a part of my WWW server dedicated to this = DE> project (think I'll call it Project Asimov - seems appropriate for a = DE> global robot communications system). DE> DE> The key features I have thought of so far as listed below, so you can = DE> comment on these also (ie. tell m,e if I'm being too = DE> stupid/ambitious/etc) DE> DE> 1. Dedicated port approved as an Internet standard port number. (What = DE> does this require?) DE> 2. Protocol (similar to FTP I think) which allows remote agents to = DE> exchange URL's, perform searchs and get the results in a standard = DE> format, database mirroring(?), etc. The idea behind this is that if = DE> Robot A finds a URL handled by another remote Robot (such as by domain = DE> name, keywords(?), etc), then it can inform the remote robot of it's = DE> existance. Similarly, if a user wants to search for something which = DE> happens to be handled by the remote server, a standard data format will = DE> be returned which can them be presented in any format. DE> 3. A method of correlating Robots with specialties (what the robot is = DE> for). An approach similar to DNS may come in handy here - limited = DE> functionality could be obtained by using a "hosts" type file (called = DE> "robots" ?), while large scale, transparent functionality would probably = DE> require a centralised site which would maintain a list of all know = DE> robots and their specialties. Remote robots would download the list( or = DE> search parts of it) as required. This could probably be another = DE> protocol command on the port above. [snip] Before doing anything new on robot/agent communication, you wish to look into some of the already in-place efforts, ie. KQML and KIF. Check out <http://www.cs.umbc.edu/kse>. They don't do everything you want, but they do provide a framework. -- Mordechai T. Abzug http://umbc.edu/~mabzug1 mabzug1@umbc.edu finger -l mabzug1@gl.umbc.edu If you believe in telekinesis, raise my hand. From owner-robots Fri Dec 29 15:10:16 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17154; Fri, 29 Dec 95 15:10:16 -0800 Message-Id: <01BAD696.20CAA5A0@pluto.planets.com.au> From: David Eagles <eaglesd@planets.com.au> To: "'robots@webcrawler.com'" <robots@webcrawler.com> Subject: RE: Inter-robot Communications - Part II Date: Sat, 30 Dec 1995 09:06:36 +-1100 Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="---- =_NextPart_000_01BAD696.20DB6E80" Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com ------ =_NextPart_000_01BAD696.20DB6E80 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable I'll share a different idea I had about this stuff (oe, or do I need to patent it first these days?). If in the distributed gathering part we start sending URL's, HTTP response headers, and complete content, doesn't the word "caching proxy" spring to mind? I need to think more about this, but it sounds to me that if we had an efficient way of updating and pre-loading distributed caches using a between-cache protocol, we'd be killing two birds with one stone: better caching performance than the current 30%, and complete freedom to do whatever you want with the content for robot purposes... I'd actually had the same idea, but admittedly hadn't thought of taking = it as far as a distributed cache situation. I currently generate my = database using the cache files from a fairly large Australian ISP. Just = tar all the appropriate files I require (which is easy in the case of = FunnelWeb because it uses domain names to determine if the data is = relevant to it or not - I just tar *.au, *.nz, etc), then download them = at let FunnelWeb go crazy. Accessing the local filesystem makes the = initial data gathering very fast obviously, and I can then re-visit each = host in the database and try a more thorough traversal of the site. Now, if we had a distributed cache mechanism, I wouldn't need to grab = their cache file anymore - the robot itself could either access the = cache files directly, or talk to the local cache handler using the = between-cache protocol. The storage format of the CERN proxy-cache is quite convenient for file = access by robots (except it should compress the data - I haven't looked = at it lately, so if it does now please ignore the last comment). Unfortunately, the same problems come up as I described in the last = message. It is a waste of bandwidth, time and storage to completely = duplicate entire caches. The ideal way would be to have some selection = criteria, but what? Later all, David ------ =_NextPart_000_01BAD696.20DB6E80 Content-Type: application/ms-tnef Content-Transfer-Encoding: base64 eJ8+IicWAQaQCAAEAAAAAAABAAEAAQeQBgAIAAAA5AQAAAAAAADoAAENgAQAAgAAAAIAAgABBJAG ACQBAAABAAAADAAAAAMAADADAAAACwAPDgAAAAACAf8PAQAAAEkAAAAAAAAAgSsfpL6jEBmdbgDd AQ9UAgAAAAByb2JvdHNAd2ViY3Jhd2xlci5jb20AU01UUAByb2JvdHNAd2ViY3Jhd2xlci5jb20A AAAAHgACMAEAAAAFAAAAU01UUAAAAAAeAAMwAQAAABYAAAByb2JvdHNAd2ViY3Jhd2xlci5jb20A AAADABUMAQAAAAMA/g8GAAAAHgABMAEAAAAYAAAAJ3JvYm90c0B3ZWJjcmF3bGVyLmNvbScAAgEL MAEAAAAbAAAAU01UUDpST0JPVFNAV0VCQ1JBV0xFUi5DT00AAAMAADkAAAAACwBAOgEAAAACAfYP AQAAAAQAAAAAAAAD0jcBCIAHABgAAABJUE0uTWljcm9zb2Z0IE1haWwuTm90ZQAxCAEEgAEAKQAA AFJFOiBJbnRlci1yb2JvdCBDb21tdW5pY2F0aW9ucyAtIFBhcnQgSUkA5Q0BBYADAA4AAADLBwwA HgAJAAYAJAAGADUBASCAAwAOAAAAywcMAB4ACAA2AAUABgBFAQEJgAEAIQAAADMwMDQ5MjVCRDE0 MUNGMTE5ODZBMDAwMEMwOEMwMzRFAOAGAQOQBgCUBwAAEgAAAAsAIwAAAAAAAwAmAAAAAAALACkA AAAAAAMANgAAAAAAQAA5AMACDOw51roBHgBwAAEAAAApAAAAUkU6IEludGVyLXJvYm90IENvbW11 bmljYXRpb25zIC0gUGFydCBJSQAAAAACAXEAAQAAABYAAAAButY56/tbkgQxQdERz5hqAADAjANO AAAeAB4MAQAAAAUAAABTTVRQAAAAAB4AHwwBAAAAEgAAAGVhZ2xlc2RAcGMuY29tLmF1AAAAAwAG EPYNei8DAAcQDwYAAB4ACBABAAAAZQAAAElMTFNIQVJFQURJRkZFUkVOVElERUFJSEFEQUJPVVRU SElTU1RVRkYoT0UsT1JET0lORUVEVE9QQVRFTlRJVEZJUlNUVEhFU0VEQVlTPylJRklOVEhFRElT VFJJQlVURURHQVQAAAAAAgEJEAEAAAAGBgAAAgYAAIcJAABMWkZ1Mmz6ev8ACgEPAhUCqAXrAoMA UALyCQIAY2gKwHNldDI3BgAGwwKDMgPFAgBwckJxEeJzdGVtAoMztwLkBxMCgzQSzBTFfQqAiwjP Cdk7F58yNTUCgAcKgQ2xC2BuZzEwM18UUAsKFFEL8hNQbxPQY8MFQAqLbGkzNhwhG39RHIJJJ2wD IHMRgWXQIGEgZAaQZgSQCfBNBUBpDbAgAEkgEYBkyx/wBuB1BUB0aAQAH5AEdHUN0CAob2UsUiAF sWRvIQFuCeBkfQqFdCMQCrAT0CCSBUBmnmkRoCGyB5Af4GRhE7D4PykuH0AiYAuAJSIgEZsTwAUQ YiGgCYAgZyRAvyVABRAaoCQhACAKhXcf4D8TwCghH5AJ8CAgJ+FVUghMJ3MisEhUVFDyIBegc3AC ICVhJUAhQN8EkCoRAHAhUAWgbQtQEcDvH+AFoAIwIIEsCoUjAAeQPG4nJRMosAWwIVAiY48A0CHg J+IDYHh5Ih+QlxNQJ9IkAW0LgGQ/IyWzI/Ih0W5rMBAFsGUKhe8heCKwJxEkknMIYClwBCD/L/If 4CHQJEAgsCJgKMEhM7MDoA3BaWMIkCCRdyWgpSLAZgqFdXAlkHQn0vMrshNQZS0XMCFAJ9Imuvsu kgeRdQCQNyIy4BHAKMDdCfAtORMKhRxCbxcRIrDNKMAnIVA6ACBrAxAdkP8vwi4wMuAk4DOhA/Ah 0CLA9yNQIhE90To58hPQBcAuldc61gSQAhByA4FjNBMmVMZjCHAgczMwJSudA1AvI2EDcCPyIwF3 NEFldvsEkAqFeQhgNeE1wj2SQOO/LJQkwAWxA2AG4AVAcAhw8yrAEbBzLkdQCoceChy8vR38YwBA H0EhURyQdQdAzmw2ECEyJnJzYTQBIMLvMtQhQDAgPqFkS3QttAhg3GdoBUA2MCHAYTxgJ+H1JKFh BCBmMgIEICACOH3/H5AkoEtANwACICXwIQFBJf1LcWcJ8ASQJEEwEDYQNuH/AaBPgB/gOYRA41ET JNAsMH9PkQNhH/FPsCTgS3ELYHJ/UuAUsDmAJuAHQAcwA6BJ9FNQUfFKVuFO0TIRH3H5JnJhcBxB L5FTMlU0IRDZF6BxdSTgH+AoQ6A1gPc9sCHxIOBzNhAmRS6QJWENTrFGM4AjUGxXZWL/OfEukDmA TFEFQF1xBCBDAf8LcSNATDEzsw2wPrEwIUxRv07BJoJTsVsSF6AsMHZE4lckASShBbFuRqEtIQFq OVf2Ki5dYCKwYvBuevMisBHAYykisCUxA6AjAPx3bjfSJSJVwQVALDFciUZnIxAFAGF6eVHxQf5j QGAEEFRWFzAukAMgVTP/E7MwEE7wXtImgQuAJKAHMX9gNCd4Q/E2EE+wJQFGgHb/UcA5gEtwK5RS IUC0A6A3of9rwFFhW0Fa8U5AJQEmRlO2/yuyJuA2ECAAMYJOIgNgTmF9b6FhQ/FMIAMgTrJRNGX3 R3ZI70n8TmRwIrA0iVA/71NRBZBAoQQAbSKwIRAuMPx1bE3jMKYJwAGgJSIk4P9UySuhBsBwImIg JnJGdCSg/RGwbCJgBaB30TVAPZE+wf8A0GciVI8gIBegHJBsEgWx/QGQbDFgMPNnxlEEQKFNgN8+ wVQ4Ogs7N3JNVEvyJAAfUyBWoUADTpQmgUNFUv5OLwSB9SHxWlEsVEPwAwB3RfZ51Hy0YjYQRnME ICj8ZXhAYAUxMzJOQXwBK/L/KpFpNGBDYiIRgIdBLcEXMP5vaRAhUSSDC2AT0GwSM2B/NHIkoS1y YdEH4CwhU/Jp/mdh4HAzZ8FrYivxB4ACMPMl4HJcVW5AASIwXqCNVH9L5xxBAmAT4AQgK/FUEXD/ T3IhEA2wBPI6ACFQJkWP8/8HgUwgUuBR8iChUAI18BPB706iU+ApcAPwZCHQY/EHcd8ro4QWJAEr 9ktxZDbAHZD/LpAsUSCBWnI5FFHxg9Igwv8DIDXyd7M8IiQBi+IzUTQB/3uBHIFRwWZxchEHITLU Q6K6P3JcTCRBWFMs9kRxEF8gwHJfHW9IixbBAKVgAAADABAQAAAAAAMAERAAAAAAQAAHMADMBSw4 1roBQAAIMADMBSw41roBHgA9AAEAAAAFAAAAUkU6IAAAAABbcA== ------ =_NextPart_000_01BAD696.20DB6E80-- From owner-robots Fri Dec 29 17:01:25 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22417; Fri, 29 Dec 95 17:01:25 -0800 To: robots@webcrawler.com Cc: John_R_R_Leavitt@NL.CS.CMU.EDU Subject: Re: Unfriendly Lycos , again ... In-Reply-To: Your message of "28 Dec 95 13:36:07 PST." <m2u42keurc.fsf@diana.miranova.com> Date: Fri, 29 Dec 95 20:00:37 EST Message-Id: <9464.820285237@NL.CS.CMU.EDU> From: John_R_R_Leavitt@NL.CS.CMU.EDU Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com steve@miranova.com (Steven L. Baur) wrote: >> Is anyone getting requests from an anonymous robot, presumably, >> from the lycos domain presumably (cmu.edu), as follows .. > >BRAGI.CC.CMU.EDU - - [16/Dec/1995:19:18:55 +0800] "GET /" 200 2427 > >One request since August doesn't seem unfriendly to me. >-- >steve@miranova.com baur Please be aware that cmu.edu != lycos.com. We had our roots in CMU (and you may note that some of us are stilling getting/sending mail from there), but our operations are now almost entirely moved over to the lycos.com domain. Those remaining are in the cs.cmu.edu (computer science) subdomain, not cc.cmu.edu (computer club). Also, our spiders have always clearly identified themselves with the User-agent header. -John. John R. R. Leavitt | Director, Product Development | Lycos Inc. 412 268 8259 | jrrl@lycos.com | http://agent2.lycos.com:8001/jrrl/ Reading: Half the Day is Night by Maureen McHugh From owner-robots Sat Dec 30 12:02:15 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14410; Sat, 30 Dec 95 12:02:15 -0800 X-Sender: mak@surfski.webcrawler.com Message-Id: <v02140800ad0b372b6628@[199.221.45.139]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sat, 30 Dec 1995 12:02:24 -0700 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: RE: Inter-robot Communications - Part II Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 2:06 AM 12/30/95, David Eagles wrote: > Now, if we had a distributed cache mechanism, I wouldn't need to grab > their cache file anymore - the robot itself could either access the > cache files directly, or talk to the local cache handler using the > between-cache protocol. Quite. >The storage format of the CERN proxy-cache is quite convenient for file > access by robots (except it should compress the data - I haven't looked > at it lately, so if it does now please ignore the last comment). Hmmm... I have the feeling the CERN cache is far from ideal these days. > Unfortunately, the same problems come up as I described in the last > message. It is a waste of bandwidth, time and storage to completely > duplicate entire caches. The ideal way would be to have some > selection criteria, but what? You could do all sorts, but for just the caching side you can use the standard caching mechanisms, and base it on popularity etc. For content subject selection you'd have to find some other ways, but at least if you're sitting on a complete cache you have the freedom to choose. Wouldn't it be handy if you could run a java/Safe-perl/whatever selector on the remote cache, so it can choose for itself according to _your_ rules instead of the server? :-) Happy New Year all, -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Sun Dec 31 08:54:08 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09461; Sun, 31 Dec 95 08:54:08 -0800 Date: Sun, 31 Dec 1995 11:49:24 -0600 (CST) From: gil cosson <gil@rusty.waterworks.com> To: robots@webcrawler.com Subject: please add my site Message-Id: <Pine.LNX.3.91.951231114800.15693A-100000@rusty.waterworks.com> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Please crawl my site, but don't kill me... I am at http://www.waterworks.com thanks, gil. From owner-robots Mon Jan 1 09:56:04 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08566; Mon, 1 Jan 96 09:56:04 -0800 X-Sender: mak@surfski.webcrawler.com Message-Id: <v02140804ad0dbf423937@[199.221.45.139]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Mon, 1 Jan 1996 09:56:20 -0700 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: Re: please add my site Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 11:49 AM 12/31/95, gil cosson wrote: >Please crawl my site, but don't kill me... :-) > I am at http://www.waterworks.com Before anyone else is tempted: this is not an appropriate message for this mailing list; the list is for technical discussions. To invite robots round, check out Submit-it! or submit by hand to the services you want. Happy New Year all, -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Tue Jan 2 00:40:53 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15807; Tue, 2 Jan 96 00:40:53 -0800 Date: Tue, 2 Jan 96 17:40:33 KST From: dhkim@sarang.kyungsung.ac.kr (Dong-Hyun Kim) Message-Id: <9601020840.AA04808@sarang.kyungsung.ac.kr> To: robots@webcrawler.com Subject: Please Help ME!! Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi... my dear Everyone. Happy New Year.. I've run Harvest program. But It can't Search and Gather 2bytes Languages like korean, japaness, etc So.. I'm so blue.. How can I fix that problem... If it can be possible where do I fix? Please help me.... I want search 2bytes Languages~~~~ from... DH http://sarang.kyungsung.ac.kr:8585 From owner-robots Tue Jan 2 12:28:10 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21443; Tue, 2 Jan 96 12:28:10 -0800 Message-Id: <199601022028.PAA18749@lexington.cs.columbia.edu> To: robots@webcrawler.com Subject: Re: Inter-robot Communications - Part II In-Reply-To: Your message of "Sat, 30 Dec 1995 12:02:24 MST." <v02140800ad0b372b6628@[199.221.45.139]> Date: Tue, 02 Jan 1996 15:28:06 -0500 From: "John D. Pritchard" <jdp@cs.columbia.edu> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com hi, i think some of the issues coming up could be resolved with a another approach to the protocol problem. i think that negotiated protocols would be one of the most important aspects of a robot to robot link so that new protocols may be created or old ones extended at will. in the simplest approach, protocols are named and versioned, which ident strings can be used for communication set up. this presumes a basic subset of all robot protocols, which would be stateless, of course, for transferring the version string. higher level protocols may want to be stateful, as some search engines/web sites are now for narrow-casting, or for caching. (www.sony.com: the magic cookies it puts into its forms expire, or at least appear to "expire", they're having lots of problems lately) more interestingly, a minimal (stateless) inter-robot language would also provide for more sophisticated forms of negotiation. for example, systems with dynamic local indexing, eg, expiring magic cookies, could negotiate an expired context with previous search info which effectively restores the lost context. in such an environment, caching info is predetermined. for example, saving GET strings with search info. this meets the ideas raised by martijn. as mentioned there are lots of ways to do this, which can be negotiated to some degree. downloading a Java or Safe-perl program is one way, and could be a subset of the proposed protocol, since various people would have various ideas on how to do this. for example, what kind of namespace the script enters, or are there a class of scripts with particular init arguments. these things provide for narrow casting techniques which would be valuable for "client robots", intelligent agents. so, ive presented a radical view which would tend to promote imaginations to prefer just downloading scripts to some anarchy of protocol extensions. but no one has to support any extension one doesnt want to. on the otherhand, with more and more Java browsers, it's possible to download Java code (protocols or protocol extensions) into clients, providing a means for such anarchy. so, can we provide a framework for this kind of environment? i think most of it is already provided via HTTP and Java. the merging of these things under a common umbrella just serves to solve concurrency problems like caching and promote robot interoperability, like a more open than opendoc, ie, corba, http://www.cilabs.org/ approach to robots. another approach would have negotiated semantics, ie, lay out a protocol for doing everything you ever want to do. a whole new language. i think protocol ident strings are useful as a family identifier, but the string approach would require something more, maybe a concatenation of extension ident strings, for identifying extensions. this is the deterministic perspective. a one-degree less deterministic perspective would have extensions negotiated as methods required, as required. the idea of "robustness in failures" as a general model of error catching and zero or more (degrees of) adaptation in communication can make a mess of intended versus actual semantics if applied to a protocol namespace. however, it's required anyway so the question is only to what degree it is a part of the architecture. von neumann permits it to be a principal component of his self-reproducing automata. this is the nondeterministic perspective. -john From owner-robots Wed Jan 3 06:43:07 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA13806; Wed, 3 Jan 96 06:43:07 -0800 From: Byung-Gyu Chang <chitos@ktmp.kaist.ac.kr> Message-Id: <199601031357.WAA08007@ktmp.kaist.ac.kr> Subject: Re: Please Help ME!! To: robots@webcrawler.com Date: Wed, 3 Jan 1996 22:57:23 +0900 (KST) In-Reply-To: <9601020840.AA04808@sarang.kyungsung.ac.kr> from "Dong-Hyun Kim" at Jan 2, 96 05:40:33 pm X-Mailer: ELM [version 2.4 PL21-h4] Mime-Version: 1.0 Content-Type: text/plain; charset=iso-2022-kr Content-Transfer-Encoding: 7bit Content-Length: 484 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Plz use comp.infosystems.harvest newsgroup. I think that Harvest is not the interest of this mailing-list ... ;) > > Hi... my dear Everyone. > Happy New Year.. > > I've run Harvest program. > But It can't Search and Gather 2bytes Languages like korean, japaness, etc > So.. I'm so blue.. > > How can I fix that problem... If it can be possible where do I fix? > Please help me.... > I want search 2bytes Languages~~~~ > > from... DH > > http://sarang.kyungsung.ac.kr:8585 > From owner-robots Fri Jan 5 13:31:54 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12199; Fri, 5 Jan 96 13:31:54 -0800 Message-Id: <n1391273301.8920@mail.intouchgroup.com> Date: 5 Jan 1996 13:34:36 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: Infinite e-mail loop To: " " <robots@webcrawler.com> X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I would like to appologise to everyone on the robots list for the inconvenience which my brain damaged mail software's vacation autoresponder caused about three weeks ago. --Roger Dearnaley From owner-robots Fri Jan 5 16:10:17 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23704; Fri, 5 Jan 96 16:10:17 -0800 Date: Fri, 5 Jan 1996 19:10:05 -0500 From: Skip Montanaro <skip@automatrix.com> Message-Id: <199601060010.TAA06882@dolphin.automatrix.com> To: robots@webcrawler.com Subject: Infinite e-mail loop In-Reply-To: <n1391273301.8920@mail.intouchgroup.com> References: <n1391273301.8920@mail.intouchgroup.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I would like to appologise to everyone on the robots list for the inconvenience which my brain damaged mail software's vacation autoresponder caused about three weeks ago. No big deal. How many replies would you like? :-) Skip Montanaro skip@calendar.com (518)372-5583 Musi-Cal: http://www.calendar.com/concerts/ or mailto:concerts@calendar.com Internet Conference Calendar: http://www.calendar.com/conferences/ >>> ZLDF: http://www.netresponse.com/zldf <<< From owner-robots Fri Jan 5 17:57:45 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA29373; Fri, 5 Jan 96 17:57:45 -0800 Date: Sat, 6 Jan 96 09:20:51 +1100 (EST) Message-Id: <v01530504ad13f013b1d5@[192.190.215.47]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: radio@mpx.com.au (James) Subject: Up to date list of Robots Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Dear sirs, Madams, We wish to make efficient use of services provided by active Robots. We have a submission service for new URL's called World Announce Archive at: http://www.com.au/aaa/linkform.html Does anyone have a up to date list of robots and search engines on the Web. The exisiting material is patchy. Keith AAA Australia Announce Archive / Tourist Radio Home of the Australian Cool Site of the Day ! http://www.com.au/aaa Postal: AAA Australia Announce Archive / Tourist Radio P.O. Box 202, Caringbah 2229 Australia From owner-robots Sun Jan 7 09:35:00 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21583; Sun, 7 Jan 96 09:35:00 -0800 Date: Sat, 6 Jan 1996 09:36:04 -0500 (EST) From: Matthew Gray <mkgray@Netgen.COM> X-Sender: mkgray@fairbanks To: robots@webcrawler.com Subject: Web Robot Message-Id: <Pine.SOL.3.91.960106092927.18700A-100000@fairbanks> Organization: net.Genesis Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Last night we were hit a few hundred times (spaced over many hours) by a robot calling itself 'Web Robot/OTWR:001p116 libwww/2.17' coming from 205.216.146.163. It did not request /robots.txt but was otherwise perfectly reasonable. This one is not on the list of known robots. On a related note, I've been running a number of robots, with User-Agent's as follows: webTool Wander Mk1 Matthew Gray ---------------------------- voice: (617) 577-9800 x240 net.Genesis fax: (617) 577-9850 68 Rogers St. mkgray@netgen.com Cambridge, MA 02142-1119 ------------- http://www.netgen.com/~mkgray From owner-robots Sun Jan 7 13:16:16 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04582; Sun, 7 Jan 96 13:16:16 -0800 Date: Mon, 8 Jan 96 08:15:57 +1100 (EST) Message-Id: <v01530503ad168491c3b4@[192.190.215.50]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: radio@mpx.com.au (James) Subject: Re: Web Robots Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Can someone please advise where we can find this list of known robots. James AAA AAA World Announce Archive Home: Australian Cool Site of the Day ! and Daily News. Web: http://www.com.au/aaa Email: radio@mpx.com.au Postal: AAA Australia Announce Archive / Tourist Radio P.O. Box 202, Caringbah 2229 Australia From owner-robots Sun Jan 7 14:28:33 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08532; Sun, 7 Jan 96 14:28:33 -0800 Comments: Authenticated sender is <jakob@cybernet.dk> From: "Jakob Faarvang" <jakob@jubii.dk> Organization: Jubii / cybernet.dk To: robots@webcrawler.com Date: Sun, 7 Jan 96 23:30:07 +0100 (CET) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Subject: Re: Web Robots Priority: normal X-Mailer: Pegasus Mail for Windows (v2.23) X-Info: Evaluation version at mail.cybernet.dk Message-Id: 22300767208411@cybernet.dk X-Info: cybernet.dk Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Can someone please advise where we can find this list of known robots. http://info.webcrawler.com/mak/projects/robots/active.html - Jakob Faarvang From owner-robots Sun Jan 7 14:53:22 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09636; Sun, 7 Jan 96 14:53:22 -0800 Date: Sun, 7 Jan 96 2:50:47 CET From: Thomas Stets <stets@stets.bb.bawue.de> Message-Id: <30ef26f7.stets@stets.bb.bawue.de> Subject: Does this count as a robot? To: robots@webcrawler.com X-Mailer: ELM [version 2.3 PL11] for OS/2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I am currently writing (actually it's already written and I'm testing) a program to copy a subtree of a server to my machine. The reason behind this is that every so often I find a web site with some interesting information, but I don't have the time (or money - I have to pay for my connection) to study it all. From the first days of accessing the web I wished I could copy pages or complete subtress to my computer, graphics and all. Well, now I can. :-) OTOH, I don't want to upset anyone with my program. Any comments are appreciated. Here is the basic functionality: - The program starts at a given URL and follows all links that are in the same directory or below. (Starting with http://x/a/b/c/... it would follow /a/b/c/d/... but not /a/b/e/...) (except for IMG graphics) - It will, optionally, follow links to other servers one level deep. - No links with .../cgi-bin/... or ?parameters are followed. - Only http: links are followed. - No Document is requested twice. (To prevent loops) - It will identify itself with User-agent: and From: - It will use HEAD requests when refreshing pages. The program was started primarily for my own use, but I might release it as shareware (when I'm sure it's well-behaved). Since it is intended for the consumer market (it is written for OS/2), the users of this program will generally be connected by modem, (In my case currently with 14.400 bps) which helps keeping used bandwidth down. What I'd like to know: - Should this Program use /robots.txt? Is it the type of program that robots.txt is supposed to control? It is basically a web-browser, the retrieved pages will just be read offline. - How fast should I make my requests? Since this is not a robot in the sense that it visits many different hosts, and since it is not intended to traverse the whole server (after all, I have to store all the data on my PC and I have to pay for the connection), I'd rather not wait too long between requests. My Idea is to read single pages in a similar way the IBM WebExplorer does it: read the main dokument and get all the embedded graphics as fast as possible. Then wait some time (some seconds) before making the next request. - How is the general feeling towards copying web-pages for non-commercial use? TIA Thomas Stets -- ----------------------------------------------------------------------------- Thomas Stets ! Words shrink things that were Holzgerlingen, Germany ! limitless when they were in your ! head to no more than living size stets@stets.bb.bawue.de ! when they're brought out. CIS: 100265,2101 ! [Stephen King] ----------------------------------------------------------------------------- From owner-robots Mon Jan 8 01:59:29 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10834; Mon, 8 Jan 96 01:59:29 -0800 Date: Mon, 8 Jan 1996 09:59:44 GMT From: jeremy@mari.co.uk (Jeremy.Ellman) Message-Id: <9601080959.AA09596@kronos> To: robots@webcrawler.com Subject: Re: Does this count as a robot? X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > I am currently writing (actually it's already written and I'm testing) > a program to copy a subtree of a server to my machine. > Sounds like HTMLGOBBLE. Why not just do use that? I've been trying to fix some bugs in it but it's only real problem is that it does not respect robots.txt Jeremy Ellman MARI Computer Systems From owner-robots Mon Jan 8 07:57:56 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24873; Mon, 8 Jan 96 07:57:56 -0800 Date: Mon, 8 Jan 1996 08:09:24 -0800 (PST) From: Benjamin Franz <snowhare@netimages.com> X-Sender: snowhare@ns.viet.net To: robots@webcrawler.com Subject: Re: Does this count as a robot? In-Reply-To: <30ef26f7.stets@stets.bb.bawue.de> Message-Id: <Pine.LNX.3.91.960108080309.2792A-100000@ns.viet.net> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Sun, 7 Jan 1996, Thomas Stets wrote: > Here is the basic functionality: > > - The program starts at a given URL and follows all links that > are in the same directory or below. (Starting with http://x/a/b/c/... > it would follow /a/b/c/d/... but not /a/b/e/...) > (except for IMG graphics) > - It will, optionally, follow links to other servers one level deep. > - No links with .../cgi-bin/... or ?parameters are followed. > - Only http: links are followed. > - No Document is requested twice. (To prevent loops) > - It will identify itself with User-agent: and From: > - It will use HEAD requests when refreshing pages. From your description, it is vulnerable to looping still. Many sites use symbolic links from lower to upper levels. If you try to suck 'everything', you will end up in an infinite recursion. You need a depth limit (no more than X '/' elements in the URL), and probably a total pages limit (no more than Y pages total) to prevent any obscure cases from sucking it down an unexpected rat hole. -- Benjamin Franz From owner-robots Mon Jan 8 08:05:05 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25286; Mon, 8 Jan 96 08:05:05 -0800 Message-Id: <199601081603.LAA00699@revere.musc.edu> Comments: Authenticated sender is <lindroth@atrium.musc.edu> From: "John Lindroth" <lindroth@musc.edu> Organization: Medical University of South Carolina To: "Christopher J. Tomasello/WSC" <Christopher_J.._Tomasello@hammer.net>, robots@webcrawler.com Date: Mon, 8 Jan 1996 11:03:54 +0000 Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Subject: Re: unknown robot Priority: normal X-Mailer: Pegasus Mail for Windows (v2.23) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Have you ruled out the AutoSurf option in the latest Moasic (for Windows). It basically does the same thing as a robot, only to create a table of URL links. -John > To: robots <robots@webcrawler.com> > From: "Christopher J. Tomasello/WSC" > <Christopher_J.._Tomasello@hammer.net> > Date: 29 Dec 95 9:04:46 EDT > Subject: unknown robot > Reply-to: robots@webcrawler.com > Anyone with information on this robot please respond to the group or perferably > to me ctomasello@hammer.net > > There is a robot hitting our web server on a regular basis. It hits every file > on the server in a very rapid rate (many requests per second). The curious > thing about this is that the robot is using our IP/domain name to gain access. > So in the log files it looks like one of our internal servers is hitting the > site. Also, all the the requests this robot makes are returning 404 errors. > > I have heard rumors that the Alta Vista spider is doing this kind of spoofing - > but I have also heard that it is not. Any information would be greatly > appreciated. > > > ============================================= John Lindroth Senior Systems Programmer Academic & Research Computing Services Center for Computing & Information Technology Medical University of South Carolina E-Mail: lindroth@musc.edu URL: http://www.musc.edu/~lindroth ============================================= Any opinions expressed are mine, not my employer's. And they may be wrong (gasp!) ============================================= From owner-robots Mon Jan 8 10:11:41 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00846; Mon, 8 Jan 96 10:11:41 -0800 Subject: Re: Does this count as a robot? From: YUWONO BUDI <yuwono@uxmail.ust.hk> To: robots@webcrawler.com Date: Tue, 9 Jan 1996 02:10:06 +0800 (HKT) In-Reply-To: <Pine.LNX.3.91.960108080309.2792A-100000@ns.viet.net> from "Benjamin Franz" at Jan 8, 96 08:09:24 am X-Mailer: ELM [version 2.4 PL24alpha3] Content-Type: text Content-Length: 1458 Message-Id: <96Jan9.021013hkt.19035-3+186@uxmail.ust.hk> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > On Sun, 7 Jan 1996, Thomas Stets wrote: > > Here is the basic functionality: > > > > - The program starts at a given URL and follows all links that > > are in the same directory or below. (Starting with http://x/a/b/c/... > > it would follow /a/b/c/d/... but not /a/b/e/...) > > (except for IMG graphics) > > - It will, optionally, follow links to other servers one level deep. > > - No links with .../cgi-bin/... or ?parameters are followed. > > - Only http: links are followed. > > - No Document is requested twice. (To prevent loops) > > - It will identify itself with User-agent: and From: > > - It will use HEAD requests when refreshing pages. > > >From your description, it is vulnerable to looping still. Many sites use > symbolic links from lower to upper levels. If you try to suck > 'everything', you will end up in an infinite recursion. You need a depth > limit (no more than X '/' elements in the URL), and probably a total > pages limit (no more than Y pages total) to prevent any obscure cases > from sucking it down an unexpected rat hole. One trick that I use to get around symbolic-link loops is to detect any recurring path segment (a /x/) in a URL. Hopefully, no web author creates a subdirectory with the same name as its parent or grand* parent directory (in which case my robot would think there is a loop and stop there). So far (half a thousand sites we visited), I haven't seen such a case. -Budi. From owner-robots Mon Jan 8 10:42:23 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01557; Mon, 8 Jan 96 10:42:23 -0800 Message-Id: <9601081842.AA01551@webcrawler.com> To: robots@webcrawler.com Subject: Re: Does this count as a robot? In-Reply-To: Your message of "Tue, 09 Jan 96 02:10:06 +0800." <96Jan9.021013hkt.19035-3+186@uxmail.ust.hk> Date: Mon, 08 Jan 96 18:41:18 +0000 From: M.Levy@cs.ucl.ac.uk Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >One trick that I use to get around symbolic-link loops is to >detect any recurring path segment (a /x/) in a URL. Hopefully, >no web author creates a subdirectory with the same name as >its parent or grand* parent directory (in which case my robot would >think there is a loop and stop there). So far (half a thousand >sites we visited), I haven't seen such a case. > >-Budi. er, yes, but if there was such a naming convention then you wouldn't be able to tell the difference between that and a recurring path segment. Would you? |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||Micah Levy Department of Computer Science || || University College London || ||Web Page: http://www.cs.ucl.ac.uk/students/M.Levy/ || ||Email: M.Levy@cs.ucl.ac.uk Cestor@delphi.com || || zcacma0@cs.ucl.ac.uk Micah@delphi.com || |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| From owner-robots Mon Jan 8 11:03:53 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01697; Mon, 8 Jan 96 11:03:53 -0800 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199601081904.UAA01225@wsinis10.win.tue.nl> Subject: avoiding infinite regress for robots To: robots@webcrawler.com Date: Mon, 8 Jan 1996 20:04:19 +0100 (MET) In-Reply-To: <Pine.LNX.3.91.960108080309.2792A-100000@ns.viet.net> from "Benjamin Franz" at Jan 8, 96 08:09:24 am X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Content-Length: 1154 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Benjamin Franz wrote: >Many sites use >symbolic links from lower to upper levels. If you try to suck >'everything', you will end up in an infinite recursion. You need a depth >limit (no more than X '/' elements in the URL), and probably a total >pages limit (no more than Y pages total) to prevent any obscure cases >from sucking it down an unexpected rat hole. I'm surprised that no spider seems to use the page content to guess whether or not two document trees are equal. For example, one heuristic would be to keep a checksum for every visited page, and to decide that two subtrees are probably equal if its root nodes and their children have iddentical checksums. Do spiders use the content to cut off walks, and if not, is it because alternative techniques are sufficient? Since my own spiders are rather simple-minded (and not widely used), I'd be interested in seeing a more informed opinion on the usefulness of comparing content. >Benjamin Franz -- Reinier Post reinpost@win.tue.nl a.k.a. <A HREF="http://www.win.tue.nl/win/cs/is/reinpost/">me</A> [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] From owner-robots Mon Jan 8 14:10:10 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02970; Mon, 8 Jan 96 14:10:10 -0800 Message-Id: <01BADE68.EAF50080@pluto.planets.com.au> From: David Eagles <eaglesd@planets.com.au> To: "'robots@webcrawler.com'" <robots@webcrawler.com> Subject: RE: avoiding infinite regress for robots Date: Tue, 9 Jan 1996 08:03:08 +-1100 Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="---- =_NextPart_000_01BADE68.EAFE2840" Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com ------ =_NextPart_000_01BADE68.EAFE2840 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Benjamin Franz wrote: >Many sites use=20 >symbolic links from lower to upper levels. If you try to suck=20 >'everything', you will end up in an infinite recursion. You need a = depth=20 >limit (no more than X '/' elements in the URL), and probably a total=20 >pages limit (no more than Y pages total) to prevent any obscure cases=20 >from sucking it down an unexpected rat hole. I'm surprised that no spider seems to use the page content to guess = whether or not two document trees are equal. For example, one heuristic would be = to keep a checksum for every visited page, and to decide that two subtrees are = probably equal if its root nodes and their children have iddentical checksums. Do spiders use the content to cut off walks, and if not, is it because alternative techniques are sufficient? Since my own spiders are rather simple-minded (and not widely used), I'd be interested in seeing a more informed opinion on the usefulness of comparing content. Yep. This is one of the ways FunnelWeb (the latest version I haven't = quite released yet) checks for looping. What would be REALLY nice, however, would be if the HTML spec was = extended to include a Filename: field sent by the server for every = request. The field would specify the exact filename after all links = were resolved and would therefor eliminate a lot of the guess work, = parsing, etc required by clients, spiders, etc. Hope everyone had a great New Year. Regards, David ------ =_NextPart_000_01BADE68.EAFE2840 Content-Type: application/ms-tnef Content-Transfer-Encoding: base64 eJ8+IgoVAQaQCAAEAAAAAAABAAEAAQeQBgAIAAAA5AQAAAAAAADoAAENgAQAAgAAAAIAAgABBJAG ACQBAAABAAAADAAAAAMAADADAAAACwAPDgAAAAACAf8PAQAAAEkAAAAAAAAAgSsfpL6jEBmdbgDd AQ9UAgAAAAByb2JvdHNAd2ViY3Jhd2xlci5jb20AU01UUAByb2JvdHNAd2ViY3Jhd2xlci5jb20A AAAAHgACMAEAAAAFAAAAU01UUAAAAAAeAAMwAQAAABYAAAByb2JvdHNAd2ViY3Jhd2xlci5jb20A AAADABUMAQAAAAMA/g8GAAAAHgABMAEAAAAYAAAAJ3JvYm90c0B3ZWJjcmF3bGVyLmNvbScAAgEL MAEAAAAbAAAAU01UUDpST0JPVFNAV0VCQ1JBV0xFUi5DT00AAAMAADkAAAAACwBAOgEAAAACAfYP AQAAAAQAAAAAAAAD0jcBCIAHABgAAABJUE0uTWljcm9zb2Z0IE1haWwuTm90ZQAxCAEEgAEAKQAA AFJFOiBhdm9pZGluZyBpbmZpbml0ZSByZWdyZXNzIGZvciByb2JvdHMA8w4BBYADAA4AAADMBwEA CQAIAAMACAACAPIAASCAAwAOAAAAzAcBAAkABwA6ABYAAgA2AQEJgAEAIQAAAEU3QTlDRURCNUE0 QUNGMTE5ODZBMDAwMEMwOEMwMzRFAEwHAQOQBgCYBgAAEgAAAAsAIwAAAAAAAwAmAAAAAAALACkA AAAAAAMANgAAAAAAQAA5AICQX7YM3roBHgBwAAEAAAApAAAAUkU6IGF2b2lkaW5nIGluZmluaXRl IHJlZ3Jlc3MgZm9yIHJvYm90cwAAAAACAXEAAQAAABYAAAABut4MtlfbzqnoSloRz5hqAADAjANO AAAeAB4MAQAAAAUAAABTTVRQAAAAAB4AHwwBAAAAEgAAAGVhZ2xlc2RAcGMuY29tLmF1AAAAAwAG EL+tcaYDAAcQeQQAAB4ACBABAAAAZQAAAEJFTkpBTUlORlJBTlpXUk9URTpNQU5ZU0lURVNVU0VT WU1CT0xJQ0xJTktTRlJPTUxPV0VSVE9VUFBFUkxFVkVMU0lGWU9VVFJZVE9TVUNLRVZFUllUSElO RyxZT1VXSUxMRU4AAAAAAgEJEAEAAAAJBQAABQUAAJsHAABMWkZ1GnmE6f8ACgEPAhUCqAXrAoMA UALyCQIAY2gKwHNldDI3BgAGwwKDMgPFAgBwckJxEeJzdGVtAoMzdwLkBxMCgH0KgAjPCdk78RYP MjU1AoAKgQ2xC2BgbmcxMDMUUAsOMW42CqADYBPQYwVACots+GkzNg3wC1UUUQvyGrYiQgnwamFt C4AgRqJyAHB6IHcawjoKhckKhT5NAHB5IACQE9AtBCB1EbAfd3MGw2ljciAcAG5rBCADUiHwbyJ3 BJAgdG8goHBwASLhbGV2ZWxzLmAgSWYgeQhgIwByQyAwIxFzdWNrH3cnEyOxJKB0aAuAZycsWyRD A/BsAyAJ8GQjMSC3HiEDkQuAZguAIGEgFhBeYwhwAJACICQAWSRhbrMJ4CdgYSANsAUwaB935xwA HhAFQChuIyAEYBYQByMAEYADoFggJy8n/ycwI6AHgAIwBCAeISYwINDwVVJMKSaQAHAnYBqx+mIB oGwgMCngIxABkAMgux+GCrBnB5Eq7wORWS4Q+y/DLtMpIwITUCOxAjAn0Z8gMC5ABPAIcCDQY2ER sH8EIB+GImMlAiZRJ6AFQGTrIsAnw3UpkHgjYBsAKbFLHmAFQGgG8GUuHxxJ+ic0wnITUAQAKbEr 0QVAcStRc3BpBIEgQAngbf8xsiCjLUIvsjOgAiEysiMR9mcKUAQRdy1QLUEFwAWw9wqFK1A7wXcj IDWQKMAssh8kgQngBCAKwCDQZXF1fwdAJAAeQAWxNjAeAAtQZf8mkAIgINAtUAhxE8Ah0T3QtHVs J2BiK7EjIGsJ4P5wCoUp4BFwBZAiMD4wIlDzP8Il8iB2BAAgYS4BL8F/LcQjEQWBObE5BD3CJQBi /z6YLiYKhT8zJ6AkMCBgBCB/A2A9kStQDbA+0UUSLVBp9wXAEXADEGQWEAOgEYAjwP0noGQNsAIw IdAvAUL2I/D9HxxEOXYglC1CO2kowAVA/m8N0B6gB0AiMC3ESJE9gX8mkAQANVJBwDOwILFCZmy3 E9AEoDbQaUsBGuFoAwB/P0A+xSUADdAh0AiQAjA/1T+QUwuAYyDQbTMRNbF/TZY+8jbBPLEKhQCQ QCIt9x4RDbAnYCgt4j2CA/ANsPsugSCxZC2xODBBowuAUlH/B5A2gh4hOgE1IingK4IKhe8oEQWw B4AnYG85oAMAAiD3QHEtMyCxZkGAKZAEEU+A/ztRQCAKwDUiO2U3RhxcGyy9HGxjAEApQCoQP4FU JkD/LPFeAUCRXiEtQk/AE7AeQMM2ECmQbFdlYiswLUJ/C2AgcQVAJfEo8iQQStNuficFQD9AKGQj oDPBJ2B5vxHAMiBC9ENzFaBcgmc3TQZXOSJBZ1JFQUxM/zFQAwBU4CaQNwAi0CXxJpCHQWdIkS1C SFRNTDmB/wWQT7EEIDYwO5FXsiMRVMH3CkBFsSngRgMQCfAeAB7w3yJQCJBBkRGwMsFiJLEtUf0R sHIl8UN5FhBTUl9QYwL/INBwlEFkbgIGkHFUP/EbAf8oMHAUJ9ABgCLhB0ADICIE+yLRKIJzBvAj wCnBJ1FBZP88og3AP8IwElKBb7IVoE9i9y0zPBUFsGsmkGEBAJAZEP8mkBHAIeBywkogKbFxQW9w 31RCUAFNlXvDN01IXIA/Efsl8kCDYSnCCcE20QfCYsDjCsA3TVJlZwsRUABM9r9K8DmwHxwbr2BO FTEAhqAAAAADABAQAAAAAAMAERAAAAAAQAAHMICZtgsM3roBQAAIMICZtgsM3roBHgA9AAEAAAAF AAAAUkU6IAAAAABZ/g== ------ =_NextPart_000_01BADE68.EAFE2840-- From owner-robots Mon Jan 8 16:35:02 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03922; Mon, 8 Jan 96 16:35:02 -0800 Message-Id: <199601090034.TAA10724@honsu.cis.ohio-state.edu> Subject: Re: Does this count as a robot? To: robots@webcrawler.com Date: Mon, 8 Jan 1996 19:34:51 -0500 (EST) In-Reply-To: <9601081842.AA01551@webcrawler.com> from "M.Levy@cs.ucl.ac.uk" at Jan 8, 96 06:41:18 pm From: yuwono@uxmail.ust.hk (YUWONO BUDI) X-Mailer: ELM [version 2.4 PL22] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 864 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com According to your message: > > >One trick that I use to get around symbolic-link loops is to > >detect any recurring path segment (a /x/) in a URL. Hopefully, > >no web author creates a subdirectory with the same name as > >its parent or grand* parent directory (in which case my robot would > >think there is a loop and stop there). So far (half a thousand > >sites we visited), I haven't seen such a case. > > er, yes, but if there was such a naming convention then you wouldn't be able > to tell the difference between that and a recurring path segment. > Would you? I wouldn't. Then again I need not, because my philosophy on robot behavior is "be as non-aggressive as possible," so my robot would simply give it up. If I were a web site admin, I would appreciate that in a robot. Anyway, this trick was actually a 5-minute-hacking solution. -Budi. From owner-robots Tue Jan 9 01:39:03 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05871; Tue, 9 Jan 96 01:39:03 -0800 From: Jaakko Hyvatti <Jaakko.Hyvatti@www.fi> Message-Id: <199601090938.LAA18356@krisse.www.fi> Subject: Recursing heuristics (Re: Does this..) To: robots@webcrawler.com Date: Tue, 9 Jan 1996 11:38:47 +0200 (EET) In-Reply-To: <199601090034.TAA10724@honsu.cis.ohio-state.edu> from "YUWONO BUDI" at Jan 8, 96 07:34:51 pm X-Mailer: ELM [version 2.4 PL22] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 2009 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > I wouldn't. Then again I need not, because my philosophy on robot > behavior is "be as non-aggressive as possible," so my robot would > simply give it up. If I were a web site admin, I would appreciate > that in a robot. > Anyway, this trick was actually a 5-minute-hacking solution. > > -Budi. The robots home pages mention some heuristics that sould be used in recursive traversal, but it does not currently count those recently mentioned here, not to mention it does not count all that are necessary for a modern robot. I think it is time to collect a definitive list of minimum requirements and possible refinements for traversal algorithms. My robot Hämähäkki indexes *.fi -domain, Finland, 207147 URL:s currently, and I could list the following rules it follows (expressed as something like regular expressions): - check the recursion depth limit - check with robots.txt and if the path already was fetched and has not expired, whatever rules are used for that. - recurse only '.*/', '.*\.html?' and paths that seem like they just are missing the ending '/' and usually cause redirection to a index. This means something like '.*/~?[a-zA-Z0-9]+' that does not match '.*bin.*', '.*cgi.*' or '.*\..*' As you see I do not use HEAD check for the type of every link like for example the MOMspider. I might in the future. - drop paths like '.*/cgi-bin/.*', '.*[?=+].*' - drop paths like '.*\.html?/.*' - interpret things like '.*//.*', '.*/\./.*' and '.*/\.\.//*' correctly. These are quite restrictive and might make me miss something, but that's minor and they serve well. I am about to add recursion detection with content comparison by crc shortly. Even if it has not been a problem as only one or two sites out of 755 symlinked substantial subtrees and they were easy to pick out by hand. Otherwise none of the sites hit the recursion limit. Am I missing something important here? Let's collect something useful for the first-time robot-writers. And others. From owner-robots Tue Jan 9 04:18:56 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06291; Tue, 9 Jan 96 04:18:56 -0800 Message-Id: <01BADEDF.A3F92F40@pluto.planets.com.au> From: David Eagles <eaglesd@planets.com.au> To: "'robots@webcrawler.com'" <robots@webcrawler.com> Subject: Recursion Date: Tue, 9 Jan 1996 22:12:58 +-1100 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Seems to me like there are quite a few people using CRC-like methods to = detect recursion. As I am in the process of trying to work out a means = of inter-robot communication, I think it may be useful to use a standard = CRC algorithm. This way, communicating robots can more quickly and = easily determine which URL's to exchange/reject. Now the tricky part - everyone will have they're own technique/algorithm = for this so what will the standard be? Does anyone have a particularly = good algorithm they would care to make available. It should produce a = value which can be represented on any machine architecture (ie. doesn't = use long long's, etc). A single or double "long" value may be the most = simple, but of course suggestions would be welcomed (I won't suggest = using a crypto key generation algorithm, even though I'd like to). Regards, David From owner-robots Tue Jan 9 07:33:37 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06807; Tue, 9 Jan 96 07:33:37 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: <v02130500ad183995c409@[192.187.143.12]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Tue, 9 Jan 1996 07:34:35 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Duplicate docs (was avoiding infinite regress...) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >I'm surprised that no spider seems to use the page content to guess whether or >not two document trees are equal. For example, one heuristic would be to keep >a checksum for every visited page, and to decide that two subtrees are probably >equal if its root nodes and their children have iddentical checksums. We've had requests for that behavior, not only due to sym links, but also because there are many copies of the same document within an enterprise network, and even more so when you're indexing large parts of the Internet. (Imagine how many copies of FAQs are out there, for example.) I think there are two main reasons it hasn't happened yet. One is just that it hasn't risen high enough in the priority list, at least for those of use who have commercial spider tools. For the most part, people are still happy just to get a spider *working* in a convenient, maintainable manner. Thus, most haven't even realized that sym links and duplicates are an issue. Second, the problem of duplicates is a slippery slope. It's probably not hard to find 80 or 90 percent of them, but getting the last bunch, which aren't *exact* duplicates, is going to have to be quite clever, since brute force will probably be slow, at best. Nick From owner-robots Tue Jan 9 13:13:02 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11321; Tue, 9 Jan 96 13:13:02 -0800 From: mabzug1@gl.umbc.edu Message-Id: <199601092112.QAA03572@umbc10.umbc.edu> Subject: Re: Recursion To: robots@webcrawler.com Date: Tue, 9 Jan 1996 16:12:37 -0500 (EST) In-Reply-To: <01BADEDF.A3F92F40@pluto.planets.com.au> from "David Eagles" at Jan 9, 96 10:12:58 pm X-Mailer: ELM [version 2.4 PL24] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 1063 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com "DE" == David Eagles spake thusly: DE> DE> Seems to me like there are quite a few people using CRC-like methods to = [snip] DE> Now the tricky part - everyone will have they're own technique/algorithm = DE> for this so what will the standard be? Does anyone have a particularly = DE> good algorithm they would care to make available. It should produce a = [snip] DE> simple, but of course suggestions would be welcomed (I won't suggest = DE> using a crypto key generation algorithm, even though I'd like to). Might I suggest the standard 'message digest' algorithm, md5, described in rfc1321? An md5 header line is even (officially) part of HTTP, although I haven't seen too many servers that return it. . . yet. There's a standard C implementation, and Neil Winton even put together a Perl implementation. See <http://www.gl.umbc.edu/~mabzug1/md5/md5.html> for (marginally) more information. -- Mordechai T. Abzug http://umbc.edu/~mabzug1 mabzug1@umbc.edu finger -l mabzug1@gl.umbc.edu 1st rule of intelligent tinkering - save all the parts From owner-robots Tue Jan 9 15:46:22 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15896; Tue, 9 Jan 96 15:46:22 -0800 Message-Id: <01BADF3B.12252A40@pluto.planets.com.au> From: David Eagles <eaglesd@planets.com.au> To: "'robots@webcrawler.com'" <robots@webcrawler.com> Subject: RE: Recursion Date: Wed, 10 Jan 1996 09:07:27 +-1100 Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="---- =_NextPart_000_01BADF3B.1246BC00" Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com ------ =_NextPart_000_01BADF3B.1246BC00 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable ---------- From: mabzug1@gl.umbc.edu[SMTP:mabzug1@gl.umbc.edu] Sent: Wednesday, January 10, 1996 3:12 To: robots@webcrawler.com Subject: Re: Recursion "DE" =3D=3D David Eagles spake thusly: DE>=20 DE> Seems to me like there are quite a few people using CRC-like methods = to =3D [snip] DE> Now the tricky part - everyone will have they're own = technique/algorithm =3D DE> for this so what will the standard be? Does anyone have a = particularly =3D DE> good algorithm they would care to make available. It should produce = a =3D [snip] DE> simple, but of course suggestions would be welcomed (I won't suggest = =3D DE> using a crypto key generation algorithm, even though I'd like to). Might I suggest the standard 'message digest' algorithm, md5, described = in rfc1321? An md5 header line is even (officially) part of HTTP, although = I haven't seen too many servers that return it. . . yet. There's a = standard C implementation, and Neil Winton even put together a Perl implementation. = See <http://www.gl.umbc.edu/~mabzug1/md5/md5.html> for (marginally) more information. I agree totally. This was actually the crypto algorithm I was thinking = of but couldn't think of the name. The fact that HTTP already specifies = it's use (which I didn't know) doesn't really leave any other logical = alternative. Guess this means I've got more work to do now :-( Thanks, David ------ =_NextPart_000_01BADF3B.1246BC00 Content-Type: application/ms-tnef Content-Transfer-Encoding: base64 eJ8+Ih8WAQaQCAAEAAAAAAABAAEAAQeQBgAIAAAA5AQAAAAAAADoAAENgAQAAgAAAAIAAgABBJAG ACQBAAABAAAADAAAAAMAADADAAAACwAPDgAAAAACAf8PAQAAAEkAAAAAAAAAgSsfpL6jEBmdbgDd AQ9UAgAAAAByb2JvdHNAd2ViY3Jhd2xlci5jb20AU01UUAByb2JvdHNAd2ViY3Jhd2xlci5jb20A AAAAHgACMAEAAAAFAAAAU01UUAAAAAAeAAMwAQAAABYAAAByb2JvdHNAd2ViY3Jhd2xlci5jb20A AAADABUMAQAAAAMA/g8GAAAAHgABMAEAAAAYAAAAJ3JvYm90c0B3ZWJjcmF3bGVyLmNvbScAAgEL MAEAAAAbAAAAU01UUDpST0JPVFNAV0VCQ1JBV0xFUi5DT00AAAMAADkAAAAACwBAOgEAAAACAfYP AQAAAAQAAAAAAAAD0jcBCIAHABgAAABJUE0uTWljcm9zb2Z0IE1haWwuTm90ZQAxCAEEgAEADgAA AFJFOiBSZWN1cnNpb24AqwQBBYADAA4AAADMBwEACgAJAAcAGwADAAwBASCAAwAOAAAAzAcBAAoA CQAEAAUAAwDzAAEJgAEAIQAAAEVDODg1OEJCQTM0QUNGMTE5ODZBMDAwMEMwOEMwMzRFAC8HAQOQ BgCEBgAAEgAAAAsAIwAAAAAAAwAmAAAAAAALACkAAAAAAAMANgAAAAAAQAA5AECWm9ze3roBHgBw AAEAAAAOAAAAUkU6IFJlY3Vyc2lvbgAAAAIBcQABAAAAFgAAAAG63t7ck7tYiO1KoxHPmGoAAMCM A04AAB4AHgwBAAAABQAAAFNNVFAAAAAAHgAfDAEAAAASAAAAZWFnbGVzZEBwYy5jb20uYXUAAAAD AAYQQIUdawMABxAzBAAAHgAIEAEAAABlAAAALS0tLS0tLS0tLUZST006TUFCWlVHMUBHTFVNQkNF RFVTTVRQOk1BQlpVRzFAR0xVTUJDRURVU0VOVDpXRURORVNEQVksSkFOVUFSWTEwLDE5OTYzOjEy VE86Uk9CT1RTQFdFQgAAAAACAQkQAQAAABIFAAAOBQAAewgAAExaRnVCBepT/wAKAQ8CFQKoBesC gwBQAvIJAgBjaArAc2V0MjcGAAbDAoMyA8UCAHByQnER4nN0ZW0CgzM3AuQHEwKDNARGEzMxIGhG aXgJgHMTsAKAfRcKgAjPCdk7F98yNTUPAoAKgQ2xC2BuZzEwjjMUUAsKFWFzMTgXQE0AQCAKhQqL bGkcUDDBAtFpLTE0NA3wDNBzHqMLWTE2CqADYBPQY30FQC0gxwqHH3sMMCBGRl0DYTohziBGDIIg AMBiBHp1GvBAZ2wudQkG0GMuCYB1W1NN2FRQOiWPJpBdIW8ifa8GYAIwI68ku1cJgG4HkEBkYXks IEoAcHUZCsB5IBsALQAxOTnANiAzOjEyKF8ifTxUbyqfJLsDYAbgdHMIQHdlJlByYXdsNQSQLgWg bS5/KW51Ys5qIIEwnyS7UmU2MDgQPmMIcACQAiAczx3TMzYPH0cUUQvyIEYiREUiACA9PSBEYXZp sGQgRWEmAAeRcwqwAGtlIHRodXNs7Hk6CoU78D4ctj5yBmDzE+AEIHRvJXA9cB4APWOPBJA9cArA PXBxdWkT0IlAwCBmB9FwZW8LUAc9cD2wC4BnIENSQ74tQCMHgD2QBHA/oz0KhTRbcwMAcChGPnJO b88H4EBxPYAFEGNrLYAKsdEgoSBldgSQeQIgPXA9A/BsAyARgEbgQGJ5J29AoUWAA6AgcWgDAEEQ ZZovB0BnBbBBMGhtQ9d/PnICEAXAPZAEAD0gP9B3/xGABUBHY0WyE8AAcCzQCyCgIGJlPyA8UG8H kb8AcEcTR7NBcEZiRhB1C2D+cj3QSftJgARwQMBJd0gC60dQCGBsPKBjQNI/wj1STzxwC3ALYAJg ZS5NIEm9BUBzQ2BRYiBBJpBjQVI3Q+8+NgCQbUIBLQBidfkFQG9mUZAIYRGwPSAlwP5nB5BOoAIg BCBRREzwR1CMZWwzkQmAIChJUTH8bidTQVd0SftCREFwBQD+eQUwP9A9YC2AV5AsoDMgP1fCUEgt AEbRSKFTcWdojVMgJzygQCRvKS447HxNaV4wBUBZUFnWTCsnbQeBczzQPXBkYCBXoSfxXOptZDUt AA2wBPJM8Bc8oAuACoVyEWAxMzI6MU0RQQOgY6FHoGVh3wSBQBFHMUshXaMoVsAecDZjBzE90ClG VFbBSFT/JuAtAAdAXfYKhUeyWZMJ4e8/sVICTaBqoXJG4T+hS5J9F+B0CHADoEEwUwBs8iDWeRHA UwFUQIInTXFMWP5DCoVWEweAAjBck2kBTJB7B7EDEVcLgD/AA6Bdo3DfVpE/wFeQQHJBYVAEkAMg L298UwE/YQqFPGBAdHBQOi8vd3UwLiYJL1p+JxUvY6F2si5gQG3KbEqkKADAcmcLgGfk9wRgF+Bv Bm5KwQDAc6Mzzf869B0sG69gYTzQCdE/sQGQ52fxbZNLIXdhTXEgkC1Qf2fxRaNblUloWVF+8UsB bv5rQmJWwVaCVvFRcFmSgWP/VrJFsnhAB4BtlEGAfyFsBP9owlBBF+BmUGtxQdBnwB5w3weRQTBu ET2wPXAoS4BGEGdeQWJhgqNrbkWAaCBk/01RWZKFgX9yM1BOE2thIGD/clIXcHggUaADIGkhBJFc kftG4F8tRwpQBBFLAweABiK/XmBH0UmABUB4slExcoMwjz/BiHCDsEWBOi0oOOzjbcAAcGtzLD4G PHI47i98ayBGe4UXAQCUoAAAAwAQEAAAAAADABEQAAAAAEAABzCAgVxk3t66AUAACDCAgVxk3t66 AR4APQABAAAABQAAAFJFOiAAAAAAMQQ= ------ =_NextPart_000_01BADF3B.1246BC00-- From owner-robots Wed Jan 10 06:48:31 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09610; Wed, 10 Jan 96 06:48:31 -0800 Date: Wed, 10 Jan 1996 09:48:22 -0500 From: Skip Montanaro <skip@automatrix.com> Message-Id: <199601101448.JAA20509@dolphin.automatrix.com> To: robots@webcrawler.com Subject: MD5 in HTTP headers - where? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com It was mentioned yesterday that there is an HTTP response header useful for sending back an MD5 digest. I just did a quick scan through http://www.w3.org/hypertext/WWW/Protocols/HTTP/HTTP2.html and didn't find any response headers that looked like they were related to use of MD5. Can someone give me a pointer? Thanks, Skip Montanaro | Looking for a place to promote your music venue, new CD skip@calendar.com | or next concert tour? Place a focused banner ad in (518)372-5583 | Musi-Cal! http://www.calendar.com/concerts/ From owner-robots Wed Jan 10 07:32:28 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11794; Wed, 10 Jan 96 07:32:28 -0800 Date: Wed, 10 Jan 1996 10:35:10 -0500 (EST) From: Adam Jack <ajack@corp.micrognosis.com> X-Sender: ajack@becks To: Robots <robots@webcrawler.com> Subject: robots.txt extensions Message-Id: <Pine.SUN.3.91.960110095606.1141B-100000@becks> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hello, Since this list started I've only ever seen one suggestion for an extension to robots.txt. That, from Tim Bray, http://info.webcrawler.com/mailing-lists/robots/0001.html seemed sensible enough -- to add expiry information for the robots.txt file itself. No response appears to have been given -- did people not think it worth while? Did people think the HTTP response field, Expires, should be used for that? I don't know if this was discussed to death somewhere -- but are people still considering extensions to robots.txt? I'd be interested in any pointers to an archive of such a discussion. If there is point in discussion additions pls read on -- otherwise bin this mail. MinRequestInterval: X Minimum request interval in seconds, (0=no minimum), with a default, if missing, of 60. This is for those of us lowely enough not to have huge gathering tasks and the luxury ;-) of a backlog of URLs over distributed sites. (I.e. Those of us doing a sequential search exhausting our interest in a site in one slurp.) Additionally local admins would have more control over wanderers that visted. DefaultIndex: index.html Stating that XXXX/ and XXXX/index.html are identicle. You can argue that this is lamely inadequate - or that it makes a saving. I know the bigger issue is recusion. Here I am merely hoping to save those single page recusions. CGIMask: *.cgi Rather than guessing at CGI urls -- why not get the local admin to answer it? I know that the WN server uses a file extension to indicate a CGI script -- not /cgi-bin/. Q: Are CGI scripts universally avoided in advance -- or do robots look at the HTTP flags of results to try to work out wether some content is dynamically generated? Finally -- I never understood why robots.txt was exclusion only. Why does it not have some of positive hints added? I.e. you are allowed & welcome to browse XXXX/fred.html. Was this a choice built upon pragmatism -- thinking that this would open a can of worms? Thanks for any feedback, Adam -- +1-203-730-5437 | http://www.micrognosis.com/~ajack/index.html From owner-robots Wed Jan 10 08:12:34 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14262; Wed, 10 Jan 96 08:12:34 -0800 Message-Id: <199601101612.LAA17875@mail.internet.com> Comments: Authenticated sender is <raisch@mail.internet.com> From: "Robert Raisch, The Internet Company" <raisch@internet.com> Organization: The Internet Company To: robots@webcrawler.com Date: Wed, 10 Jan 1996 11:08:47 -0400 Subject: Does anyone else consider this irresponsible? Priority: normal X-Mailer: Pegasus Mail for Windows (v2.01) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Altavista, while a marvelous example of what can be done when you throw multiple hundreds of thousands of dollars at the idea of indexing Internet accessible resources, appears to extract data from a host by connecting to EVERY tcp port on the machine. Each probe appears to look for an HTTP service and if found, walks the tree on that port. Ignoring, for the moment that there are 32,000 available ports to probe and that that many tcp connections would seem to be rather excessive... Does anyone else have a problem with this kind of behavior? While I am cognizant of the use of the robots.txt file, it seems more than a little antisocial to index materials that are, for all intents and purposes, unpublished. I, for one, do not believe that just because I run a server on a port, that that gives anyone permission to index and provide others navigation to the material I serve from that port. Many times, a client needs to have access to the service, in the same manner as a typical user, and imposing passwords on the service is an unacceptable burden. I'm looking for comments on this before I take it to a higher level. Thanks. </rr> From owner-robots Wed Jan 10 08:33:08 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15456; Wed, 10 Jan 96 08:33:08 -0800 Message-Id: <199601101632.LAA22167@northsea.com> To: robots@webcrawler.com Subject: Re: Does anyone else consider this irresponsible? In-Reply-To: Your message of "Wed, 10 Jan 1996 11:08:47 -0400." <199601101612.LAA17875@mail.internet.com> Date: Wed, 10 Jan 1996 11:32:33 -0500 From: Stan Norton <norton@northsea.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com In message <199601101612.LAA17875@mail.internet.com>, "Robert Raisch, The Inter net Company" writes: > >Altavista, while a marvelous example of what can be done when >you throw multiple hundreds of thousands of dollars at the idea >of indexing Internet accessible resources, appears to extract >data from a host by connecting to EVERY tcp port on the machine. >... > >I'm looking for comments on this before I take it to a higher >level. > >Thanks. </rr> agreed. absurd behavior. Stan -- Stan Norton -- norton@northsea.com From owner-robots Wed Jan 10 08:47:26 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16272; Wed, 10 Jan 96 08:47:26 -0800 Message-Id: <199601101646.IAA00680@sparty.surf.com> Date: Tue, 09 Jan 96 20:45:01 -0800 From: Super-User <murrayb@surf.com> X-Mailer: Mozilla 1.12 (X11; I; IRIX 5.3 IP22) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: Does anyone else consider this irresponsible? X-Url: http://www.lombard.com/cgi-bin/PACenter/Graph/graph Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com This search engine seemed to me to use and promote the robots-exclusion protocol. Maybe there should be a "URL delete" facility, since the scooter robot was operating long before the search engine became so accessible. Given the amount of resources devoted to this, I'm sure they could provide a "URL delete" facility! Any other requests?? murrayb@surf.com From owner-robots Wed Jan 10 09:38:49 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19011; Wed, 10 Jan 96 09:38:49 -0800 From: <monier@pa.dec.com> Message-Id: <9601101732.AA21836@evil-twins.pa.dec.com> To: robots@webcrawler.com Subject: Re: Does anyone else consider this irresponsible? In-Reply-To: Your message of "Wed, 10 Jan 96 11:32:33 EST." <199601101632.LAA22167@northsea.com> Date: Wed, 10 Jan 96 09:32:44 -0800 X-Mts: smtp Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Gang, I am the father of Scooter, and for the second time I need to rescue my robot from groundless accusations. Soon I'll have enough material for a book... Scooter is a regular robot: it follows links, and only follows links. It does not guess IP addresses, or try out all possible files names (one of my favorite), or spy on sites to guess the "secret test port", or anything like that. In this particular instance, I have to insist that Scooter does not "extract data from a host by connecting to EVERY tcp port". Over 130,000 sites times 32,000 possible ports would amount to a lot of stupid pinging with not much return! The Web is large enough, there is no need to invent new and exotic techniques to access more data. My current estimate BTW is that there are at least 50 million Web pages (text of some sort) publicly available and indexable (not covered by a robots.txt file), so there is really no lack of raw material. Could the next person who feels an urge to speak for the Alta Vista robot please check with me first? Nothing about this project is very secret, all you have to do is ask. Cheers, --Louis From owner-robots Wed Jan 10 10:44:38 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22425; Wed, 10 Jan 96 10:44:38 -0800 Message-Id: <199601101844.NAA21870@mail.internet.com> Comments: Authenticated sender is <raisch@mail.internet.com> From: "Robert Raisch, The Internet Company" <raisch@internet.com> Organization: The Internet Company To: robots@webcrawler.com Date: Wed, 10 Jan 1996 13:40:45 -0400 Subject: Re: Does anyone else consider this irresponsible? Priority: normal X-Mailer: Pegasus Mail for Windows (v2.01) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Re: Hoped for URL Delete facility in Altavista 1. There is no economic incentive for AV to provide such a feature. In fact, there is a strong disincentive as this would affect their claims of "N millions of URLs listed." 2. Why should I have to "unlist" something that they (IMHO) never should have harvested in the first place? </rr> From owner-robots Wed Jan 10 11:16:30 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23625; Wed, 10 Jan 96 11:16:30 -0800 Date: Wed, 10 Jan 1996 14:16:15 -0500 From: Skip Montanaro <skip@automatrix.com> Message-Id: <199601101916.OAA21393@dolphin.automatrix.com> To: robots@webcrawler.com Cc: bob@dolphin.automatrix.com, dick@dolphin.automatrix.com Subject: Responsible behavior, Robots vs. humans, URL botany... In-Reply-To: <199601101612.LAA17875@mail.internet.com> References: <199601101612.LAA17875@mail.internet.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Robert Raisch writes: ... indexing Internet accessible resources, appears to extract data from a host by connecting to EVERY tcp port on the machine. I suspect it's a case of some fool deciding that the ends justify the means. On the Lycos search page they proudly announce: Lycos indexes 91% of the web! Select that link and you get: Lycos has indexed over 10.75 million pages throughout the world.... What could the Alta Vista folks do to top that? How about: You have access to all 8 billion words found in over 16 million Web pages. One way to get to stuff the Lycos folks couldn't find was to be a little more rapacious (ooh, I like that word - makes me think of Jurrasic Park...). <digression> Not to let the AV folks be the only ones getting jabbed, I'll take advantage of the opportunity to jab Lycos a little. They have a small table on their 91% page: Lycos 91% 10.75 Million Open Text 12% 0.80 Million Infoseek 6% 0.40 Million Yahoo <1% 0.05 Million It is obviously a case of apples and oranges to compare Lycos with Yahoo (I can't comment on the others, although I believe they use robots as well), since Yahoo is a reasonably well-organized human-built index. I tend to be able to find things in Yahoo. Lycos, for all the scoring, abstracts, searching options, yadda, yadda, yadda, is still a robot-generated index with all the problems for us mere humans that implies. We tend to like things a bit more structured. I don't normally find poring over a robot's search engine output all that fruitful. I still can't seem to write queries to any of the search engines that provide all that great a "usefulness quotient", even with a degree in Computer Science. If most of what's out there is crap (for the sake of argument, let's just pick a number out of thin air, say, 91%... :-), users of Lycos and the other robot indexes are bound to need real big shovels. On the other hand, presumably the Yahoo folks or the submitters of URLs to Yahoo at least sniff the URLs before deciding whether to add them to the database. In addition, Yahoo tends to index the trunks of URL trees (which I find more useful), not every friggin' leaf and branch. Hypothetical conversation between two botanists on a field trip: Ooh, Bob! look at this oak leaf! It sure is a whole lot different than the one we found on that other tree! Let's remember where we found it! Put that other one back... Has anyone considered adding an option to the various robot search engines that would restrict the depth of URLs returned to a query or at least use the number of components in a URL's path to help score the page? </digression> Sorry for the digression. I'm done venting. Please return to work now. Skip Montanaro | Looking for a place to promote your music venue, new CD skip@calendar.com | or next concert tour? Place a focused banner ad in (518)372-5583 | Musi-Cal! http://www.calendar.com/concerts/ From owner-robots Wed Jan 10 11:20:59 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23825; Wed, 10 Jan 96 11:20:59 -0800 From: "Mordechai T. Abzug" <mabzug1@gl.umbc.edu> Message-Id: <199601101920.OAA20604@umbc8.umbc.edu> Subject: Re: MD5 in HTTP headers - where? To: robots@webcrawler.com Date: Wed, 10 Jan 1996 14:20:46 -0500 (EST) In-Reply-To: <199601101448.JAA20509@dolphin.automatrix.com> from "Skip Montanaro" at Jan 10, 96 09:48:22 am X-Mailer: ELM [version 2.4 PL24] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 737 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com "SM" == Skip Montanaro spake thusly: SM> SM> SM> It was mentioned yesterday that there is an HTTP response header useful for SM> sending back an MD5 digest. I just did a quick scan through SM> SM> http://www.w3.org/hypertext/WWW/Protocols/HTTP/HTTP2.html SM> SM> and didn't find any response headers that looked like they were related to SM> use of MD5. Can someone give me a pointer? As the person who first made the claim, guess the burden of proof is on me. See the IETF draft, available at: <http://www.ics.uci.edu/pub/ietf/http/draft-ietf-http-v11-spec-00.txt>. -- Mordechai T. Abzug http://umbc.edu/~mabzug1 mabzug1@umbc.edu finger -l mabzug1@gl.umbc.edu It's hard to RTFM when you can't find the FM. . . From owner-robots Wed Jan 10 11:24:30 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24010; Wed, 10 Jan 96 11:24:30 -0800 Message-Id: <199601101924.LAA29177@scam.XCF.Berkeley.EDU> X-Authentication-Warning: scam.XCF.Berkeley.EDU: Host localhost [127.0.0.1] didn't use HELO protocol To: robots@webcrawler.com Subject: Re: Does anyone else consider this irresponsible? In-Reply-To: Your message of "Wed, 10 Jan 1996 11:32:33 EST." <199601101632.LAA22167@northsea.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Id: <29174.821301875.1@scam.XCF.Berkeley.EDU> Date: Wed, 10 Jan 1996 11:24:35 -0800 From: Eric Hollander <hh@scam.XCF.Berkeley.EDU> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >In message <199601101612.LAA17875@mail.internet.com>, "Robert Raisch, The Inte r >net Company" writes: >> >>Altavista, while a marvelous example of what can be done when >>you throw multiple hundreds of thousands of dollars at the idea >>of indexing Internet accessible resources, appears to extract >>data from a host by connecting to EVERY tcp port on the machine. >>... > >agreed. absurd behavior. you'll probably find more interesting data if you scan udp ports, anyway. e From owner-robots Wed Jan 10 11:29:28 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24311; Wed, 10 Jan 96 11:29:28 -0800 From: mnorman@netcom.com (Mark Norman) Message-Id: <199601101928.LAA04725@netcom11.netcom.com> Subject: Re: Does anyone else consider this irresponsible? To: robots@webcrawler.com Date: Wed, 10 Jan 1996 11:28:55 -0800 (PST) In-Reply-To: <9601101732.AA21836@evil-twins.pa.dec.com> from "monier@pa.dec.com" at Jan 10, 96 09:32:44 am X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 182 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Your reply to the complaint said your robot finds web sites just by following links. But what links does it start with? Thanks, and thanks for participating in this mail list. bye! From owner-robots Wed Jan 10 11:52:06 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25506; Wed, 10 Jan 96 11:52:06 -0800 Message-Id: <199601101951.OAA23476@mail.internet.com> Comments: Authenticated sender is <raisch@mail.internet.com> From: "Robert Raisch, The Internet Company" <raisch@internet.com> Organization: The Internet Company To: robots@webcrawler.com, monier@pa.dec.com Date: Wed, 10 Jan 1996 14:48:14 -0400 Subject: Re: Does anyone else consider this irresponsible? Priority: normal X-Mailer: Pegasus Mail for Windows (v2.01) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Louis and others, I owe you an apology. It appears that I jumped to an erroneous conclusion when I assumed that Scooter harvested data from every available port. In my defence, it was the only assumption I could reach, based upon the information I had available to me. It seems that, rather than the behavior I suggested, the URLs with which I had a problem were inadvertently exposed to Scooter through the general publishing of an employee's hotlist. Thank you for the clarification and I regret any inconvenience this may have caused. It should be noted that I reached my current state of education regarding this matter via the "link:hostname" mechanism in Altavista. An excellent resource with innovative features. My compliments. </rr> Robert Raisch chief scientist The Internet Company On 10 Jan 96 at 9:32, monier@pa.dec.com wrote: > Gang, > > I am the father of Scooter, and for the second time I need to rescue my robot > from groundless accusations. Soon I'll have enough material for a book... From owner-robots Wed Jan 10 12:09:09 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26420; Wed, 10 Jan 96 12:09:09 -0800 X-Sender: mak@surfski.webcrawler.com Message-Id: <v02140800ad19ba7e82cc@[199.221.45.139]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 10 Jan 1996 12:09:45 -0700 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: FAQ again. Cc: kfischer@mail.win.org Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi all, I've been getting a lot of robot questions recently, so decided the FAQ time is now :-) I wrote the stuff below, and cross-checked with Keith Fischer's preliminary FAQ of early November last year; think I have addressed most of the questions he proposed. Pending comments I'l HTML-ise it and add it to the robot pages this week. Regards, ______________ WWW Robot Frequently Asked Questions Last updated: 10 January 1996 Maintained by Martijn Koster <m.koster@webcrawler.com> Location: http://info.webcrawler.com/mak/projects/robots/faq.html 1) About WWW robots 1.1) What is a WWW robot? 1.2) What is an agent? 1.3) What is a search engine? 1.4) What kinds of robots are there? 1.5) Aren't robots bad for the web? 1.6) Where do I find out more about robots? 2) Indexing robots 2.1) How does a robot decide where to visit? 2.2) How does an indexing robot decide what to index? 2.3) How do I register my page with a robot? 3) For Server Administrators 3.1) How do I know if I've been visited by a robot? 3.2) I've been visited by a robot. Now what? 3.3) A robot is traversing my whole site too fast! 3.4) How do I keep a robot off my server? 4) Robots exclusion standard 4.1) Why do I find entries for /robots.txt in my log files? 4.2) How do I prevent robots scanning my site? 4.3) Where do I find out how /robots.txt files work? 4.4) Will the /robots.txt standard be extended? 5) Availability 5.1) Where can I use a robot? 5.2) Where can I get a robot? 5.3) Where can I get the source code for a robot? 5.4) I'm writing a robot, what do I need to be careful of? 5.5) I've written a robot, how do I list it? 1) About Web Robots =================== 1.1) What is a WWW robot? ------------------------- A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. Note that "recursive" here doesn't limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot. Normal Web browsers are not robots, because the are operated by a human, and don't automatically retrieve referenced documents (other than inline images). Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These names are a bit misleading as they give the impression the software itself moves between sites like a virus; this not the case, a robot simply visits sites by requesting documents from them. 1.2) What is an agent? ---------------------- The word "agent" is used for lots of meanings in computing these days. Specifically: - Autonomous agents are programs that do travel between sites, deciding themselves when to move and what to do (e.g. General Magic's Telescript). These can only travel between special servers and are currently not widespread in the Internet. - Intelligent agents are programs that help users with things, such as choosing a product, or guiding a user through form filling, or even helping users find things. These have generally little to do with networking. - User-agents are a technical name for programs that perform networking tasks for a user, such as Web User-agents like Netscape Explorer, Email User-agent like Qualcomm Eudora etc. 1.3) What is a search engine? ----------------------------- A search engine is a program that searches through some dataset. In the context of the Web, the word "search engine" is most often used for search forms that search through databases of HTML documents gathered by a robot. 1.4) What other kinds of robots are there? ------------------------------------------ Robots can be used for a number of purposes: - Indexing (see section 2) - HTML validation - Link validation - "What's New" monitoring - Mirroring See the list of active robots to see what robot does what. Don't ask me -- all I know is what's on the list... 1.5) Aren't robots bad for the web? ----------------------------------- There are a few reasons people believe robots are bad for the Web: - Certain robot implementations can (and have in the past) overloaded networks and servers. This happens especially with people who are just starting to write a robot; these days there is sufficient information on robots to prevent some of these mistakes. - Robots are operated by humans, who make mistakes in configuration, or simply don't consider the implications of their actions. This means people need to be careful, and robot authors need to make it difficult for people to make mistakes with bad effects - Web-wide indexing robots build a central database of documents, which doesn't scale too well to millions of documents on millions of sites. But at the same time the majority of robots are well designed, professionally operated, cause no problems, and provide a valuable service in the absence of widely deployed better solutions. So no, robots aren't inherently bad, nor inherently brilliant, and need careful attention. 1.6) Where do I find out more about robots? ------------------------------------------- There is a Web robots home page on: http://info.webcrawler.com/mak/projects/robots/robots.html while this is hosted at one of the major robots' site, it is an unbiased and reasoneably comprehensive collection of information which is maintained by Martijn Koster <m.koster@webcrawler.com>. Of course the latest version of this FAQ is there. You'll also find details and an archive of the robots mailing list, which is intended for technical discussions about robots. 2) Indexing robots ================== 2.1) How does a robot decide where to visit? -------------------------------------------- This depends on the robot, each one uses different strategies. In general they start from a historical list of URLs, especially of documents with many links elsewhere, such as server lists, "What's New" pages, and the most popular sites on the Web. Most indexing services also allow you to submit URLs manually, which will then be queued and visited by the robot. Sometimes other sources for URLs are used, such as scanners through USENET postings, published mailing list achives etc. Given those starting points a robot can select URLs to visit and index, and to parse and use as a source for new URLs. 2.2) How does an indexing robot decide what to index? ----------------------------------------------------- If an indexing robot knows about a document, it may decide to parse it, and insert it into its database. How this is done depends on the robot: Some robots index the HTML Titles, or the first few paragraphs, or parse the entire HTML and index all words, with weightings depending on HTML constructs, etc. Some parse the META tag, or other special hidden tags. We hope that as the Web evolves more facilities becomes available to efficiently associate meta data such as indexing information with a document. This is being worked on... 2.3) How do I register my page with a robot? -------------------------------------------- You guessed it, it depends on the service :-) Most services have a link to a URL submission form on their search page. Fortunately you don't have to submit your URL to every service by hand: Submit-it <URL: http://www.submit-it.com/> will do it for you. 3) For Server Administrators ============================ 3.1) How do I know if I've been visited by a robot? --------------------------------------------------- You can check your server logs for sites that retrieve many documents, especially in a short time. If your server supports User-agent logging you can check for retrievals with unusual User-agent heder values. Finally, if you notice a site repeatedly checking for the file '/robots.txt' chances are that is a robot too. 3.2) I've been visited by a robot. Now what? -------------------------------------------- Well, nothing :-) The whole idea is they are automatic; you don't need to do anything. If you think you have discovered a new robot (ie one that is not listed on the list of active robots on <URL: http://info.webcrawler.com/ mak/projects/robots/robots.html>, and it does more than sporadic visits, drop me a line so I can make a note of it for future reference. But please don't tell me about every robot that happens to drop by! 3.3) A robot is traversing my whole site too fast! -------------------------------------------------- This is called "rapid-fire", and people usually notice it if they're monitoring or analysing an access log file. First of all check if it is a problem by checking the load of your server, and monitoring your servers' error log, and concurrent connections if you can. If you have a medium or high performance server, it is quite likely to be able to cope a high load of even several requests per second, especially if the visits are quick. However you may have problems if you have a low performance site, such as your own desktop PC or Mac you're working on, or you run low performance server software, or if you have many long retrievals (such as CGI scripts or large documents). These problems manifest themselves in refused connections, a high load, performance slowdowns, or in extreme cases a system crash. If this happens, there are a few things you should do. Most importantly, start logging information: when did you notice, what happened, what do your logs say, what are you doing in response etc; this helps investigating the problem later. Secondly, try and find out where the robot came from, what IP addresses or DNS domains, and see if they are mentioned in the list of active robots on <URL: http://info.webcrawler.com/mak/projects /robots/robots.html>. If you can identify a site this way, you can email the person responsible, and ask them what's up. If this doesn't help, try their own site for telephone numbers, or mail postmaster at their domain. If the robot is not on the list, mail me with all the information you have collected, including actions on your part. If I can't help, at least I can make a note of it for others. 3.4) How do I keep a robot off my server? Read the next section... 4) Robots exclusion standard ============================ 4.1) Why do I find entries for /robots.txt in my log files? ----------------------------------------------------------- They are probably from robots trying to see if you have specified any rules for them using the Standard for Robot Exclusion, see question 4.4. If you don't care about robots and want to prevent the messages in your error logs, simply create an empty file called robots.txt in the root level of your server. Don't put any HTML or English language "Who the hell are you?" text in it -- it will probably never get read by anyone :-) 4.2) How do I prevent robots scanning my site? ---------------------------------------------- The quick way to prevent robots visiting your site is put these two lines into your server: User-agent: * Disallow: / but its easy to be more selective than that, see 4.3 4.3) Where do I find out how /robots.txt files work? ---------------------------------------------------- You can read the whole standard on the Robot Page <URL: http://info.webcrawler.com/mak/projects/robots/robots.html> but the basic concept is simple: by writing a structured text file you can indicate to robots that certain parts of your server are off-limits to some or all robots. It is best explained with an example (The vertical bar on the left is not part of the contents): | # /robots.txt file for http://webcrawler.com/ | # mail webmaster@webcrawler.com for constructive criticism | | User-agent: webcrawler | Disallow: | | User-agent: lycra | Disallow: / | | User-agent: * | Disallow: /tmp | Disallow: /logs The first two lines, starting with '#', specify a comment The first paragraph specifies that the robot called 'webcrawler' has nothing disallowed: it may go anywhere. The second paragraph indicates that the robot called 'lycra' has all relative URLs starting with '/' disallowed. Because all relative URL's on a server start with '/', this means the entire site is closed off. The third paragraph indicates that all other robots should not visit URLs starting with /tmp or /log. Note the '*' is a special token; its not a regular expression. Two common errors: Regular expressions are _not_ supported: instead of 'Disallow: /tmp/*' just say 'Disallow: /tmp'. You shouldn't put more than one path on a Disallow line (this may change in a future version of the spec) 4.4) Will the /robots.txt standard be extended? ----------------------------------------------- Probably... there are some ideas floating around. They haven't made it into a coherent proposal because of time constraints, and because there is little pressure. Mail suggestions to the robots mailing list, and check the robots home page for work in progress. 5) Availability =============== 5.1) Where can I use a robot? ----------------------------- If you mean a search service, check out the various directory pages on the Web, such as Netscape's <URL: http://home.netscape.com/home/internet-directory.html> or try one of the Meta search services such as <UL: http://metasearch.com/> 5.2) Where can I get a robot? ----------------------------- Well, you can have a look at the list of robots; I'm starting to indicate their public availability slowly. In the meantime, two indexing robots that you should be able to get hold of are Harvest (free), and Verity's. 5.3) Where can I get the source code for a robot? ------------------------------------------------- See 5.2 -- some may be willing to give out source code. 5.4) I'm writing a robot, what do I need to be careful of? ---------------------------------------------------------- Lots. First read through all the stuff on the robot page http://info.webcrawler.com/mak/projects/robots/robots.html then read the proceedings of past WWW Conferences, and the complete HTTP and HTML spec. Yes; it's a lot of work :-) 5.5) I've written a robot, how do I list it? --------------------------------------------- Simply fill in http://info.webcrawler.com/mak/projects/robots/form.html and mail the result to Martijn Koster <m.koster@webcrawler.com> with a subject of "Addition to the list of robots". THE END -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Wed Jan 10 12:46:18 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28305; Wed, 10 Jan 96 12:46:18 -0800 Date: Wed, 10 Jan 1996 13:59:57 -0600 From: ecarp@tssun5.dsccc.com (Ed Carp @ TSSUN5) Message-Id: <9601101959.AA21857@tssun5.> To: robots@webcrawler.com Subject: Re: Does anyone else consider this irresponsible? X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > From owner-robots@webcrawler.com Wed Jan 10 13:30 CST 1996 > From: <monier@pa.dec.com> > To: robots@webcrawler.com > Subject: Re: Does anyone else consider this irresponsible? > Date: Wed, 10 Jan 96 09:32:44 -0800 > X-Mts: smtp > Could the next person who feels an urge to speak for the Alta Vista robot please > check with me first? Nothing about this project is very secret, all you have to > do is ask. OK ... um ... how about source? ;) From owner-robots Wed Jan 10 13:27:50 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00457; Wed, 10 Jan 96 13:27:50 -0800 From: <monier@pa.dec.com> Message-Id: <9601102121.AA22127@evil-twins.pa.dec.com> To: robots@webcrawler.com Subject: Re: Does anyone else consider this irresponsible? In-Reply-To: Your message of "Wed, 10 Jan 96 13:59:57 CST." <9601101959.AA21857@tssun5.> Date: Wed, 10 Jan 96 13:21:06 -0800 X-Mts: smtp Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Got a 1GB machine to run it on? OK, maybe not the source. It's too buggy and I would be embarrassed (;-)). --Louis From owner-robots Wed Jan 10 14:04:52 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02364; Wed, 10 Jan 96 14:04:52 -0800 X-Sender: mak@surfski.webcrawler.com Message-Id: <v02140807ad19d0dfc63f@[199.221.45.139]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 10 Jan 1996 14:05:30 -0700 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: Re: robots.txt extensions Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 10:35 AM 1/10/96, Adam Jack wrote: >Hello, > >Since this list started I've only ever seen one suggestion >for an extension to robots.txt. A extension discussion document sounds like an ideal, though belated, New Years resolution :-) >to add expiry information for the >robots.txt file itself. No response appears to have been given >-- did people not think it worth while? Did people think the >HTTP response field, Expires, should be used for that? Yes, and I also don't think its something widely wanted, and that is will be confusing to people (who don't understand all the ins and outs anyway (How about a separate 'funny messages in /robots.txt thread? :-). The thing about expires is that it is a prediction, and people are not good at making predictions; they want a "I changed it, now update all your robots out there" push scheme. Does submitting a '/robots.txt' manually to robots bump it up in the queue (does in WebCrawler)? Then you could use submit-it to do the push :-) >I don't know if this was discussed to death somewhere -- but >are people still considering extensions to robots.txt? I'd be >interested in any pointers to an archive of such a discussion. My thoughts never made it to the list :-) >If there is point in discussion additions pls read on -- >otherwise bin this mail. No, by all means. But most of all I want to keep things simple. >MinRequestInterval: X > > Minimum request interval in seconds, (0=no minimum), > with a default, if missing, of 60. > > This is for those of us lowely enough not to have huge > gathering tasks and the luxury ;-) of a backlog of URLs > over distributed sites. (I.e. Those of us doing a > sequential search exhausting our interest in a site in > one slurp.) Additionally local admins would have more > control over wanderers that visted. Interesting, I didn't think people still did that :-) I think 60 is a sensible default, so lets think about why you would change it from that. There seems little point in setting it much higher, because even on the worst platform one requets per minute is no problem (unless previous connections are still open). But who would set it much lower? Only someone who wants to run a robot to their own site, in which case they can control the speed themselves... So is it worth doing it at all? >DefaultIndex: index.html > > Stating that XXXX/ and XXXX/index.html are identicle. > > You can argue that this is lamely inadequate - or that it > makes a saving. I know the bigger issue is recusion. Here > I am merely hoping to save those single page recusions. Yes, I do argue that this is lamely inadequate; I too think checksums are the way for this, even if it is post-retrieval; pre-retrieval is always a guess (even if we could have an If-not-md5 HTTP header) >CGIMask: *.cgi > > Rather than guessing at CGI urls -- why not get the local > admin to answer it? I know that the WN server uses a file > extension to indicate a CGI script -- not /cgi-bin/. > > Q: Are CGI scripts universally avoided in advance -- or do > robots look at the HTTP flags of results to try to work > out wether some content is dynamically generated? I always think you shouldn't make a distinction between dynamically generated output and static output. What you should pay attention to is things like Expires, form and queries, and outrageous recursion... >Finally -- I never understood why robots.txt was exclusion only. >Why does it not have some of positive hints added? I.e. you are >allowed & welcome to browse XXXX/fred.html. Was this a choice >built upon pragmatism -- thinking that this would open a can of >worms? Ha, finally someone who understands me! :-)) Yes, the can is really opened up when you start allowing keywords and stuff. I did think maybe one or both of a 'Visit' and a 'Meta' header would be a reseonable idea: 'Visit' would allow URLs to be listed for retrieval, and nothing more. So you could do: | Disallow: / | Visit: /welcome.html | Visit: /products.html | Visit: /keywords-and-overview-for-robots.html Which would be kinda cool and simple, but doesn't scale well to many URL's, or to more meta data. 'Meta' would specify a link to a seperate document, using some TBD format (or formats, using content negotiation to pick text/url-list, text/aliweb, urc/foo or whatever) which further guides content selection and meta data for a site. Other requests I've had are regular expression support (e.g. for Disallow: *.html3), and allowing multiple paths per disallow line. What do people think of the above? Any others? -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Wed Jan 10 14:05:00 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02379; Wed, 10 Jan 96 14:05:00 -0800 X-Sender: mak@surfski.webcrawler.com Message-Id: <v02140808ad19d8f2ac66@[199.221.45.139]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 10 Jan 1996 14:05:36 -0700 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: Robots / source availability? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Any robot author could have written: > maybe not the source. It's too buggy and I would be embarrassed (;-)). :-) Who is currently selling/giving away robot binaries and/or source? I'd like to add that info to the robots listed... people ask me all the time. -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Wed Jan 10 16:18:44 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08903; Wed, 10 Jan 96 16:18:44 -0800 Message-Id: <30F45854.1697@corp.micrognosis.com> Date: Wed, 10 Jan 1996 19:22:44 -0500 From: Adam Jack <ajack@corp.micrognosis.com> Organization: CSK/Micrognosis Inc. X-Mailer: Mozilla 2.0b3 (X11; I; SunOS 5.5 sun4m) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: robots.txt extensions References: <v02140807ad19d0dfc63f@[199.221.45.139]> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Martijn Koster wrote: > > [...] sequential search exhausting our interest in a site in > > Interesting, I didn't think people still did that :-) Martijn -- think lowly, very very lowley ... ;-) People will allways start somewhere -- and, as I mentioned in point to point mail, it is us beginners that *all* ought be wary off. Robots.txt seems the first line of defense. A site can make explicit statements in it. Being explicit is a good reason for a MinRequestInterval. > I think 60 is a sensible default, so lets think about why you would > change it from that. [...] But who would set it much lower? > 60 might be sensible to your need -- but what about other's search needs? Consider people like me who get libwwwperl and a spare afternoon and a goal. Robots, Spiders et all will get more and more prolific and they won't all have long term aims and/or budgets. In testing, and in practice, I felt myself get tempted to hack down the 60 second default to, say, 30 ... then I read that an 'OK' robot on the active list did once-a-second :-) :-) ...... Soon, rabid thoughts of 60 *micro *seconds came to mind ... However - if any site every mentioned a preference for, say, 120 seconds - then I'd be happy to oblige. I think this information is a good addition. It needn't be of use to the thundering giants -- it is the WWW site that benefits. > >DefaultIndex: index.html > > > > Stating that XXXX/ and XXXX/index.html are identicle. > > > > You can argue that this is lamely inadequate - or that it > > makes a saving. I know the bigger issue is recusion. Here > > I am merely hoping to save those single page recusions. > > Yes, I do argue that this is lamely inadequate; I too think checksums > are the way for this, even if it is post-retrieval; pre-retrieval is > always a guess (even if we could have an If-not-md5 HTTP header) > Again - giants verses the lowely. This misses a saving for those who don't have MD5 capabilities. Also, as for whether checksums are the answer - that seems odd : So - a robot must cache a whole site of checksums, or load the checksum lists when a site's URL is individually access ( for those non-sequential giants.) All this to see if an URL is the same as one already seen? Is this not a huge procesing overhead? Is this mechanism suggested only because existing HTTP servers and header field would need no change to support it? Adam -- +1-203-730-5437 | http://www.micrognosis.com/~ajack/index.html From owner-robots Wed Jan 10 16:38:34 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09762; Wed, 10 Jan 96 16:38:34 -0800 From: Jaakko Hyvatti <Jaakko.Hyvatti@www.fi> Message-Id: <199601110038.CAA11256@krisse.www.fi> Subject: Re: robots.txt extensions To: robots@webcrawler.com Date: Thu, 11 Jan 1996 02:38:12 +0200 (EET) In-Reply-To: <Pine.SUN.3.91.960110095606.1141B-100000@becks> from "Adam Jack" at Jan 10, 96 10:35:10 am X-Mailer: ELM [version 2.4 PL22] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 3148 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Adam: > If there is point in discussion additions pls read on -- > otherwise bin this mail. Sure is. In the following I will comment your proposals from the viewpoint 'Does it solve any problems? Will anyone implement it?', because I think no one will implement any extensions if there is nothing to gain and more specifically no problems to solve with them. > http://info.webcrawler.com/mailing-lists/robots/0001.html > > seemed sensible enough -- to add expiry information for the > robots.txt file itself. No response appears to have been given Problem is: someone changes robots.txt while cached copy is trusted by robot. Adding expiry info does not enforce sysadmins to not to edit robots.txt before it's expiration, so it still has to be retrieved with some sensible intervals before expiration if set too far away in the future. Retrieving robots.txt every 100th - 1000th GET or minimum 8 hours, maximum couple of days will not increase net traffic and solves the problem better than expiry fields. And because every robot has to handle robots.txt expiration sensibly, no sysadmin sees this as a problem and will not implement the new field. > MinRequestInterval: X > > Minimum request interval in seconds, (0=no minimum), > with a default, if missing, of 60. There is no problem with request intervals with well-behaved robots, and ill-behaving ones - will they obey it anyway? So there is no problem and it does not even get solved :-) Again nobody will implement this. > DefaultIndex: index.html > > Stating that XXXX/ and XXXX/index.html are identicle. Checksums are easier and have to be implemented anyway, because most sites will not have this field implemented. And 'cause checksums work, this is unnecessary and no one will use it.. > CGIMask: *.cgi Hmm. Disallow: with regular expressions would be more generic. But again: how many such cases can be found that this is necessary? > Finally -- I never understood why robots.txt was exclusion only. > Why does it not have some of positive hints added? I.e. you are > allowed & welcome to browse XXXX/fred.html. Was this a choice > built upon pragmatism -- thinking that this would open a can of > worms? I do not believe it is a problem to give robots URLs, they are pretty good at finding them themselves. Also listing an url in robots.txt does not bring the robot for a visit - a submission to the robot admin will. On the other hand, lack of exclusion of robots from sites/URLs was a severe problem and was well solved by robots.txt. Also, while updating the information content of a site, sysadmins and ordinary users surely will forget to update robots.txt. (Directories are more static and therefore the current scheme works.) I am sorry I sound quite negative.. Actually, the ideas might be pretty good. I do not mean to be rude :-) I actually have a new idea too: Textarchive: /allpages.zip or Textarchive: /publicdocs.tar.gz (or with any other compressed archive format) ..instructs robots to fetch all there is in a compressed format. Is this a simple enough interface for everyone to accept? Too simple? From owner-robots Wed Jan 10 18:43:35 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16707; Wed, 10 Jan 96 18:43:35 -0800 From: mnorman@netcom.com (Mark Norman) Message-Id: <199601110243.SAA20444@netcom11.netcom.com> Subject: Re: Does anyone else consider... To: robots@webcrawler.com Date: Wed, 10 Jan 1996 18:43:09 -0800 (PST) X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 138 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Louis, you said your robot only "follows links" to find web sites, but how do you get the links you give it as a starting point? thanks. From owner-robots Wed Jan 10 19:58:35 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21787; Wed, 10 Jan 96 19:58:35 -0800 From: <monier@pa.dec.com> Message-Id: <9601110353.AA22505@evil-twins.pa.dec.com> To: robots@webcrawler.com Subject: Re: Does anyone else consider... In-Reply-To: Your message of "Wed, 10 Jan 96 18:43:09 PST." <199601110243.SAA20444@netcom11.netcom.com> Date: Wed, 10 Jan 96 19:53:28 -0800 X-Mts: smtp Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Louis, you said your robot only "follows links" to find web sites, but > how do you get the links you give it as a starting point? thanks. I just started from a few well-known sources, like the NCSA archives, and the Web is sufficiently connected to do the rest. I did not have the guts to give it a single URL, but I bet that it would take quite a bit of work to find a URL that would not connect to the whole Web. Think about how many pages mention Yahoo for example, and how quickly the search will branch after that. And then of course I use any URL people contribute. ----------------- Since I'm here, and in the interest of saving bandwidth I want to respond to Skip who was missing one important point and calling me a fool (;-)). Alta Vista uses a fast robot. I ran this robot for a week and got 16M pages. If I had run it for two weeks I would no doubt have 25-30M pages today. Once I restart the robot the index will contain more pages, unless it finds a lot of sites with better /robots.txt in which case it will delete these pages, and I may report a smaller index for a while, which would be fine with me. Notice that I said a "better" robots.txt, because I would actually enjoy seeing every webmaster put up a good file and save everyone the trouble to fetch, index, and read stuff that was never intended to be indexed. Every chance I get to educate another person, specially a reporter, about the Robots Exclusion Standard, I do it, because it's our only chance so far to improve the quality of what ends up in Web indexes. And of course if webmasters used password protection on ports that are not intended for public usage it would make life somewhat easier: I have answered enough "you have violated my secret test site" messages. My point is that I don't want to maximize at all cost the number of pages to report: I am interested in finding out how large the Web is, and giving everyone access to its complete index. And while doing this I want to report facts, not engage in a p...ing contest with some outfit who has reportedly indexed 91% of an absolutely unknown and moving figure. Alta Vista is a research project with no place for this kind of creative arithmetic. --Louis From owner-robots Wed Jan 10 21:32:44 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28339; Wed, 10 Jan 96 21:32:44 -0800 Message-Id: <30F4A20A.1B91@corp.micrognosis.com> Date: Thu, 11 Jan 1996 00:37:14 -0500 From: Adam Jack <ajack@corp.micrognosis.com> Organization: CSK/Micrognosis Inc. X-Mailer: Mozilla 2.0b3 (X11; I; SunOS 5.5 sun4m) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: robots.txt extensions References: <199601110038.CAA11256@krisse.www.fi> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Jaakko Hyvatti wrote: > > I am sorry I sound quite negative.. Actually, the ideas might be > pretty good. I do not mean to be rude :-) > Don't worry about that. I appreciate your information. Thanks. Okay - so zero for 3... How about I just comment on the following : monier@pa.dec.com wrote: > > sites with better /robots.txt [...] > index, and read stuff that was never intended to be indexed. > Our admin knows + cares little for our content. Our content is automatically transfered to our own sub-tree locations each hour. We are in a position to determine what is worthy & what not -- but not in a position to modify /robots.txt. I am sure our site is not alone in this... Adam -- +1-203-730-5437 | http://www.micrognosis.com/~ajack/index.html From owner-robots Wed Jan 10 22:03:48 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00606; Wed, 10 Jan 96 22:03:48 -0800 Message-Id: <v02130500ad1a569db95e@[202.237.148.20]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 11 Jan 1996 15:02:50 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: Re: Does anyone else consider... Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Louis: >Alta Vista uses a fast robot. I ran this robot for a week and got 16M pages. >If I had run it for two weeks I would no doubt have 25-30M pages today. Once I Are you saying that your entire database was obtained in a week?! What do the dates mean in the listings that are returned?--the date the file was created, changed, or documented by AV? I've been using AV to trace links to our page, and I can use the advanced query and divide things up by time period. When I do this the distribution of pages is over a matter of months, not a week. (By the way, there seems to be a bug in the date feature, since it returns data even when you set it to the future, and there are a couple other odd things.) Are you able to update the database in real time, or do you have to rebuild it every time you add/revise data? --Mark From owner-robots Thu Jan 11 01:44:21 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA13378; Thu, 11 Jan 96 01:44:21 -0800 Date: Thu, 11 Jan 1996 09:44 UT From: MGK@NEWTON.NPL.CO.UK (Martin Kiff) Message-Id: <0099C391D03F1640.5846@NEWTON.NPL.CO.UK> To: robots@webcrawler.com Subject: Re: robots.txt extensions X-Vms-To: SMTP%"robots@webcrawler.com" X-Vms-Cc: MGK Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hello all, I've just joined the Robots list, so this is a general 'hello'. > 'Visit' would allow URLs to be listed for retrieval, and nothing more. > So you could do: > > | Disallow: / > | Visit: /welcome.html > | Visit: /products.html > | Visit: /keywords-and-overview-for-robots.html > > Which would be kinda cool and simple, but doesn't scale well to many > URL's, or to more meta data. I'd use this for a Visit: /changes.html which contains a 'w3new' type list of all pages (all pages I want indexed) on the server with the most recently modified at the top... crawlers can do what they like with the information but at least it is there. It might help however to qualify the 'Visit' keyword in some way to say that the information is ordered. Regards, Martin Kiff mgk@newton.npl.co.uk From owner-robots Thu Jan 11 03:57:11 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20139; Thu, 11 Jan 96 03:57:11 -0800 Date: Thu, 11 Jan 1996 06:56:59 -0500 From: Skip Montanaro <skip@automatrix.com> Message-Id: <199601111156.GAA27257@dolphin.automatrix.com> To: robots@webcrawler.com Subject: Re: Does anyone else consider... In-Reply-To: <9601110353.AA22505@evil-twins.pa.dec.com> References: <199601110243.SAA20444@netcom11.netcom.com> <9601110353.AA22505@evil-twins.pa.dec.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Since I'm here, and in the interest of saving bandwidth I want to respond to Skip who was missing one important point and calling me a fool (;-)). My apologies also. I was responding in part to Robert Raisch's message. A foundation built on sand... Skip Montanaro | Looking for a place to promote your music venue, new CD skip@calendar.com | or next concert tour? Place a focused banner ad in (518)372-5583 | Musi-Cal! http://www.calendar.com/concerts/ From owner-robots Thu Jan 11 08:02:49 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02821; Thu, 11 Jan 96 08:02:49 -0800 From: <monier@pa.dec.com> Message-Id: <9601111557.AA23045@evil-twins.pa.dec.com> To: robots@webcrawler.com Subject: Re: Does anyone else consider... In-Reply-To: Your message of "Thu, 11 Jan 96 15:02:50 +0900." <v02130500ad1a569db95e@[202.237.148.20]> Date: Thu, 11 Jan 96 07:57:21 -0800 X-Mts: smtp Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com The database was obtained in 8 days. The date is last-modified as reported by the server, which is often bogus, but there is nothing I can do, except educate more webmasters (;-)). This should be better documented, we are working on documentation right now. The database is updated in real time, i.e. while queries come in: the news index for example is constantly in flux since articles come in and expire all the time. --Louis From owner-robots Thu Jan 11 09:23:28 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA07254; Thu, 11 Jan 96 09:23:28 -0800 Message-Id: <199601111723.JAA23233@meitner.cs.washington.edu> In-Reply-To: m.koster@webcrawler.com's message of Wed, 10 Jan 1996 14:05:36 -0700 To: robots@webcrawler.com Subject: Re: Robots / source availability? References: <v02140808ad19d8f2ac66@[199.221.45.139]> Date: Thu, 11 Jan 1996 09:23:23 PST From: Erik Selberg <speed@cs.washington.edu> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Martijn Koster writes: > Who is currently selling/giving away robot binaries and/or source? > I'd like to add that info to the robots listed... people ask me all the time. There's a mini-robot available with the recent release of libwww (v4.0) from w3.org -Erik From owner-robots Sat Jan 13 07:06:34 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08811; Sat, 13 Jan 96 07:06:34 -0800 Message-Id: <30F7CB49.3E4A@wsnet.com> Date: Sat, 13 Jan 1996 09:10:01 -0600 From: Alison Gwin <alison@wsnet.com> Organization: Coldwell Banker Smith X-Mailer: Mozilla 2.0b3 (Win95; I) Mime-Version: 1.0 To: robots@webcrawler.com Subject: (no subject) X-Url: http://info.webcrawler.com/mailing-lists/robots/info.html Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Could someone point me to a simple application that will scan newsgroups of interest to me and save the email adresses form those newsgroups? I'm sure that such an appication exists, but can't find one anywhere. Creating one from scratch seems like such a waste of effort when I know there's probably one out there already. Thanks! From owner-robots Sat Jan 13 10:58:29 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24036; Sat, 13 Jan 96 10:58:29 -0800 X-Sender: dhender@oly.olympic.net Message-Id: <v01510101ad1db16bf2ef@[205.240.23.66]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sat, 13 Jan 1996 11:00:17 -0800 To: robots@webcrawler.com From: david@quickimage.com (David Henderson) Subject: Robots not Frames savy Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com When will robots support frames???? _____________________________________________________________ David Henderson - Webmaster - QUICKimage HOME PH/FAX: 360-377-2182 WORK PH: 206-443-1430 WORK FAX: 206-443-5670 From owner-robots Sat Jan 13 16:03:31 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11650; Sat, 13 Jan 96 16:03:31 -0800 Message-Id: <v02130501ad1df5a3c7c2@[202.243.51.216]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sun, 14 Jan 1996 09:04:30 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: Re: Spam Software Sought Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 9:10 AM 1/13/96, Alison Gwin wrote: >Could someone point me to a simple application that will scan newsgroups >of interest to me and save the email adresses form those newsgroups? >I'm sure that such an appication exists, but can't find one anywhere. >Creating one from scratch seems like such a waste of effort when I know >there's probably one out there already. Thanks! Why would you want such a program? You wouldn't be working for Canter and Siegel, would you? From owner-robots Sun Jan 14 00:53:22 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10100; Sun, 14 Jan 96 00:53:22 -0800 Date: Sun, 14 Jan 96 12:02:20 EST From: smadja@netvision.net.il Subject: Re: Does anyone else consider... To: robots@webcrawler.com X-Mailer: Chameleon ARM_55, TCP/IP for Windows, NetManage Inc. Message-Id: <Chameleon.960114120412.smadja@Haifa.netvision.net.il> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Thu, 11 Jan 96 07:57:21 -0800 monier@pa.dec.com wrote: > >The database was obtained in 8 days. The date is last-modified as reported by >the server, which is often bogus, but there is nothing I can do, except educate >more webmasters (;-)). This should be better documented, we are working on >documentation right now. >The database is updated in real time, i.e. while queries come in: the news index >for example is constantly in flux since articles come in and expire all the time. > > --Louis Louis: What do you mean by the DB is updated real time. Do you rescan the web continuously checking for updates and new pages (with a 1-week cycle), or do you have some other strategy? Thanks ------------------------------------- Name: Frank Smadja E-mail: smadja@netvision.net.il Date: 01/14/96 Time: 12:02:20 ------------------------------------- From owner-robots Sun Jan 14 10:53:23 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA07332; Sun, 14 Jan 96 10:53:23 -0800 From: <monier@pa.dec.com> Message-Id: <9601141846.AA26989@evil-twins.pa.dec.com> To: robots@webcrawler.com Subject: Horror story Date: Sun, 14 Jan 96 10:46:32 -0800 X-Mts: smtp Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I was taking a look at the Alta Vista database and found out the following fact: about 5% of all sites visited (>100,000 total) have a non-empty /robots.txt file. Horrifying! This suggests that before we add all sorts of improvements to the standard we should try to educate the webmasters: it does not matter whether we have 36 options on how often to refetch the damn file if nobody uses it and robots still fall down holes, wander in test areas, gobble up access logs... Seriously, how about some sort of concerted effort to educate webmasters everywhere that with two minutes of their time they can make everyone's life better: less visits to their site, less junk in indexes (indices?). --Louis From owner-robots Sun Jan 14 12:05:44 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11141; Sun, 14 Jan 96 12:05:44 -0800 Date: Sun, 14 Jan 1996 15:05:31 -0500 From: Skip Montanaro <skip@automatrix.com> Message-Id: <199601142005.PAA01945@dolphin.automatrix.com> To: robots@webcrawler.com Subject: Re: Horror story In-Reply-To: <9601141846.AA26989@evil-twins.pa.dec.com> References: <9601141846.AA26989@evil-twins.pa.dec.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Seriously, how about some sort of concerted effort to educate webmasters everywhere that with two minutes of their time they can make everyone's life better: less visits to their site, less junk in indexes (indices?). Sounds good in principal. In practice, howver, since most people want to tout as many "hits" as possible, they may be less inclined than you might think to squelch robots. Here are a few suggestions: 1. Every site that uses robots (Lycos, Alta Vista, Webcrawler, ...) should have an easily found link to Martijn's norobots page. I know some do already. Others mention robots.txt but don't provide a link. 2. If possible, expose several "load-and-go" annotated robots.txt files (maybe on Martijn's site), each with clear statements of the particular file's goals. I know there are a few on the norobots page, but I doubt there are very many sites with /cyberworld directories. 3. Every robot site that supports URL inputs should mention robots.txt in both the submission form and the submission response page. 4. How about a robots.txt creation Web form? 5. Are there some good non-Web places to get a little publicity? What about WebWeek, Interactive Age, and other Internet rags? Could a short article be written? 6. All the major Web servers should come with a little blurb about robots.txt. 7. How about a little IMG like the Point Communications Top 5% graphic that points to the norobots site? Anybody with a robots.txt file could display it proudly (we are, after all a pretty elite group if Louis's message is indicative of reality). It could have little image and a catchy phrase like: robots.txt - the diagraphm for your Web server -- Skip Montanaro | Looking for a place to promote your music venue, new CD skip@calendar.com | or next concert tour? Place a focused banner ad in (518)372-5583 | Musi-Cal! http://www.calendar.com/concerts/ From owner-robots Sun Jan 14 16:05:07 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24212; Sun, 14 Jan 96 16:05:07 -0800 From: Jaakko Hyvatti <Jaakko.Hyvatti@www.fi> Message-Id: <199601150005.CAA02337@krisse.www.fi> Subject: Re: Horror story To: robots@webcrawler.com Date: Mon, 15 Jan 1996 02:04:58 +0200 (EET) In-Reply-To: <199601142005.PAA01945@dolphin.automatrix.com> from "Skip Montanaro" at Jan 14, 96 03:05:31 pm X-Mailer: ELM [version 2.4 PL22] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 760 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Skip Montanaro: > 6. All the major Web servers should come with a little blurb about > robots.txt. That is the most important thing here! And not only should all server installation instructions have a step called 'Creating /robots.txt' and an example with 'Disallow: /cgi-bin/' with it, any software that creates information or scripts that should not be indexed, like statistics packages, query frontends, database gateways.. should come with /robots.txt and specific instructions! Now it is just a question of who is going to do the real work here, to list all such software with developers contact addresses, and formulate a letter that impresses them to include these instructions into the next release. I believe it might even work. From owner-robots Sun Jan 14 17:50:05 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA29935; Sun, 14 Jan 96 17:50:05 -0800 Message-Id: <9601150149.AA29926@webcrawler.com> Date: Sun, 14 Jan 1996 17:38:00 -0800 From: Ted Sullivan <tsullivan@blizzard.snowymtn.com> Subject: Re: Horror story To: robots <robots@webcrawler.com> X-Mailer: Worldtalk (NetConnex V3.50c)/MIME Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com 3. Every robot site that supports URL inputs should mention robots.txt in both the submission form and the submission response page. How about somebody, say maybe Louis as he would certainly have the resources (I would do it but don't have a budget for that kind of stuff, unless of course somebody came up with a few man weeks) run your little spider again against your complete data set of URL's and while you are looking for links on ONLY the top most page of the site record any mail address that has "webmaster..." or something similar in it. Then send a little message to the webmaster saying in effect that a group of sites that use spiders (mention a few of the big one that are on this mailing list) have informally got together and in order to properly index your site in the future would like to make a little suggestion.... have noticed that you do not have a robots.txt file.... Then include a short example that the webmaster could cut and paste into the file system with a should tutorial on how to do it. Two things would happen, 1) lots of people would get the message and 2) if you sent out >100,000 e-mail messages one weekend surely some of the trade publications would write up a few articles on our behalf after their webmaster got a message and realized that it was send to the world. We would not hit everybody, but it would sure hit a lot of the sites that could do the 2 minute piece of work and set it up properly. Ted Sullivan tsullivan@snowymtn.com From owner-robots Sun Jan 14 19:25:23 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05452; Sun, 14 Jan 96 19:25:23 -0800 Date: Sun, 14 Jan 1996 19:24:47 -0800 Message-Id: <199601150324.TAA07209@one.mind.net> X-Sender: belisle@mind.net X-Mailer: Windows Eudora Version 1.4.3 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: belisle@mind.net (Hal Belisle) Subject: Gopher Protocol Question Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi! my name is Hal Belisle and I am new to this list. I am writing a small web client (spider?) that automatically searches specific sites on a bi-weekly basis for information using existing search engines. I am having a hard time with gophers. I can gain access and retrieve files, but I can't seem to give them a valid search string. What exactly do you replace the ? you normally see in web searches with when you query a gopher? Any help or pointers to other sources of information (i.e. a listserve for web clients) would be greatly appreciated. Thanks in advance Hal Belisle From owner-robots Sun Jan 14 22:40:14 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16486; Sun, 14 Jan 96 22:40:14 -0800 Message-Id: <9601150640.AA16476@webcrawler.com> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Brian Pinkerton <bp> Date: Sun, 14 Jan 96 22:40:09 -0800 To: robots@webcrawler.com Subject: Re: Horror story References: <9601150149.AA29926@webcrawler.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com The robots.txt format got some good press in the Nov. issue of WebWeek, and more extensive coverage on the whole robots issue is in the works. Skip is right: a lot of people want as much exposure as possible, and aren't likely to pay attention to ideas that might reduce that exposure! On the flip side, offering a way to specify what files on a site are most important *to* index would be seen as a big step forward. Martijn may have more to say on this issue. :) cheers, bri From owner-robots Sun Jan 14 23:10:11 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17982; Sun, 14 Jan 96 23:10:11 -0800 Message-Id: <199601150709.XAA05451@sparty.surf.com> Date: Sun, 14 Jan 96 11:07:40 -0800 From: Murray Bent <murrayb@surf.com> Organization: Web21 Inc. X-Mailer: Mozilla 1.12 (X11; I; IRIX 5.3 IP22) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: Horror story X-Url: http://home.netscape.com/ Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Any ideas for a back-up plan in case the 'robots.txt' approach does not gain more than 5% of the sites (admittedly representing more than 5% of the cool content). There are lots of other web middleware facilities in the works, and a "deny permission" element or quality of service element is in some them. If we want smarter robots we need the middleware too. - there are docs on the w3.org site for all manner of proposals, watch out .. some of them may happen! - without these middleware facilities, the number of dumb robots and site slurpers will *proliferate* . Is robots.txt the solution of choice to the 'personalised' robots that will come in the box with Windows 96 say? mj From owner-robots Mon Jan 15 12:06:15 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26938; Mon, 15 Jan 96 12:06:15 -0800 From: "Mordechai T. Abzug" <mabzug1@gl.umbc.edu> Message-Id: <199601152006.PAA24219@umbc10.umbc.edu> Subject: Re: Horror story To: robots@webcrawler.com Date: Mon, 15 Jan 1996 15:06:04 -0500 (EST) In-Reply-To: <199601150709.XAA05451@sparty.surf.com> from "Murray Bent" at Jan 14, 96 11:07:40 am X-Mailer: ELM [version 2.4 PL24] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 974 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com "MB" == Murray Bent spake thusly: MB> MB> Any ideas for a back-up plan in case the 'robots.txt' approach MB> does not gain more than 5% of the sites (admittedly representing MB> more than 5% of the cool content). MB> I hate to sound heretical, but why is everyone so concerned about this '5%' figure? I'm sure everyone on this list is a sufficiently sophisticated programmer to have developed the sort of complex web systems that robots.txt is supposed to protect, but most web servers probably don't need a robots.txt. Now, 'most' might not be 95% of all web servers, but the problem is still not as bad as it might seem. We don't *need* to inform every 'webmaster' who downloads a server kit and can write HTML; only the hackers. Monier, do you have any guesses on what fraction of servers *should* have a robots.txt? -- Mordechai T. Abzug http://umbc.edu/~mabzug1 mabzug1@umbc.edu finger -l mabzug1@gl.umbc.edu Assembly programmers drive stick shifts. From owner-robots Mon Jan 15 13:35:23 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA27701; Mon, 15 Jan 96 13:35:23 -0800 Subject: New Robot Announcement From: Larry Burke <lburke@aktiv.com> To: <robots@webcrawler.com> Mime-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Date: Mon, 15 Jan 1996 13:38:29 -0800 Message-Id: <1390409387-681890@aktiv.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com New Robot Announcement Name: Duppies (rhymes with puppies) Author: Larry Burke, AKTIV Software Platform: Mac OS (considering Windows NT port) User-Agent: Duppies/1.0 Purpose: Allows website administrator to provide searchable index of their own site as well as other related sites. Has facilities to perform timed updates. Performs several other utility functions. Includes filtering system to limit indexing to files meeting specified criteria. Single program performs robot function, text indexing, and search processing either as a CGI or a stand-alone web server. Important Note: It is our intention to make Duppies available commercially to web administrators. Any comments on this would be welcomed. We feel a large missing part of many web sites is the lack of a site specific index (try finding anything at www.apple.com). Status: Currently being implemented by the Government of British Columbia to index all the official ministry sites. Supports Robot Exclusion standard: Yes, and we have implemented all the other robot niceties we could think of (and read "Internet Agents" by Fah-Chun Cheong). I have been following this list since July or August. For more information visit "http://www.aktiv.com/duppies/duppies.html". -------------------- Larry Burke AKTIV Software Victoria, B.C. email: lburke@aktiv.com phone: 604.383.4195 From owner-robots Mon Jan 15 15:13:27 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28452; Mon, 15 Jan 96 15:13:27 -0800 From: "Mordechai T. Abzug" <mabzug1@gl.umbc.edu> Message-Id: <199601152313.SAA05450@umbc10.umbc.edu> Subject: Re: New Robot Announcement To: robots@webcrawler.com Date: Mon, 15 Jan 1996 18:13:01 -0500 (EST) In-Reply-To: <1390409387-681890@aktiv.com> from "Larry Burke" at Jan 15, 96 01:38:29 pm X-Mailer: ELM [version 2.4 PL24] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 999 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com "LB" == Larry Burke spake thusly: LB> LB> Name: Duppies (rhymes with puppies) LB> LB> Purpose: Allows website administrator to provide searchable index of LB> their own site as well as other related sites. Has facilities to perform LB> timed updates. Performs several other utility functions. Includes LB> filtering system to limit indexing to files meeting specified criteria. LB> Single program performs robot function, text indexing, and search LB> processing either as a CGI or a stand-alone web server. LB> LB> Important Note: It is our intention to make Duppies available LB> commercially to web administrators. Any comments on this would be LB> welcomed. We feel a large missing part of many web sites is the lack of a LB> site specific index (try finding anything at www.apple.com). Exactly how is this better than Harvest, which is free? -- Mordechai T. Abzug http://umbc.edu/~mabzug1 mabzug1@umbc.edu finger -l mabzug1@gl.umbc.edu Naaah, real men don't read docs. From owner-robots Mon Jan 15 15:56:55 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28815; Mon, 15 Jan 96 15:56:55 -0800 Subject: Re: New Robot Announcement From: Larry Burke <lburke@aktiv.com> To: <robots@webcrawler.com> Mime-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Date: Mon, 15 Jan 1996 15:56:21 -0800 Message-Id: <1390401115-150821641@gco.gov.bc.ca> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >Exactly how is this better than Harvest, which is free? Harvest server software currently runs only on UNIX machines. Duppies was designed for the web serving community who either do not know and don't want to know UNIX or do not have a UNIX box available. -------------------- Larry Burke AKTIV Software Victoria, B.C. email: lburke@aktiv.com phone: 604.383.4195 From owner-robots Mon Jan 15 22:09:15 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16221; Mon, 15 Jan 96 22:09:15 -0800 To: robots@webcrawler.com Subject: Re: robots.txt extensions X-Url: http://www.miranova.com/%7Esteve/ References: <199601110038.CAA11256@krisse.www.fi> From: Steven L Baur <steve@miranova.com> Date: 15 Jan 1996 22:06:27 -0800 In-Reply-To: Jaakko Hyvatti's message of 10 Jan 1996 16:38:12 -0800 Message-Id: <m2wx6slllo.fsf@miranova.com> Organization: Miranova Systems, Inc. Lines: 23 X-Mailer: September Gnus v0.26/Emacs 19.30 Mime-Version: 1.0 (generated by tm-edit 7.38) Content-Type: text/plain; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >>>>> "Jaakko" == Jaakko Hyvatti <Jaakko.Hyvatti@www.fi> writes: >> Finally -- I never understood why robots.txt was exclusion only. >> Why does it not have some of positive hints added? I.e. you are >> allowed & welcome to browse XXXX/fred.html. Was this a choice >> built upon pragmatism -- thinking that this would open a can of >> worms? I too would like to see something like this. Or at least some way of prioritizing pages. Jaakko> I do not believe it is a problem to give robots URLs, they are Jaakko> pretty good at finding them themselves. A little too good sometimes. The problem comes when one has an archive of something like a manual that is under development. I maintain two such archives, and had several robots going through the pages while pieces were being changed actively, while other more static pages were ignored. Regards, -- steve@miranova.com baur From owner-robots Tue Jan 16 01:39:23 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25968; Tue, 16 Jan 96 01:39:23 -0800 Date: Tue, 16 Jan 1996 09:39:23 GMT From: jeremy@mari.co.uk (Jeremy.Ellman) Message-Id: <9601160939.AA05701@kronos> To: robots@webcrawler.com Subject: Re: New Robot Announcement X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > LB> Important Note: It is our intention to make Duppies available > LB> commercially to web administrators. Any comments on this would be > LB> welcomed. We feel a large missing part of many web sites is the lack of a > LB> site specific index (try finding anything at www.apple.com). > > Exactly how is this better than Harvest, which is free? > > -- Or FreeWAIS, SWISH, and a host of others From owner-robots Tue Jan 16 05:54:26 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08486; Tue, 16 Jan 96 05:54:26 -0800 Date: Tue, 16 Jan 1996 08:54:14 -0500 From: Skip Montanaro <skip@automatrix.com> Message-Id: <199601161354.IAA05082@dolphin.automatrix.com> To: robots@webcrawler.com Subject: Re: robots.txt extensions In-Reply-To: <m2wx6slllo.fsf@miranova.com> References: <199601110038.CAA11256@krisse.www.fi> <m2wx6slllo.fsf@miranova.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Steven L. Baur writes: A little too good sometimes. The problem comes when one has an archive of something like a manual that is under development. I maintain two such archives, and had several robots going through the pages while pieces were being changed actively, while other more static pages were ignored. Hmmm... seems like you need a Disallow: item to keep those pesky robots away from your development tree. I too think a positive hint would be useful. If a robot is well-behaved, it will take some period of time to munch my entire site. I'd like to be able to suggest where it should munch first. Skip Montanaro | Looking for a place to promote your music venue, new CD, skip@calendar.com | festival or next concert tour? Place a focused banner (518)372-5583 | ad in Musi-Cal! http://www.calendar.com/concerts/ From owner-robots Tue Jan 16 09:18:33 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18167; Tue, 16 Jan 96 09:18:33 -0800 Date: Tue, 16 Jan 1996 10:28:59 -0600 From: ecarp@tssun5.dsccc.com (Ed Carp @ TSSUN5) Message-Id: <9601161628.AA06920@tssun5.> To: robots@webcrawler.com Subject: Re: New Robot Announcement X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > From owner-robots@webcrawler.com Mon Jan 15 19:28 CST 1996 > Subject: Re: New Robot Announcement > From: Larry Burke <lburke@aktiv.com> > To: <robots@webcrawler.com> > Mime-Version: 1.0 > Date: Mon, 15 Jan 1996 15:56:21 -0800 > > >Exactly how is this better than Harvest, which is free? > > Harvest server software currently runs only on UNIX machines. Duppies was > designed for the web serving community who either do not know and don't > want to know UNIX or do not have a UNIX box available. I was under the impression that most web servers were running on a UNIX box. What else are you going to run a server on? I would argue that NT doesn't have the horsepower, and tehre aren't a lot of alternatives. From owner-robots Tue Jan 16 11:18:50 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22816; Tue, 16 Jan 96 11:18:50 -0800 Message-Id: <30FBF9C5.45DD@interworld.com> Date: Tue, 16 Jan 1996 14:17:25 -0500 From: David@interworld.com (David Levine) Organization: InterWorld, Really Cool Stuff Division X-Mailer: Mozilla 2.0b5 (WinNT; I) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: New Robot Announcement References: <9601161628.AA06920@tssun5.> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Ed Carp @ TSSUN5 wrote: > I was under the impression that most web servers were running > on a UNIX box. > What else are you going to run a server on? I would argue > that NT doesn't > have the horsepower, and tehre aren't a lot of alternatives. NT can be extremely powerful when running on a Dec Alpha. My company provided the software for a server running on such a system which receives approximately 1,000,000 GETs a day. Still pretty fast. David Levine From owner-robots Tue Jan 16 11:42:39 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24167; Tue, 16 Jan 96 11:42:39 -0800 Subject: Re: New Robot Announcement From: Larry Burke <lburke@aktiv.com> To: <robots@webcrawler.com> Mime-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Date: Tue, 16 Jan 1996 11:42:17 -0800 Message-Id: <1390329959-155101283@gco.gov.bc.ca> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >I was under the impression that most web servers were running on a UNIX box. >What else are you going to run a server on? I would argue that NT doesn't >have the horsepower, and tehre aren't a lot of alternatives. There are many sites that are run on the Mac OS using either WebStar or MacHTTP. See "http://brad.net/machttp_talk/sites.by.title.html" for a partial list. I don't mean to be an advocate for the Apple Internet Server products but they are easy to use and plenty powerful for many server applications. And NT products are becoming very respectable as well. Check out "http://www.cc.gatech.edu/gvu/user_surveys/survey-10-1995/graphs/info/which _server.html" if you want some recent statistics on server usage. -------------------- Larry Burke AKTIV Software Victoria, B.C. email: lburke@aktiv.com web: www.aktiv.com phone: 604.383.4195 From owner-robots Tue Jan 16 11:51:08 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24631; Tue, 16 Jan 96 11:51:08 -0800 Date: Tue, 16 Jan 1996 13:05:36 -0600 From: ecarp@tssun5.dsccc.com (Ed Carp @ TSSUN5) Message-Id: <9601161905.AA13813@tssun5.> To: robots@webcrawler.com Subject: Alta Vista searches WHAT?!? X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com There has been a concern raised on another list that I belong to, about the privacy implications of robots and such. The specific example was that the Alta Vista web crawler didn't only index linked documents, but any and all documents that it could find at a site! Is this true, and if so, how is it doing it? How does one keep documents private? I sure don't want my personal correspondence sitting out on someone's database just because my home directory happens to be readable! From owner-robots Tue Jan 16 12:48:21 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA27956; Tue, 16 Jan 96 12:48:21 -0800 Comments: Authenticated sender is <jakob@cybernet.dk> From: "Jakob Faarvang" <jakob@jubii.dk> Organization: Jubii / cybernet.dk To: robots@webcrawler.com Date: Tue, 16 Jan 96 21:49:19 +0100 (CET) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Subject: Re: New Robot Announcement Priority: normal X-Mailer: Pegasus Mail for Windows (v2.23) Message-Id: 20491955603076@cybernet.dk X-Info: cybernet.dk Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > I was under the impression that most web servers were running on a UNIX box. > What else are you going to run a server on? I would argue that NT doesn't > have the horsepower, and tehre aren't a lot of alternatives. FYI: We run all our stuff on NT. Our web-server currently handles more than 50 virtual domains and more than 50,000 hits per day without complaining. On a 32 mb Pentium 100. But let's make this an OS war, for heavens sake. - Jakob Med venlig hilsen Jakob Faarvang Jubii / cybernet.dk -- Jakob Faarvang - jakob@jubii.dk / jakob@cybernet.dk Jubii - hele Danmarks World Wide Web-database http://www.jubii.dk From owner-robots Tue Jan 16 13:24:47 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28843; Tue, 16 Jan 96 13:24:47 -0800 X-Sender: mak@surfski.webcrawler.com Message-Id: <v02140803ad21b7e9a71c@[199.221.45.139]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Tue, 16 Jan 1996 13:25:45 -0700 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: Re: Alta Vista searches WHAT?!? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 1:05 PM 1/16/96, Ed Carp @ TSSUN5 wrote: >There has been a concern raised on another list that I belong to, about the >privacy implications of robots and such. The specific example was that the >Alta Vista web crawler didn't only index linked documents, but any and all >documents that it could find at a site! Is this true, and if so, how is it >doing it? What is this Alta-Vista vicious rumour mill stuff on the list recently? :-) If you have a question, ask the robot author, his email address is on the robots page... It would also help if you included the complete referenced article -- I wouldn't be in the least surprised if the person's files in question were in fact reacheable from the web, and therefore findable by any browser or robot. >How does one keep documents private? I sure don't want my personal >correspondence sitting out on someone's database just because my home directory >happens to be readable! To protect pages from other people, configure your server to return "access denied" for them... -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Tue Jan 16 13:34:19 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28920; Tue, 16 Jan 96 13:34:19 -0800 X-Sender: mak@surfski.webcrawler.com Message-Id: <v02140806ad21baf85f1b@[199.221.45.139]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Tue, 16 Jan 1996 13:35:17 -0700 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: BOUNCE robots: Admin request Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Date: Tue, 16 Jan 96 03:28:31 -0800 From: <owner-robots> To: owner-robots Subject: BOUNCE robots: Admin request X-Filter: mailagent [version 3.0 PL41] for mak@surfski.webcrawler.com Approved: robbie From s.nisbet@doc.mme.ac.uk Tue Jan 16 03:27:50 1996 Return-Path: <s.nisbet@doc.mme.ac.uk> Received: from ehlana.mmu.ac.uk by webcrawler.com (NX5.67f2/NX3.0M) id AA02048; Tue, 16 Jan 96 03:27:50 -0800 Received: from patsy.doc.aca.mmu.ac.uk by ehlana with SMTP (PP); Tue, 16 Jan 1996 11:26:46 +0100 Received: from raphael.doc.aca.mmu.ac.uk by patsy.doc.aca.mmu.ac.uk (4.1/SMI-4.1) id AA20450; Tue, 16 Jan 96 11:26:28 GMT Received: from jd-e114-07.doc.aca.mmu.ac.uk by raphael.doc.aca.mmu.ac.uk (4.1/SMI-4.1) id AA02693; Tue, 16 Jan 96 11:26:39 GMT Date: Tue, 16 Jan 96 11:26:38 GMT Message-Id: <9601161126.AA02693@raphael.doc.aca.mmu.ac.uk> X-Sender: steven@raphael.doc.aca.mmu.ac.uk X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Steve Nisbet <s.nisbet@doc.mme.ac.uk> Subject: Re: Horror story Maybe Im being a little touchy, but I take exception to being 'educated'. I subscribe to the robots line and others, because Im interested and because I know my stuff and want to know where things are going. A great many 'Web Masters' do a lot more than run webs, which in turn require a lot of effort and a lot of seperate tasks. I suspect as Mordechai T. Abzug points out that the majority of sites dont need a robots.txt and maybe are not that interested in robots. Think about it, they already have a lot on their plates with the admin of their respectives webs as it is. Steve Nisbet Web Admin (and other web related stuff!!) Department of Computing and sub webs Manchester Metro Uni. http://www.doc.mmu.ac.uk/ -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Tue Jan 16 13:50:00 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA29112; Tue, 16 Jan 96 13:50:00 -0800 Message-Id: <199601162149.QAA26754@revere.musc.edu> Comments: Authenticated sender is <lindroth@atrium.musc.edu> From: "John Lindroth" <lindroth@musc.edu> Organization: Medical University of South Carolina To: robots@webcrawler.com Date: Tue, 16 Jan 1996 16:49:50 +0000 Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Subject: Re: New Robot Announcement Priority: normal X-Mailer: Pegasus Mail for Windows (v2.23) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > >Exactly how is this better than Harvest, which is free? > > > > Harvest server software currently runs only on UNIX machines. Duppies was > > designed for the web serving community who either do not know and don't > > want to know UNIX or do not have a UNIX box available. > > I was under the impression that most web servers were running on a UNIX box. > What else are you going to run a server on? I would argue that NT doesn't > have the horsepower, and tehre aren't a lot of alternatives. Larry's original post stated that the robot would run under the MacOS. While our main server is on a unix workstation, many of our departments run on Macs. And with each department's info distributed on its own mac server, no single system gets a lot of hits. I can't say that I think that the Mac is a great platform to run a server, but I think they have identified a niche market that just might work. MHO, -John Lindroth MUSC Web Master ============================================= John Lindroth Senior Systems Programmer Academic & Research Computing Services Center for Computing & Information Technology Medical University of South Carolina E-Mail: lindroth@musc.edu URL: http://www.musc.edu/~lindroth ============================================= Any opinions expressed are mine, not my employer's. And they may be wrong (gasp!) ============================================= From owner-robots Tue Jan 16 14:55:36 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00539; Tue, 16 Jan 96 14:55:36 -0800 From: <monier@pa.dec.com> Message-Id: <9601162247.AA00329@evil-twins.pa.dec.com> To: robots@webcrawler.com Subject: Re: Alta Vista searches WHAT?!? In-Reply-To: Your message of "Tue, 16 Jan 96 13:05:36 CST." <9601161905.AA13813@tssun5.> Date: Tue, 16 Jan 96 14:47:51 -0800 X-Mts: smtp Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hum, one more time. Scooter, the robot behind Alta Vista, follows links, and only follows links. If the "directory browsing" option is enabled on a server, and someone publishes the URL for a directory, then the robots gets back a page of HTML which lists every file as a link, but that is not intentional. And yes, this has led to embarrassing situations, but again, it's not intentional. In the absence of strong conventions about directory names or file extensions it is hard for a robot to exclude anything a-priori. I wish it was easier... To keep a document private, list it in /robots.txt, password-protect it, change the protection on the file, or simpler: do not leave it in your Web hierarchy. Can you imagine what happens when someone uses / as web root, exposing for example the password file? It has happened! Remember that what a robot does, anyone with a browser can do: find this private file and then post to usenet for example, robots have no magic powers! The bottom line is that the usual danger is not aggressive robots, but clueless Web masters. --Louis From owner-robots Tue Jan 16 15:02:51 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00626; Tue, 16 Jan 96 15:02:51 -0800 Message-Id: <30FC2E66.77FC@corp.micrognosis.com> Date: Tue, 16 Jan 1996 18:01:58 -0500 From: Adam Jack <ajack@corp.micrognosis.com> Organization: CSK/Micrognosis Inc. X-Mailer: Mozilla 2.0b3 (X11; I; SunOS 5.5 sun4m) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: Alta Vista searches WHAT?!? References: <9601161905.AA13813@tssun5.> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Ed Carp @ TSSUN5 wrote: > > There has been a concern raised [...] > Alta Vista [..] I think all on the list ought reply for Louis :) :) Adam P.S. Ed, That was raised here a week ago. It was a *mistake* -- Alta Vista doesn't access other than is linked. -- +1-203-730-5437 | http://www.micrognosis.com/~ajack/index.html ajack@corp.micrognosis.com -> ajack@netcom.com -> ajack@?.??? From owner-robots Tue Jan 16 15:04:11 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00673; Tue, 16 Jan 96 15:04:11 -0800 Date: Wed, 17 Jan 96 00:00:01 +0100 Message-Id: <9601162300.AA03865@indy2> X-Face: $)p(\g8Er<<5PVeh"4>0m&);m(]e_X3<%RIgbR>?i=I#c0ksU'>?+~)ztzpF&b#nVhu+zsv x4[FS*c8aHrq\<7qL/v#+MSQ\g_Fs0gTR[s)B%Q14\;&J~1E9^`@{Sgl*2g:IRc56f:\4o1k'BDp!3 "`^ET=!)>J-V[hiRPu4QQ~wDm\%L=y>:P|lGBufW@EJcU4{~z/O?26]&OLOWLZ<V^N`hYM;pD#v&!` _A?V7^R! X-Url: http://www-ihm.lri.fr/~tronche/ From: "Tronche Ch. le pitre" <Christophe.Tronche@lri.fr> To: robots@webcrawler.com In-Reply-To: <9601161905.AA13813@tssun5.> (ecarp@tssun5.dsccc.com) Subject: Re: Alta Vista searches WHAT?!? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > There has been a concern raised on another list that I belong to, about the > privacy implications of robots and such. The specific example was that the > Alta Vista web crawler didn't only index linked documents, but any and all > documents that it could find at a site! Is this true, and if so, how is it > doing it? How does one keep documents private? I sure don't want my personal > correspondence sitting out on someone's database just because my home directory > happens to be readable! Not only your home directory, but also your mail directory. And all of them are readable by anybody at your site (or at least by your HTTP server). Just stay cool. Alta Vista or any other robots will certainly not access data that couldn't be accessed by another mean, this is just a classical security issue. Speaking about privacy, I feel more concerned by being cross-indexed in multiple robots-built databases. For example, we may suppose that, after some years of a career, you've left behind you some data about you in many of the organizations you've worked for. A robot could collect all these data to create a file about you. Of course, none of these infos may be very "sensitive", but, from some kind of "holistic" point of view, their gathering would permit to infer some interesting properties about yourself... May be... By the way, if every data in the world become available on the World Wide Web, such as dictionaries, encyclopedia, personal files, and so on, the Web may become MUCH LARGER than it's now. Have we any evidence the index databases will be able to scale to this extent ? +--------------------------+------------------------------------+ | | | | Christophe TRONCHE | E-mail : tronche@lri.fr | | | | | +-=-+-=-+ | Phone : 33 - 1 - 69 41 66 25 | | | Fax : 33 - 1 - 69 41 65 86 | +--------------------------+------------------------------------+ | ###### ** | | ## # Laboratoire de Recherche en Informatique | | ## # ## Batiment 490 | | ## # ## Universite de Paris-Sud | | ## #### ## 91405 ORSAY CEDEX | | ###### ## ## FRANCE | |###### ### | +---------------------------------------------------------------+ From owner-robots Tue Jan 16 17:31:42 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03549; Tue, 16 Jan 96 17:31:42 -0800 Message-Id: <9601170131.AA03543@webcrawler.com> Date: Tue, 16 Jan 1996 17:30:00 -0800 From: Ted Sullivan <tsullivan@blizzard.snowymtn.com> Subject: RE: Alta Vista searches WHAT?!? To: robots <robots@webcrawler.com> X-Mailer: Worldtalk (NetConnex V3.50c)/MIME Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com If you put your files in a file system area that the Web server has access to then they will get picked up. Robots cannot find things that they cannot see. It comes down to site security, publish what you desire the world to see and hide the rest. Ted Sullivan ---------- From: robots To: robots Subject: Alta Vista searches WHAT?!? Date: Wednesday, January 17, 1996 2:53PM There has been a concern raised on another list that I belong to, about the privacy implications of robots and such. The specific example was that the Alta Vista web crawler didn't only index linked documents, but any and all documents that it could find at a site! Is this true, and if so, how is it doing it? How does one keep documents private? I sure don't want my personal correspondence sitting out on someone's database just because my home directory happens to be readable! From owner-robots Tue Jan 16 19:05:29 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06396; Tue, 16 Jan 96 19:05:29 -0800 Message-Id: <v0213050cad221644aca5@[202.237.148.6]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 17 Jan 1996 12:00:59 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: Re: Alta Vista searches WHAT?!? Cc: robots@webcrawler.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 0:00 AM 1/17/96, Tronche Ch. le pitre wrote: >Speaking about privacy, I feel more concerned by being cross-indexed >in multiple robots-built databases. For example, we may suppose that, >after some years of a career, you've left behind you some data about >you in many of the organizations you've worked for. A robot could >collect all these data to create a file about you. Of course, none of >these infos may be very "sensitive", but, from some kind of "holistic" >point of view, their gathering would permit to infer some interesting >properties about yourself... May be... This is really the issue, I think, but the main problem is Usenet archives and mailing list archives, not indexing normal web pages. HTML documents tend to disappear, and presumably they would be eliminated from the robots index eventually. Archives tend to be permanent, and the participants in newsgroups and especially mailing lists are often not aware they're writing for posterity. --Mark From owner-robots Wed Jan 17 00:17:37 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19484; Wed, 17 Jan 96 00:17:37 -0800 Message-Id: <199601170817.JAA05933@storm.certix.fr> Comments: Authenticated sender is <savron@world-net.sct.fr> From: savron@world-net.sct.fr To: robots@webcrawler.com Date: Wed, 17 Jan 1996 09:07:02 +0000 Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Subject: robots.txt , authors of robots , webmasters .... Priority: normal X-Mailer: Pegasus Mail for Windows (v2.10) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com A few thoughts about the robots stuff : -- there should be no need to include a line such as : /cgi-bin/ in robots.txt because it should come as a standard of indexer robots The one exception I see is an automated query of search engines . -- Webmasters complaining about robots indexing partially built document trees . So why are they linked to the main tree ??? -- I agree with the proposed 'positive' extension of robots.txt to include 'these pages should score more than the others of my site' -- I don't understand why , if a web site is publicly accessible it shouldn't be indexable and so why there is a need for such a thing as robots.txt . -- Correct me if I'm wrong on this : If webmasters want to reserve access to certain pages to certain specific users they can do it , without needing to passwording it , by giving the pages names to these users and not linking them to the main tree . As robots follows the links they find and can't guess ( well , if you don't choose an obvious page name ) ( snoopers sort of robots ) you are pretty safe ( and if you really need it -- setup a password query form ( only a partial tree is reserved ) -- choose another port than 80 and password it too ( in case of a http port scanner sort of robot ) -- Why in the HTTP protocol there is not such an info about the required delay between to successive queries to the same server ( see the webmasters complaining about rapid fire queries from robots ) that the webserver should send in the header of each answer . If anyone wants to comment on this , I will be pleased to hear his opinion Bye Bye From owner-robots Wed Jan 17 00:17:38 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19488; Wed, 17 Jan 96 00:17:38 -0800 Message-Id: <199601170817.JAA05940@storm.certix.fr> Comments: Authenticated sender is <savron@world-net.sct.fr> From: savron@world-net.sct.fr To: robots@webcrawler.com Date: Wed, 17 Jan 1996 09:07:03 +0000 Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Subject: Web robots and gopher space -- two separate worlds Priority: normal X-Mailer: Pegasus Mail for Windows (v2.10) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Why web robots doesn't follow gopher links when they step on one ? If anyone wants to comment , especially web robot authors , feel free Thanks a lot From owner-robots Wed Jan 17 02:22:58 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25105; Wed, 17 Jan 96 02:22:58 -0800 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199601171023.LAA03915@wsinis10.win.tue.nl> Subject: Re: Alta Vista searches WHAT?!? To: robots@webcrawler.com Date: Wed, 17 Jan 1996 11:23:27 +0100 (MET) In-Reply-To: <9601161905.AA13813@tssun5.> from "Ed Carp @ TSSUN5" at Jan 16, 96 01:05:36 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Content-Length: 3747 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (Ed Carp @ TSSUN5) write: > >There has been a concern raised on another list that I belong to, about the >privacy implications of robots and such. >The specific example was that the >Alta Vista web crawler didn't only index linked documents, but any and all >documents that it could find at a site! Did you also get the messages in which the author explained that this isn't true? >Is this true, and if so, how is it doing it? How does one keep documents >private? I sure don't want my personal correspondence sitting out on >someone's database just because my home directory happens to be readable! I have a big problem with your phrase 'happens to be'. There have been more discussions like this, in which people were quite happy to make a bunch of documents available without restriction, except to indexers. Their main idea was that it is common practice to keep documents 'out of sight' without actually indicating access restrictions explicitly. I think this is plainly wrong. On Unix, if you want to indicate who is allowed access to your files, you use file permissions. If a certain file of mine is world readable, the implication is that I, the author, intentionally allow the rest of the world to read my file. (Here, 'the world' means any user with access to the file system.) I have, occasionally, browsed other people's directories and found stuff that wasn't intended for me to be read; I always assumed a mistake on their part, and decided not to read on, as a matter of courtesy. But the mistake was theirs. The same principle has always been assumed on the Internet, I guess. Iif you serve files off a WWW server without access restrictions, you intend to make them available to the rest of the world. There is no way of knowing the purpose of the accesses you get for your documents: it may be an individual user, a WWW indexer, or a secret program operated by the FBI/Mossad/KGB/whoever to scan for suspect activities. It's the access permissions that specify your intentions, not the existence of explicit references to the files, or the set of users you have told the URLs to your site explicitly, or anything else. In my opinion, it's a mistake to accuse robots of malicious behaviour when all they do is find files that have been made available to them. robots.txt should be regarded as a service to robots, a way of saying: don't bother to index this, the results won't justify the load it will place on the network and on my system. To honour this is a matter of courtesy. If you don't want robots to get access to your documents at all, then set proper access restrictions on the documents themselves. The only problem I see is that 'the world' is not the same for everybody. For example, suppose user A wants all files to be readable for all other users on the system. To user A, 'the world' is all users on the system. User A makes all files world readable. Now suppose that user B runs a WWW server, making all files on the system available to the whole Internet. (User B will think twice before doing this on purpose, but it may be a configuration error.) Suddenly, user A's files have become available to the whole Internet community. Suppose that user C (a WWW indexer) finds user A's files. It is unreasonable for user A to blame C, when B is at fault. Obviously, there must be a way for A to correct the problem, and get the files removed from C's index. This is possible in most WWW indexers. But if A is indignant at the mere fact that C found his files, s/he's barking up the wrong tree. -- Reinier Post reinpost@win.tue.nl a.k.a. <A HREF="http://www.win.tue.nl/win/cs/is/reinpost/">me</A> [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] From owner-robots Wed Jan 17 02:54:11 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26288; Wed, 17 Jan 96 02:54:11 -0800 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199601171054.LAA04028@wsinis10.win.tue.nl> Subject: Re: robots.txt , authors of robots , webmasters .... To: robots@webcrawler.com Date: Wed, 17 Jan 1996 11:54:42 +0100 (MET) In-Reply-To: <199601170817.JAA05933@storm.certix.fr> from "savron@world-net.sct.fr" at Jan 17, 96 09:07:02 am X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Content-Length: 2079 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (savron@world-net.sct.fr) write: > >A few thoughts about the robots stuff : > >-- there should be no need to include a line such as : > /cgi-bin/ > in robots.txt > because it should come as a standard of indexer robots That would be a kludge. It doesn't identify CGI scripts exactly (I do not usually include /cgi-bin/ in references to my CGI scripts) and it is not necessary tp exclude CGI scripts categorically (I sometimes serve a set of files through a CGI script). Furthermore, netter heuristics exist (eg. don't follow forms/POST requests). >-- Webmasters complaining about robots indexing partially built >document trees . So why are they linked to the main tree ??? Well, it would help if WWW servers took more pains to send accurate Expires: and Last-modified: headers. >-- I agree with the proposed 'positive' extension of robots.txt to >include 'these pages should score more than the others of my site' Perhaps, but once you're on that road, ALIWEB may be a better approach. >-- I don't understand why , if a web site is publicly accessible it >shouldn't be indexable and so why there is a need for such a thing as >robots.txt . Neither do I (see separate message). >-- Correct me if I'm wrong on this : If webmasters want to reserve >access to certain pages to certain specific users they can do it , >without needing to passwording it , by giving the pages names to >these users and not linking them to the main tree . Wrong (see that message): third parties have the right to poke for URLs, IMHO. Access restriction (password-based or otherwise) will do the job. >-- Why in the HTTP protocol there is not such an info about the >required delay between to successive queries to the same server ( see >the webmasters complaining about rapid fire queries from robots ) >that the webserver should send in the header of each answer . There is an HTTP response meaning "please don't return for a while, I'm busy". http://www.w3.org/pub/WWW/Protocols/HTTP1.0/draft-ietf-http-spec.html#Code503 -- Reinier Post reinpost@win.tue.nl From owner-robots Wed Jan 17 06:43:28 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05128; Wed, 17 Jan 96 06:43:28 -0800 Message-Id: <m0tcZ2i-0009mqC@walnut.holli.com> From: wlamb@walnut.holli.com (Wayne Lamb) Subject: Re: robots.txt , authors of robots , webmasters .... To: robots@webcrawler.com Date: Wed, 17 Jan 1996 09:40:51 -0500 (EST) In-Reply-To: <199601170817.JAA05933@storm.certix.fr> from "savron@world-net.sct.fr" at Jan 17, 96 09:07:02 am X-Mailer: ELM [version 2.4 PL23] Content-Type: text Content-Length: 1801 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com savron@world-net.sct.fr wrote: > > A few thoughts about the robots stuff : > > -- there should be no need to include a line such as : > /cgi-bin/ > in robots.txt > because it should come as a standard of indexer robots > > The one exception I see is an automated query of search engines . > > -- Webmasters complaining about robots indexing partially built > document trees . So why are they linked to the main tree ??? > > -- I agree with the proposed 'positive' extension of robots.txt to > include 'these pages should score more than the others of my site' > > -- I don't understand why , if a web site is publicly accessible it > shouldn't be indexable and so why there is a need for such a thing as > robots.txt . > > -- Correct me if I'm wrong on this : If webmasters want to reserve > access to certain pages to certain specific users they can do it , > without needing to passwording it , by giving the pages names to > these users and not linking them to the main tree . > As robots follows the links they find and can't guess ( well , if you > don't choose an obvious page name ) ( snoopers sort of robots ) you > are pretty safe ( and if you really need it > > -- setup a password query form ( only a partial tree is reserved ) > -- choose another port than 80 and password it too ( in case of a > http port scanner sort of robot ) > > -- Why in the HTTP protocol there is not such an info about the > required delay between to successive queries to the same server ( see > the webmasters complaining about rapid fire queries from robots ) > that the webserver should send in the header of each answer . > > If anyone wants to comment on this , I will be pleased to hear his > opinion > > Bye Bye > > Please take me off of your list -- From owner-robots Wed Jan 17 06:44:03 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05157; Wed, 17 Jan 96 06:44:03 -0800 Message-Id: <m0tcZ3K-0009mqC@walnut.holli.com> From: wlamb@walnut.holli.com (Wayne Lamb) Subject: Re: Web robots and gopher space -- two separate worlds To: robots@webcrawler.com Date: Wed, 17 Jan 1996 09:41:30 -0500 (EST) In-Reply-To: <199601170817.JAA05940@storm.certix.fr> from "savron@world-net.sct.fr" at Jan 17, 96 09:07:03 am X-Mailer: ELM [version 2.4 PL23] Content-Type: text Content-Length: 237 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com savron@world-net.sct.fr wrote: > > Why web robots doesn't follow gopher links when they step on one ? > > If anyone wants to comment , especially web robot authors , feel free > > Thanks a lot > Please take me off of your list -- From owner-robots Wed Jan 17 06:45:26 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05226; Wed, 17 Jan 96 06:45:26 -0800 Message-Id: <m0tcZ4X-0009mrC@walnut.holli.com> From: wlamb@walnut.holli.com (Wayne Lamb) Subject: Re: Alta Vista searches WHAT?!? To: robots@webcrawler.com Date: Wed, 17 Jan 1996 09:42:45 -0500 (EST) In-Reply-To: <199601171023.LAA03915@wsinis10.win.tue.nl> from "Reinier Post" at Jan 17, 96 11:23:27 am X-Mailer: ELM [version 2.4 PL23] Content-Type: text Content-Length: 3958 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Reinier Post wrote: > > You (Ed Carp @ TSSUN5) write: > > > >There has been a concern raised on another list that I belong to, about the > >privacy implications of robots and such. > > >The specific example was that the > >Alta Vista web crawler didn't only index linked documents, but any and all > >documents that it could find at a site! > > Did you also get the messages in which the author explained that > this isn't true? > > >Is this true, and if so, how is it doing it? How does one keep documents > >private? I sure don't want my personal correspondence sitting out on > >someone's database just because my home directory happens to be readable! > > I have a big problem with your phrase 'happens to be'. > > There have been more discussions like this, in which people were quite happy > to make a bunch of documents available without restriction, except to indexers. > Their main idea was that it is common practice to keep documents 'out of > sight' without actually indicating access restrictions explicitly. I think > this is plainly wrong. On Unix, if you want to indicate who is allowed access > to your files, you use file permissions. If a certain file of mine is world > readable, the implication is that I, the author, intentionally allow the rest > of the world to read my file. (Here, 'the world' means any user with access > to the file system.) I have, occasionally, browsed other people's directories > and found stuff that wasn't intended for me to be read; I always assumed a > mistake on their part, and decided not to read on, as a matter of courtesy. > But the mistake was theirs. > > The same principle has always been assumed on the Internet, I guess. > Iif you serve files off a WWW server without access restrictions, > you intend to make them available to the rest of the world. > There is no way of knowing the purpose of the accesses you get for your > documents: it may be an individual user, a WWW indexer, or a secret program > operated by the FBI/Mossad/KGB/whoever to scan for suspect activities. > > It's the access permissions that specify your intentions, not the existence > of explicit references to the files, or the set of users you have told > the URLs to your site explicitly, or anything else. > > In my opinion, it's a mistake to accuse robots of malicious behaviour > when all they do is find files that have been made available to them. > > robots.txt should be regarded as a service to robots, a way of saying: > don't bother to index this, the results won't justify the load it will > place on the network and on my system. To honour this is a matter of > courtesy. If you don't want robots to get access to your documents at > all, then set proper access restrictions on the documents themselves. > > The only problem I see is that 'the world' is not the same for everybody. > > For example, suppose user A wants all files to be readable for all > other users on the system. To user A, 'the world' is all users > on the system. User A makes all files world readable. > > Now suppose that user B runs a WWW server, making all files on the system > a vailable to the whole Internet. (User B will think twice before doing this > on purpose, but it may be a configuration error.) Suddenly, user A's files > have become available to the whole Internet community. Suppose that user C > (a WWW indexer) finds user A's files. It is unreasonable for user A to > blame C, when B is at fault. Obviously, there must be a way for A to correct > the problem, and get the files removed from C's index. This is possible in > most WWW indexers. But if A is indignant at the mere fact that C found his > files, s/he's barking up the wrong tree. > > -- > Reinier Post reinpost@win.tue.nl > a.k.a. <A HREF="http://www.win.tue.nl/win/cs/is/reinpost/">me</A> > [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] > Please take me off of your list -- From owner-robots Wed Jan 17 06:46:08 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05293; Wed, 17 Jan 96 06:46:08 -0800 Message-Id: <m0tcZ5E-0009mrC@walnut.holli.com> From: wlamb@walnut.holli.com (Wayne Lamb) Subject: Re: robots.txt , authors of robots , webmasters ....OMOMOM[D To: robots@webcrawler.com Date: Wed, 17 Jan 1996 09:43:28 -0500 (EST) In-Reply-To: <199601171054.LAA04028@wsinis10.win.tue.nl> from "Reinier Post" at Jan 17, 96 11:54:42 am X-Mailer: ELM [version 2.4 PL23] Content-Type: text Content-Length: 2213 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Reinier Post wrote: > > You (savron@world-net.sct.fr) write: > > > >A few thoughts about the robots stuff : > > > >-- there should be no need to include a line such as : > > /cgi-bin/ > > in robots.txt > > because it should come as a standard of indexer robots > > That would be a kludge. It doesn't identify CGI scripts exactly > (I do not usually include /cgi-bin/ in references to my CGI scripts) > and it is not necessary tp exclude CGI scripts categorically > (I sometimes serve a set of files through a CGI script). Furthermore, > netter heuristics exist (eg. don't follow forms/POST requests). > > >-- Webmasters complaining about robots indexing partially built > >document trees . So why are they linked to the main tree ??? > > Well, it would help if WWW servers took more pains to send accurate > Expires: and Last-modified: headers. > > >-- I agree with the proposed 'positive' extension of robots.txt to > >include 'these pages should score more than the others of my site' > > Perhaps, but once you're on that road, ALIWEB may be a better approach. > > >-- I don't understand why , if a web site is publicly accessible it > >shouldn't be indexable and so why there is a need for such a thing as > >robots.txt . > > Neither do I (see separate message). > > >-- Correct me if I'm wrong on this : If webmasters want to reserve > >access to certain pages to certain specific users they can do it , > >without needing to passwording it , by giving the pages names to > >these users and not linking them to the main tree . > > Wrong (see that message): third parties have the right to poke for URLs, IMHO. > Access restriction (password-based or otherwise) will do the job. > > >-- Why in the HTTP protocol there is not such an info about the > >required delay between to successive queries to the same server ( see > >the webmasters complaining about rapid fire queries from robots ) > >that the webserver should send in the header of each answer . > > There is an HTTP response meaning "please don't return for a while, I'm busy". > > http://www.w3.org/pub/WWW/Protocols/HTTP1.0/draft-ietf-http-spec.html#Code503 > > -- > Reinier Post reinpost@win.tue.nl > -- From owner-robots Wed Jan 17 06:47:13 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05366; Wed, 17 Jan 96 06:47:13 -0800 Message-Id: <m0tcZ6N-0009mrC@walnut.holli.com> From: wlamb@walnut.holli.com (Wayne Lamb) Subject: Re: robots.txt , authors of robots , webmasters ....OM To: robots@webcrawler.com Date: Wed, 17 Jan 1996 09:44:39 -0500 (EST) In-Reply-To: <199601171054.LAA04028@wsinis10.win.tue.nl> from "Reinier Post" at Jan 17, 96 11:54:42 am X-Mailer: ELM [version 2.4 PL23] Content-Type: text Content-Length: 2246 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Reinier Post wrote: > > You (savron@world-net.sct.fr) write: > > > >A few thoughts about the robots stuff : > > > >-- there should be no need to include a line such as : > > /cgi-bin/ > > in robots.txt > > because it should come as a standard of indexer robots > > That would be a kludge. It doesn't identify CGI scripts exactly > (I do not usually include /cgi-bin/ in references to my CGI scripts) > and it is not necessary tp exclude CGI scripts categorically > (I sometimes serve a set of files through a CGI script). Furthermore, > netter heuristics exist (eg. don't follow forms/POST requests). > > >-- Webmasters complaining about robots indexing partially built > >document trees . So why are they linked to the main tree ??? > > Well, it would help if WWW servers took more pains to send accurate > Expires: and Last-modified: headers. > > >-- I agree with the proposed 'positive' extension of robots.txt to > >include 'these pages should score more than the others of my site' > > Perhaps, but once you're on that road, ALIWEB may be a better approach. > > >-- I don't understand why , if a web site is publicly accessible it > >shouldn't be indexable and so why there is a need for such a thing as > >robots.txt . > > Neither do I (see separate message). > > >-- Correct me if I'm wrong on this : If webmasters want to reserve > >access to certain pages to certain specific users they can do it , > >without needing to passwording it , by giving the pages names to > >these users and not linking them to the main tree . > > Wrong (see that message): third parties have the right to poke for URLs, IMHO. > Access restriction (password-based or otherwise) will do the job. > > >-- Why in the HTTP protocol there is not such an info about the > >required delay between to successive queries to the same server ( see > >the webmasters complaining about rapid fire queries from robots ) > >that the webserver should send in the header of each answer . > > There is an HTTP response meaning "please don't return for a while, I'm busy". > > http://www.w3.org/pub/WWW/Protocols/HTTP1.0/draft-ietf-http-spec.html#Code503 > > -- > Reinier Post reinpost@win.tue.nl > Please take me off of your list -- From owner-robots Wed Jan 17 06:48:47 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05454; Wed, 17 Jan 96 06:48:47 -0800 Message-Id: <m0tcZ1Q-0009mrC@walnut.holli.com> From: wlamb@walnut.holli.com (Wayne Lamb) Subject: Re: Alta Vista searches WHAT?!? To: robots@webcrawler.com Date: Wed, 17 Jan 1996 09:39:32 -0500 (EST) In-Reply-To: <v0213050cad221644aca5@[202.237.148.6]> from "Mark Schrimsher" at Jan 17, 96 12:00:59 pm X-Mailer: ELM [version 2.4 PL23] Content-Type: text Content-Length: 1088 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Mark Schrimsher wrote: > > At 0:00 AM 1/17/96, Tronche Ch. le pitre wrote: > >Speaking about privacy, I feel more concerned by being cross-indexed > >in multiple robots-built databases. For example, we may suppose that, > >after some years of a career, you've left behind you some data about > >you in many of the organizations you've worked for. A robot could > >collect all these data to create a file about you. Of course, none of > >these infos may be very "sensitive", but, from some kind of "holistic" > >point of view, their gathering would permit to infer some interesting > >properties about yourself... May be... > > This is really the issue, I think, but the main problem is Usenet archives > and mailing list archives, not indexing normal web pages. HTML documents > tend to disappear, and presumably they would be eliminated from the robots > index eventually. Archives tend to be permanent, and the participants in > newsgroups and especially mailing lists are often not aware they're writing > for posterity. > > --Mark > Please take me off you list of mail> > -- From owner-robots Wed Jan 17 08:22:21 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10530; Wed, 17 Jan 96 08:22:21 -0800 Date: Wed, 17 Jan 1996 09:35:00 -0600 From: ecarp@tssun5.dsccc.com (Ed Carp @ TSSUN5) Message-Id: <9601171535.AA07837@tssun5.> To: robots@webcrawler.com Subject: Re: New Robot Announcement X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > From owner-robots@webcrawler.com Tue Jan 16 17:55 CST 1996 > Date: Tue, 16 Jan 1996 14:17:25 -0500 > From: David@interworld.com (David Levine) > Mime-Version: 1.0 > To: robots@webcrawler.com > Subject: Re: New Robot Announcement > Content-Transfer-Encoding: 7bit > > Ed Carp @ TSSUN5 wrote: > > I was under the impression that most web servers were running > > on a UNIX box. > > What else are you going to run a server on? I would argue > > that NT doesn't > > have the horsepower, and tehre aren't a lot of alternatives. > > > NT can be extremely powerful when running on a Dec Alpha. So can linux ;) From owner-robots Wed Jan 17 10:47:41 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19080; Wed, 17 Jan 96 10:47:41 -0800 Message-Id: <199601171847.KAA05176@mir.cs.washington.edu> In-Reply-To: reinpost@win.tue.nl's message of Wed, 17 Jan 1996 11:23:27 +0100 (MET) To: robots@webcrawler.com Subject: Re: Alta Vista searches WHAT?!? References: <199601171023.LAA03915@wsinis10.win.tue.nl> Date: Wed, 17 Jan 1996 10:47:36 PST From: Erik Selberg <speed@cs.washington.edu> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Here's a slightly different tack --- While I think that the /robots.txt is very nice, I don't think it's a worthwhile, or even workable, solution to the Idiot's Security Problem. The Idiot's Security Problem: this is when an idiot I puts some private data P on the Web but attempts to keep them private, by having either a subtle link somewhere or none at all. Later, a robot R finds data P and puts it in some database D. Now, the /robots.txt won't do a bit of good here. Why? Because (a) robots don't have to support the robots.txt file, and (b) because the goal is to keep said data _private_ from everyone, not just robots. The problem is that users feel that hiding data is a good solution to security. Robots just publicly announce that security of that form is bogus. The issue people have with robots I think is bogus; what they should be addressing is that there needs to be a better form of protection on the Web, or at least a more intuitive method of setting access control lists than the funky .htaccess file stuff (or at least a better UI!). -Erik From owner-robots Wed Jan 17 10:48:19 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19145; Wed, 17 Jan 96 10:48:19 -0800 Date: Wed, 17 Jan 1996 10:59:43 -0800 (PST) From: Benjamin Franz <snowhare@netimages.com> X-Sender: snowhare@ns.viet.net To: robots@webcrawler.com Subject: Re: robots.txt , authors of robots , webmasters .... In-Reply-To: <199601171054.LAA04028@wsinis10.win.tue.nl> Message-Id: <Pine.LNX.3.91.960117104601.18185A-100000@ns.viet.net> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Wed, 17 Jan 1996, Reinier Post wrote: > You (savron@world-net.sct.fr) write: > > > >A few thoughts about the robots stuff : > > > >-- there should be no need to include a line such as : > > /cgi-bin/ > > in robots.txt > > because it should come as a standard of indexer robots > > That would be a kludge. It doesn't identify CGI scripts exactly > (I do not usually include /cgi-bin/ in references to my CGI scripts) > and it is not necessary tp exclude CGI scripts categorically > (I sometimes serve a set of files through a CGI script). Furthermore, > netter heuristics exist (eg. don't follow forms/POST requests). And then you risk falling down rat holes like Usenet archives. I have *over* 100,000 archived Usenet articles online on the Web via my Usenet-Web software. The links are all GET to facilitate bookmarking. Now - I know enough to have a robots.txt file blocking that tree from indexing. But many of the people who have downloaded my software (many hundreds of people) are unlikely to use robots.txt. But since the installation instructions will generally lead people to put the script in /cgi-bin/ - a smart indexer will avoid it because /cgi-bin/ is dangerous in general to index. It is very wise in general to avoid all links that match any of these regexs: \.pl$ \.cgi$ \?.*$ cgi-bin -- Benjamin Franz From owner-robots Wed Jan 17 12:18:26 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24747; Wed, 17 Jan 96 12:18:26 -0800 Message-Id: <9601172019.AA05400@marys.smumn.edu> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Kevin Hoogheem <khooghee@marys.smumn.edu> Date: Wed, 17 Jan 96 14:20:11 -0600 To: robots@webcrawler.com Subject: Re: New Robot Announcement References: <9601171535.AA07837@tssun5.> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Begin forwarded message: > > Date: Wed, 17 Jan 1996 09:35:00 -0600 > From: ecarp@tssun5.dsccc.com (Ed Carp @ TSSUN5) > To: robots@webcrawler.com > Subject: Re: New Robot Announcement > X-Sun-Charset: US-ASCII > Sender: owner-robots@webcrawler.com > Reply-To: robots@webcrawler.com > > > > From owner-robots@webcrawler.com Tue Jan 16 17:55 CST 1996 > > Date: Tue, 16 Jan 1996 14:17:25 -0500 > > From: David@interworld.com (David Levine) > > Mime-Version: 1.0 > > To: robots@webcrawler.com > > Subject: Re: New Robot Announcement > > Content-Transfer-Encoding: 7bit > > > > Ed Carp @ TSSUN5 wrote: > > > I was under the impression that most web servers were running > > > on a UNIX box. > > > What else are you going to run a server on? I would argue > > > that NT doesn't > > > have the horsepower, and tehre aren't a lot of alternatives. > > > > > > NT can be extremely powerful when running on a Dec Alpha. > > So can linux ;) > So can a Vic 20.. But thats not the real problem is it. Hell any machine can be real powerful if you let it. Half of you would laugh if I told you we were runnign our Webserver on an Apple Quadra 950 running A/UX - apples version of UNIX but then again it does have 80Megs of RAM which makes it hum From owner-robots Wed Jan 17 13:20:25 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28380; Wed, 17 Jan 96 13:20:25 -0800 Date: Wed, 17 Jan 96 16:19:00 EST From: "Jim Meritt" <jmeritt@smtpinet.aspensys.com> Message-Id: <9600178219.AA821926152@smtpinet.aspensys.com> To: robots@webcrawler.com Subject: Re: BOUNCE robots: Admin request Content-Length: 2245 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com So maybe those particular folks shouldn't get on mailing lists talking about it? Jim ______________________________ Reply Separator _________________________________ Subject: BOUNCE robots: Admin request Author: robots@webcrawler.com at SMTPINET Date: 1/16/96 7:24 PM Date: Tue, 16 Jan 96 03:28:31 -0800 From: <owner-robots> To: owner-robots Subject: BOUNCE robots: Admin request X-Filter: mailagent [version 3.0 PL41] for mak@surfski.webcrawler.com Approved: robbie From s.nisbet@doc.mme.ac.uk Tue Jan 16 03:27:50 1996 Return-Path: <s.nisbet@doc.mme.ac.uk> Received: from ehlana.mmu.ac.uk by webcrawler.com (NX5.67f2/NX3.0M) id AA02048; Tue, 16 Jan 96 03:27:50 -0800 Received: from patsy.doc.aca.mmu.ac.uk by ehlana with SMTP (PP); Tue, 16 Jan 1996 11:26:46 +0100 Received: from raphael.doc.aca.mmu.ac.uk by patsy.doc.aca.mmu.ac.uk (4.1/SMI-4.1) id AA20450; Tue, 16 Jan 96 11:26:28 GMT Received: from jd-e114-07.doc.aca.mmu.ac.uk by raphael.doc.aca.mmu.ac.uk (4.1/SMI-4.1) id AA02693; Tue, 16 Jan 96 11:26:39 GMT Date: Tue, 16 Jan 96 11:26:38 GMT Message-Id: <9601161126.AA02693@raphael.doc.aca.mmu.ac.uk> X-Sender: steven@raphael.doc.aca.mmu.ac.uk X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Steve Nisbet <s.nisbet@doc.mme.ac.uk> Subject: Re: Horror story Maybe Im being a little touchy, but I take exception to being 'educated'. I subscribe to the robots line and others, because Im interested and because I know my stuff and want to know where things are going. A great many 'Web Masters' do a lot more than run webs, which in turn require a lot of effort and a lot of seperate tasks. I suspect as Mordechai T. Abzug points out that the majority of sites dont need a robots.txt and maybe are not that interested in robots. Think about it, they already have a lot on their plates with the admin of their respectives webs as it is. Steve Nisbet Web Admin (and other web related stuff!!) Department of Computing and sub webs Manchester Metro Uni. http://www.doc.mmu.ac.uk/ -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Thu Jan 18 00:24:37 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09158; Thu, 18 Jan 96 00:24:37 -0800 To: robots@webcrawler.com, w3-search@rodem.slab.ntt.com, NCGUR@uccmvsa.ucop.edu, www-vrml@wired.com Cc: amf@pdp.crl.sony.co.jp Subject: [ANNOUNCE] CFP: AAAI-96 WS on Internet-based Information Systems Date: Thu, 18 Jan 96 17:23:19 +0900 Message-Id: <8724.821953414@orange> From: Alexander Franz <amf@pdp.crl.sony.co.jp> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Call for Papers (brief version) AAAI-96 Workshop on Internet-based Information Systems August 4 or 5, Portland, Oregon The purpose of this workshop is to examine the state of the art, and explore the future, of network-based systems for browsing, searching, and sharing information in text and other forms. The focus will be on interactivity and Artificial Intelligence techniques. We solicit submissions relevant to these areas. Electronic submissions are due by March 18, 1996. For full details, please see the workshop home page: http://www.cs.cmu.edu/~amf/iis96.html From owner-robots Thu Jan 18 06:24:32 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26215; Thu, 18 Jan 96 06:24:32 -0800 Date: Thu, 18 Jan 1996 10:36:33 GMT Message-Id: <199601181036.KAA06151@admin.nj.devry.edu> X-Sender: bsran@admin.nj.devry.edu X-Mailer: Windows Eudora Version 1.4.4 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: bsran@admin.nj.devry.edu (Bhupinder S. Sran) Subject: Robot Research Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi. I am looking for materials to help me compare the various search engines on the web as a part of my research as a Ph.D student in Information Management. I have got plenty of information from the links provided on the search pages. I would appreciate if you could point me to any research on the search engines or to a source where I can get more information about how each engine works (e.g. How does it index the documents, how does it rank them, what are the theoretical basis for the search engine, etc) Bhupinder S. Sran :) :> :-) :> :) :-> :} :] :-) :> :} :> :-) :) :> :) :} :-) :) :> :) :> SMILE! It makes everyone wonder what you are up to :) Bhupinder S. Sran, Professor, CIS Department DeVry Technical Institute, Woodbridge, NJ 07095 Email: bsran@admin.nj.devry.edu Phone: 908-634-3460 Home Page: http://admin.nj.devry.edu/~bsran :) :> :-) :> :) :-> :} :] :-) :> :} :> :-) :) :> :) :} :-) :) :> :) :> From owner-robots Thu Jan 18 08:24:21 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04100; Thu, 18 Jan 96 08:24:21 -0800 Message-Id: <199601181623.LAA12548@mail.internet.com> Comments: Authenticated sender is <raisch@mail.internet.com> From: "Robert Raisch, The Internet Company" <raisch@internet.com> Organization: The Internet Company To: robots@webcrawler.com Date: Thu, 18 Jan 1996 11:20:15 -0400 Subject: Re: robots.txt , authors of robots , webmasters .... Priority: normal X-Mailer: Pegasus Mail for Windows (v2.01) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On 17 Jan 96 at 11:54, Reinier Post wrote: > Wrong (see that message): third parties have the right to poke for URLs, IMHO. > Access restriction (password-based or otherwise) will do the job. Quite frankly, I am surprised and extremely dismayed at this comment. Previously, I wrongly accused Alta-Vista of indexing pages that I had no interest in having indexed. It turned out that rather than poking each TCP port for an HTTP server, Alta-Vista actually did what every other 'bot does and follows all the links it can find. I spent some tube-time sleuthing and discovered that the pages were indeed referenced from other, generally accessible pages. I now believe my indignation at the possibility of this port-poking behavior was based on two separate considerations: 1. that the poking of ports would impose an unwelcome burden on my servers, and 2. that there are indeed pages I would not like to publish broadly that are nonetheless available behind ports I don't share with others. Having put the first issue to rest, it is now this second idea that attracts my attention. Where did we get the idea that just because a thing is accessible, that that gives us the moral right to access it, perhaps against the interests of its owner? In another message, Reinier states his belief that if a user makes the mistake of exposing his home directory to the web, that we (as robot owners) can index anything we find there with impunity; that the error is on the part of the web-master and not on the part of the robot's designer. Let me see if I understand Reinier's point and can perhaps state it another way: If I leave my house unlocked, I have given my permission for any and all to come in and read my personal papers. Does this strike anyone else as somewhat absurd? In our enthusiasm to become the cartographers of this new region of the information universe, do we not run the risk of violating the privacy of the indigenous peoples we find there? I believe that this "-WE- are the most comprehensive index of cyberspace" mentality is very dangerous and suggests a kind of information vigiliantism that I find personally distasteful. Perhaps what is really needed is a reevaluation of the role of the robots.txt file. If we take the stance, as I believe we should, that the decision to be indexed belongs in the hands of the owner of the data, not in the mechanical claws of wild roving robots, the robots.txt file should become the a source of permission not exclusion from indexing. And most importantly, that the expectation should be one of privacy, not exposure. In other words, we should not index a web-site if there is no robots.txt file to be retrieved that gives explicit permission to do so. Do any others feel as I do that control over use of my information is my responsibility and mine alone? That the assumption should be not to index a site that has not explicitly given permission to be indexed? (I don't expect much agreement here, to be honest. But I thought I would ask.) It should be noted that there is a fairly strong case to be made that a robot threshing through a non-published web site is an illegal activity under the abuse of computing facilities statute in U.S. law. </rr> From owner-robots Thu Jan 18 08:52:34 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06048; Thu, 18 Jan 96 08:52:34 -0800 Date: Thu, 18 Jan 1996 09:03:56 -0800 (PST) From: Benjamin Franz <snowhare@netimages.com> X-Sender: snowhare@ns.viet.net To: robots@webcrawler.com Subject: Re: robots.txt , authors of robots , webmasters .... In-Reply-To: <199601181623.LAA12548@mail.internet.com> Message-Id: <Pine.LNX.3.91.960118084915.21806B-100000@ns.viet.net> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Thu, 18 Jan 1996, Robert Raisch, The Internet Company wrote: > > Perhaps what is really needed is a reevaluation of the role of > the robots.txt file. If we take the stance, as I believe we > should, that the decision to be indexed belongs in the hands of > the owner of the data, not in the mechanical claws of wild > roving robots, the robots.txt file should become the a source of > permission not exclusion from indexing. And most importantly, > that the expectation should be one of privacy, not exposure. > > In other words, we should not index a web-site if there is no > robots.txt file to be retrieved that gives explicit permission > to do so. If you will review recent messages here you will discover that only about 5% of sites *have* a robots.txt file. This means that using the prescription of 'don't index unless there is a robots.txt file' would result in about one site in twenty being indexed *at best*. Because of there being such a low probability of a site with a robots.txt file linking to *another* site with a robots.txt file, the reality would be orders of magnitude worse that that. A robot would have to be exceptionally lucky to find a few hundred sites that way. In other words - it completely destroys the usefullness of robots for resource discovery. It is, and must be, the responsibility of each site to provide their own document security. If you don't want your pages indexed - add access control or *don't put them on the web*. It is *trivial* on most servers to block directory trees from remote access. You could even specifically target the search engines for blocking. If you don't want people reading your material - don't leave it on the table in the reading room of the library (which is what you are doing when you place documents on the WWW with no access control). -- Benjamin Franz From owner-robots Thu Jan 18 10:15:21 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11613; Thu, 18 Jan 96 10:15:21 -0800 Subject: Re: Re: robots.txt , authors of robots , webmasters .... From: Larry Burke <lburke@aktiv.com> To: <robots@webcrawler.com> Mime-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Date: Thu, 18 Jan 1996 10:14:55 -0800 Message-Id: <1390162401-3861802@gco.gov.bc.ca> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >If I leave my house unlocked, I have >given my permission for any and all to come in and read my >personal papers. Does this strike anyone else as somewhat >absurd? It is generally accepted in modern society that one should not enter into someone elses home without permission. A web server, by its very nature, invites public access. >Do any others feel as I do that control over use of my >information is my responsibility and mine alone? That the >assumption should be not to index a site that has not explicitly >given permission to be indexed? (I don't expect much agreement >here, to be honest. But I thought I would ask.) It is unfortunate that the robots.txt standard supports exclusion and not permission. The HTTP standard should have had indexing permission built right into it such that all servers would support some type of call that tells robots where they are allowed to go. This could have made it necessary for permission and denial to be set up during server configuration. -------------------- Larry Burke AKTIV Software Victoria, B.C. email: lburke@aktiv.com web: www.aktiv.com phone: 604.383.4195 From owner-robots Thu Jan 18 11:21:20 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16047; Thu, 18 Jan 96 11:21:20 -0800 Message-Id: <9601181922.AA06192@marys.smumn.edu> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Kevin Hoogheem <khooghee@marys.smumn.edu> Date: Thu, 18 Jan 96 13:22:37 -0600 To: robots@webcrawler.com Subject: Re: robots.txt , authors of robots , webmasters .... References: <199601181623.LAA12548@mail.internet.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > > Begin forwarded message: > > > Previously, I wrongly accused Alta-Vista of indexing pages that > I had no interest in having indexed. It turned out that rather > than poking each TCP port for an HTTP server, Alta-Vista > actually did what every other 'bot does and follows all the > links it can find. I spent some tube-time sleuthing and > discovered that the pages were indeed referenced from other, > generally accessible pages. > > I now believe my indignation at the possibility of this > port-poking behavior was based on two separate considerations: > > 1. that the poking of ports would impose an unwelcome > burden on my servers, and First I dont think that too many web-robot writers would write it so that it would probe all ports on a machine, rather would write the option to look at other ports on the runners command or if they were to find a differant port mentioned in a url. > > Where did we get the idea that just because a thing is > accessible, that that gives us the moral right to access it, > perhaps against the interests of its owner > > In another message, Reinier states his belief that if a user > makes the mistake of exposing his home directory to the web, > that we (as robot owners) can index anything we find there with > impunity; that the error is on the part of the web-master and > not on the part of the robot's designer. > > Let me see if I understand Reinier's point and can perhaps > state it another way: If I leave my house unlocked, I have > given my permission for any and all to come in and read my > personal papers. Does this strike anyone else as somewhat > absurd? > > In our enthusiasm to become the cartographers of this new > region of the information universe, do we not run the risk of > violating the privacy of the indigenous peoples we find there? > > I believe that this "-WE- are the most comprehensive index of > cyberspace" mentality is very dangerous and suggests a kind of > information vigiliantism that I find personally distasteful. > > Perhaps what is really needed is a reevaluation of the role of > the robots.txt file. If we take the stance, as I believe we > should, that the decision to be indexed belongs in the hands of > the owner of the data, not in the mechanical claws of wild > roving robots, the robots.txt file should become the a source of > permission not exclusion from indexing. And most importantly, > that the expectation should be one of privacy, not exposure. > > In other words, we should not index a web-site if there is no > robots.txt file to be retrieved that gives explicit permission > to do so. > > It should be noted that there is a fairly strong case to be > made that a robot threshing through a non-published web site is > an illegal activity under the abuse of computing facilities > statute in U.S. law. First off I do think that we as computer users on Unix systems think that there should be some level of protection of our documents that if it was intended to be private then they should be protected. But in the other case if they set up a web direcory then they are saying that this information is PUBLIC and any one that wishes to search it out can freely look at it. They should take the time and trouble to lock it up and make it so that no one but the intended people can see it. True if I leave my house unlocked I dont want anyone going into it but that is the risk I take isnt it? Also I feel that no one is realy out there doing a ip sweap of every number out there trying to connect to port 80 as of yet to find every server they possibley can, not only would that take forever but would put a big burden on not only there machine and users but the time wasted to find what??? I also feel that it is up to the web robot writers to share in some responsiblity to write robots that do not try to go out of the WWW published directorys and maybe themselfs just not even look into folders that might be of test related documents.. Why look into a folder called test unless there is a freely published html document that refers to that. From owner-robots Thu Jan 18 13:13:25 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23266; Thu, 18 Jan 96 13:13:25 -0800 Date: Thu, 18 Jan 1996 14:27:35 -0600 From: ecarp@tssun5.dsccc.com (Ed Carp @ TSSUN5) Message-Id: <9601182027.AA24864@tssun5.> To: robots@webcrawler.com Subject: Re: robots.txt , authors of robots , webmasters .... X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > It should be noted that there is a fairly strong case to be > made that a robot threshing through a non-published web site is > an illegal activity under the abuse of computing facilities > statute in U.S. law. I doubt it. In the first place, if you put up a web server on a well-known port, there isn't a DA in this country that will support a proscecution based on this, even if the site isn't "published". First of all, if you don't want the site accessed on that port, it's *your* responsibility to protect it. That's why we have login and password programs - if you don't have a modicum of protection on your site, the courts will take a very dim view of you trying to get someone nailed. Doing probes on other ports ("twisting the knobs", as it's called) to see "what's out there" is generally considered to be an unfriendly act, though. I think it's patently absurd to suggest that robots by default have no right to access your pages - if you don't want anyone looking at your pages, why put up a web site, if not just for your own self-aggrandizment? From owner-robots Thu Jan 18 13:44:28 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25465; Thu, 18 Jan 96 13:44:28 -0800 Date: Thu, 18 Jan 1996 22:37:34 +0100 (GMT+0100) From: Carlos Baquero <cbm@di.uminho.pt> To: robots@webcrawler.com Subject: Re: robots.txt , authors of robots , webmasters .... In-Reply-To: <199601181623.LAA12548@mail.internet.com> Message-Id: <Pine.LNX.3.91.960118222336.97C-100000@poe.di.uminho.pt> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 1203 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Thu, 18 Jan 1996, Robert Raisch, The Internet Company wrote: > Do any others feel as I do that control over use of my > information is my responsibility and mine alone? That the > assumption should be not to index a site that has not explicitly > given permission to be indexed? (I don't expect much agreement > here, to be honest. But I thought I would ask.) > I have some simpathy for your argument but I cannot agree. Suppose that the mass media needed explicit autorizations to publish info or photographs of public activities. There would'nt be to much info for the public, I guess. And that would be very bad ... I think that there is a legal notion of public and private places. It would be invasive to publish a photo of myself inside my house and taken through the window, but once I get into the street I am aware that a photo of myne can appear in a newspaper. I do think that unprotected places published in the web are public places by default. Carlos Baquero Distributed Systems Fax +351 (53) 612954 University of Minho, Portugal Voice +351 (53) 604475 cbm@di.uminho.pt http://shiva.di.uminho.pt/~cbm From owner-robots Thu Jan 18 15:54:44 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04434; Thu, 18 Jan 96 15:54:44 -0800 Message-Id: <30FEDED0.5526@corp.micrognosis.com> Date: Thu, 18 Jan 1996 18:59:12 -0500 From: Adam Jack <ajack@corp.micrognosis.com> Organization: CSK/Micrognosis Inc. X-Mailer: Mozilla 2.0b3 (X11; I; SunOS 5.5 sun4m) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: robots.txt , authors of robots , webmasters .... References: <Pine.LNX.3.91.960118084915.21806B-100000@ns.viet.net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Benjamin Franz wrote: > > On Thu, 18 Jan 1996, Robert Raisch, The Internet Company wrote: > > > In other words, we should not index a web-site if there is no > > robots.txt file to be retrieved that gives explicit permission > > to do so. > > In other words - it completely destroys the usefullness of robots for > resource discovery. > I wonder wether it would, instead, be the single best mechanism for mass education. What if a number of the major robots proclaimed that this was to be the case as of XXXXX date. No longer would a site be accessed (for either update or review) -- if it didn't have a robots.txt. I doubt these robots would loose out -- since they, undoubtedly, have not completed a full WWW index. They'd still have plenty of information -- it would probably be increased quality also.... People who want their data accessed would conform. Editting a robots.txt wxis no more difficult an activity than using submit-it. Adam -- +1-203-730-5437 | http://www.micrognosis.com/~ajack/index.html ajack@corp.micrognosis.com -> ajack@netcom.com -> ajack@?.??? From owner-robots Thu Jan 18 17:06:48 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA07788; Thu, 18 Jan 96 17:06:48 -0800 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199601190107.CAA13141@wsinis10.win.tue.nl> Subject: Re: robots.txt , authors of robots , webmasters .... To: robots@webcrawler.com Date: Fri, 19 Jan 1996 02:07:26 +0100 (MET) In-Reply-To: <199601181623.LAA12548@mail.internet.com> from "Robert Raisch, The Internet Company" at Jan 18, 96 11:20:15 am X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Content-Length: 3313 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (Robert Raisch, The Internet Company) write: >Where did we get the idea that just because a thing is >accessible, that that gives us the moral right to access it, >perhaps against the interests of its owner? I stated my source: the Unix environment. You are misrepresenting what I said. There is no such moral right; the normal obligation remains of notify people of likely mistakes. If I leave my personal papers on the shelves in a public library, you have the moral right to open them. You do not have the moral right to take advantage of what is obviously a mistake, once you discover it. BTW, the attitude that a robot is just like a human user probably stems from the Unix environment, too. Under Unix, people routinely scan the whole filesystem in order to search for information. The Internet is very Unix-minded; rather typical is AFS, a single global world-wide Unix-like file system where everyone is supposed to hook up their own files. >In another message, Reinier states his belief that if a user >makes the mistake of exposing his home directory to the web, >that we (as robot owners) can index anything we find there with >impunity; Yes; provided that the usual amount of care and politeness is observed, and under the moral obligation to correct mistakes once they are reported. You left that out in your summary. >that the error is on the part of the web-master and >not on the part of the robot's designer. That's what I think. >Let me see if I understand Reinier's point and can perhaps >state it another way: If I leave my house unlocked, I have >given my permission for any and all to come in and read my >personal papers. Does this strike anyone else as somewhat >absurd? Yes. I regard a WWW site like a public exhibit (maybe in someone's backyard). Not as a person's private home. >In our enthusiasm to become the cartographers of this new >region of the information universe, do we not run the risk of >violating the privacy of the indigenous peoples we find there? Indigenous people gain access to the Internet, not the other way round. (Except through malicious attacks and sloppy Webmasters.) >I believe that this "-WE- are the most comprehensive index of >cyberspace" mentality is very dangerous and suggests a kind of >information vigiliantism that I find personally distasteful. Many people hold your views, many hold mine. I don't think there's an easy solution. (Like you, I grew nervous when I found out how much Altavista knows about me.) Wouldn't it be possible for robots to generate email to the Webmaster if no robots.txt was found, offering an example robots.txt file and a pointer to relevant documentation? The robot might still start its indexing process, provided that the Webmaster has a way to undo the results. >It should be noted that there is a fairly strong case to be >made that a robot threshing through a non-published web site is >an illegal activity under the abuse of computing facilities >statute in U.S. law. Not so in the Netherlands, where entering a computer is only illegal if a lock (protection) was broken to gain access. -- Reinier Post reinpost@win.tue.nl a.k.a. <A HREF="http://www.win.tue.nl/win/cs/is/reinpost/">me</A> [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] From owner-robots Thu Jan 18 17:11:31 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08127; Thu, 18 Jan 96 17:11:31 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: <v02130506ad249da4e234@[192.187.143.12]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 18 Jan 1996 17:12:29 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: robots.txt , authors of robots , webmasters .... Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >Where did we get the idea that just because a thing is >accessible, that that gives us the moral right to access it, >perhaps against the interests of its owner? There's a difference between making something accessible with the intention of sharing it, as is the case in putting it on the Web without security, and allowing it to be accessible without the intention of sharing it. The moral argument is less clear when you dig a bit deeper into the publisher's intentions, which may not include support for automated access that would consume untoward resources. >In our enthusiasm to become the cartographers of this new >region of the information universe, do we not run the risk of >violating the privacy of the indigenous peoples we find there? The privacy argument is a difficult one to reconcile with the other watch-word of the Internet, freedom. We could talk at great length about that, but a robots list isn't the place, I think. >In other words, we should not index a web-site if there is no >robots.txt file to be retrieved that gives explicit permission >to do so. We thought and discussed this approach at some length when we got close to releasing the 1.0 version of our spider. Our pre-release version had basically no restrictions on it except that it wouldn't follow links from one server to another; it was designed to index just one site at a time. We even considered a scheme in which we'd look for robots.txt, and if it wasn't present, generate an e-mail to the webmaster, suggesting that one should be in place, with pointers to references. After X days, if we still didn't find a robots.txt, we'd consider silence to be consent to index anything the robot finds. However, clearer heads prevailed, I think, and we left things as they were. The fundamental reason that we scrapped the idea was that it was just too complex. Too many things could go wrong, it added a lot of administrative overhead, etc. Let's remember that the marketplace usually eventually solves these problems. Robot defenses can and will be built. In fact, we discovered early on that inet-d is a pretty good defense, since it limits the number of connections. Our first design of the robot was based on the typical limits of inet-d. I suspect that robot designers time would be better spent on reaching consensus on distributed systems that will make the whole wretched mess more efficient by combining pull and push methods of building indexes. There is going to be a marketplace for the meta-information that robots are generating. The sooner that robot developers agree on standards along the lines of Harvest (but simpler, perhaps), the sooner that trade in meta-information can begin to mature... and the less likely that one big player will set the standards by sheer size. For example, what if Microsoft announced a robot standard tomorrow...? Nick From owner-robots Thu Jan 18 18:59:12 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11507; Thu, 18 Jan 96 18:59:12 -0800 Date: 18 Jan 96 16:04:51 EST From: John Lammers <JLAMMERS@CSI.compuserve.com> To: Robots List <ROBOTS@webcrawler.com> Subject: re: privacy, courtesy, protection Message-Id: <CSI_6188-3773@CompuServe.COM> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Robert Raisch asks: >>Do any others feel as I do that control over use of my >>information is my responsibility and mine alone? Yes...so exercise that control and restrict access to the material that you don't want read. Control over the security of your site and your data is your responsibility and yours alone. I'm not saying that robots should TRY to invade your privacy, but your comparison of your web site and your house is a little off, I think. >>If I leave my house unlocked, I have given my permission for any and >>all to come in and read my personal papers. Does this strike anyone >>else as somewhat absurd? I think it's more analogous to leaving your office unlocked in a office building accessible to the public. You don't expect someone to sit down and read all your stuff, but then again, you don't necessarily expect that no one will notice what you have lying around your office. Robots are accustomed to going where they can. The robots.txt file is as much or more for the robot's benefit as the site's. I tend to agree with an earlier contributor that many sites don't have a robots.txt, have no need for one, and can't be expected to have one. If all these sites are excluded from indexes.... Besides, if you're relying on robots faithfully abiding by whatever you have in robots.txt for your security scheme, you're only keeping out the robots (and human browsers) who don't want your private data. Anyone that wants it can get it, if you don't protect it. I'm not advocating that, I'm just saying that's the case. Like it or not, putting info on the Web is publishing. Lack of advertising doesn't mean something hasn't been published. The failure of a chapter to appear in the table of contents doesn't mean it's not in the book. Again, I don't WANT your privacy invaded, but if you put your stuff in a public place and don't restrict access to it.... -- John Lammers From owner-robots Thu Jan 18 19:02:37 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11559; Thu, 18 Jan 96 19:02:37 -0800 Message-Id: <199601190302.VAA06301@sam.neosoft.com> X-Mailer: Post Road Mailer (Green Edition Ver 1.03a) To: robots@webcrawler.com From: Edward Stangler <mred@neosoft.com> Date: Thu, 18 Jan 1996 20:59:44 CST Subject: Re: Alta Vista searches WHAT?!? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com ** Reply to note from Erik Selberg <speed@cs.washington.edu> 01/17/96 10:47am PST > Now, the /robots.txt won't do a bit of good here. Why? Because (a) > robots don't have to support the robots.txt file, and (b) because the > goal is to keep said data _private_ from everyone, not just > robots. The problem is that users feel that hiding data is a good > solution to security. Robots just publicly announce that security of > that form is bogus. The issue people have with robots I think is > bogus; what they should be addressing is that there needs to be a > better form of protection on the Web, or at least a more intuitive > method of setting access control lists than the funky .htaccess file > stuff (or at least a better UI!). What if you're using ROBOTS.TXT to exclude CGI's which don't appear in /cgi-bin? What if the CGI's--or any data types unknown to the robot--are indistinguishable from directory pathnames or acceptable data types except if (a) it is excluded with something like ROBOTS.TXT or (b) the robot spends considerable time and resources to analyze it? -Ed- mred@neosoft.com http://www.neosoft.com/~mred 1:106/1076 - 30:30/0 - 85:842/105 74620,2333 From owner-robots Fri Jan 19 02:33:11 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA13465; Fri, 19 Jan 96 02:33:11 -0800 Date: Fri, 19 Jan 1996 10:32 UT From: MGK@NEWTON.NPL.CO.UK (Martin Kiff) Message-Id: <0099C9E1F38A92E0.992E@NEWTON.NPL.CO.UK> To: robots@webcrawler.com Subject: Server name in /robots.txt X-Vms-To: SMTP%"robots@webcrawler.com" X-Vms-Cc: MGK Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Looking up your site in the indexes is indeed educational... I have found the same pages appearing under multiple domain names - the canonical DNS name, various CNAME equivalents and the raw IP address *despite* having a <BASE HREF="http://xxx.xxx.xxx.xxx/xxx.html"> giving a 'preferred URL' in the header. Obviously indexers don't (or some indexer don't) recognise this and just build on incorrect, but currently working, links from other pages. Would it be an option to include a the preferred site name in the /robots.txt file? Couldn't enforce anything of course but would act as a reminder to the robots. Regards, Martin Kiff mgk@newton.npl.co.uk From owner-robots Fri Jan 19 05:12:10 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14219; Fri, 19 Jan 96 05:12:10 -0800 Date: Fri, 19 Jan 1996 08:11:55 -0500 From: AJAJR@aol.com Message-Id: <960119081146_201004653@mail04.mail.aol.com> To: robots@webcrawler.com Subject: Polite Request #2 to be Removed form List Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Sir: Thank you very much for including me on this list. At this time I would like to politely request for the second time that my name now be removed. Thank you in advance for your kind consideration which is much appreciated. From owner-robots Fri Jan 19 10:00:15 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15378; Fri, 19 Jan 96 10:00:15 -0800 Message-Id: <m0tdL6g-0003DtC@giant.mindlink.net> Date: Fri, 19 Jan 96 10:00 PST X-Sender: a07893@giant.mindlink.net X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Tim Bray <tbray@opentext.com> Subject: Re: Server name in /robots.txt Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >I have found >the same pages appearing under multiple domain names - the canonical DNS >name, various CNAME equivalents and the raw IP address *despite* having a > > <BASE HREF="http://xxx.xxx.xxx.xxx/xxx.html"> > >giving a 'preferred URL' in the header. Obviously indexers don't >(or some indexer don't) recognise this and just build on incorrect, >but currently working, links from other pages. Yes, well, reading a variety of specs carefully makes it clear that HTML does *not* at the current time provide a mechanism for specifying the "canonical name" of the current page. Having noticed this [several tens of thousands of times] during the construction of the Open Text Index, I tried rattling cages over in the HTML Working Group, and discovered a complete lack of consensus; some people feel that this is an appropriate use of <BASE>, as did I; others, including people who *really* know HTML, think <META HTTP-EQUIV="URI" CONTENT="http://xxx.xxx.xxx/xxx.html"> is more appropriate. I tried to get them to make up their minds, but couldn't generate sufficient interest. I don't care [nor would any other robot flogger, I think] which