From owner-robots Thu Oct 12 14:39:19 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20349; Thu, 12 Oct 95 14:39:19 -0700 Message-Id: <9510122139.AA20341@webcrawler.com> To: robots Subject: The robots mailing list at WebCrawler From: Martijn Koster Date: Thu, 12 Oct 1995 14:39:19 -0700 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Welcome to our new home... This mailing list is now open for traffic. For details see: http://info.webcrawler.com/mailing-lists/robots/info.html -- Martijn __________ Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Thu Oct 12 16:09:58 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25602; Thu, 12 Oct 95 16:09:58 -0700 Message-Id: Date: Thu, 12 Oct 95 16:09 PDT X-Sender: a07893@giant.mindlink.net X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Tim Bray Subject: Something that would be handy Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com It might be nice to enhance robots.txt to include a hint as to how long the file ought to be cached by a Robot driver. People who don't understand why probably ought to ignore this message. People who do might want to suggest (a) reasons why this is a silly idea, (b) a syntax/method for doing it, or (c) any implementation difficulties that could ensue. My suggestion, expressed in the form of perl code that could be used to implement it: if (/^\s*CacheHint:\s+(\d+)\s*([dhm])\s*$/) { $SecondsToCache = $1; if ($2 eq 'd') { $SecondsToCache *= 60*60*24; } elsif ($2 eq 'h') { $SecondsToCache *= 60*60; } else { $SecondsToCache *= 60; } } Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com) From owner-robots Fri Oct 13 18:03:54 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA29927; Fri, 13 Oct 95 18:03:54 -0700 Message-Id: Date: Sat, 14 Oct 95 11:07:39 0000 From: James Organization: Tourist Radio Pty Ltd X-Mailer: Mozilla 1.1N (Macintosh; I; 68K) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Site Announcement X-Url: http://info.webcrawler.com/mailing-lists/robots/info.html Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com We wish to advise those with a robot seeking facility that we have two sites at http://www.com.au/aaa and http://www.world.net/touristradio We would be grateful if you would ask your robots to visit and announce our sites where possible. If this is bad net ettique, we apologise, there are huge back logs with manual services. James From owner-robots Mon Oct 16 08:25:16 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00957; Mon, 16 Oct 95 08:25:16 -0700 Message-Id: <9510161525.AA00951@webcrawler.com> To: robots Subject: Re: Site Announcement In-Reply-To: Your message of "Sat, 14 Oct 1995 11:07:39." Date: Mon, 16 Oct 1995 08:25:16 -0700 From: Martijn Koster Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, You've asked me to add a link. The best way to get a link added to the WebCrawler, submit them to http://www.webcrawler.com/WebCrawler/SubmitURLS.html Regards, -- Martijn __________ Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Mon Oct 16 18:36:43 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA29862; Mon, 16 Oct 95 18:36:43 -0700 Message-Id: Date: 16 Oct 1995 18:40:48 -0800 From: "Roger Dearnaley" Subject: How do I let spiders in? To: " " X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Is there any way currently supported of providing spiders access to our (soon to be launched) username & password authenticated site? (Of course if a customer followed a link generated by this spider search, they will be asked for authentication, but when the can't provide it we will redirect them to a Registration page.) The security on our site is not meant to be high: it is there primarily so that the forms CGI scripts have a unique user name to figure out who is doing what. Thus for our site we would probably be happy to just place a user name and password in robots.txt, or some similar low-security solution. However, I can see that for other sites this might not be an acceptable, so spider maintainers might want to consider adding fields for the username and password to use to their 'Please index this URL' submission forms. Then, ideally, it should be possible to submit these forms securely. --Roger Dearnaley (roger_dearnaley@intouchgroup.com) From owner-robots Wed Oct 18 08:32:24 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12938; Wed, 18 Oct 95 08:32:24 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 18 Oct 1995 08:31:05 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Unfriendly robot at 205.177.10.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com One of my Web servers (http://asearch.mccmedia.com/ last night was attacked by a very unfriendly robot that requested many documents per second. This robot was originating from 205.177.10.2. I've tried to resolve that IP address, but I'm unable thus far. However, a traceroute shows that a cais.net router was the last hop before the domain in which the offending robot lives, so I sent an e-mail to the postmaster there, hoping that he or she will know whose host that is and will forward it (assuming that whoever owns this thing is a CAIS customer). Has anyone else encountered this one? It doesn't identify itself at all. Nick From owner-robots Wed Oct 18 08:58:47 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14082; Wed, 18 Oct 95 08:58:47 -0700 Message-Id: Date: Wed, 18 Oct 95 08:58 PDT X-Sender: a07893@giant.mindlink.net X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Tim Bray Subject: Re: Unfriendly robot at 205.177.10.2 Cc: robots@webcrawler.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 08:31 18/10/95 -0700, Nick Arnett wrote: >One of my Web servers (http://asearch.mccmedia.com/ last night was attacked >by a very unfriendly robot that requested many documents per second. This >robot was originating from 205.177.10.2. That resolves to 'murph.cais.net' - no idea who they are, never heard of 'em. - Tim From owner-robots Wed Oct 18 09:06:44 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14459; Wed, 18 Oct 95 09:06:44 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 18 Oct 1995 09:05:20 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: CORRECTION -- Re: Unfriendly robot Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Whoops -- I pasted the wrong IP address into this message. The unfriendly robot was at 205.252.60.50. Nick From owner-robots Wed Oct 18 09:32:08 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15587; Wed, 18 Oct 95 09:32:08 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 18 Oct 1995 09:30:32 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Unfriendly robot at 205.177.10.2 Cc: tbray@opentext.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 8:58 AM 10/18/95, Tim Bray wrote: >That resolves to 'murph.cais.net' - no idea who they are, never heard >of 'em. As you may have seen in my correction, that was a mistake on my part. I copied that from the traceroute -- it's the last router before the address space in which the misbehaving robot lives. It is Capitol Area Internet Service and under the assumption that the owner of the robot is one of their customers, I sent a message to the CAIS postmaster. The correct address of the owner of the robot is 205.252.60.50, which won't resolve. Tight security, apparently. Ironically. Nick From owner-robots Wed Oct 18 09:43:26 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16066; Wed, 18 Oct 95 09:43:26 -0700 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199510181643.RAA22167@wsinis11.win.tue.nl> Subject: Re: Unfriendly robot at 205.177.10.2 To: robots@webcrawler.com Date: Wed, 18 Oct 1995 17:42:55 +0100 (MET) In-Reply-To: from "Nick Arnett" at Oct 18, 95 08:31:05 am X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 921 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (Nick Arnett) write: > >One of my Web servers (http://asearch.mccmedia.com/ last night was attacked >by a very unfriendly robot that requested many documents per second. This >robot was originating from 205.177.10.2. I've tried to resolve that IP >address, but I'm unable thus far. However, a traceroute shows that a >cais.net router was the last hop before the domain in which the offending >robot lives, so I sent an e-mail to the postmaster there, hoping that he or >she will know whose host that is and will forward it (assuming that whoever >owns this thing is a CAIS customer). Here you are: % host 205.177.10.2 Name: murph.cais.net Address: 205.177.10.2 Aliases: >Has anyone else encountered this one? It doesn't identify itself at all. No accesses here from 205.177.10.2 or cais.net. >Nick -- Reinier Post reinpost@win.tue.nl a.k.a. me From owner-robots Wed Oct 18 11:32:15 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21768; Wed, 18 Oct 95 11:32:15 -0700 Message-Id: <9510181831.AA06646@ai.iit.nrc.ca> Date: Wed, 18 Oct 95 14:31:39 EDT From: Alain Desilets To: robots@webcrawler.com Subject: Looking for a spider Cc: alain@ai.iit.nrc.ca Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Dear spider developpers. My name is Alain Desilets. I am a researcher in the Interactive Information Group of the National Research Council of Canada. We are a small group (6 people) developing tools for interactive access to information. Our technological angle on this problem is AI based approaches, in particular Machine Learning and Agents. You can find more about our work at http://ai.iit.nrc.ca/II_public/. In order to test our methods we need to acquire a large corpus of full HTML files from the Web. We plan to use a spider for that task. We are aware of the controversy surrounding the creation of new spiders and therefore do not plan to develop one. That would not only be a duplication of effort but would also introduce a new, possibly buggy spider in Koster's already vast list of Web critters. Instead, we would like to use a publically available, well behaved and proven spider. Is there such spider available for serious research purpose? Or maybe the corpus we need already exists? Is there a CD-ROM or .zip file that would give us the whole of the web in full HTML? Thanks for your help. Alain Desilets Institute for Information Technology National Research Concil of Canada Building M-50 Montreal Road Ottawa (Ont) K1A 0R6 e-mail: alain@ai.iit.nrc.ca Tel: (613) 990-2813 Fax: (613) 952-7151 From owner-robots Wed Oct 18 12:28:54 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23934; Wed, 18 Oct 95 12:28:54 -0700 Date: Wed, 18 Oct 1995 15:34:04 -0400 Message-Id: <199510181934.PAA12177@maple.sover.net> X-Sender: Leigh.D.Dupee@neinfo.net X-Mailer: Windows Eudora Version 1.4.4 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Leigh.D.Dupee@neinfo.net (Leigh DeForest Dupee) Subject: Re: Unfriendly robot at 205.177.10.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Query:All records (ALL):2.10.177.205.in-addr.arpa Authoritative Answer 2.10.177.205.in-addr.arpa PTR murph.cais.net 10.177.205.in-addr.arpa NS cais.com cais.com A 199.0.216.4 Complete: 2.10.177.205.in-addr.arpa Query:All records (ALL):murph.cais.net Authoritative Answer Name does not exist Complete:NO_DATA murph.cais.net Best I can come up with! >One of my Web servers (http://asearch.mccmedia.com/ last night was attacked >by a very unfriendly robot that requested many documents per second. This >robot was originating from 205.177.10.2. I've tried to resolve that IP >address, but I'm unable thus far. However, a traceroute shows that a >cais.net router was the last hop before the domain in which the offending >robot lives, so I sent an e-mail to the postmaster there, hoping that he or >she will know whose host that is and will forward it (assuming that whoever >owns this thing is a CAIS customer). > >Has anyone else encountered this one? It doesn't identify itself at all. > >Nick > > > --------------------------------------------------------------- Leigh DeForest Dupee Help Me Learn, Inc., Administrator for NEInfo.Net South Stream Road RR3 Box 4203, Bennington, VT 05201 (802) 447-2905 --------------------------------------------------------------- From owner-robots Wed Oct 18 12:49:50 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24697; Wed, 18 Oct 95 12:49:50 -0700 Message-Id: <9510181951.AA08164@pluto.sybgate.sybase.com> X-Sender: dbakin@pluto X-Mailer: Windows Eudora Version 2.1.1 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 18 Oct 1995 12:49:14 -0700 To: robots@webcrawler.com From: David Bakin Subject: Is it a robot or a link-updater? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com As the subject implies, I'm curious if there is a difference, in the impact on the serving site, between a true robot and someone running an automatic link updater? Can they even be told apart by the serving site? -- Dave -- Dave Bakin How much work would a work flow flow if a #include 415-872-1543 x5018 work flow could flow work? From owner-robots Wed Oct 18 13:16:38 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25902; Wed, 18 Oct 95 13:16:38 -0700 From: amonge@cs.ucsd.edu (Alvaro Monge) Message-Id: <9510182013.AA10642@dino> Subject: Re: Looking for a spider To: robots@webcrawler.com Date: Wed, 18 Oct 1995 13:13:55 -0700 (PDT) In-Reply-To: <9510181831.AA06646@ai.iit.nrc.ca> from "Alain Desilets" at Oct 18, 95 02:31:39 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 1865 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com A colleague of mine and I are also doing research which is AI based and are in need of a large corpus for our use. We would like to use anything that is already available which keeps the structure of the real WWW and does not take anything away. This is in order to create realistic experiments of our approaches. Thanks in advance for any pointers, --Alvaro Computer science and engineering department University of California, San Diego > > Dear spider developpers. > > > My name is Alain Desilets. I am a researcher in the Interactive > Information Group of the National Research Council of Canada. > > We are a small group (6 people) developing tools for interactive > access to information. Our technological angle on this problem is AI > based approaches, in particular Machine Learning and Agents. You can > find more about our work at http://ai.iit.nrc.ca/II_public/. > > In order to test our methods we need to acquire a large corpus of > full HTML files from the Web. We plan to use a spider for that task. > > We are aware of the controversy surrounding the creation of new > spiders and therefore do not plan to develop one. That > would not only be a duplication of effort but would also introduce a > new, possibly buggy spider in Koster's already vast list of Web > critters. Instead, we would like to use a publically available, well > behaved and proven spider. > > Is there such spider available for serious research purpose? > > Or maybe the corpus we need already exists? Is there a CD-ROM or .zip > file that would give us the whole of the web in full HTML? > > > Thanks for your help. > > Alain Desilets > > Institute for Information Technology > National Research Concil of Canada > Building M-50 > Montreal Road > Ottawa (Ont) > K1A 0R6 > > e-mail: alain@ai.iit.nrc.ca > Tel: (613) 990-2813 > Fax: (613) 952-7151 > > From owner-robots Wed Oct 18 14:13:35 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28102; Wed, 18 Oct 95 14:13:35 -0700 Message-Id: Date: 18 Oct 1995 15:13:44 -0700 From: "Xiaodong Zhang" Subject: Re: Looking for a spider To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Reply to: RE>>Looking for a spider 7/24/95 - Frontier Technologies licenses Lycos Internet Catalog software MEQUON, WIS. (July 24) BUSINESS WIRE -July 24, 1995--Frontier Technologies Corp. today announced it has signed an agreement to license the Lycos(TM) Internet Catalog. The Lycos Catalog has been incorporated into Frontier Technologies" new SuperHighway Access product, called SuperHighway Access CyberSearch(TM), which allows users to perform a Lycos search offline via CD-ROM, connecting to the Internet only once relevant Internet resources have been identified. The Lycos technology was developed at Carnegie Mellon University, and was recently transferred to Lycos Inc., a newly-created subsidiary of CMG Information Services Inc. Lycos is a software system which contains a robot that searches the World Wide Web and catalogs the documents it finds. It also includes an information search engine that helps users access information quickly and easily when they type in key words or topics. The Lycos exploration robot locates new and changed documents and builds abstracts, which consist of title, headings, subheadings, 100 most significant words and the first 20 lines of the document. The catalog is continually updated by the Lycos exploration agent. Frontier will receive regular updates from Lycos Inc., allowing it to produce monthly issues of SuperHighway Access CyberSearch. "It's now widely understood that one of the primary barriers to users" productivity on the Internet is finding information," said Dennis Freeman, Frontier Technologies" marketing director. "That's why Internet search services like Lycos are among the Internet's most popular sites." "Lycos Inc. is pleased to partner with Frontier as they contribute to our continued position as the most widely used and most comprehensive catalog product on the Web," said Bob Davis, CEO of Lycos Inc. The product, now shipping, consists of a 608-megabyte subset of the Lycos catalog, indexing about half a million web pages, integrated with Frontier's multi-session, multi-protocol Internet browser software. The product is shipped on CD-ROM and is available through Frontier's reseller channel. The CD will be updated monthly (bi-monthly initially) Frontier is offering the first issue of CyberSearch at $14.95. A charter subscription for 6 issues is priced at $6.75 per month. Subscribers should call 1-800/879-0075 (+1-414/571-0190 outside the U.S.) or access Frontier's web server, http://www.frontiertech.com for further information. Lycos Inc., with offices in Wilmington, Mass. and Pittsburgh, Penn., is the newly formed corporation based upon technology developed at Carnegie Mellon University. Frontier Technologies Corp., based in Mequon, is a leading supplier of TCP/IP and Internet-based products that make businesses more competitive in a global market. CONTACT: Frontier Technologies Corp., Mequon Nicole Rogers, 414/241-4555 x293 or Lycos Inc. Mike Olfe, 508/657-5050 x3124 ------------------------------ Date: 10/18/95 3:01 PM To: Zhang, Xiaodong From: robots@webcrawler.com A colleague of mine and I are also doing research which is AI based and are in need of a large corpus for our use. We would like to use anything that is already available which keeps the structure of the real WWW and does not take anything away. This is in order to create realistic experiments of our approaches. Thanks in advance for any pointers, --Alvaro Computer science and engineering department University of California, San Diego > > Dear spider developpers. > > > My name is Alain Desilets. I am a researcher in the Interactive > Information Group of the National Research Council of Canada. > > We are a small group (6 people) developing tools for interactive > access to information. Our technological angle on this problem is AI > based approaches, in particular Machine Learning and Agents. You can > find more about our work at http://ai.iit.nrc.ca/II_public/. > > In order to test our methods we need to acquire a large corpus of > full HTML files from the Web. We plan to use a spider for that task. > > We are aware of the controversy surrounding the creation of new > spiders and therefore do not plan to develop one. That > would not only be a duplication of effort but would also introduce a > new, possibly buggy spider in Koster's already vast list of Web > critters. Instead, we would like to use a publically available, well > behaved and proven spider. > > Is there such spider available for serious research purpose? > > Or maybe the corpus we need already exists? Is there a CD-ROM or .zip > file that would give us the whole of the web in full HTML? > > > Thanks for your help. > > Alain Desilets > > Institute for Information Technology > National Research Concil of Canada > Building M-50 > Montreal Road > Ottawa (Ont) > K1A 0R6 > > e-mail: alain@ai.iit.nrc.ca > Tel: (613) 990-2813 > Fax: (613) 952-7151 > > ------------------ RFC822 Header Follows ------------------ Received: by zazu.softshell.com with SMTP;18 Oct 1995 14:59:25 -0700 Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25902; Wed, 18 Oct 95 13:16:38 -0700 From: amonge@cs.ucsd.edu (Alvaro Monge) Message-Id: <9510182013.AA10642@dino> Subject: Re: Looking for a spider To: robots@webcrawler.com Date: Wed, 18 Oct 1995 13:13:55 -0700 (PDT) In-Reply-To: <9510181831.AA06646@ai.iit.nrc.ca> from "Alain Desilets" at Oct 18, 95 02:31:39 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 1865 Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com From owner-robots Wed Oct 18 14:55:47 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA29718; Wed, 18 Oct 95 14:55:47 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 18 Oct 1995 14:54:22 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Unfriendly robot owner identified! Cc: aleonard@well.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Got 'em. Using whois, I found that the IP address belongs to Library Corp. in Virginia. They're the providers of the "NlightN" search service at: http://www.nlightn.com/ Anybody know anything about their robot? I know that they've licensed the Lycos data. Their background information says, "NlightN, a division of The Library Corporation, was formed to develop and market a Universal Index to the world's electronically stored information." I guess their robot has to work fast to build a universal index... ;-) Nick From owner-robots Wed Oct 18 15:19:02 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01014; Wed, 18 Oct 95 15:19:02 -0700 Date: Wed, 18 Oct 1995 15:18:53 -0700 (PDT) From: Andrew Leonard Subject: Re: Unfriendly robot owner identified! To: robots@webcrawler.com In-Reply-To: Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, all. I'm a reporter for Wired working on a story about bots, and I'm personally following up on this NlightN robot episode. I've put a call into their Reston VA headquarters asking to talk to someone about their search robot, and I'll keep the list posted on whatever I find out. Andrew Leonard Wired Magazine > Got 'em. > > Using whois, I found that the IP address belongs to Library Corp. in > Virginia. They're the providers of the "NlightN" search service at: > > http://www.nlightn.com/ > > Anybody know anything about their robot? I know that they've licensed the > Lycos data. > > Their background information says, "NlightN, a division of The Library > Corporation, was formed to develop and market a Universal Index to the > world's electronically stored information." > > I guess their robot has to work fast to build a universal index... ;-) > > Nick > > > From owner-robots Wed Oct 18 15:38:57 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02009; Wed, 18 Oct 95 15:38:57 -0700 From: amonge@cs.ucsd.edu (Alvaro Monge) Message-Id: <9510182200.AA11857@dino> Subject: Re: Looking for a spider To: robots@webcrawler.com Date: Wed, 18 Oct 1995 15:00:01 -0700 (PDT) In-Reply-To: from "Xiaodong Zhang" at Oct 18, 95 03:13:44 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 555 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Unfortunately, I cannot use most robots that I know of because they DO NOT SAVE the entire document, or its hierarchical structure. Lycos for example: > The Lycos exploration robot locates new and changed documents and > builds abstracts, which consist of title, headings, subheadings, > 100 most significant words and the first 20 lines of the document. For my research, this is not that useful. I need the entire document, as it appears at the source -- not as saved by some robot, because I want to follow the links within the document. --Alvaro From owner-robots Wed Oct 18 16:19:20 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04137; Wed, 18 Oct 95 16:19:20 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 18 Oct 1995 16:18:02 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Really fast searching Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com It's a bit off-topic, but I can't resist sharing something that one of our sharp-eyed engineers found in a certain company's information page about their search service: > By transparently linking hundreds of data sources, ******* has > created the world's largest integrated index, already comprised of > more than 100 gigabytes and growing daily. A proprietary database > engine provides immediate response time and actually increases speed > as the size of the index grows. We need this algorithm, our engineer says. It start off with immediate responses, then gets faster. Wowza! ("A meeting on time travel will be held last week.") Nick From owner-robots Thu Oct 19 06:29:53 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16844; Thu, 19 Oct 95 06:29:53 -0700 Message-Id: <9510191329.AA12490@ai.iit.nrc.ca> Date: Thu, 19 Oct 95 09:29:15 EDT From: Alain Desilets To: robots@webcrawler.com Subject: Re: Looking for a spider Cc: alain@ai.iit.nrc.ca Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Dear Alvaro, Thanks for responding. I'll let you know if I find something. I'm interested to know more about your work. Do you have a Web page on it? Thanks Alain From owner-robots Thu Oct 19 06:32:09 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17037; Thu, 19 Oct 95 06:32:09 -0700 Message-Id: <9510191331.AA12583@ai.iit.nrc.ca> Date: Thu, 19 Oct 95 09:31:31 EDT From: Alain Desilets To: robots@webcrawler.com Subject: Re: Looking for a spider Cc: alain@ai.iit.nrc.ca Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Dear Zhang, Thank you for the info. Unfortunately, I am in the same position as Alvaro Monge. I need the original HTML files, as opposed to some condensed version of it produced by a robot. Alain From owner-robots Thu Oct 19 06:39:50 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17600; Thu, 19 Oct 95 06:39:50 -0700 Message-Id: <9510191339.AA12691@ai.iit.nrc.ca> Date: Thu, 19 Oct 95 09:39:13 EDT From: Alain Desilets To: robots@webcrawler.com Subject: Sorry! Cc: alain@ai.iit.nrc.ca Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Sorry about the previous messages. I intended to send them directly to the people concerned but it somehow got sent to this list. - Alain From owner-robots Thu Oct 19 07:53:29 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22425; Thu, 19 Oct 95 07:53:29 -0700 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199510191453.PAA06141@wswiop11.win.tue.nl> Subject: Re: Unfriendly robot at 205.177.10.2 To: robots@webcrawler.com Date: Thu, 19 Oct 1995 15:53:11 +0100 (MET) Cc: tbray@opentext.com In-Reply-To: from "Nick Arnett" at Oct 18, 95 09:30:32 am X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 989 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >The correct address of the owner of the robot is 205.252.60.50, which won't >resolve. Tight security, apparently. Ironically. Well, on our site (www.win.tue.nl), it's causing no problems at all: % grep '205\.252' /usr/www/logs/cern_access.log 205.252.60.50 - - [13/Oct/1995:12:30:13 +0100] "GET / HTTP/1.0" 302 381 205.252.60.50 - - [13/Oct/1995:20:58:55 +0100] "GET / HTTP/1.0" 302 381 % wc /usr/www/logs/cern_access.log 206422 2062250 22193056 /usr/www/logs/cern_access.log That is, out of the last 206,422 requests, 2 were from this site. Lycos wants to index as many documents on a site it can find. This robot has only made two requests, and it didn't even retrieve our home page (/ is redirected to /win/, which is the actual home page). Perhaps it doesn't follow redirections. >Nick -- Reinier Post reinpost@win.tue.nl a.k.a. me [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] From owner-robots Thu Oct 19 07:57:03 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22755; Thu, 19 Oct 95 07:57:03 -0700 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199510191456.PAA06159@wswiop11.win.tue.nl> Subject: Re: Looking for a spider To: robots@webcrawler.com Date: Thu, 19 Oct 1995 15:56:40 +0100 (MET) In-Reply-To: <9510182200.AA11857@dino> from "Alvaro Monge" at Oct 18, 95 03:00:01 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 1038 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (Alvaro Monge) write: >Unfortunately, I cannot use most robots that I know of because they >DO NOT SAVE the entire document, or its hierarchical structure. > >Lycos for example: > >> The Lycos exploration robot locates new and changed documents and >> builds abstracts, which consist of title, headings, subheadings, >> 100 most significant words and the first 20 lines of the document. > >For my research, this is not that useful. I need the entire document, >as it appears at the source -- not as saved by some robot, because I >want to follow the links within the document. Lycos follows the links of documents; that's how robots work. The summaries are built for indexing purposes. You can't save the full text of all documents because of the disk space requirements (perhaps OpenText can?) and because of legal considerations. >--Alvaro -- Reinier Post reinpost@win.tue.nl a.k.a. me [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] From owner-robots Thu Oct 19 08:44:31 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26046; Thu, 19 Oct 95 08:44:31 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 19 Oct 1995 08:41:05 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Unfriendly robot at 205.252.60.50 Cc: tbray@opentext.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 7:53 AM 10/19/95, Reinier Post wrote: >>The correct address of the owner of the robot is 205.252.60.50, which won't >>resolve. Tight security, apparently. Ironically. > >Well, on our site (www.win.tue.nl), it's causing no problems at all In my e-mail to NlightN, I said that I assume it was unintentional. I can't imagine that anyone would purposely request documents at the rate they were hitting us. Of course, there's no way to know if that was the robot or a human-controlled browser hitting your site from the same host... Thanks! Nick From owner-robots Thu Oct 19 09:10:34 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28065; Thu, 19 Oct 95 09:10:34 -0700 Message-Id: <9510191609.AA14728@ai.iit.nrc.ca> Date: Thu, 19 Oct 95 12:09:49 EDT From: Alain Desilets To: robots@webcrawler.com Subject: Re: Looking for a spider Cc: alain@ai.iit.nrc.ca Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com In response to Alvaro's message, > > > >> The Lycos exploration robot locates new and changed documents and > >> builds abstracts, which consist of title, headings, subheadings, > >> 100 most significant words and the first 20 lines of the document. > > > >For my research, this is not that useful. I need the entire document, > >as it appears at the source -- not as saved by some robot, because I > >want to follow the links within the document. Reinier Post writes: > > Lycos follows the links of documents; that's how robots work. > The summaries are built for indexing purposes. You can't save > the full text of all documents because of the disk space requirements > (perhaps OpenText can?) and because of legal considerations. > Like Alvaro, no robot generated indexe of the whole web is sufficient for my purpose. My group working on developping new tools that can process the web and "summarise" it in some novel way. For example: - New and hopefully better keyword extraction algorithms - Automatic generation of hierarchichal indexes a la Yahoo - Merging of small indexes into bigger ones - etc... In order to test these new approaches, we need the full HTML, not an index of it. - Alain From owner-robots Thu Oct 19 09:18:30 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28618; Thu, 19 Oct 95 09:18:30 -0700 Date: Fri, 20 Oct 1995 02:18:16 +1000 From: Murray Bent Message-Id: <199510191618.CAA08466@wittgenstein.icis.qut.edu.au> To: robots@webcrawler.com Subject: re: Lycos unfriendly robot Content-Length: 439 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com According to Reinier Post: >Lycos wants to index as many documents on a site it can find. This >robot has only made two requests, and it didn't even retrieve our home page >(/ is redirected to /win/, which is the actual home page). Perhaps it doesn't >follow redirections. >>Nick >-- >Reinier Post reinpost@win.tue.nl That may be fine if you have shares in Lycos or something. Do you? mj From owner-robots Thu Oct 19 11:01:14 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05721; Thu, 19 Oct 95 11:01:14 -0700 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199510191801.TAA19705@wsinis02.win.tue.nl> Subject: Re: Lycos unfriendly robot To: robots@webcrawler.com Date: Thu, 19 Oct 1995 19:01:00 +0100 (MET) In-Reply-To: <199510191618.CAA08466@wittgenstein.icis.qut.edu.au> from "Murray Bent" at Oct 20, 95 02:18:16 am X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 918 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (Murray Bent) write: > > >According to Reinier Post: >>Lycos wants to index as many documents on a site it can find. This >>robot has only made two requests, and it didn't even retrieve our home page >>(/ is redirected to /win/, which is the actual home page). Perhaps it doesn't >>follow redirections. > >>>Nick > >>-- >>Reinier Post reinpost@win.tue.nl > >That may be fine if you have shares in Lycos or something. Do you? I don't follow your logic. *What* is fine if I have shares in Lycos? The fact that this visit was made by something that doesn't follow redirections, and therefore is unlikely to be a Lycos robot? >mj For some reason you seem to bear a grudge against Lycos. If my posting did anything to tear open any old wounds, I apologise. -- Reinier Post reinpost@win.tue.nl a.k.a. me From owner-robots Sat Oct 21 07:17:11 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06960; Sat, 21 Oct 95 07:17:11 -0700 Date: Sat, 21 Oct 1995 07:17:03 -0700 (PDT) From: Andrew Leonard Subject: Re: Unfriendly robot at 205.252.60.50 To: robots@webcrawler.com Cc: robots@webcrawler.com, tbray@opentext.com In-Reply-To: Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I contacted NlightN, and their CEO said that their most junior hire was testing a new robot. They were apparently unaware of the robot exclusion protocol but plan to mend their ways. Andrew Leonard Wired Magazine On Thu, 19 Oct 1995, Nick Arnett wrote: > At 7:53 AM 10/19/95, Reinier Post wrote: > >>The correct address of the owner of the robot is 205.252.60.50, which won't > >>resolve. Tight security, apparently. Ironically. > > > >Well, on our site (www.win.tue.nl), it's causing no problems at all > > In my e-mail to NlightN, I said that I assume it was unintentional. I > can't imagine that anyone would purposely request documents at the rate > they were hitting us. Of course, there's no way to know if that was the > robot or a human-controlled browser hitting your site from the same host... > > Thanks! > > Nick > > > From owner-robots Sat Oct 21 11:21:18 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23944; Sat, 21 Oct 95 11:21:18 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sat, 21 Oct 1995 10:35:40 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Unfriendly robot at 205.252.60.50 Cc: robots@webcrawler.com, tbray@opentext.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 7:17 AM 10/21/95, Andrew Leonard wrote: >I contacted NlightN, and their CEO said that their most junior hire was >testing a new robot. They were apparently unaware of the robot exclusion >protocol but plan to mend their ways. I haven't heard from them, but our server/spider product manager received a telephone apology. I can't resist pointing out the irony of a search services company that apparently failed to find some critical information about robots on the Internet. On the other hand, we've probably done equally silly things. I hope they'll add a user-agent field, at least. Nick From owner-robots Sat Oct 21 17:47:17 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20668; Sat, 21 Oct 95 17:47:17 -0700 Message-Id: From: kimba@snog.it.com.au (Kim Davies) Subject: Re: Unfriendly robot at 205.252.60.50 To: robots@webcrawler.com Date: Sun, 22 Oct 1995 08:46:39 +0800 (WST) In-Reply-To: from "Nick Arnett" at Oct 21, 95 10:35:40 am X-Mailer: ELM [version 2.4 PL24 PGP2] Content-Type: text Content-Length: 554 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, > >I contacted NlightN, and their CEO said that their most junior hire was > >testing a new robot. They were apparently unaware of the robot exclusion > >protocol but plan to mend their ways. > > I haven't heard from them, but our server/spider product manager received a > telephone apology. Has someone invited them to join this list? If they discussed what they were doing it might be better for all concerned.. catchya, -- Kim Davies | "Belief is the death of intelligence" -Snog kimba@it.com.au | http://www.it.com.au/~kimba/ From owner-robots Sun Oct 22 13:14:28 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01215; Sun, 22 Oct 95 13:14:28 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sun, 22 Oct 1995 13:13:12 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Unfriendly robot at 205.252.60.50 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 5:46 PM 10/21/95, Kim Davies wrote: >Hi, > >> >I contacted NlightN, and their CEO said that their most junior hire was >> >testing a new robot. They were apparently unaware of the robot exclusion >> >protocol but plan to mend their ways. >> >> I haven't heard from them, but our server/spider product manager received a >> telephone apology. > >Has someone invited them to join this list? If they discussed what they >were doing it might be better for all concerned.. I directed them to the robots pages on www.webcrawler.com, which should lead them to this list. What am I thinking -- the server that they were hammering with their robot includes recent messages from this list (at http://asearch.mccmedia.com/robots/). I suppose that means they might have looked... Nick From owner-robots Mon Oct 23 07:50:14 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03859; Mon, 23 Oct 95 07:50:14 -0700 Date: Mon, 23 Oct 95 10:50:03 EDT From: wulfekuh@cps.msu.edu (Marilyn R Wulfekuhler) Message-Id: <9510231450.AA10394@pixel.cps.msu.edu> To: robots@webcrawler.com Subject: Re: Looking for a spider Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Alain Desilets writes: > In order to test our methods we need to acquire a large corpus of > full HTML files from the Web. We plan to use a spider for that task. > and Alvaro Monge writes: > A colleague of mine and I are also doing research which is AI based > and are in need of a large corpus for our use. We would like to use > anything that is already available which keeps the structure of the > real WWW and does not take anything away. This is in order to create > realistic experiments of our approaches. > We are also doing research on AI based approaches to processing the web, and toward the goal of having a test bed of the web, we have a text-only copy of a subset of the web (currently about 650 meg) which we have been calling "the proving grounds". It is not possible to get a complete snapshot of the web at any given time, but without images and audio, we can at least have a large, known, subset. It's also to our collective advantage to all be working from the same subset. It is our intention to make the proving grounds available to the public, hopefully within the next two weeks. We used a spider which was a modified htmlgobble, which takes a URL and follows all the links, copying all the documents it finds except image, audio, and video files. The urls inside the documents have been modified so that everything points to the local copy, enabling a spider (or human browser) to traverse the database locally. Before we go public, I have a few questions: (1) We currently don't copy audio, video, image files and instead create a file by the same name with a single character identifying it as video, image, or audio. Would an empty file suffice? Is there another identification scheme that would be more useful? (2) We currently copy postscript, but are considering treating them as we do image files. They take a LOT of space, and are of no utility for the kind of analysis that we want to do. Would it be more useful to keep the postscript, or treat it as we do images (which would then allow us to use the space for a larger web subset)? I appreciate any feedback and I'll announce to the list when it's ready for public use. Marilyn Wulfekuhler Intelligent Systems Lab, Michigan State University From owner-robots Mon Oct 23 15:27:34 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05597; Mon, 23 Oct 95 15:27:34 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Mon, 23 Oct 1995 15:26:16 -0700 To: Andrew Daviel , robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Proposed URLs that robots should search Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 1:51 PM 10/23/95, Andrew Daviel wrote: >With my other hat on (admin@vancouver-webpages.com), I'm >trying to build a database of URLs and other information for businesses >on the Net. I can't quite contain the urge to say, "Isn't everyone?" >Some database registration robots (I believe) search submitted URLs for >keywords, doing some natural language processing to discard modifiers and >prepositions. However, the trend to graphics-dominated homepages makes >such efforts of dubious utility. I wouldn't be so quick to jump to that conclusion. I have seen few, if any, business sites that don't offer text-only versions of their key pages. Also, I'm utterly certain that a good relevancy-ranking engine will do a better job at assigning categories than will an uncontrolled set of people, especially when those people are out to maximize hits, rather than to maximize relevancy. Having said all of that, I'd like to agree that we need some additional information for robots. Could we start simply by having a standard way to set forth the name of the site? An icon for the site would be really nice. It's very frustrating to build a search results list and have no definitive way of describing the site on which the documents reside! Next, I'd like to have the means to name groups of documents (Press releases, product descriptions, as examples of typical business groupings). We guess at these from directory names, but that's very haphazard. The secondary naming problem is more difficult because there are many-to-many relationships involved. >In the spirit of /robots.txt, I would like to propose a set of files that >robots would be encouraged to visit: > >/robots.htm - an HTML list of links that robots are encouraged to traverse What does "encouraged" mean? How is it differnet from (not (robots.txt))? Why HTML? >/descript.txt - a text file describing what the site (or directory) is > all about Agreed. >/keywords.txt - a text file with comma-delimited keywords relevant to the > site (or directory) Disagree greatly. This opens a giant can of worms. Keywords are never enough, often confusing and difficult to maintain. >/linecard.txt - for commercial sites, a text file with comma-delimited > line items (brands) manufactured or stocked This will drown in details. >/sitedata.txt - a text file similar to the InterNIC submissions forms, > with publicly-available site data such as > >Organization: organisation name >Type: commercial/non/profit/educational etc. >Admin: email of admininstration >Webmaster: email of Web admininstration >Postal: postal address >ZIP: ZIP/postcode >Country: >Position: Lat/Long >etc. Yes to some of this at least. But there's an assumption that there's a one-to-one relationship between the server and these field data. Often, there isn't and no scheme that fails to deal with that is going to succeed. I'm ready to adapt one of my prototype robots to parse this data for our engine, so here's one hand up for "Yes, I'll implement it." I'm just doing research, but my research does fall in front of our engineers at some point. By the way, today, Verity announced that NetManage and Purveyor have signed up to use our search engine. They join Netscape, Quarterdeck and a few others. Nick P.S. I've replied to the new list server address at webcrawler.com, rather than the Nexor address. From owner-robots Mon Oct 23 16:31:22 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10352; Mon, 23 Oct 95 16:31:22 -0700 Message-Id: <9510232331.AA10338@webcrawler.com> To: robots Cc: Andrew Daviel Subject: Re: Proposed URLs that robots should search In-Reply-To: Your message of "Mon, 23 Oct 1995 15:26:16 PDT." Date: Mon, 23 Oct 1995 16:31:17 -0700 From: Martijn Koster Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com In message , Nick Arnett writes: > Also, I'm utterly certain that a good relevancy-ranking engine will do a > better job at assigning categories than will an uncontrolled set of people, > especially when those people are out to maximize hits, rather than to > maximize relevancy. Yeah, isn't that fun... :-/ Maybe we should have a shared spammer blacklist :-) > [want the name of the site] > [groups of documents] > >In the spirit of /robots.txt, I would like to propose a set of files that > >robots would be encouraged to visit: > > > >/robots.htm - an HTML list of links that robots are encouraged to traverse > > What does "encouraged" mean? How is it differnet from (not (robots.txt))? Because a robot may not want to traverse the whole site, and would prefer to get "sensible" pages. > Why HTML? Yeah, bad news. > [/keywords] > Disagree greatly. This opens a giant can of worms. Keywords are never > enough, often confusing and difficult to maintain. Hmmm... yes, but it's not necesarrily worse than straight HTML text, which is the alternative. > >/linecard.txt - for commercial sites, a text file with comma-delimited > > line items (brands) manufactured or stocked > > This will drown in details. Yup. > >/sitedata.txt - a text file similar to the InterNIC submissions forms, > > with publicly-available site data such as > > > Yes to some of this at least. But there's an assumption that there's a > one-to-one relationship between the server and these field data. Often, > there isn't and no scheme that fails to deal with that is going to succeed. Well, I hate to repeat myself, but ALIWEB's /site.idx will give you all of the above (OK, not the icon, but you could add that). It doesn't seem to scale to well to large sites who want to describe every single page or resource on their server, but that's not the goal here... Note also that nobody is stopping you to pull just the URLs from a site.idx, and doing your standard robot summarising on that... -- Martijn __________ Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Mon Oct 23 17:06:25 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12787; Mon, 23 Oct 95 17:06:25 -0700 Message-Id: From: kimba@snog.it.com.au (Kim Davies) Subject: Re: Proposed URLs that robots should search To: andrew@andrew.triumf.ca (Andrew Daviel) Date: Tue, 24 Oct 1995 08:03:58 +0800 (WST) Cc: robots@webcrawler.com In-Reply-To: from "Andrew Daviel" at Oct 23, 95 09:51:17 pm X-Mailer: ELM [version 2.4 PL24 PGP2] Content-Type: text Content-Length: 1378 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, > /robots.htm - an HTML list of links that robots are encouraged to traverse A plain text file would be much more well suited, similar to the existing robots.txt - reading in plain text and adding it to the stack of URL's to be processed is sure to be more effective than sending the html to the robot reasoning engine to parse about. > [snip] > > Organization: organisation name > Type: commercial/non/profit/educational etc. > Admin: email of admininstration > Webmaster: email of Web admininstration > Postal: postal address > ZIP: ZIP/postcode > Country: > Position: Lat/Long > etc. How are you going to get a system administrator to implement all these files? How many system administrators do you know even know about robots.txt? Assuming you want a large chunk of sites to adopt these details, I'd propose it be implemented into the HTTP protocol somehow. an "ADMIN" request, for example, could request the above details from the site just as an "/admin", for example, on IRC, grabs the admin details of a server from the lines in the configuration. If a space was made in a server's configuration or makefile for these details, web administrators are far more likely to implement. catchya, -- Kim Davies | "Belief is the death of intelligence" -Snog kimba@it.com.au | http://www.it.com.au/~kimba/ From owner-robots Tue Oct 24 02:48:24 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14601; Tue, 24 Oct 95 02:48:24 -0700 Date: Tue, 24 Oct 1995 02:48:19 -0700 (PDT) From: Andrew Daviel To: robots@webcrawler.com Subject: Re: Proposed URLs that robots should search In-Reply-To: Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Let's see if I can reply to everyone without getting in a tangle ... :)= >>I'm trying to build a database of URLs for business... >I can't quite contain the urge to say, "Isn't everyone?" Know any good ones? Nothing jumped out at me from CUSI, or Submit-It, etc. >I have seen few .. business sites that don't offer text-only versions I seem to keep seeing sites that say "Works best with Netscape 1.2 - get it!" >Could we start .. standard way to set forth the name of the site? Having it in the of the document root is quite common, but you get "BloggCo Home Page", "Welcome to BloggCo", and sometimes "Welcome to B L O G G C O". I've tried looking for non-dictionary words with some success. >>/linecard.txt - for commercial sites, a text file with comma-delimited >> line items (brands) manufactured or stocked >This will drown in details. >Yup. This was a suggestion from a professional buyer. Sure, collecting these for the whole world would get out of control, but with a small enough scope it might be manageable. The buyers look up brand names in a huge 12-volume book to find distributors or manufacturers. Finding who stocks Motorcraft in Tipperary can't produce that many records. >Well, I hate to repeat myself, but ALIWEB's /site.idx will give you .. Didn't know about it. Looks like what I was thinking of. I see it has keywords ( >..Disagree greatly. This opens a giant can ... ) > >/robots.htm - an HTML list of links > Why HTML? A simplistic idea. I figured that if existing robots are written to traverse HTML, then giving them an HTML file to start from would be fairly easy. Re. site.idx, is this a fairly open-ended list of fields? I had in mind some fields relevant to larger businesses, like Sales-Email, Info-Email, Tech-Email, Sales-FaxBack, etc. etc. for voice, fax, email where some places may have separate hotlines for hardware, software, licenses, etc. How to handle this for big concerns that have one website and hundreds of regional offices is another problem. I find the Lat/Long format in IAFA a bit strange; I use the "standard" navigational format from navigation books, GPS and Loran, etc. eg. 49D14.7N 123D13.6W, except that as there isn't a degree symbol in ASCII I've used "D", which makes it similar to the NMEA0182 format. The current NMEA0183 standard for navigation equipment would use something like: $LCGLL,4001.74,N,07409.43,W for 40 degrees 1.74 minutes North, 74 degrees 9.43 minutes West. Anyway, it's just bits and easy enough to convert. >How are you going to get a system administrator to implement all these >files? Well, one might assume that a good many HTML authors and Webmasters read comp.infosystems.author.html, or whatever it's called. Or one could just send them all mail ... 50,000 returned mail messages wouldn't make too much of a dent in my disk ... :)= >I'd propose it be implemented into the HTTP protocol .. I'd think it might take a while for everyone to update their servers - say, at least 2 years... Andrew Daviel email: advax@triumf.ca From owner-robots Wed Oct 25 15:49:09 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05021; Wed, 25 Oct 95 15:49:09 -0700 Date: Thu, 26 Oct 1995 08:48:57 +1000 From: Murray Bent <murrayb@icis.qut.edu.au> Message-Id: <199510252248.IAA09980@wittgenstein.icis.qut.edu.au> To: robots@webcrawler.com Subject: lycos patents Content-Length: 134 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com To add insult to injury, Lycos are patenting spiders and robots. Anyone care to comment on what Lycos Inc. is up to these days? mj From owner-robots Wed Oct 25 15:56:03 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05454; Wed, 25 Oct 95 15:56:03 -0700 Message-Id: <9510252256.AA05447@webcrawler.com> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Scott Stephenson <scott> Date: Wed, 25 Oct 95 15:55:18 -0700 To: robots Subject: Re: lycos patents References: <199510252248.IAA09980@wittgenstein.icis.qut.edu.au> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, What, Lycos is trying to patent spiders and robots. Got any more information on this?!? How can this be possible, as it is certainly not technology that they developed. ss From owner-robots Wed Oct 25 15:58:36 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05629; Wed, 25 Oct 95 15:58:36 -0700 Message-Id: <9510252258.AA05583@webcrawler.com> To: robots Cc: Murray Bent <murrayb@icis.qut.edu.au> Subject: Re: lycos patents In-Reply-To: Your message of "Thu, 26 Oct 1995 08:48:57 +1000." <199510252248.IAA09980@wittgenstein.icis.qut.edu.au> Date: Wed, 25 Oct 1995 15:58:13 -0700 From: Martijn Koster <mak@beach.webcrawler.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com In message <199510252248.IAA09980@wittgenstein.icis.qut.edu.au>, Murray Bent wr ites: > To add insult to injury, Lycos are patenting spiders and robots. Can you elaborate? Where did you hear this, where can we find out more? -- Martijn __________ Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Wed Oct 25 16:09:34 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06328; Wed, 25 Oct 95 16:09:34 -0700 Date: Wed, 25 Oct 1995 19:08:47 -0400 (EDT) From: Matthew Gray <mkgray@Netgen.COM> X-Sender: mkgray@bokonon To: robots@webcrawler.com Subject: Re: lycos patents In-Reply-To: <199510252248.IAA09980@wittgenstein.icis.qut.edu.au> Message-Id: <Pine.SOL.3.91.951025190537.13893C-100000@bokonon> Organization: net.Genesis Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > To add insult to injury, Lycos are patenting spiders and robots. I assume he is referring to the comment: > We have a patent pending on our spider technology, which makes it > possible for us to both keep up with the exponential growth of the > Internet, and still find the most popular sites. which appears in the FAQ at http://lycos-tmp1.psc.edu/reference/faq.html I hope when they refer to "our spider technology", they are referring to something genuinely unique. If not there are a great many cases for prior art, notably my Wanderer which (while no longer the best) was the first one around in spring of '93. I agree that some comment or clarification from Lycos would be good. Matthew Gray --------------------------------- voice: (617) 577-9800 net.Genesis fax: (617) 577-9850 56 Rogers St. mkgray@netgen.com Cambridge, MA 02142-1119 ------------- http://www.netgen.com/~mkgray From owner-robots Wed Oct 25 16:19:27 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06783; Wed, 25 Oct 95 16:19:27 -0700 Date: Thu, 26 Oct 1995 09:16:39 +1000 From: Murray Bent <murrayb@icis.qut.edu.au> Message-Id: <199510252316.JAA10010@wittgenstein.icis.qut.edu.au> To: robots@webcrawler.com Subject: re: Lycos patents Content-Length: 570 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com reference: > From: "Alison O'Balle" <a.oballe@mail.utexas.edu> (Alison O'Balle) > Subject: Catalog of the Internet > To: Multiple recipients of list <web4lib@library.berkeley.edu> [...] > A representative from Lycos made a presentation on campus Thursday morning > in which he said a number of interesting things about the future of the > internet, cataloging,and other topics. [Interesting facts and figures deleted] > They are patenting web spiders and robots. This was glossed over, but the > lycos guy said the patent process was going well for them so far. From owner-robots Wed Oct 25 16:22:14 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06913; Wed, 25 Oct 95 16:22:14 -0700 Message-Id: <9510252322.AA06904@webcrawler.com> To: fuzzy@cmu.edu Cc: robots Subject: Patents? From: Martijn Koster <m.koster@webcrawler.com> Date: Wed, 25 Oct 1995 16:22:18 -0700 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi Fuzzy, I can't see you in the list of subscribers to the robots list, (to which this is cc'ed) so maybe you missed a message regarding patents there. In http://www.lycos.com/reference/faq.html one reads: > We have a patent pending on our spider technology, which makes it > possible for us to both keep up with the exponential growth of the > Internet, and still find the most popular sites. Can you give any further details, either on the technical nature or the patent application? -- Martijn __________ Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Wed Oct 25 16:45:53 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08081; Wed, 25 Oct 95 16:45:53 -0700 Message-Id: <n1397482621.64443@mail.intouchgroup.com> Date: 25 Oct 1995 16:47:13 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: Re: lycos patents To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > What, Lycos is trying to patent spiders and robots. Got any more > information on this?!? How can this be possible, as it is certainly > not technology that they developed. If this is so, then some interested parties should let the Patent Office (or whatever the corresponding US body is called) know this. Particularly given what a terrible job they have been doing judging software and algorithm patents recently, it's a bad idea to just assume that the Patent Office will get it right. --Roger Dearnaley From owner-robots Wed Oct 25 19:19:25 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14858; Wed, 25 Oct 95 19:19:25 -0700 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199510260219.DAA02026@wsinis02.win.tue.nl> Subject: Re: lycos patents To: robots@webcrawler.com Date: Thu, 26 Oct 1995 03:19:08 +0100 (MET) In-Reply-To: <Pine.SOL.3.91.951025190537.13893C-100000@bokonon> from "Matthew Gray" at Oct 25, 95 07:08:47 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 1094 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Lycos's patents: >I hope when they refer to "our spider technology", they are referring to >something genuinely unique. If not there are a great many cases for >prior art, notably my Wanderer which (while no longer the best) was the >first one around in spring of '93. Mmm ... I think I first saw JumpStation in January '93. http://js.stir.ac.uk/jsbin/js Simple spiders existed before; I used one in November '92 to fill a proxy cache and fake a live Internet connection for a demo, but it wasn't used for indexing purposes. >I agree that some comment or clarification from Lycos would be good. The author has been seen to post to this list, before it moved. I should think the summaries may be patentable; in fact this thought first occurred to me when I saw his short talk on Lycos at WWW'95 in Darmstadt, in the workshop on Web indexing. But I haven't heard from Lycos since. There may be some unusual tricks in running the spiders as well. If XOR-ing bitmaps can be patented, why can't a bunch of details in spider technology? -- Reinier Post reinpost@win.tue.nl From owner-robots Tue Oct 31 06:58:02 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03475; Tue, 31 Oct 95 06:58:02 -0800 From: davidmsl@anti.tesi.dsi.unimi.it (Davide Musella) Message-Id: <9510311459.AA13828@anti.tesi.dsi.unimi.it> Subject: meta tag implementation To: robots@webcrawler.com (Mailing list su robot) Date: Tue, 31 Oct 1995 15:59:26 +0100 (MET) Organization: Dept. of Computer Science, Milan, Italy. X-Mailer: ELM [version 2.4 PL23alpha2] Content-Type: text Content-Length: 772 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi to everybody! I would like to know what do you think about a possible implementation of the meta http-equiv tag on an http-server. I' working in this direction to build a complete system to catalogue www docs but I think that the bigger problems is that there isn't any http-server that handle this meta tag (maybe only the WN server) Thanx Davide +--------------------------------------------------+ |Davide Musella | |e-Mail musella@dsi.unimi.it Dept. of | |Phone number +39.(0)2.4390821 Computer Science | |Address: Via Montevideo, 25 University of | | 20144 Milano ITALY Milan, Italy | |http://www.dsi.unimi.it/Users/Tesi/musella | +--------------------------------------------------+ From owner-robots Thu Nov 2 09:30:07 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15340; Thu, 2 Nov 95 09:30:07 -0800 Message-Id: <YkaDzD200YUxASM0sm@andrew.cmu.edu> Date: Thu, 2 Nov 1995 12:28:47 -0500 (EST) From: "Jeffrey C. Chen" <jc7k+@andrew.cmu.edu> To: robots@webcrawler.com (Mailing list su robot) Subject: Re: meta tag implementation Cc: In-Reply-To: <9510311459.AA13828@anti.tesi.dsi.unimi.it> References: <9510311459.AA13828@anti.tesi.dsi.unimi.it> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi everybody! I am a MS student at CMU. I am working on a software tool for collecting full system traces on the Alpha. The tool will also gather statistics by using the on-chip hardware event counters. I am interested in using a web server and a client as my test workload. It would be interesting to identify performance bottlenecks in a web server as it runs over a period of time servicing requests. Does anyone have a simple robot that I can use to exercise a web server? Thanks, Jeff From owner-robots Thu Nov 2 10:40:02 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20410; Thu, 2 Nov 95 10:40:02 -0800 From: Jaakko Hyvatti <Jaakko.Hyvatti@www.fi> Message-Id: <199511021835.UAA17200@krisse.www.fi> Subject: Simple load robot To: robots@webcrawler.com Date: Thu, 2 Nov 1995 20:35:19 +0200 (EET) In-Reply-To: <YkaDzD200YUxASM0sm@andrew.cmu.edu> from "Jeffrey C. Chen" at Nov 2, 95 12:28:47 pm X-Mailer: ELM [version 2.4 PL22] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 412 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Does anyone have a simple robot that I can use to exercise a web > server? Would this do the job, maybe run multiple times in parallel? (Please replace the url's..) #!/bin/sh while true do for i in \ http://www.fi/ \ http://www.fi/search.html \ http://www.fi/index/ \ http://www.fi/~jaakko/ \ http://www.fi/sss/ \ http://www.fi/www/ \ http://www.fi/links.html do lynx -source $i > /dev/null done done From owner-robots Mon Nov 6 22:44:28 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15194; Mon, 6 Nov 95 22:44:28 -0800 Date: Tue, 7 Nov 1995 00:43:47 -0600 Message-Id: <9511070643.AA120822@nic.smsu.edu> X-Sender: kdf274s@nic.smsu.edu X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Keith Fischer <kfischer@mail.win.org> Subject: Preliminary robot.faq (Please Send Questions or Comments) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Archive-name: robot.faq Posting-Frequency: variable Last-modified: Nov. 6, 1995 This article is a description and primer for World Wide Web robots and spiders. The following topics are addressed: 1) DEFINING ROBOTS AND SPIDERS 1.1) What is a ROBOT? 1.2) What is a SPIDER? 1.3) What is a search engine? 1.4) How many ROBOTS are there? 1.5) What can be achieved by using ROBOTS? 1.6) What harm can a ROBOT do? 2) THE THEORY BEHIND A ROBOT 2.1) Who can write one? 2.2) How is one written? 2.3) What is the Proposed Standard for Robot Exclusion? 2.4) What are the potential problems? 2.5) How do I use proper Etiquette? 3) THE REALITY OF THE WEB 3.1) Can I visit the entire web? 1) DEFINING ROBOTS AND SPIDERS 1.1) What is a ROBOT? A Robot is a program that traverses the World Wide Web, gathering some sort of information from each site it visits. This journey is accomplished by visiting a web page and then recursively visiting all or some of it's linked pages. 1.2) What is a SPIDER? Spiders are synonymous with Robots, as are Wanderers. These names however, have some misleading implications. For instance many people think that a spider or wanderer leaves the home site to work its magic, when in reality it never leaves. The Spider rather just acts as a sophisticated web browser, automatically retrieving documents and/or images until it is told to stop. I prefer the term Robot and will continue using it throughout this document. 1.3) What is a search engine? A search engine is not a robot. However some search engines rely heavily on robots. A search engine is nothing more than a glorified index. It searches the index, which resides on the host's computer, and returns the result. A common misconception is that a search engine like Lycos or Yahoo actively searches the web upon request. This is not true, all activity by the robot is done ahead of time. 1.4) How many ROBOTS are there? There are about 30 in existence. Martijn Koster maintains a list at: http://info.webcrawler.com/mak/projects/robots/active.html 1.5) What can be achieved by using ROBOTS? The possibilities are endless. Once you visit a page, you have free run of the html. You can retrieve files or the html itself. Most robots retrieve pieces of the html document. This is then used to build an index, which is later used by a search engine. 1.6) What harm can a ROBOT do? The robot can do no harm per say, but it can anger a lot of people. If your robot acts irresponsibly it can fall into a black hole, a link that dynamically makes new links, or worse it can get stuck in a loop. Both of these actions are certain to reek havoc on a server. The goal in web traversal is to never be on one server for to long. The solution to the problem of bad htmls or rather your robot's handling of bad htmls is to stay online. Simply put, never leave your robot unattended. 2) THE THEORY BEHIND A ROBOT 2.1) Who can write one? Anyone can write a robot provided that they have web access. But, a word to the wise, tell your system administrators because they WILL feel the system drain and they WILL hear many complaints concerning your activities. But, just because the possibility exists doesn't mean you should take on this task half cocked. Before even thinking about coding a robot: do your research, have an intended goal, and read the following: The Proposed Standard for Robot Exclusion located at: http://info.webcrawler.com/mak/projects/robots/norobots.html The Guidelines for Robot Writers located at: http://info.webcrawler.com/mak/projects/robots/guidelines.html Ethical Web Agent located at: http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Agents/eichmann.ethical/eichma nn.html 2.2) How is one written? A Robot is nothing more than an executable program. It can be in the form of a script or a binary file. It makes a connection to a web server and requests a document be sent, much the same way a web browser works. The difference is in the automation provided by the robot. 2.3) What is the Proposed Standard for Robot Exclusion? Martijn Koster explains the reason for a robot exclusion standard with the following: "In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting)." The form the robot exclusion standard takes is given in more detail at: The Proposed Standard for Robot Exclusion located at: http://info.webcrawler.com/mak/projects/robots/norobots.html 2.4) What are the potential problems? The potential problems can't be listed. The list would be far to big and unpredictable. The very nature of the World Wide Web is diversity and this very diversity makes robot writing both important and increasingly difficult. There is no one right html. They can be written in many ways and in many formats. My suggestion is get the spec sheet for html and practice, practice, practice, making your robot robust. 2.5) How do I use proper Etiquette? Etiquette is a very touchy subject. Many people stand in opposition to your newly written robot. They don't like the idea that their server will be over run with seemingly pointless requests. The solution is simple, first give them the results. Or rather put up for public consumption the results of your searches. This is the concept of giving back to the community that provided for you. Not to mention, if a person can use your results, the robot's requests may seem to have more merit. Another form of etiquette is slow requests. You've heard the term rapid fire. This means quick requests (a request every second or so); basically put, this brings a server to its figurative knees. The solution is limit your requests to any given server to one every minute (some say one every five minutes). More information about etiquette is located at: The Guidelines for Robot Writers located at: http://info.webcrawler.com/mak/projects/robots/guidelines.html Ethical Web Agents located at: http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Agents/eichmann.ethical/eichma nn.html 3) THE REALITY OF THE WEB 3.1) Can I visit the entire web? No. So don't try. Gauge your goals in reasonable amounts. ______________________________________________________________ I disclaim everything. The contents of this article might be totally inaccurate, inappropriate, misguided, or otherwise perverse - except for my name (you can probably trust me on that). Copyright (c) 1995 by Keith D. Fischer, all rights reserved. This FAQ may be posted to any USENET newsgroup, on-line service, or BBS as long as it is posted in its entirety and includes this copyright statement. This FAQ may not be distributed for financial gain. This FAQ may not be included in commercial collections or compilations without express permission from the author. ____________________________________________________________ Keith D. Fischer - kfischer@mail.win.org or kfischer@science.smsu.edu Keith D. Fischer kfischer@mail.win.org kdf274s@nic.smsu.edu "Misery loves company" By Anonymous "Today is a good day to die." By Crazy Horse "To be or not to be ..." Hamlet -- William Shakespeare From owner-robots Tue Nov 7 02:37:01 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA27042; Tue, 7 Nov 95 02:37:01 -0800 Date: Tue, 7 Nov 95 10:32:55 GMT Message-Id: <9511071032.AA09660@raphael.doc.aca.mmu.ac.uk> X-Sender: steven@raphael.doc.aca.mmu.ac.uk X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Steve Nisbet <S.Nisbet@DOC.MMU.AC.UK> Subject: Re: meta tag implementation Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 12:28 PM 11/2/95 -0500, you wrote: >Hi everybody! > >I am a MS student at CMU. I am working on a software tool for >collecting full system traces on the Alpha. The tool will also gather >statistics by using the on-chip hardware event counters. I am >interested in using a web server and a client as my test workload. It >would be interesting to identify performance bottlenecks in a web server >as it runs over a period of time servicing requests. Does anyone have a >simple robot that I can use to exercise a web server? > >Thanks, >Jeff > > Hi there Jef, know this sounds cheeky, but if you get any useful repliesfor robots that have nothing to do with PERL, could you let me know. I tride asking the same question you asked, but got no replies. From owner-robots Tue Nov 7 04:05:00 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02122; Tue, 7 Nov 95 04:05:00 -0800 From: davidmsl@anti.tesi.dsi.unimi.it (Davide Musella) Message-Id: <9511071205.AA13152@anti.tesi.dsi.unimi.it> Subject: Re: meta tag implementation To: robots@webcrawler.com Date: Tue, 7 Nov 1995 13:05:21 +0100 (MET) In-Reply-To: <9511071032.AA09660@raphael.doc.aca.mmu.ac.uk> from "Steve Nisbet" at Nov 7, 95 10:32:55 am Organization: Dept. of Computer Science, Milan, Italy. X-Mailer: ELM [version 2.4 PL23alpha2] Content-Type: text Content-Length: 251 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Hi there Jef, know this sounds cheeky, but if you get any useful repliesfor > robots that have nothing to do with PERL, could you let me know. I tride > asking the same question you asked, but got no replies. No replies until now....sigh!!! Davide From owner-robots Tue Nov 7 06:17:49 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08070; Tue, 7 Nov 95 06:17:49 -0800 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199511071417.OAA06656@wsinis11.win.tue.nl> Subject: Re: meta tag implementation To: robots@webcrawler.com Date: Tue, 7 Nov 1995 15:17:26 +0100 (MET) In-Reply-To: <9511071205.AA13152@anti.tesi.dsi.unimi.it> from "Davide Musella" at Nov 7, 95 01:05:21 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (Davide Musella) write: > >> Hi there Jef, know this sounds cheeky, but if you get any useful repliesfor >> robots that have nothing to do with PERL, could you let me know. I tride >> asking the same question you asked, but got no replies. > >No replies until now....sigh!!! You might use Lynx (2.4.FM); it has a -traverse switch now. Experimental, and I don't think it supports the RES (Robot Exclusion Standard) yet. We have a simple robot written in C, but it doesn't follow the RES either. What's your resaon to stay away from Perl? >Davide -- Reinier Post reinpost@win.tue.nl From owner-robots Tue Nov 7 06:54:36 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09686; Tue, 7 Nov 95 06:54:36 -0800 Date: Tue, 7 Nov 95 14:41:39 GMT Message-Id: <9511071441.AA11827@raphael.doc.aca.mmu.ac.uk> X-Sender: steven@raphael.doc.aca.mmu.ac.uk X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Steve Nisbet <S.Nisbet@DOC.MMU.AC.UK> Subject: Re: meta tag implementation Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi Davide, thanks very much for the info. I stay away from Perl here because it was badly set up and I have to reinstall it. SO its more of a grudge :) Other than that I think its a good thing. I will do as you suggest. All the best in you endeavours. From owner-robots Tue Nov 7 07:11:12 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10944; Tue, 7 Nov 95 07:11:12 -0800 Message-Id: <m0tCpg2-0003LMC@giant.mindlink.net> Date: Tue, 7 Nov 95 07:11 PST X-Sender: a07893@giant.mindlink.net X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Tim Bray <tbray@opentext.com> Subject: Re: Preliminary robot.faq (Please Send Questions or Comments) Cc: robots@webcrawler.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >1.1) What is a ROBOT? > > A Robot is a program that traverses the World Wide Web, gathering some >sort of information from each site it visits. This journey is accomplished >by visiting a web page and then recursively visiting all or some of it's >linked pages. True but misleading; there are much better strategies for covering the web than this kind of direct recursion. Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com) From owner-robots Wed Nov 8 01:30:52 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21486; Wed, 8 Nov 95 01:30:52 -0800 Date: Wed, 8 Nov 1995 03:30:45 -0600 Message-Id: <9511080930.AA35454@nic.smsu.edu> X-Sender: kdf274s@nic.smsu.edu X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Keith Fischer <kfischer@mail.win.org> Subject: Re: Preliminary robot.faq (Please Send Questions or Comments) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >>1.1) What is a ROBOT? >> >> A Robot is a program that traverses the World Wide Web, gathering some >>sort of information from each site it visits. This journey is accomplished >>by visiting a web page and then recursively visiting all or some of it's >>linked pages. > >True but misleading; there are much better strategies for covering >the web than this kind of direct recursion. > > >Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com) 1.1) What is a ROBOT? A Robot is a program that traverses the World Wide Web, gathering some sort of information from each site it visits. This journey is accomplished by visiting a web page and then visiting some or all of its linked pages. The method one follows whether it's recursive or some sort of fuzzy logic determines the effectivness of the search. How is the above. If you like, this will be the new 1.1. Also, could you please elaborate on better stratagies. (I'm assuming you are talking about the fuzzy logic that Yahoo and Lycos use.) Keith kfischer@mail.win.org kdf274s@nic.smsu.edu Keith D. Fischer kfischer@mail.win.org kdf274s@nic.smsu.edu "Misery loves company" By Anonymous "Today is a good day to die." By Crazy Horse "To be or not to be ..." Hamlet -- William Shakespeare From owner-robots Wed Nov 8 05:45:00 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03365; Wed, 8 Nov 95 05:45:00 -0800 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199511081344.NAA17571@wsinis02.win.tue.nl> Subject: Re: Preliminary robot.faq (Please Send Questions or Comments) To: robots@webcrawler.com Date: Wed, 8 Nov 1995 14:44:43 +0100 (MET) In-Reply-To: <9511080930.AA35454@nic.smsu.edu> from "Keith Fischer" at Nov 8, 95 03:30:45 am X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (Keith Fischer) write: >1.1) What is a ROBOT? > > A Robot is a program that traverses the World Wide Web, gathering some >sort of information from each site it visits. This journey is accomplished >by visiting a web page and then visiting some or all of its linked pages. >The method one follows whether it's recursive or some sort of fuzzy logic >determines the effectivness of the search. We have a robot which does 'fuzzy' searching, for which your description is appropriate. But in general, the document collection process (= robot) and the search process executed in response to a user query (on the resulting collection) are completely separate. Besides, searching the contents of document collections is not the only purpose of robots; robots can be used to check the validity of hyperlinks, for example. Your description is accurate, as applied to the robot process itself, but it may be confusing. A minor quibble: robots must use some heuristics in determining which links to follow. All robots are 'recursive', and most of them cut off the process in a more or less arbitrary way, which could be called 'fuzzy'. There is no or/or decision here. -- Reinier Post reinpost@win.tue.nl a.k.a. <A HREF="http://www.win.tue.nl/win/cs/is/reinpost/">me</A> [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] From owner-robots Wed Nov 8 08:38:48 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA13242; Wed, 8 Nov 95 08:38:48 -0800 Subject: Re: Preliminary robot.faq (Please Send Questions or Comments) From: YUWONO BUDI <yuwono@uxmail.ust.hk> To: robots@webcrawler.com Date: Thu, 9 Nov 1995 00:37:33 +0800 (HKT) In-Reply-To: <9511080930.AA35454@nic.smsu.edu> from "Keith Fischer" at Nov 8, 95 03:30:45 am X-Mailer: ELM [version 2.4 PL24alpha3] Content-Type: text Content-Length: 1603 Message-Id: <95Nov9.003740hkt.19032-3+260@uxmail.ust.hk> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > >>1.1) What is a ROBOT? > >> > >> A Robot is a program that traverses the World Wide Web, gathering some > >>sort of information from each site it visits. This journey is accomplished > >>by visiting a web page and then recursively visiting all or some of it's > >>linked pages. > > > >True but misleading; there are much better strategies for covering > >the web than this kind of direct recursion. > > > > > >Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com) > > > 1.1) What is a ROBOT? > > A Robot is a program that traverses the World Wide Web, gathering some > sort of information from each site it visits. This journey is accomplished > by visiting a web page and then visiting some or all of its linked pages. > The method one follows whether it's recursive or some sort of fuzzy logic > determines the effectivness of the search. I am not sure understand what the original comment is getting at. But it seems to me that the word "recursive" is somewhat overloaded. To those with CS background, a "recursive" visit implies a "depth first" tree traversal. Most robot implementations that I'm aware of use "breadth first" traversals. Among the reasons is that you would want to be able to limit the depth your robot digs into. Whether depth limitation is more useful than breadth limitation is another issue, IMHO. One thing for sure, stopping the robot after it reaches a certain depth is much simpler than deciding which links to follow/ignore. I don't know what would be the more general term in place of "recursively," "sequentially" perhaps? -Budi. From owner-robots Thu Nov 9 08:53:37 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12795; Thu, 9 Nov 95 08:53:37 -0800 Resent-Message-Id: <9511091653.AA12783@webcrawler.com> Resent-From: mak@beach.webcrawler.com Resent-To: robots Resent-Date: Thu, 9 Nov 1995 16:53:32 Date: Wed, 8 Nov 95 10:08:51 -0800 From: <owner-robots> Message-Id: <9511081808.AA19321@webcrawler.com> To: owner-robots Subject: BOUNCE robots: Admin request X-Filter: mailagent [version 3.0 PL41] for mak@surfski.webcrawler.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >From tbray@opentext.com Wed Nov 8 10:08:46 1995 Return-Path: <tbray@opentext.com> Received: from giant.mindlink.net by webcrawler.com (NX5.67f2/NX3.0M) id AA19311; Wed, 8 Nov 95 10:08:46 -0800 Received: from Default by giant.mindlink.net with smtp (Smail3.1.28.1 #5) id m0tDEv9-000343C; Wed, 8 Nov 95 10:08 PST Message-Id: <m0tDEv9-000343C@giant.mindlink.net> Date: Wed, 8 Nov 95 10:08 PST X-Sender: a07893@giant.mindlink.net X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Tim Bray <tbray@opentext.com> Subject: Re: Preliminary robot.faq (Please Send Questions or Comments) Cc: robots@webcrawler.com We're wasting too much time on this. All I meant to say was that the original language strongly suggested that robots use the following algorithm: sub RetrievePage(url) text = HttpGet(url) foreach sub_url in text RetrievePage(sub_url) Whereas lots of robots don't. Obviously it is recursive in that you do pull urls out of pages and eventually follow them, but it doesn't feel recursive. The 'fuzzy' stuff is a complete red herring - except for the special case of 'fuzzy logic' (not what's being done here) the word 'fuzzy' in the information retrieval context is a marketing term without semantic content. Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com) From owner-robots Fri Nov 17 09:12:34 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21835; Fri, 17 Nov 95 09:12:34 -0800 Date: Fri, 17 Nov 1995 09:24:00 -0800 (PST) From: Benjamin Franz <snowhare@netimages.com> X-Sender: snowhare@ns.viet.net To: robots@webcrawler.com Subject: Bad robot: WebHopper bounch! Owner: peter@cartes.hut.fi In-Reply-To: <95Nov9.003740hkt.19032-3+260@uxmail.ust.hk> Message-Id: <Pine.LNX.3.91.951117085518.25864A-100000@ns.viet.net> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I was checking my stats and this showed up with 1838 hits on the 9th of November. It tried to completely explore an infinite virtual space in one run, with an average time between hits of 4.3 seconds. Its' parser has to be broken because it was exploring a space defined by a ?cookie=number (used for shopping basket session tracking), but failing to preserve the '=' (generating 'cookienumber' instead of 'cookie=number') between calls and causing a new cookie to be assigned to every request. It went into an infinite loop over the same five base pages as it tried to do a depth first search of the site - for a little over two hours. Argh. Anyone else hit by this rather broken robot? -- Benjamin Franz, Webmaster, Net Images From owner-robots Thu Nov 23 12:44:36 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03420; Thu, 23 Nov 95 12:44:36 -0800 Date: Thu, 23 Nov 1995 12:42:51 -0800 (PST) From: Andrew Daviel <andrew@andrew.triumf.ca> To: libwww-perl@ics.UCI.EDU, /CN=robots/@nexor.co.uk Cc: Daniel Terrer <Daniel.Terrer@sophia.inria.fr> Subject: wwwbot.pl problem Message-Id: <Pine.LNX.3.91.951123111508.16547A-100000@andrew.triumf.ca> Mime-Version: 1.0 Content-Type: text/PLAIN; charset="US-ASCII" Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com (I send a request to libwww-perl-request just before my last message to the list, so I might not be on yet. Please Cc any replies to me.) I was having trouble with wwwbot from the libwww-perl-0.40 library. I continued to work on the problem after posting to the perl list. It seems that botcache is not well enough defined, so that a site with User-Agent: * Disallow / would kill subsequent GETs to a site that was previously in the cache. I have made a patch which adds the address to the cache, and fixes a couple of other odd cases, such as where the address is not fully defined working within a domain, and there are host names such as ypsun, ypsun2 etc. which would become confused with the path count. See ftp://andrew.triumf.ca/pub/wwwbot.patch Andrew Daviel email: advax@triumf.ca TRIUMF voice: 604-222-7376 4004 Wesbrook Mall fax: 604-222-7307 Vancouver BC http://andrew.triumf.ca/~andrew Canada V6T 2A3 49D14.7N 123D13.6W From owner-robots Thu Nov 23 23:45:39 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA07952; Thu, 23 Nov 95 23:45:39 -0800 Date: Fri, 24 Nov 95 16:45:28 JST From: francis@cactus.slab.ntt.jp (Paul Francis) Message-Id: <9511240745.AA03918@cactus.slab.ntt.jp> To: robots@webcrawler.com Subject: yet another robot Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com For all its worth, we have implemented a robot in order to (surprise surprise) gather web resources to build a (distributed) search database. The robot is called Yobot, and http://rodem.slab.ntt.jp:8080/home/robot-e.html tells you who to complain to if Yobot misbehaves. Thanks, PF From owner-robots Fri Nov 24 13:51:35 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17245; Fri, 24 Nov 95 13:51:35 -0800 Date: Sat, 25 Nov 1995 07:53:43 +1000 (EST) From: David Eagles <eaglesd@planets.com.au> To: robots@webcrawler.com Subject: yet another robot, volume 2 In-Reply-To: <9511240745.AA03918@cactus.slab.ntt.jp> Message-Id: <Pine.LNX.3.91.951125075027.1078A-100000@earth.planets.com.au> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com We, too, have developed a robot to provide Web resource search facilities to Australia and the South Pacific. The crawler engine will only follow links to designated domains, and the search engine allows individual selection of the search domain for queries. Named after a famous Australian spider, the FunnelWeb, the service is available at http://funnelweb.net.au Enjoy. Regards, David Eagles From owner-robots Fri Nov 24 15:20:08 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22501; Fri, 24 Nov 95 15:20:08 -0800 Date: Sat, 25 Nov 95 09:29:44 +1100 (EST) Message-Id: <v01530506acdc92c54ddb@[192.190.215.44]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: radio@mpx.com.au (James) Subject: Re: yet another robot, volume 2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >We, too, have developed a robot to provide Web resource search facilities >to Australia and the South Pacific. The crawler engine will only follow >links to designated domains, and the search engine allows individual >selection of the search domain for queries. > >Named after a famous Australian spider, the FunnelWeb, the service is >available at http://funnelweb.net.au > >Enjoy. >David we tried it out the other day.We lodged AAA and Tourist Radio(2 Sites) Great VISION Keith Ashton >Regards, >David Eagles AAA Australia Announce Archive / Tourist Radio Home of the Australian Cool Site of the Day ! http://www.com.au/aaa From owner-robots Fri Nov 24 16:13:17 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25777; Fri, 24 Nov 95 16:13:17 -0800 Date: Sat, 25 Nov 95 11:13:05 +1100 (EST) Message-Id: <v01530507acdcab771ad7@[192.190.215.44]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: radio@mpx.com.au (James) Subject: Re: yet another robot, volume 2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >>We, too, have developed a robot to provide Web resource search facilities >>to Australia and the South Pacific. The crawler engine will only follow >>links to designated domains, and the search engine allows individual >>selection of the search domain for queries. >> >>Named after a famous Australian spider, the FunnelWeb, the service is >>available at http://funnelweb.net.au >> > >>Enjoy. >>David we tried it out the other day.We lodged AAA and Tourist Radio(2 Sites) >Great VISION > >Keith Ashton > > > ____________________________________________________________________________ ___________ David, We just got an Email back from you but there was no content Keith ____________________________________________________________________________ ____________ > > > >>Regards, >>David Eagles > >AAA Australia Announce Archive / Tourist Radio >Home of the Australian Cool Site of the Day ! >http://www.com.au/aaa AAA Australia Announce Archive / Tourist Radio Home of the Australian Cool Site of the Day ! http://www.com.au/aaa From owner-robots Sat Nov 25 06:21:14 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05034; Sat, 25 Nov 95 06:21:14 -0800 From: Byung-Gyu Chang <chitos@ktmp.kaist.ac.kr> Message-Id: <199511251419.XAA02550@ktmp.kaist.ac.kr> Subject: Q: Cooperation of robots To: robots@webcrawler.com (Robot Mailing list) Date: Sat, 25 Nov 1995 23:19:12 +0900 (KST) X-Mailer: ELM [version 2.4 PL21-h4] Mime-Version: 1.0 Content-Type: text/plain; charset=iso-2022-kr Content-Transfer-Encoding: 7bit Content-Length: 378 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, I am newbie to this mailing-list. If I do some mistake, plz reply to me. I have one question : Is there some effort for robots to do gathering informations in cooperative work style? That is, Sharing informations gathered by the other kind of robots with some communication between robots like the that of intelligent agents in Intelligent Agent area. - Byung-Gyu Chang From owner-robots Sat Nov 25 10:19:10 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15907; Sat, 25 Nov 95 10:19:10 -0800 Date: Sat, 25 Nov 1995 13:19:03 -0500 Message-Id: <199511251819.NAA27702@moe.infi.net> X-Sender: magi@infi.net (Unverified) X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Michael Goldberg <magi@infi.net> Subject: Smart Agent help Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com A am developing sites for numerous large associations. I want to provide a service to the members by which they can choose from selected topics..say mortgage interest rates..and a robot goes out and searches selected sites and provides either by e-mail a formated "newsletter" or return a "newsletter" in html. Any suggestions? <<< Media Access Group>>> Local Access to electronic marketing Triad member- Network Hampton Roads 2101 Parks Ave. Suite 606 Virginia Beach, VA 23451 804-422-4481 From owner-robots Sat Nov 25 15:22:58 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01362; Sat, 25 Nov 95 15:22:58 -0800 Date: Sun, 26 Nov 1995 09:24:39 +1000 (EST) From: David Eagles <eaglesd@planets.com.au> To: robots@webcrawler.com Subject: Re: Q: Cooperation of robots In-Reply-To: <199511251419.XAA02550@ktmp.kaist.ac.kr> Message-Id: <Pine.LNX.3.91.951126091817.2816A-100000@earth.planets.com.au> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Sat, 25 Nov 1995, Byung-Gyu Chang wrote: > Hi, I am newbie to this mailing-list. If I do > some mistake, plz reply to me. > > > I have one question : > > Is there some effort for robots to do gathering > informations in cooperative work style? > That is, Sharing informations gathered by the other kind of > robots with some communication between robots like > the that of intelligent agents in Intelligent Agent area. > > - Byung-Gyu Chang > I'm not sure if there is any official cooperation going on, but I'm currently enhancing my web crawler (http://funnelweb.net.au) to include support for this type of operation. Basically, here's what I'm planning: The current web crawler, based in Australia, limits it's searching and collection to countries in the South Pacific. I'm planning to enhance this such that any URL's found (during the crawling process) for non-South Pacific countries will be forwarded to the web crawler responsible for that domain (as determined by a simple config file - maybe an automated registration process in the future). Similarly, the search engine will allow ANY individual country(s) to be searched (as is the case now for only South Pacific countries), and will fork the request off to the appropriate engine. Is this the type of info you were after? Regards, David Eagles From owner-robots Sun Nov 26 09:10:54 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17874; Sun, 26 Nov 95 09:10:54 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: <v02130502acde4cefcae6@[192.187.143.12]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sun, 26 Nov 1995 09:10:32 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Q: Cooperation of robots Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 11:19 PM 11/25/95, Byung-Gyu Chang wrote: >Is there some effort for robots to do gathering >informations in cooperative work style? >That is, Sharing informations gathered by the other kind of >robots with some communication between robots like >the that of intelligent agents in Intelligent Agent area. There are various efforts, but the most significant one is probably the Harvest project at the University of Colorado. I can't remember their URL at the moment, but I know we have a link to it from: http://www.verity.com/customers.html Nick From owner-robots Sun Nov 26 16:57:32 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11355; Sun, 26 Nov 95 16:57:32 -0800 Date: Mon, 27 Nov 95 09:57:15 JST From: francis@cactus.slab.ntt.jp (Paul Francis) Message-Id: <9511270057.AA12772@cactus.slab.ntt.jp> To: robots@webcrawler.com Subject: Re: Smart Agent help Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > A am developing sites for numerous large associations. I want to provide a > service to the members by which they can choose from selected topics..say > mortgage interest rates..and a robot goes out and searches selected sites > and provides either by e-mail a formated "newsletter" or return a > "newsletter" in html. > > Any suggestions? A number of people are working towards the ability to search selected sites, though I haven't heard of anyone trying to put the result in a newletter format. Harvest allows the user to custom build his own database, which is then locally accessed at search time. (http://harvest.cs.colorado.edu/) MetaCrawler, Silk, IBMinfoMarket, and no doubt many others query multiple pre-configured search databases at search time. (http://metacrawler.cs.washington.edu:8080/home.html http://services.bunyip.com:8000/products/silk/silk.html http://www.infomkt.ibm.com/about.htm) I'm looking forward to the day when two of these "meta" search services point to each other and create an infinite search loop.... PF ps. If you're going to the WWW conference in Boston, I'll be chairing a BOF on distributed searching. Please see http://rodem.slab.ntt.jp:8080/paulStuff/ From owner-robots Sun Nov 26 18:28:42 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16185; Sun, 26 Nov 95 18:28:42 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: <v02130509acded1bfff27@[192.187.143.12]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sun, 26 Nov 1995 18:28:33 -0800 To: robots@webcrawler.com, owner-robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: BOUNCE robots: Admin request Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 10:08 AM 11/8/95, <owner-robots@webcrawler.com> wrote: >Whereas lots of robots don't. Obviously it is recursive in that you >do pull urls out of pages and eventually follow them, but it doesn't >feel recursive. The 'fuzzy' stuff is a complete red herring - except >for the special case of 'fuzzy logic' (not what's being done here) the >word 'fuzzy' in the information retrieval context is a marketing term >without semantic content. Minor point -- let's not assume that no one on the list is using fuzzy logic to decide which links to follow. After all, some of us have search engines that use fuzzy logic operators. I'm fascinated by using evidential reasoning to build agents that explore. Nick From owner-robots Sun Nov 26 19:43:06 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20091; Sun, 26 Nov 95 19:43:06 -0800 Date: Mon, 27 Nov 95 12:42:56 JST From: francis@cactus.slab.ntt.jp (Paul Francis) Message-Id: <9511270342.AA14195@cactus.slab.ntt.jp> To: robots@webcrawler.com Subject: Re: Q: Cooperation of robots Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > Is there some effort for robots to do gathering > informations in cooperative work style? > That is, Sharing informations gathered by the other kind of > robots with some communication between robots like > the that of intelligent agents in Intelligent Agent area. > I haven't seen anything, but I only pay so much attention to this list. I know that one problem is that many robots run to support profit- (or planned profit-) based services, so don't want to share their info. What do you see as the advantage to sharing information? It is offhand not clear to me that much is to be gained by it. For instance, given that each robot-running organization usually has their own way of processing the resources they find, then they have to go out and retrieve the resources in any event. Thus, not much may be saved by sharing information.... PF From owner-robots Mon Nov 27 01:14:04 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06867; Mon, 27 Nov 95 01:14:04 -0800 From: Jaakko Hyvatti <Jaakko.Hyvatti@www.fi> Message-Id: <199511270913.LAA29177@krisse.www.fi> Subject: Re: Q: Cooperation of robots To: robots@webcrawler.com Date: Mon, 27 Nov 1995 11:13:46 +0200 (EET) In-Reply-To: <9511270342.AA14195@cactus.slab.ntt.jp> from "Paul Francis" at Nov 27, 95 12:42:56 pm X-Mailer: ELM [version 2.4 PL22] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 1744 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com francis@cactus.slab.ntt.jp (Paul Francis): > I haven't seen anything, but I only pay so much > attention to this list. I know that one problem is > that many robots run to support profit- (or planned > profit-) based services, so don't want to share their > info. We at http://www.fi/ have a good coverage of the www-resources of Finland. You are right, we are clearly not willing to share our information base with other search engines in Finland (there is another one). On the other hand, it might be possible to share the database with some or all of the international search engines as a promotion. We would not lose any markets here in finland, 'cause always our site would be the fastest way for Finnish customers to perform searching. > What do you see as the advantage to sharing information? > It is offhand not clear to me that much is to be gained > by it. For instance, given that each robot-running > organization usually has their own way of processing > the resources they find, then they have to go out and > retrieve the resources in any event. Thus, not much > may be saved by sharing information.... If the two co-operating parties agree of common set of information to stre about each individual page, both could modify their robots to comply with this. Possibly even just a compressed .tar.gz archive of the pages could do. Anyway it saves bandwidth in international connections and annoys the servers less. I do not believe that our current database would suit anybody elses needs, but maybe the next time we collect all the pages we could fetch all the information necessary to someone else too. Feel free to contact me at Jaakko.Hyvatti@www.fi if you are interested. We cover almost all of Finland. From owner-robots Mon Nov 27 08:27:10 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04759; Mon, 27 Nov 95 08:27:10 -0800 Message-Id: <9511271626.AA04714@webcrawler.com> Original-Received: from research by ns Pp-Warning: Illegal Received field on preceding line X-Mailer: exmh version 1.6.4 10/10/95 From: Fred Douglis <douglis@research.att.com> To: Andrew Daviel <andrew@andrew.triumf.ca> Cc: libwww-perl@ics.UCI.EDU, /CN=robots/@nexor.co.uk, Daniel Terrer <Daniel.Terrer@sophia.inria.fr> Subject: Re: wwwbot.pl problem In-Reply-To: Your message of "Thu, 23 Nov 1995 12:42:51 PST." <Pine.LNX.3.91.951123111508.16547A-100000@andrew.triumf.ca> X-Face: *lvs`^NFil<?gI%c@~W[5*dWZ5;4-8#&S`1t,Ey&5R5z7nLBE)TKc?44|-sPxDy<i[jb[s Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com XQu4i;It_f~o>3, KN{Fk?$+k063Tiv(F~;?02MoaTUP/:+;eeHIOHWf_Ob-s*iTugCX^)YVicQB<1: {??RaMPnky^1nA7'2!$REBJNc=skHq:poE<ObzL*~*M-w$9Vxx`Lv>ZcirD$]R#_f8~qT,O[Vc)x, G bKn>8, <X)r, rKv|oipe=j/;e0%f/j:#/bRy('D]"f|zB3 X-Uri: http://www.research.att.com/orgs/ssr/people/douglis Date: Mon, 27 Nov 1995 11:15:47 -0500 Sender: douglis@pelican.research.att.com Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" I reported this bug a few months ago and I thought a patch had been installed in the distribution. Roy? -- Fred Douglis MIME accepted douglis@research.att.com AT&T Bell Laboratories 908 582-3633 (office) 600 Mountain Ave., Rm. 2B-105 908 582-3063 (fax) Murray Hill, NJ 07974 http://www.research.att.com/orgs/ssr/people/douglis/ From owner-robots Mon Nov 27 12:29:54 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17259; Mon, 27 Nov 95 12:29:54 -0800 Message-Id: <199511272029.PAA14228@lexington.cs.columbia.edu> To: robots@webcrawler.com Subject: harvest Date: Mon, 27 Nov 1995 15:29:38 -0500 From: "John D. Pritchard" <jdp@cs.columbia.edu> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com there's been some mention of harvest.. the URL is http://harvest.cs.colorado.edu/ this provides a ton of infrastructure for implementing robots on top of, in the form of gatherers and or brokers. harvest sites cooperate so that once (with caching) a set of data (ftp, http, gopher, wais, etc.) has been "harvested" (or gathered), the global harvest database can reuse the gathered info without re-harvesting (re-gathering) from the target data site. this is "responsible"* robots that dont load up data sites with redundant automated downloading and cooperative robots, via brokering. * or ethical: http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Agents/eichmann.ethical/eichmann.html see http://harvest.cs.colorado.edu/harvest/technical.html for more. for a linear robot cooperation, harvest provides Summary Object Interchange Format (SOIF), http://harvest.cs.colorado.edu/Harvest/brokers/soifhelp.html arbitrary extensions to SOIF are on the object, object-attribute model. for nonlinear robot cooperation or interaction, brokers can be defined arbitrarily. i'm presently working on an associative AI which i had developed as a standalone program, but am stripping my lame gathering and brokering code for the sophistication of harvest. -john From owner-robots Mon Nov 27 14:39:00 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18292; Mon, 27 Nov 95 14:39:00 -0800 Date: Mon, 27 Nov 95 15:55:32 EST From: Jason_Murray_at_FCRD@cclink.tfn.com Message-Id: <9510278175.AA817518051@cclink.tfn.com> To: robots@webcrawler.com Subject: Re: Smart Agent help Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Give me a call (617) 345-2465 or send email (netsoft@aol.com). We are in process of creating just such an agent. Jason Murray DataMarket 306 Union St Rockland MA 02370 Fax 617-871-5816 From owner-robots Mon Nov 27 14:58:48 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18458; Mon, 27 Nov 95 14:58:48 -0800 Message-Id: <30BA6C06.444C@infi.net> Date: Mon, 27 Nov 1995 17:55:18 -0800 From: Michael Goldberg <magi@infi.net> Organization: Media Access Group X-Mailer: Mozilla 2.0b2a (Windows; I; 16bit) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: harvest References: <199511272029.PAA14228@lexington.cs.columbia.edu> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Received your email through the robots listserv... I need an application built for a site I am developing... THe application allows users of the site to tailor a specified areas of interest,...say mortgages.. and search specific WWW sites and retrieve the information eith by email or a formatted newsletter. Can Harvest do this? From owner-robots Mon Nov 27 16:38:38 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19231; Mon, 27 Nov 95 16:38:38 -0800 Message-Id: <199511280038.TAA14968@lexington.cs.columbia.edu> To: robots@webcrawler.com Subject: mortgages with: Re: harvest In-Reply-To: Your message of "Mon, 27 Nov 1995 17:55:18 PST." <30BA6C06.444C@infi.net> Date: Mon, 27 Nov 1995 19:38:34 -0500 From: "John D. Pritchard" <jdp@cs.columbia.edu> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Received your email through the robots listserv... > I need an application built for a site I am developing... > THe application allows users of the site to tailor a specified > areas of interest,...say mortgages.. and search specific WWW sites this is the kind of thing that harvest provides for. basically, in "tailoring" information dynamically (as opposed to going to a static menu system) your user is faced with (recursively) traversing an association graph. the user wants to see data with mortgage numbers. associativity is the service we are providing. better associativity, however, classes data, eg, via SOIF, so that the user has more coherent domains to search through than "every document with numeric strings and the string 'mortgage'". presently, SOIF provides for arbitrary degrees of data classification which is a strong solution for most applications, and generally an optimal solution for applications involving fairly regular data formats, eg, reports or forms. harvest provides for sites to cooperate or interoperate efficiently for applications such as these since no one site could ever have space to replicate the entire internet, or even a significant associative slice of it, in providing a monolithic internet database. basically the talent of harvest in linear interoperability, via SOIF, is providing the architecture for this recursively infinite association graph traversal in most forms of data, especially business data. > and retrieve the information eith by email or a formatted newsletter. > Can Harvest do this? certainly you could put an email or such interface on the system, but your users would probably be happier with something more responsive and flexible like a web interface. an interactive interface provides the opportunity for refining data collection, for discovering new sources of data, etc. -john From owner-robots Mon Nov 27 19:52:36 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26675; Mon, 27 Nov 95 19:52:36 -0800 Date: Mon, 27 Nov 1995 22:52:30 -0500 From: Skip Montanaro <skip@automatrix.com> Message-Id: <199511280352.WAA24695@dolphin.automatrix.com> To: robots@webcrawler.com Subject: How frequently should I check /robots.txt? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I'm working on a specialized robot to identify Web sites with concert itineraries (by scoring the contents of the file against expected patterns). I will announce it here when I begin exercising it outside my local network. I'm a bit confused about how often I should update my local copy of a site's /robots.txt file. Clearly I shouldn't check it with each access, since that would double the number of accesses my robot would make to a site. I saw nothing in my server's access logs that would suggest that any of the robots that visit our site ever perform a HEAD request for /robots.txt (indicating they were checking for a Last-modified header). So how about it? How often should /robots.txt be checked? Thx, Skip Montanaro skip@calendar.com (518)372-5583 Musi-Cal: http://www.calendar.com/concerts/ or mailto:concerts@calendar.com Internet Conference Calendar: http://www.calendar.com/conferences/ >>> ZLDF: http://www.netresponse.com/zldf <<< From owner-robots Mon Nov 27 20:31:52 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00165; Mon, 27 Nov 95 20:31:52 -0800 Date: Mon, 27 Nov 1995 23:27:08 -0600 (CST) From: gil cosson <gil@rusty.waterworks.com> To: robots@webcrawler.com Cc: robots@webcrawler.com Subject: Re: How frequently should I check /robots.txt? In-Reply-To: <199511280352.WAA24695@dolphin.automatrix.com> Message-Id: <Pine.LNX.3.91.951127231751.1718A-100000@rusty.waterworks.com> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com How about adding an entry to the robots.txt file that specifies how frequently the robots.txt file should be checked? gil. ========================================================================== "Everybody can be great because anybody can serve. You don't have to have a college degree to serve. You don't have to make your subject and verb agree to serve. You don't have to know the second theory of Thermo Dynamics and physics to serve. You only need a heart full of grace. A soul generated by love." Martin Luther King Jr. On Mon, 27 Nov 1995, Skip Montanaro wrote: > > I'm working on a specialized robot to identify Web sites with concert > itineraries (by scoring the contents of the file against expected patterns). > I will announce it here when I begin exercising it outside my local network. > > I'm a bit confused about how often I should update my local copy of a site's > /robots.txt file. Clearly I shouldn't check it with each access, since that > would double the number of accesses my robot would make to a site. > > I saw nothing in my server's access logs that would suggest that any of the > robots that visit our site ever perform a HEAD request for /robots.txt > (indicating they were checking for a Last-modified header). > > So how about it? How often should /robots.txt be checked? > > Thx, > > Skip Montanaro skip@calendar.com (518)372-5583 > Musi-Cal: http://www.calendar.com/concerts/ or mailto:concerts@calendar.com > Internet Conference Calendar: http://www.calendar.com/conferences/ > >>> ZLDF: http://www.netresponse.com/zldf <<< > From owner-robots Mon Nov 27 23:22:57 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03527; Mon, 27 Nov 95 23:22:57 -0800 Message-Id: <9511280722.AA03518@webcrawler.com> To: robots@webcrawler.com Subject: Re: How frequently should I check /robots.txt? In-Reply-To: Your message of "Mon, 27 Nov 1995 23:27:08 CST." <Pine.LNX.3.91.951127231751.1718A-100000@rusty.waterworks.com> Date: Mon, 27 Nov 1995 23:22:54 -0800 From: Martijn Koster <mak@surfski.webcrawler.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com In message <Pine.LNX.3.91.951127231751.1718A-100000@rusty.waterworks.com>, gil cosson writes: > How about adding an entry to the robots.txt file that specifies how > frequently the robots.txt file should be checked? Hmm.. and then how often do you check if the checking frequency has changed? :-) Seriously though I don't think there'd be a lot of benefit; as an admin you tend not to know when you'll make the next change. From an http point of view robots could be smart, and look at the Expires header. Deciding how often to check for the /robots.txt depends highly on how you run your robot: how many runs per week, how many documents when, etc. I'd say a week is a reasoneable time. If your robot supports end-user submissions you could of course be clever about people submitting their /robots.txt URL; that would give them more influence. -- Martijn __________ Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Wed Nov 29 18:16:32 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11717; Wed, 29 Nov 95 18:16:32 -0800 Message-Id: <9511300215.AA04718@grasshopper.ucsd.edu> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Christopher Penrose <penrose@grasshopper.ucsd.edu> Date: Wed, 29 Nov 95 18:15:27 -0800 To: robots@webcrawler.com Subject: McKinley Spider hit us hard Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com A spider from magellan.mckinley.com hit us hard today and did a deep recursive search of our web tree. Not very friendly, but their spider did check /robots.txt which indicates that they may have successfully implemented the robot exclusion protocol. Christopher Penrose penrose@ucsd.edu http://www-crca.ucsd.edu/TajMahal/after.html here is their internic info if anyone else wants to complain to them: The McKinley Group (MCKINLEY-DOM) 85 Liberty Ship Way Suite 201 Sausalito, CA 94965 Domain Name: MCKINLEY.COM Administrative Contact, Technical Contact, Zone Contact: Cohen, Alexander J. (ASC2) xcohen@MCKINLEY.COM 415-331-1884 FAX Record last updated on 21-Sep-95. Record created on 14-Jul-94. Domain servers in listed order: NS1.NOC.NETCOM.NET 204.31.1.1 NS2.NOC.NETCOM.NET 204.31.1.2 From owner-robots Wed Nov 29 18:58:31 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15455; Wed, 29 Nov 95 18:58:31 -0800 From: Adminstrator <POSTMASTER@ATTAUST1.austria.attgis.com> To: robots@webcrawler.com Subject: Mail failure Date: Thu, 30 Nov 95 03:57:00 PST Message-Id: <30BD9C2C@mailgate.austria.attgis.com> Encoding: 50 TEXT X-Mailer: Microsoft Mail V3.0 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com User mail received addressed to the following unknown addresses: AUSTRIA/ATTAUST1/mostendo ------------------------------------------------------------------------------ Return-Path: <@issaust.austria.ncr.com:robots@webcrawler.com> Message-Id: <9511300215.AA04718@grasshopper.ucsd.edu> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Christopher Penrose <penrose@grasshopper.ucsd.edu> Date: Wed, 29 Nov 95 18:15:27 -0800 To: robots@webcrawler.com Subject: McKinley Spider hit us hard Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com A spider from magellan.mckinley.com hit us hard today and did a deep recursive search of our web tree. Not very friendly, but their spider did check /robots.txt which indicates that they may have successfully implemented the robot exclusion protocol. Christopher Penrose penrose@ucsd.edu http://www-crca.ucsd.edu/TajMahal/after.html here is their internic info if anyone else wants to complain to them: The McKinley Group (MCKINLEY-DOM) 85 Liberty Ship Way Suite 201 Sausalito, CA 94965 Domain Name: MCKINLEY.COM Administrative Contact, Technical Contact, Zone Contact: Cohen, Alexander J. (ASC2) xcohen@MCKINLEY.COM 415-331-1884 FAX Record last updated on 21-Sep-95. Record created on 14-Jul-94. Domain servers in listed order: NS1.NOC.NETCOM.NET 204.31.1.1 NS2.NOC.NETCOM.NET 204.31.1.2 From owner-robots Wed Nov 29 19:16:42 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17176; Wed, 29 Nov 95 19:16:42 -0800 From: Adminstrator <POSTMASTER@ATTAUST1.austria.attgis.com> To: robots@webcrawler.com Subject: Mail failure Date: Thu, 30 Nov 95 04:15:00 PST Message-Id: <30BDA075@mailgate.austria.attgis.com> Encoding: 70 TEXT X-Mailer: Microsoft Mail V3.0 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com User mail received addressed to the following unknown addresses: AUSTRIA/ATTAUST1/mostendo ------------------------------------------------------------------------------ Return-Path: <@issaust.austria.ncr.com:robots@webcrawler.com> From: Adminstrator <POSTMASTER@attaust1.austria.attgis.com> To: robots@webcrawler.com Subject: Mail failure Date: Thu, 30 Nov 95 03:57:00 PST Message-Id: <30BD9C2C@mailgate.austria.attgis.com> Encoding: 50 TEXT X-Mailer: Microsoft Mail V3.0 Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com User mail received addressed to the following unknown addresses: AUSTRIA/ATTAUST1/mostendo -----------------------------------