From owner-robots Thu Oct 12 14:39:19 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20349; Thu, 12 Oct 95 14:39:19 -0700 Message-Id: <9510122139.AA20341@webcrawler.com> To: robots Subject: The robots mailing list at WebCrawler From: Martijn Koster Date: Thu, 12 Oct 1995 14:39:19 -0700 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Welcome to our new home... This mailing list is now open for traffic. For details see: http://info.webcrawler.com/mailing-lists/robots/info.html -- Martijn __________ Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Thu Oct 12 16:09:58 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25602; Thu, 12 Oct 95 16:09:58 -0700 Message-Id: Date: Thu, 12 Oct 95 16:09 PDT X-Sender: a07893@giant.mindlink.net X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Tim Bray Subject: Something that would be handy Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com It might be nice to enhance robots.txt to include a hint as to how long the file ought to be cached by a Robot driver. People who don't understand why probably ought to ignore this message. People who do might want to suggest (a) reasons why this is a silly idea, (b) a syntax/method for doing it, or (c) any implementation difficulties that could ensue. My suggestion, expressed in the form of perl code that could be used to implement it: if (/^\s*CacheHint:\s+(\d+)\s*([dhm])\s*$/) { $SecondsToCache = $1; if ($2 eq 'd') { $SecondsToCache *= 60*60*24; } elsif ($2 eq 'h') { $SecondsToCache *= 60*60; } else { $SecondsToCache *= 60; } } Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com) From owner-robots Fri Oct 13 18:03:54 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA29927; Fri, 13 Oct 95 18:03:54 -0700 Message-Id: Date: Sat, 14 Oct 95 11:07:39 0000 From: James Organization: Tourist Radio Pty Ltd X-Mailer: Mozilla 1.1N (Macintosh; I; 68K) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Site Announcement X-Url: http://info.webcrawler.com/mailing-lists/robots/info.html Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com We wish to advise those with a robot seeking facility that we have two sites at http://www.com.au/aaa and http://www.world.net/touristradio We would be grateful if you would ask your robots to visit and announce our sites where possible. If this is bad net ettique, we apologise, there are huge back logs with manual services. James From owner-robots Mon Oct 16 08:25:16 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00957; Mon, 16 Oct 95 08:25:16 -0700 Message-Id: <9510161525.AA00951@webcrawler.com> To: robots Subject: Re: Site Announcement In-Reply-To: Your message of "Sat, 14 Oct 1995 11:07:39." Date: Mon, 16 Oct 1995 08:25:16 -0700 From: Martijn Koster Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, You've asked me to add a link. The best way to get a link added to the WebCrawler, submit them to http://www.webcrawler.com/WebCrawler/SubmitURLS.html Regards, -- Martijn __________ Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Mon Oct 16 18:36:43 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA29862; Mon, 16 Oct 95 18:36:43 -0700 Message-Id: Date: 16 Oct 1995 18:40:48 -0800 From: "Roger Dearnaley" Subject: How do I let spiders in? To: " " X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Is there any way currently supported of providing spiders access to our (soon to be launched) username & password authenticated site? (Of course if a customer followed a link generated by this spider search, they will be asked for authentication, but when the can't provide it we will redirect them to a Registration page.) The security on our site is not meant to be high: it is there primarily so that the forms CGI scripts have a unique user name to figure out who is doing what. Thus for our site we would probably be happy to just place a user name and password in robots.txt, or some similar low-security solution. However, I can see that for other sites this might not be an acceptable, so spider maintainers might want to consider adding fields for the username and password to use to their 'Please index this URL' submission forms. Then, ideally, it should be possible to submit these forms securely. --Roger Dearnaley (roger_dearnaley@intouchgroup.com) From owner-robots Wed Oct 18 08:32:24 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12938; Wed, 18 Oct 95 08:32:24 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 18 Oct 1995 08:31:05 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Unfriendly robot at 205.177.10.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com One of my Web servers (http://asearch.mccmedia.com/ last night was attacked by a very unfriendly robot that requested many documents per second. This robot was originating from 205.177.10.2. I've tried to resolve that IP address, but I'm unable thus far. However, a traceroute shows that a cais.net router was the last hop before the domain in which the offending robot lives, so I sent an e-mail to the postmaster there, hoping that he or she will know whose host that is and will forward it (assuming that whoever owns this thing is a CAIS customer). Has anyone else encountered this one? It doesn't identify itself at all. Nick From owner-robots Wed Oct 18 08:58:47 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14082; Wed, 18 Oct 95 08:58:47 -0700 Message-Id: Date: Wed, 18 Oct 95 08:58 PDT X-Sender: a07893@giant.mindlink.net X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Tim Bray Subject: Re: Unfriendly robot at 205.177.10.2 Cc: robots@webcrawler.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 08:31 18/10/95 -0700, Nick Arnett wrote: >One of my Web servers (http://asearch.mccmedia.com/ last night was attacked >by a very unfriendly robot that requested many documents per second. This >robot was originating from 205.177.10.2. That resolves to 'murph.cais.net' - no idea who they are, never heard of 'em. - Tim From owner-robots Wed Oct 18 09:06:44 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14459; Wed, 18 Oct 95 09:06:44 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 18 Oct 1995 09:05:20 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: CORRECTION -- Re: Unfriendly robot Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Whoops -- I pasted the wrong IP address into this message. The unfriendly robot was at 205.252.60.50. Nick From owner-robots Wed Oct 18 09:32:08 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15587; Wed, 18 Oct 95 09:32:08 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 18 Oct 1995 09:30:32 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Unfriendly robot at 205.177.10.2 Cc: tbray@opentext.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 8:58 AM 10/18/95, Tim Bray wrote: >That resolves to 'murph.cais.net' - no idea who they are, never heard >of 'em. As you may have seen in my correction, that was a mistake on my part. I copied that from the traceroute -- it's the last router before the address space in which the misbehaving robot lives. It is Capitol Area Internet Service and under the assumption that the owner of the robot is one of their customers, I sent a message to the CAIS postmaster. The correct address of the owner of the robot is 205.252.60.50, which won't resolve. Tight security, apparently. Ironically. Nick From owner-robots Wed Oct 18 09:43:26 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16066; Wed, 18 Oct 95 09:43:26 -0700 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199510181643.RAA22167@wsinis11.win.tue.nl> Subject: Re: Unfriendly robot at 205.177.10.2 To: robots@webcrawler.com Date: Wed, 18 Oct 1995 17:42:55 +0100 (MET) In-Reply-To: from "Nick Arnett" at Oct 18, 95 08:31:05 am X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 921 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (Nick Arnett) write: > >One of my Web servers (http://asearch.mccmedia.com/ last night was attacked >by a very unfriendly robot that requested many documents per second. This >robot was originating from 205.177.10.2. I've tried to resolve that IP >address, but I'm unable thus far. However, a traceroute shows that a >cais.net router was the last hop before the domain in which the offending >robot lives, so I sent an e-mail to the postmaster there, hoping that he or >she will know whose host that is and will forward it (assuming that whoever >owns this thing is a CAIS customer). Here you are: % host 205.177.10.2 Name: murph.cais.net Address: 205.177.10.2 Aliases: >Has anyone else encountered this one? It doesn't identify itself at all. No accesses here from 205.177.10.2 or cais.net. >Nick -- Reinier Post reinpost@win.tue.nl a.k.a. me From owner-robots Wed Oct 18 11:32:15 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21768; Wed, 18 Oct 95 11:32:15 -0700 Message-Id: <9510181831.AA06646@ai.iit.nrc.ca> Date: Wed, 18 Oct 95 14:31:39 EDT From: Alain Desilets To: robots@webcrawler.com Subject: Looking for a spider Cc: alain@ai.iit.nrc.ca Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Dear spider developpers. My name is Alain Desilets. I am a researcher in the Interactive Information Group of the National Research Council of Canada. We are a small group (6 people) developing tools for interactive access to information. Our technological angle on this problem is AI based approaches, in particular Machine Learning and Agents. You can find more about our work at http://ai.iit.nrc.ca/II_public/. In order to test our methods we need to acquire a large corpus of full HTML files from the Web. We plan to use a spider for that task. We are aware of the controversy surrounding the creation of new spiders and therefore do not plan to develop one. That would not only be a duplication of effort but would also introduce a new, possibly buggy spider in Koster's already vast list of Web critters. Instead, we would like to use a publically available, well behaved and proven spider. Is there such spider available for serious research purpose? Or maybe the corpus we need already exists? Is there a CD-ROM or .zip file that would give us the whole of the web in full HTML? Thanks for your help. Alain Desilets Institute for Information Technology National Research Concil of Canada Building M-50 Montreal Road Ottawa (Ont) K1A 0R6 e-mail: alain@ai.iit.nrc.ca Tel: (613) 990-2813 Fax: (613) 952-7151 From owner-robots Wed Oct 18 12:28:54 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23934; Wed, 18 Oct 95 12:28:54 -0700 Date: Wed, 18 Oct 1995 15:34:04 -0400 Message-Id: <199510181934.PAA12177@maple.sover.net> X-Sender: Leigh.D.Dupee@neinfo.net X-Mailer: Windows Eudora Version 1.4.4 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Leigh.D.Dupee@neinfo.net (Leigh DeForest Dupee) Subject: Re: Unfriendly robot at 205.177.10.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Query:All records (ALL):2.10.177.205.in-addr.arpa Authoritative Answer 2.10.177.205.in-addr.arpa PTR murph.cais.net 10.177.205.in-addr.arpa NS cais.com cais.com A 199.0.216.4 Complete: 2.10.177.205.in-addr.arpa Query:All records (ALL):murph.cais.net Authoritative Answer Name does not exist Complete:NO_DATA murph.cais.net Best I can come up with! >One of my Web servers (http://asearch.mccmedia.com/ last night was attacked >by a very unfriendly robot that requested many documents per second. This >robot was originating from 205.177.10.2. I've tried to resolve that IP >address, but I'm unable thus far. However, a traceroute shows that a >cais.net router was the last hop before the domain in which the offending >robot lives, so I sent an e-mail to the postmaster there, hoping that he or >she will know whose host that is and will forward it (assuming that whoever >owns this thing is a CAIS customer). > >Has anyone else encountered this one? It doesn't identify itself at all. > >Nick > > > --------------------------------------------------------------- Leigh DeForest Dupee Help Me Learn, Inc., Administrator for NEInfo.Net South Stream Road RR3 Box 4203, Bennington, VT 05201 (802) 447-2905 --------------------------------------------------------------- From owner-robots Wed Oct 18 12:49:50 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24697; Wed, 18 Oct 95 12:49:50 -0700 Message-Id: <9510181951.AA08164@pluto.sybgate.sybase.com> X-Sender: dbakin@pluto X-Mailer: Windows Eudora Version 2.1.1 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 18 Oct 1995 12:49:14 -0700 To: robots@webcrawler.com From: David Bakin Subject: Is it a robot or a link-updater? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com As the subject implies, I'm curious if there is a difference, in the impact on the serving site, between a true robot and someone running an automatic link updater? Can they even be told apart by the serving site? -- Dave -- Dave Bakin How much work would a work flow flow if a #include 415-872-1543 x5018 work flow could flow work? From owner-robots Wed Oct 18 13:16:38 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25902; Wed, 18 Oct 95 13:16:38 -0700 From: amonge@cs.ucsd.edu (Alvaro Monge) Message-Id: <9510182013.AA10642@dino> Subject: Re: Looking for a spider To: robots@webcrawler.com Date: Wed, 18 Oct 1995 13:13:55 -0700 (PDT) In-Reply-To: <9510181831.AA06646@ai.iit.nrc.ca> from "Alain Desilets" at Oct 18, 95 02:31:39 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 1865 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com A colleague of mine and I are also doing research which is AI based and are in need of a large corpus for our use. We would like to use anything that is already available which keeps the structure of the real WWW and does not take anything away. This is in order to create realistic experiments of our approaches. Thanks in advance for any pointers, --Alvaro Computer science and engineering department University of California, San Diego > > Dear spider developpers. > > > My name is Alain Desilets. I am a researcher in the Interactive > Information Group of the National Research Council of Canada. > > We are a small group (6 people) developing tools for interactive > access to information. Our technological angle on this problem is AI > based approaches, in particular Machine Learning and Agents. You can > find more about our work at http://ai.iit.nrc.ca/II_public/. > > In order to test our methods we need to acquire a large corpus of > full HTML files from the Web. We plan to use a spider for that task. > > We are aware of the controversy surrounding the creation of new > spiders and therefore do not plan to develop one. That > would not only be a duplication of effort but would also introduce a > new, possibly buggy spider in Koster's already vast list of Web > critters. Instead, we would like to use a publically available, well > behaved and proven spider. > > Is there such spider available for serious research purpose? > > Or maybe the corpus we need already exists? Is there a CD-ROM or .zip > file that would give us the whole of the web in full HTML? > > > Thanks for your help. > > Alain Desilets > > Institute for Information Technology > National Research Concil of Canada > Building M-50 > Montreal Road > Ottawa (Ont) > K1A 0R6 > > e-mail: alain@ai.iit.nrc.ca > Tel: (613) 990-2813 > Fax: (613) 952-7151 > > From owner-robots Wed Oct 18 14:13:35 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28102; Wed, 18 Oct 95 14:13:35 -0700 Message-Id: Date: 18 Oct 1995 15:13:44 -0700 From: "Xiaodong Zhang" Subject: Re: Looking for a spider To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Reply to: RE>>Looking for a spider 7/24/95 - Frontier Technologies licenses Lycos Internet Catalog software MEQUON, WIS. (July 24) BUSINESS WIRE -July 24, 1995--Frontier Technologies Corp. today announced it has signed an agreement to license the Lycos(TM) Internet Catalog. The Lycos Catalog has been incorporated into Frontier Technologies" new SuperHighway Access product, called SuperHighway Access CyberSearch(TM), which allows users to perform a Lycos search offline via CD-ROM, connecting to the Internet only once relevant Internet resources have been identified. The Lycos technology was developed at Carnegie Mellon University, and was recently transferred to Lycos Inc., a newly-created subsidiary of CMG Information Services Inc. Lycos is a software system which contains a robot that searches the World Wide Web and catalogs the documents it finds. It also includes an information search engine that helps users access information quickly and easily when they type in key words or topics. The Lycos exploration robot locates new and changed documents and builds abstracts, which consist of title, headings, subheadings, 100 most significant words and the first 20 lines of the document. The catalog is continually updated by the Lycos exploration agent. Frontier will receive regular updates from Lycos Inc., allowing it to produce monthly issues of SuperHighway Access CyberSearch. "It's now widely understood that one of the primary barriers to users" productivity on the Internet is finding information," said Dennis Freeman, Frontier Technologies" marketing director. "That's why Internet search services like Lycos are among the Internet's most popular sites." "Lycos Inc. is pleased to partner with Frontier as they contribute to our continued position as the most widely used and most comprehensive catalog product on the Web," said Bob Davis, CEO of Lycos Inc. The product, now shipping, consists of a 608-megabyte subset of the Lycos catalog, indexing about half a million web pages, integrated with Frontier's multi-session, multi-protocol Internet browser software. The product is shipped on CD-ROM and is available through Frontier's reseller channel. The CD will be updated monthly (bi-monthly initially) Frontier is offering the first issue of CyberSearch at $14.95. A charter subscription for 6 issues is priced at $6.75 per month. Subscribers should call 1-800/879-0075 (+1-414/571-0190 outside the U.S.) or access Frontier's web server, http://www.frontiertech.com for further information. Lycos Inc., with offices in Wilmington, Mass. and Pittsburgh, Penn., is the newly formed corporation based upon technology developed at Carnegie Mellon University. Frontier Technologies Corp., based in Mequon, is a leading supplier of TCP/IP and Internet-based products that make businesses more competitive in a global market. CONTACT: Frontier Technologies Corp., Mequon Nicole Rogers, 414/241-4555 x293 or Lycos Inc. Mike Olfe, 508/657-5050 x3124 ------------------------------ Date: 10/18/95 3:01 PM To: Zhang, Xiaodong From: robots@webcrawler.com A colleague of mine and I are also doing research which is AI based and are in need of a large corpus for our use. We would like to use anything that is already available which keeps the structure of the real WWW and does not take anything away. This is in order to create realistic experiments of our approaches. Thanks in advance for any pointers, --Alvaro Computer science and engineering department University of California, San Diego > > Dear spider developpers. > > > My name is Alain Desilets. I am a researcher in the Interactive > Information Group of the National Research Council of Canada. > > We are a small group (6 people) developing tools for interactive > access to information. Our technological angle on this problem is AI > based approaches, in particular Machine Learning and Agents. You can > find more about our work at http://ai.iit.nrc.ca/II_public/. > > In order to test our methods we need to acquire a large corpus of > full HTML files from the Web. We plan to use a spider for that task. > > We are aware of the controversy surrounding the creation of new > spiders and therefore do not plan to develop one. That > would not only be a duplication of effort but would also introduce a > new, possibly buggy spider in Koster's already vast list of Web > critters. Instead, we would like to use a publically available, well > behaved and proven spider. > > Is there such spider available for serious research purpose? > > Or maybe the corpus we need already exists? Is there a CD-ROM or .zip > file that would give us the whole of the web in full HTML? > > > Thanks for your help. > > Alain Desilets > > Institute for Information Technology > National Research Concil of Canada > Building M-50 > Montreal Road > Ottawa (Ont) > K1A 0R6 > > e-mail: alain@ai.iit.nrc.ca > Tel: (613) 990-2813 > Fax: (613) 952-7151 > > ------------------ RFC822 Header Follows ------------------ Received: by zazu.softshell.com with SMTP;18 Oct 1995 14:59:25 -0700 Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25902; Wed, 18 Oct 95 13:16:38 -0700 From: amonge@cs.ucsd.edu (Alvaro Monge) Message-Id: <9510182013.AA10642@dino> Subject: Re: Looking for a spider To: robots@webcrawler.com Date: Wed, 18 Oct 1995 13:13:55 -0700 (PDT) In-Reply-To: <9510181831.AA06646@ai.iit.nrc.ca> from "Alain Desilets" at Oct 18, 95 02:31:39 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 1865 Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com From owner-robots Wed Oct 18 14:55:47 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA29718; Wed, 18 Oct 95 14:55:47 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 18 Oct 1995 14:54:22 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Unfriendly robot owner identified! Cc: aleonard@well.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Got 'em. Using whois, I found that the IP address belongs to Library Corp. in Virginia. They're the providers of the "NlightN" search service at: http://www.nlightn.com/ Anybody know anything about their robot? I know that they've licensed the Lycos data. Their background information says, "NlightN, a division of The Library Corporation, was formed to develop and market a Universal Index to the world's electronically stored information." I guess their robot has to work fast to build a universal index... ;-) Nick From owner-robots Wed Oct 18 15:19:02 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01014; Wed, 18 Oct 95 15:19:02 -0700 Date: Wed, 18 Oct 1995 15:18:53 -0700 (PDT) From: Andrew Leonard Subject: Re: Unfriendly robot owner identified! To: robots@webcrawler.com In-Reply-To: Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, all. I'm a reporter for Wired working on a story about bots, and I'm personally following up on this NlightN robot episode. I've put a call into their Reston VA headquarters asking to talk to someone about their search robot, and I'll keep the list posted on whatever I find out. Andrew Leonard Wired Magazine > Got 'em. > > Using whois, I found that the IP address belongs to Library Corp. in > Virginia. They're the providers of the "NlightN" search service at: > > http://www.nlightn.com/ > > Anybody know anything about their robot? I know that they've licensed the > Lycos data. > > Their background information says, "NlightN, a division of The Library > Corporation, was formed to develop and market a Universal Index to the > world's electronically stored information." > > I guess their robot has to work fast to build a universal index... ;-) > > Nick > > > From owner-robots Wed Oct 18 15:38:57 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02009; Wed, 18 Oct 95 15:38:57 -0700 From: amonge@cs.ucsd.edu (Alvaro Monge) Message-Id: <9510182200.AA11857@dino> Subject: Re: Looking for a spider To: robots@webcrawler.com Date: Wed, 18 Oct 1995 15:00:01 -0700 (PDT) In-Reply-To: from "Xiaodong Zhang" at Oct 18, 95 03:13:44 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 555 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Unfortunately, I cannot use most robots that I know of because they DO NOT SAVE the entire document, or its hierarchical structure. Lycos for example: > The Lycos exploration robot locates new and changed documents and > builds abstracts, which consist of title, headings, subheadings, > 100 most significant words and the first 20 lines of the document. For my research, this is not that useful. I need the entire document, as it appears at the source -- not as saved by some robot, because I want to follow the links within the document. --Alvaro From owner-robots Wed Oct 18 16:19:20 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04137; Wed, 18 Oct 95 16:19:20 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 18 Oct 1995 16:18:02 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Really fast searching Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com It's a bit off-topic, but I can't resist sharing something that one of our sharp-eyed engineers found in a certain company's information page about their search service: > By transparently linking hundreds of data sources, ******* has > created the world's largest integrated index, already comprised of > more than 100 gigabytes and growing daily. A proprietary database > engine provides immediate response time and actually increases speed > as the size of the index grows. We need this algorithm, our engineer says. It start off with immediate responses, then gets faster. Wowza! ("A meeting on time travel will be held last week.") Nick From owner-robots Thu Oct 19 06:29:53 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16844; Thu, 19 Oct 95 06:29:53 -0700 Message-Id: <9510191329.AA12490@ai.iit.nrc.ca> Date: Thu, 19 Oct 95 09:29:15 EDT From: Alain Desilets To: robots@webcrawler.com Subject: Re: Looking for a spider Cc: alain@ai.iit.nrc.ca Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Dear Alvaro, Thanks for responding. I'll let you know if I find something. I'm interested to know more about your work. Do you have a Web page on it? Thanks Alain From owner-robots Thu Oct 19 06:32:09 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17037; Thu, 19 Oct 95 06:32:09 -0700 Message-Id: <9510191331.AA12583@ai.iit.nrc.ca> Date: Thu, 19 Oct 95 09:31:31 EDT From: Alain Desilets To: robots@webcrawler.com Subject: Re: Looking for a spider Cc: alain@ai.iit.nrc.ca Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Dear Zhang, Thank you for the info. Unfortunately, I am in the same position as Alvaro Monge. I need the original HTML files, as opposed to some condensed version of it produced by a robot. Alain From owner-robots Thu Oct 19 06:39:50 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17600; Thu, 19 Oct 95 06:39:50 -0700 Message-Id: <9510191339.AA12691@ai.iit.nrc.ca> Date: Thu, 19 Oct 95 09:39:13 EDT From: Alain Desilets To: robots@webcrawler.com Subject: Sorry! Cc: alain@ai.iit.nrc.ca Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Sorry about the previous messages. I intended to send them directly to the people concerned but it somehow got sent to this list. - Alain From owner-robots Thu Oct 19 07:53:29 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22425; Thu, 19 Oct 95 07:53:29 -0700 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199510191453.PAA06141@wswiop11.win.tue.nl> Subject: Re: Unfriendly robot at 205.177.10.2 To: robots@webcrawler.com Date: Thu, 19 Oct 1995 15:53:11 +0100 (MET) Cc: tbray@opentext.com In-Reply-To: from "Nick Arnett" at Oct 18, 95 09:30:32 am X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 989 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >The correct address of the owner of the robot is 205.252.60.50, which won't >resolve. Tight security, apparently. Ironically. Well, on our site (www.win.tue.nl), it's causing no problems at all: % grep '205\.252' /usr/www/logs/cern_access.log 205.252.60.50 - - [13/Oct/1995:12:30:13 +0100] "GET / HTTP/1.0" 302 381 205.252.60.50 - - [13/Oct/1995:20:58:55 +0100] "GET / HTTP/1.0" 302 381 % wc /usr/www/logs/cern_access.log 206422 2062250 22193056 /usr/www/logs/cern_access.log That is, out of the last 206,422 requests, 2 were from this site. Lycos wants to index as many documents on a site it can find. This robot has only made two requests, and it didn't even retrieve our home page (/ is redirected to /win/, which is the actual home page). Perhaps it doesn't follow redirections. >Nick -- Reinier Post reinpost@win.tue.nl a.k.a. me [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] From owner-robots Thu Oct 19 07:57:03 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22755; Thu, 19 Oct 95 07:57:03 -0700 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199510191456.PAA06159@wswiop11.win.tue.nl> Subject: Re: Looking for a spider To: robots@webcrawler.com Date: Thu, 19 Oct 1995 15:56:40 +0100 (MET) In-Reply-To: <9510182200.AA11857@dino> from "Alvaro Monge" at Oct 18, 95 03:00:01 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 1038 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (Alvaro Monge) write: >Unfortunately, I cannot use most robots that I know of because they >DO NOT SAVE the entire document, or its hierarchical structure. > >Lycos for example: > >> The Lycos exploration robot locates new and changed documents and >> builds abstracts, which consist of title, headings, subheadings, >> 100 most significant words and the first 20 lines of the document. > >For my research, this is not that useful. I need the entire document, >as it appears at the source -- not as saved by some robot, because I >want to follow the links within the document. Lycos follows the links of documents; that's how robots work. The summaries are built for indexing purposes. You can't save the full text of all documents because of the disk space requirements (perhaps OpenText can?) and because of legal considerations. >--Alvaro -- Reinier Post reinpost@win.tue.nl a.k.a. me [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] From owner-robots Thu Oct 19 08:44:31 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26046; Thu, 19 Oct 95 08:44:31 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 19 Oct 1995 08:41:05 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Unfriendly robot at 205.252.60.50 Cc: tbray@opentext.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 7:53 AM 10/19/95, Reinier Post wrote: >>The correct address of the owner of the robot is 205.252.60.50, which won't >>resolve. Tight security, apparently. Ironically. > >Well, on our site (www.win.tue.nl), it's causing no problems at all In my e-mail to NlightN, I said that I assume it was unintentional. I can't imagine that anyone would purposely request documents at the rate they were hitting us. Of course, there's no way to know if that was the robot or a human-controlled browser hitting your site from the same host... Thanks! Nick From owner-robots Thu Oct 19 09:10:34 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28065; Thu, 19 Oct 95 09:10:34 -0700 Message-Id: <9510191609.AA14728@ai.iit.nrc.ca> Date: Thu, 19 Oct 95 12:09:49 EDT From: Alain Desilets To: robots@webcrawler.com Subject: Re: Looking for a spider Cc: alain@ai.iit.nrc.ca Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com In response to Alvaro's message, > > > >> The Lycos exploration robot locates new and changed documents and > >> builds abstracts, which consist of title, headings, subheadings, > >> 100 most significant words and the first 20 lines of the document. > > > >For my research, this is not that useful. I need the entire document, > >as it appears at the source -- not as saved by some robot, because I > >want to follow the links within the document. Reinier Post writes: > > Lycos follows the links of documents; that's how robots work. > The summaries are built for indexing purposes. You can't save > the full text of all documents because of the disk space requirements > (perhaps OpenText can?) and because of legal considerations. > Like Alvaro, no robot generated indexe of the whole web is sufficient for my purpose. My group working on developping new tools that can process the web and "summarise" it in some novel way. For example: - New and hopefully better keyword extraction algorithms - Automatic generation of hierarchichal indexes a la Yahoo - Merging of small indexes into bigger ones - etc... In order to test these new approaches, we need the full HTML, not an index of it. - Alain From owner-robots Thu Oct 19 09:18:30 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28618; Thu, 19 Oct 95 09:18:30 -0700 Date: Fri, 20 Oct 1995 02:18:16 +1000 From: Murray Bent Message-Id: <199510191618.CAA08466@wittgenstein.icis.qut.edu.au> To: robots@webcrawler.com Subject: re: Lycos unfriendly robot Content-Length: 439 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com According to Reinier Post: >Lycos wants to index as many documents on a site it can find. This >robot has only made two requests, and it didn't even retrieve our home page >(/ is redirected to /win/, which is the actual home page). Perhaps it doesn't >follow redirections. >>Nick >-- >Reinier Post reinpost@win.tue.nl That may be fine if you have shares in Lycos or something. Do you? mj From owner-robots Thu Oct 19 11:01:14 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05721; Thu, 19 Oct 95 11:01:14 -0700 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199510191801.TAA19705@wsinis02.win.tue.nl> Subject: Re: Lycos unfriendly robot To: robots@webcrawler.com Date: Thu, 19 Oct 1995 19:01:00 +0100 (MET) In-Reply-To: <199510191618.CAA08466@wittgenstein.icis.qut.edu.au> from "Murray Bent" at Oct 20, 95 02:18:16 am X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 918 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (Murray Bent) write: > > >According to Reinier Post: >>Lycos wants to index as many documents on a site it can find. This >>robot has only made two requests, and it didn't even retrieve our home page >>(/ is redirected to /win/, which is the actual home page). Perhaps it doesn't >>follow redirections. > >>>Nick > >>-- >>Reinier Post reinpost@win.tue.nl > >That may be fine if you have shares in Lycos or something. Do you? I don't follow your logic. *What* is fine if I have shares in Lycos? The fact that this visit was made by something that doesn't follow redirections, and therefore is unlikely to be a Lycos robot? >mj For some reason you seem to bear a grudge against Lycos. If my posting did anything to tear open any old wounds, I apologise. -- Reinier Post reinpost@win.tue.nl a.k.a. me From owner-robots Sat Oct 21 07:17:11 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06960; Sat, 21 Oct 95 07:17:11 -0700 Date: Sat, 21 Oct 1995 07:17:03 -0700 (PDT) From: Andrew Leonard Subject: Re: Unfriendly robot at 205.252.60.50 To: robots@webcrawler.com Cc: robots@webcrawler.com, tbray@opentext.com In-Reply-To: Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I contacted NlightN, and their CEO said that their most junior hire was testing a new robot. They were apparently unaware of the robot exclusion protocol but plan to mend their ways. Andrew Leonard Wired Magazine On Thu, 19 Oct 1995, Nick Arnett wrote: > At 7:53 AM 10/19/95, Reinier Post wrote: > >>The correct address of the owner of the robot is 205.252.60.50, which won't > >>resolve. Tight security, apparently. Ironically. > > > >Well, on our site (www.win.tue.nl), it's causing no problems at all > > In my e-mail to NlightN, I said that I assume it was unintentional. I > can't imagine that anyone would purposely request documents at the rate > they were hitting us. Of course, there's no way to know if that was the > robot or a human-controlled browser hitting your site from the same host... > > Thanks! > > Nick > > > From owner-robots Sat Oct 21 11:21:18 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23944; Sat, 21 Oct 95 11:21:18 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sat, 21 Oct 1995 10:35:40 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Unfriendly robot at 205.252.60.50 Cc: robots@webcrawler.com, tbray@opentext.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 7:17 AM 10/21/95, Andrew Leonard wrote: >I contacted NlightN, and their CEO said that their most junior hire was >testing a new robot. They were apparently unaware of the robot exclusion >protocol but plan to mend their ways. I haven't heard from them, but our server/spider product manager received a telephone apology. I can't resist pointing out the irony of a search services company that apparently failed to find some critical information about robots on the Internet. On the other hand, we've probably done equally silly things. I hope they'll add a user-agent field, at least. Nick From owner-robots Sat Oct 21 17:47:17 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20668; Sat, 21 Oct 95 17:47:17 -0700 Message-Id: From: kimba@snog.it.com.au (Kim Davies) Subject: Re: Unfriendly robot at 205.252.60.50 To: robots@webcrawler.com Date: Sun, 22 Oct 1995 08:46:39 +0800 (WST) In-Reply-To: from "Nick Arnett" at Oct 21, 95 10:35:40 am X-Mailer: ELM [version 2.4 PL24 PGP2] Content-Type: text Content-Length: 554 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, > >I contacted NlightN, and their CEO said that their most junior hire was > >testing a new robot. They were apparently unaware of the robot exclusion > >protocol but plan to mend their ways. > > I haven't heard from them, but our server/spider product manager received a > telephone apology. Has someone invited them to join this list? If they discussed what they were doing it might be better for all concerned.. catchya, -- Kim Davies | "Belief is the death of intelligence" -Snog kimba@it.com.au | http://www.it.com.au/~kimba/ From owner-robots Sun Oct 22 13:14:28 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01215; Sun, 22 Oct 95 13:14:28 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sun, 22 Oct 1995 13:13:12 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Unfriendly robot at 205.252.60.50 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 5:46 PM 10/21/95, Kim Davies wrote: >Hi, > >> >I contacted NlightN, and their CEO said that their most junior hire was >> >testing a new robot. They were apparently unaware of the robot exclusion >> >protocol but plan to mend their ways. >> >> I haven't heard from them, but our server/spider product manager received a >> telephone apology. > >Has someone invited them to join this list? If they discussed what they >were doing it might be better for all concerned.. I directed them to the robots pages on www.webcrawler.com, which should lead them to this list. What am I thinking -- the server that they were hammering with their robot includes recent messages from this list (at http://asearch.mccmedia.com/robots/). I suppose that means they might have looked... Nick From owner-robots Mon Oct 23 07:50:14 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03859; Mon, 23 Oct 95 07:50:14 -0700 Date: Mon, 23 Oct 95 10:50:03 EDT From: wulfekuh@cps.msu.edu (Marilyn R Wulfekuhler) Message-Id: <9510231450.AA10394@pixel.cps.msu.edu> To: robots@webcrawler.com Subject: Re: Looking for a spider Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Alain Desilets writes: > In order to test our methods we need to acquire a large corpus of > full HTML files from the Web. We plan to use a spider for that task. > and Alvaro Monge writes: > A colleague of mine and I are also doing research which is AI based > and are in need of a large corpus for our use. We would like to use > anything that is already available which keeps the structure of the > real WWW and does not take anything away. This is in order to create > realistic experiments of our approaches. > We are also doing research on AI based approaches to processing the web, and toward the goal of having a test bed of the web, we have a text-only copy of a subset of the web (currently about 650 meg) which we have been calling "the proving grounds". It is not possible to get a complete snapshot of the web at any given time, but without images and audio, we can at least have a large, known, subset. It's also to our collective advantage to all be working from the same subset. It is our intention to make the proving grounds available to the public, hopefully within the next two weeks. We used a spider which was a modified htmlgobble, which takes a URL and follows all the links, copying all the documents it finds except image, audio, and video files. The urls inside the documents have been modified so that everything points to the local copy, enabling a spider (or human browser) to traverse the database locally. Before we go public, I have a few questions: (1) We currently don't copy audio, video, image files and instead create a file by the same name with a single character identifying it as video, image, or audio. Would an empty file suffice? Is there another identification scheme that would be more useful? (2) We currently copy postscript, but are considering treating them as we do image files. They take a LOT of space, and are of no utility for the kind of analysis that we want to do. Would it be more useful to keep the postscript, or treat it as we do images (which would then allow us to use the space for a larger web subset)? I appreciate any feedback and I'll announce to the list when it's ready for public use. Marilyn Wulfekuhler Intelligent Systems Lab, Michigan State University From owner-robots Mon Oct 23 15:27:34 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05597; Mon, 23 Oct 95 15:27:34 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Mon, 23 Oct 1995 15:26:16 -0700 To: Andrew Daviel , robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Proposed URLs that robots should search Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 1:51 PM 10/23/95, Andrew Daviel wrote: >With my other hat on (admin@vancouver-webpages.com), I'm >trying to build a database of URLs and other information for businesses >on the Net. I can't quite contain the urge to say, "Isn't everyone?" >Some database registration robots (I believe) search submitted URLs for >keywords, doing some natural language processing to discard modifiers and >prepositions. However, the trend to graphics-dominated homepages makes >such efforts of dubious utility. I wouldn't be so quick to jump to that conclusion. I have seen few, if any, business sites that don't offer text-only versions of their key pages. Also, I'm utterly certain that a good relevancy-ranking engine will do a better job at assigning categories than will an uncontrolled set of people, especially when those people are out to maximize hits, rather than to maximize relevancy. Having said all of that, I'd like to agree that we need some additional information for robots. Could we start simply by having a standard way to set forth the name of the site? An icon for the site would be really nice. It's very frustrating to build a search results list and have no definitive way of describing the site on which the documents reside! Next, I'd like to have the means to name groups of documents (Press releases, product descriptions, as examples of typical business groupings). We guess at these from directory names, but that's very haphazard. The secondary naming problem is more difficult because there are many-to-many relationships involved. >In the spirit of /robots.txt, I would like to propose a set of files that >robots would be encouraged to visit: > >/robots.htm - an HTML list of links that robots are encouraged to traverse What does "encouraged" mean? How is it differnet from (not (robots.txt))? Why HTML? >/descript.txt - a text file describing what the site (or directory) is > all about Agreed. >/keywords.txt - a text file with comma-delimited keywords relevant to the > site (or directory) Disagree greatly. This opens a giant can of worms. Keywords are never enough, often confusing and difficult to maintain. >/linecard.txt - for commercial sites, a text file with comma-delimited > line items (brands) manufactured or stocked This will drown in details. >/sitedata.txt - a text file similar to the InterNIC submissions forms, > with publicly-available site data such as > >Organization: organisation name >Type: commercial/non/profit/educational etc. >Admin: email of admininstration >Webmaster: email of Web admininstration >Postal: postal address >ZIP: ZIP/postcode >Country: >Position: Lat/Long >etc. Yes to some of this at least. But there's an assumption that there's a one-to-one relationship between the server and these field data. Often, there isn't and no scheme that fails to deal with that is going to succeed. I'm ready to adapt one of my prototype robots to parse this data for our engine, so here's one hand up for "Yes, I'll implement it." I'm just doing research, but my research does fall in front of our engineers at some point. By the way, today, Verity announced that NetManage and Purveyor have signed up to use our search engine. They join Netscape, Quarterdeck and a few others. Nick P.S. I've replied to the new list server address at webcrawler.com, rather than the Nexor address. From owner-robots Mon Oct 23 16:31:22 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10352; Mon, 23 Oct 95 16:31:22 -0700 Message-Id: <9510232331.AA10338@webcrawler.com> To: robots Cc: Andrew Daviel Subject: Re: Proposed URLs that robots should search In-Reply-To: Your message of "Mon, 23 Oct 1995 15:26:16 PDT." Date: Mon, 23 Oct 1995 16:31:17 -0700 From: Martijn Koster Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com In message , Nick Arnett writes: > Also, I'm utterly certain that a good relevancy-ranking engine will do a > better job at assigning categories than will an uncontrolled set of people, > especially when those people are out to maximize hits, rather than to > maximize relevancy. Yeah, isn't that fun... :-/ Maybe we should have a shared spammer blacklist :-) > [want the name of the site] > [groups of documents] > >In the spirit of /robots.txt, I would like to propose a set of files that > >robots would be encouraged to visit: > > > >/robots.htm - an HTML list of links that robots are encouraged to traverse > > What does "encouraged" mean? How is it differnet from (not (robots.txt))? Because a robot may not want to traverse the whole site, and would prefer to get "sensible" pages. > Why HTML? Yeah, bad news. > [/keywords] > Disagree greatly. This opens a giant can of worms. Keywords are never > enough, often confusing and difficult to maintain. Hmmm... yes, but it's not necesarrily worse than straight HTML text, which is the alternative. > >/linecard.txt - for commercial sites, a text file with comma-delimited > > line items (brands) manufactured or stocked > > This will drown in details. Yup. > >/sitedata.txt - a text file similar to the InterNIC submissions forms, > > with publicly-available site data such as > > > Yes to some of this at least. But there's an assumption that there's a > one-to-one relationship between the server and these field data. Often, > there isn't and no scheme that fails to deal with that is going to succeed. Well, I hate to repeat myself, but ALIWEB's /site.idx will give you all of the above (OK, not the icon, but you could add that). It doesn't seem to scale to well to large sites who want to describe every single page or resource on their server, but that's not the goal here... Note also that nobody is stopping you to pull just the URLs from a site.idx, and doing your standard robot summarising on that... -- Martijn __________ Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Mon Oct 23 17:06:25 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12787; Mon, 23 Oct 95 17:06:25 -0700 Message-Id: From: kimba@snog.it.com.au (Kim Davies) Subject: Re: Proposed URLs that robots should search To: andrew@andrew.triumf.ca (Andrew Daviel) Date: Tue, 24 Oct 1995 08:03:58 +0800 (WST) Cc: robots@webcrawler.com In-Reply-To: from "Andrew Daviel" at Oct 23, 95 09:51:17 pm X-Mailer: ELM [version 2.4 PL24 PGP2] Content-Type: text Content-Length: 1378 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, > /robots.htm - an HTML list of links that robots are encouraged to traverse A plain text file would be much more well suited, similar to the existing robots.txt - reading in plain text and adding it to the stack of URL's to be processed is sure to be more effective than sending the html to the robot reasoning engine to parse about. > [snip] > > Organization: organisation name > Type: commercial/non/profit/educational etc. > Admin: email of admininstration > Webmaster: email of Web admininstration > Postal: postal address > ZIP: ZIP/postcode > Country: > Position: Lat/Long > etc. How are you going to get a system administrator to implement all these files? How many system administrators do you know even know about robots.txt? Assuming you want a large chunk of sites to adopt these details, I'd propose it be implemented into the HTTP protocol somehow. an "ADMIN" request, for example, could request the above details from the site just as an "/admin", for example, on IRC, grabs the admin details of a server from the lines in the configuration. If a space was made in a server's configuration or makefile for these details, web administrators are far more likely to implement. catchya, -- Kim Davies | "Belief is the death of intelligence" -Snog kimba@it.com.au | http://www.it.com.au/~kimba/ From owner-robots Tue Oct 24 02:48:24 1995 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14601; Tue, 24 Oct 95 02:48:24 -0700 Date: Tue, 24 Oct 1995 02:48:19 -0700 (PDT) From: Andrew Daviel To: robots@webcrawler.com Subject: Re: Proposed URLs that robots should search In-Reply-To: Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Let's see if I can reply to everyone without getting in a tangle ... :)= >>I'm trying to build a database of URLs for business... >I can't quite contain the urge to say, "Isn't everyone?" Know any good ones? Nothing jumped out at me from CUSI, or Submit-It, etc. >I have seen few .. business sites that don't offer text-only versions I seem to keep seeing sites that say "Works best with Netscape 1.2 - get it!" >Could we start .. standard way to set forth the name of the site? Having it in the of the document root is quite common, but you get "BloggCo Home Page", "Welcome to BloggCo", and sometimes "Welcome to B L O G G C O". I've tried looking for non-dictionary words with some success. >>/linecard.txt - for commercial sites, a text file with comma-delimited >> line items (brands) manufactured or stocked >This will drown in details. >Yup. This was a suggestion from a professional buyer. Sure, collecting these for the whole world would get out of control, but with a small enough scope it might be manageable. The buyers look up brand names in a huge 12-volume book to find distributors or manufacturers. Finding who stocks Motorcraft in Tipperary can't produce that many records. >Well, I hate to repeat myself, but ALIWEB's /site.idx will give you .. Didn't know about it. Looks like what I was thinking of. I see it has keywords ( >..Disagree greatly. This opens a giant can ... ) > >/robots.htm - an HTML list of links > Why HTML? A simplistic idea. I figured that if existing robots are written to traverse HTML, then giving them an HTML file to start from would be fairly easy. Re. site.idx, is this a fairly open-ended list of fields? I had in mind some fields relevant to larger businesses, like Sales-Email, Info-Email, Tech-Email, Sales-FaxBack, etc. etc. for voice, fax, email where some places may have separate hotlines for hardware, software, licenses, etc. How to handle this for big concerns that have one website and hundreds of regional offices is another problem. I find the Lat/Long format in IAFA a bit strange; I use the "standard" navigational format from navigation books, GPS and Loran, etc. eg. 49D14.7N 123D13.6W, except that as there isn't a degree symbol in ASCII I've used "D", which makes it similar to the NMEA0182 format. The current NMEA0183 standard for navigation equipment would use something like: $LCGLL,4001.74,N,07409.43,W for 40 degrees 1.74 minutes North, 74 degrees 9.43 minutes West. Anyway, it's just bits and easy enough to convert. >How are you going to get a system administrator to implement all these >files? Well, one might assume that a good many HTML authors and Webmasters read comp.infosystems.author.html, or whatever it's called. Or one could just send them all mail ... 50,000 returned mail messages wouldn't make too much of a dent in my disk ... :)= >I'd propose it be implemented into the HTTP protocol .. I'd think it might take a while for everyone to update their servers - say, at least 2 years... Andrew Daviel email: advax@triumf.ca From owner-robots Wed Oct 25 15:49:09 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05021; Wed, 25 Oct 95 15:49:09 -0700 Date: Thu, 26 Oct 1995 08:48:57 +1000 From: Murray Bent <murrayb@icis.qut.edu.au> Message-Id: <199510252248.IAA09980@wittgenstein.icis.qut.edu.au> To: robots@webcrawler.com Subject: lycos patents Content-Length: 134 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com To add insult to injury, Lycos are patenting spiders and robots. Anyone care to comment on what Lycos Inc. is up to these days? mj From owner-robots Wed Oct 25 15:56:03 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05454; Wed, 25 Oct 95 15:56:03 -0700 Message-Id: <9510252256.AA05447@webcrawler.com> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Scott Stephenson <scott> Date: Wed, 25 Oct 95 15:55:18 -0700 To: robots Subject: Re: lycos patents References: <199510252248.IAA09980@wittgenstein.icis.qut.edu.au> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, What, Lycos is trying to patent spiders and robots. Got any more information on this?!? How can this be possible, as it is certainly not technology that they developed. ss From owner-robots Wed Oct 25 15:58:36 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05629; Wed, 25 Oct 95 15:58:36 -0700 Message-Id: <9510252258.AA05583@webcrawler.com> To: robots Cc: Murray Bent <murrayb@icis.qut.edu.au> Subject: Re: lycos patents In-Reply-To: Your message of "Thu, 26 Oct 1995 08:48:57 +1000." <199510252248.IAA09980@wittgenstein.icis.qut.edu.au> Date: Wed, 25 Oct 1995 15:58:13 -0700 From: Martijn Koster <mak@beach.webcrawler.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com In message <199510252248.IAA09980@wittgenstein.icis.qut.edu.au>, Murray Bent wr ites: > To add insult to injury, Lycos are patenting spiders and robots. Can you elaborate? Where did you hear this, where can we find out more? -- Martijn __________ Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Wed Oct 25 16:09:34 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06328; Wed, 25 Oct 95 16:09:34 -0700 Date: Wed, 25 Oct 1995 19:08:47 -0400 (EDT) From: Matthew Gray <mkgray@Netgen.COM> X-Sender: mkgray@bokonon To: robots@webcrawler.com Subject: Re: lycos patents In-Reply-To: <199510252248.IAA09980@wittgenstein.icis.qut.edu.au> Message-Id: <Pine.SOL.3.91.951025190537.13893C-100000@bokonon> Organization: net.Genesis Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > To add insult to injury, Lycos are patenting spiders and robots. I assume he is referring to the comment: > We have a patent pending on our spider technology, which makes it > possible for us to both keep up with the exponential growth of the > Internet, and still find the most popular sites. which appears in the FAQ at http://lycos-tmp1.psc.edu/reference/faq.html I hope when they refer to "our spider technology", they are referring to something genuinely unique. If not there are a great many cases for prior art, notably my Wanderer which (while no longer the best) was the first one around in spring of '93. I agree that some comment or clarification from Lycos would be good. Matthew Gray --------------------------------- voice: (617) 577-9800 net.Genesis fax: (617) 577-9850 56 Rogers St. mkgray@netgen.com Cambridge, MA 02142-1119 ------------- http://www.netgen.com/~mkgray From owner-robots Wed Oct 25 16:19:27 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06783; Wed, 25 Oct 95 16:19:27 -0700 Date: Thu, 26 Oct 1995 09:16:39 +1000 From: Murray Bent <murrayb@icis.qut.edu.au> Message-Id: <199510252316.JAA10010@wittgenstein.icis.qut.edu.au> To: robots@webcrawler.com Subject: re: Lycos patents Content-Length: 570 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com reference: > From: "Alison O'Balle" <a.oballe@mail.utexas.edu> (Alison O'Balle) > Subject: Catalog of the Internet > To: Multiple recipients of list <web4lib@library.berkeley.edu> [...] > A representative from Lycos made a presentation on campus Thursday morning > in which he said a number of interesting things about the future of the > internet, cataloging,and other topics. [Interesting facts and figures deleted] > They are patenting web spiders and robots. This was glossed over, but the > lycos guy said the patent process was going well for them so far. From owner-robots Wed Oct 25 16:22:14 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06913; Wed, 25 Oct 95 16:22:14 -0700 Message-Id: <9510252322.AA06904@webcrawler.com> To: fuzzy@cmu.edu Cc: robots Subject: Patents? From: Martijn Koster <m.koster@webcrawler.com> Date: Wed, 25 Oct 1995 16:22:18 -0700 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi Fuzzy, I can't see you in the list of subscribers to the robots list, (to which this is cc'ed) so maybe you missed a message regarding patents there. In http://www.lycos.com/reference/faq.html one reads: > We have a patent pending on our spider technology, which makes it > possible for us to both keep up with the exponential growth of the > Internet, and still find the most popular sites. Can you give any further details, either on the technical nature or the patent application? -- Martijn __________ Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Wed Oct 25 16:45:53 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08081; Wed, 25 Oct 95 16:45:53 -0700 Message-Id: <n1397482621.64443@mail.intouchgroup.com> Date: 25 Oct 1995 16:47:13 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: Re: lycos patents To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > What, Lycos is trying to patent spiders and robots. Got any more > information on this?!? How can this be possible, as it is certainly > not technology that they developed. If this is so, then some interested parties should let the Patent Office (or whatever the corresponding US body is called) know this. Particularly given what a terrible job they have been doing judging software and algorithm patents recently, it's a bad idea to just assume that the Patent Office will get it right. --Roger Dearnaley From owner-robots Wed Oct 25 19:19:25 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14858; Wed, 25 Oct 95 19:19:25 -0700 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199510260219.DAA02026@wsinis02.win.tue.nl> Subject: Re: lycos patents To: robots@webcrawler.com Date: Thu, 26 Oct 1995 03:19:08 +0100 (MET) In-Reply-To: <Pine.SOL.3.91.951025190537.13893C-100000@bokonon> from "Matthew Gray" at Oct 25, 95 07:08:47 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 1094 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Lycos's patents: >I hope when they refer to "our spider technology", they are referring to >something genuinely unique. If not there are a great many cases for >prior art, notably my Wanderer which (while no longer the best) was the >first one around in spring of '93. Mmm ... I think I first saw JumpStation in January '93. http://js.stir.ac.uk/jsbin/js Simple spiders existed before; I used one in November '92 to fill a proxy cache and fake a live Internet connection for a demo, but it wasn't used for indexing purposes. >I agree that some comment or clarification from Lycos would be good. The author has been seen to post to this list, before it moved. I should think the summaries may be patentable; in fact this thought first occurred to me when I saw his short talk on Lycos at WWW'95 in Darmstadt, in the workshop on Web indexing. But I haven't heard from Lycos since. There may be some unusual tricks in running the spiders as well. If XOR-ing bitmaps can be patented, why can't a bunch of details in spider technology? -- Reinier Post reinpost@win.tue.nl From owner-robots Tue Oct 31 06:58:02 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03475; Tue, 31 Oct 95 06:58:02 -0800 From: davidmsl@anti.tesi.dsi.unimi.it (Davide Musella) Message-Id: <9510311459.AA13828@anti.tesi.dsi.unimi.it> Subject: meta tag implementation To: robots@webcrawler.com (Mailing list su robot) Date: Tue, 31 Oct 1995 15:59:26 +0100 (MET) Organization: Dept. of Computer Science, Milan, Italy. X-Mailer: ELM [version 2.4 PL23alpha2] Content-Type: text Content-Length: 772 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi to everybody! I would like to know what do you think about a possible implementation of the meta http-equiv tag on an http-server. I' working in this direction to build a complete system to catalogue www docs but I think that the bigger problems is that there isn't any http-server that handle this meta tag (maybe only the WN server) Thanx Davide +--------------------------------------------------+ |Davide Musella | |e-Mail musella@dsi.unimi.it Dept. of | |Phone number +39.(0)2.4390821 Computer Science | |Address: Via Montevideo, 25 University of | | 20144 Milano ITALY Milan, Italy | |http://www.dsi.unimi.it/Users/Tesi/musella | +--------------------------------------------------+ From owner-robots Thu Nov 2 09:30:07 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15340; Thu, 2 Nov 95 09:30:07 -0800 Message-Id: <YkaDzD200YUxASM0sm@andrew.cmu.edu> Date: Thu, 2 Nov 1995 12:28:47 -0500 (EST) From: "Jeffrey C. Chen" <jc7k+@andrew.cmu.edu> To: robots@webcrawler.com (Mailing list su robot) Subject: Re: meta tag implementation Cc: In-Reply-To: <9510311459.AA13828@anti.tesi.dsi.unimi.it> References: <9510311459.AA13828@anti.tesi.dsi.unimi.it> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi everybody! I am a MS student at CMU. I am working on a software tool for collecting full system traces on the Alpha. The tool will also gather statistics by using the on-chip hardware event counters. I am interested in using a web server and a client as my test workload. It would be interesting to identify performance bottlenecks in a web server as it runs over a period of time servicing requests. Does anyone have a simple robot that I can use to exercise a web server? Thanks, Jeff From owner-robots Thu Nov 2 10:40:02 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20410; Thu, 2 Nov 95 10:40:02 -0800 From: Jaakko Hyvatti <Jaakko.Hyvatti@www.fi> Message-Id: <199511021835.UAA17200@krisse.www.fi> Subject: Simple load robot To: robots@webcrawler.com Date: Thu, 2 Nov 1995 20:35:19 +0200 (EET) In-Reply-To: <YkaDzD200YUxASM0sm@andrew.cmu.edu> from "Jeffrey C. Chen" at Nov 2, 95 12:28:47 pm X-Mailer: ELM [version 2.4 PL22] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 412 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Does anyone have a simple robot that I can use to exercise a web > server? Would this do the job, maybe run multiple times in parallel? (Please replace the url's..) #!/bin/sh while true do for i in \ http://www.fi/ \ http://www.fi/search.html \ http://www.fi/index/ \ http://www.fi/~jaakko/ \ http://www.fi/sss/ \ http://www.fi/www/ \ http://www.fi/links.html do lynx -source $i > /dev/null done done From owner-robots Mon Nov 6 22:44:28 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15194; Mon, 6 Nov 95 22:44:28 -0800 Date: Tue, 7 Nov 1995 00:43:47 -0600 Message-Id: <9511070643.AA120822@nic.smsu.edu> X-Sender: kdf274s@nic.smsu.edu X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Keith Fischer <kfischer@mail.win.org> Subject: Preliminary robot.faq (Please Send Questions or Comments) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Archive-name: robot.faq Posting-Frequency: variable Last-modified: Nov. 6, 1995 This article is a description and primer for World Wide Web robots and spiders. The following topics are addressed: 1) DEFINING ROBOTS AND SPIDERS 1.1) What is a ROBOT? 1.2) What is a SPIDER? 1.3) What is a search engine? 1.4) How many ROBOTS are there? 1.5) What can be achieved by using ROBOTS? 1.6) What harm can a ROBOT do? 2) THE THEORY BEHIND A ROBOT 2.1) Who can write one? 2.2) How is one written? 2.3) What is the Proposed Standard for Robot Exclusion? 2.4) What are the potential problems? 2.5) How do I use proper Etiquette? 3) THE REALITY OF THE WEB 3.1) Can I visit the entire web? 1) DEFINING ROBOTS AND SPIDERS 1.1) What is a ROBOT? A Robot is a program that traverses the World Wide Web, gathering some sort of information from each site it visits. This journey is accomplished by visiting a web page and then recursively visiting all or some of it's linked pages. 1.2) What is a SPIDER? Spiders are synonymous with Robots, as are Wanderers. These names however, have some misleading implications. For instance many people think that a spider or wanderer leaves the home site to work its magic, when in reality it never leaves. The Spider rather just acts as a sophisticated web browser, automatically retrieving documents and/or images until it is told to stop. I prefer the term Robot and will continue using it throughout this document. 1.3) What is a search engine? A search engine is not a robot. However some search engines rely heavily on robots. A search engine is nothing more than a glorified index. It searches the index, which resides on the host's computer, and returns the result. A common misconception is that a search engine like Lycos or Yahoo actively searches the web upon request. This is not true, all activity by the robot is done ahead of time. 1.4) How many ROBOTS are there? There are about 30 in existence. Martijn Koster maintains a list at: http://info.webcrawler.com/mak/projects/robots/active.html 1.5) What can be achieved by using ROBOTS? The possibilities are endless. Once you visit a page, you have free run of the html. You can retrieve files or the html itself. Most robots retrieve pieces of the html document. This is then used to build an index, which is later used by a search engine. 1.6) What harm can a ROBOT do? The robot can do no harm per say, but it can anger a lot of people. If your robot acts irresponsibly it can fall into a black hole, a link that dynamically makes new links, or worse it can get stuck in a loop. Both of these actions are certain to reek havoc on a server. The goal in web traversal is to never be on one server for to long. The solution to the problem of bad htmls or rather your robot's handling of bad htmls is to stay online. Simply put, never leave your robot unattended. 2) THE THEORY BEHIND A ROBOT 2.1) Who can write one? Anyone can write a robot provided that they have web access. But, a word to the wise, tell your system administrators because they WILL feel the system drain and they WILL hear many complaints concerning your activities. But, just because the possibility exists doesn't mean you should take on this task half cocked. Before even thinking about coding a robot: do your research, have an intended goal, and read the following: The Proposed Standard for Robot Exclusion located at: http://info.webcrawler.com/mak/projects/robots/norobots.html The Guidelines for Robot Writers located at: http://info.webcrawler.com/mak/projects/robots/guidelines.html Ethical Web Agent located at: http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Agents/eichmann.ethical/eichma nn.html 2.2) How is one written? A Robot is nothing more than an executable program. It can be in the form of a script or a binary file. It makes a connection to a web server and requests a document be sent, much the same way a web browser works. The difference is in the automation provided by the robot. 2.3) What is the Proposed Standard for Robot Exclusion? Martijn Koster explains the reason for a robot exclusion standard with the following: "In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting)." The form the robot exclusion standard takes is given in more detail at: The Proposed Standard for Robot Exclusion located at: http://info.webcrawler.com/mak/projects/robots/norobots.html 2.4) What are the potential problems? The potential problems can't be listed. The list would be far to big and unpredictable. The very nature of the World Wide Web is diversity and this very diversity makes robot writing both important and increasingly difficult. There is no one right html. They can be written in many ways and in many formats. My suggestion is get the spec sheet for html and practice, practice, practice, making your robot robust. 2.5) How do I use proper Etiquette? Etiquette is a very touchy subject. Many people stand in opposition to your newly written robot. They don't like the idea that their server will be over run with seemingly pointless requests. The solution is simple, first give them the results. Or rather put up for public consumption the results of your searches. This is the concept of giving back to the community that provided for you. Not to mention, if a person can use your results, the robot's requests may seem to have more merit. Another form of etiquette is slow requests. You've heard the term rapid fire. This means quick requests (a request every second or so); basically put, this brings a server to its figurative knees. The solution is limit your requests to any given server to one every minute (some say one every five minutes). More information about etiquette is located at: The Guidelines for Robot Writers located at: http://info.webcrawler.com/mak/projects/robots/guidelines.html Ethical Web Agents located at: http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Agents/eichmann.ethical/eichma nn.html 3) THE REALITY OF THE WEB 3.1) Can I visit the entire web? No. So don't try. Gauge your goals in reasonable amounts. ______________________________________________________________ I disclaim everything. The contents of this article might be totally inaccurate, inappropriate, misguided, or otherwise perverse - except for my name (you can probably trust me on that). Copyright (c) 1995 by Keith D. Fischer, all rights reserved. This FAQ may be posted to any USENET newsgroup, on-line service, or BBS as long as it is posted in its entirety and includes this copyright statement. This FAQ may not be distributed for financial gain. This FAQ may not be included in commercial collections or compilations without express permission from the author. ____________________________________________________________ Keith D. Fischer - kfischer@mail.win.org or kfischer@science.smsu.edu Keith D. Fischer kfischer@mail.win.org kdf274s@nic.smsu.edu "Misery loves company" By Anonymous "Today is a good day to die." By Crazy Horse "To be or not to be ..." Hamlet -- William Shakespeare From owner-robots Tue Nov 7 02:37:01 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA27042; Tue, 7 Nov 95 02:37:01 -0800 Date: Tue, 7 Nov 95 10:32:55 GMT Message-Id: <9511071032.AA09660@raphael.doc.aca.mmu.ac.uk> X-Sender: steven@raphael.doc.aca.mmu.ac.uk X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Steve Nisbet <S.Nisbet@DOC.MMU.AC.UK> Subject: Re: meta tag implementation Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 12:28 PM 11/2/95 -0500, you wrote: >Hi everybody! > >I am a MS student at CMU. I am working on a software tool for >collecting full system traces on the Alpha. The tool will also gather >statistics by using the on-chip hardware event counters. I am >interested in using a web server and a client as my test workload. It >would be interesting to identify performance bottlenecks in a web server >as it runs over a period of time servicing requests. Does anyone have a >simple robot that I can use to exercise a web server? > >Thanks, >Jeff > > Hi there Jef, know this sounds cheeky, but if you get any useful repliesfor robots that have nothing to do with PERL, could you let me know. I tride asking the same question you asked, but got no replies. From owner-robots Tue Nov 7 04:05:00 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02122; Tue, 7 Nov 95 04:05:00 -0800 From: davidmsl@anti.tesi.dsi.unimi.it (Davide Musella) Message-Id: <9511071205.AA13152@anti.tesi.dsi.unimi.it> Subject: Re: meta tag implementation To: robots@webcrawler.com Date: Tue, 7 Nov 1995 13:05:21 +0100 (MET) In-Reply-To: <9511071032.AA09660@raphael.doc.aca.mmu.ac.uk> from "Steve Nisbet" at Nov 7, 95 10:32:55 am Organization: Dept. of Computer Science, Milan, Italy. X-Mailer: ELM [version 2.4 PL23alpha2] Content-Type: text Content-Length: 251 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Hi there Jef, know this sounds cheeky, but if you get any useful repliesfor > robots that have nothing to do with PERL, could you let me know. I tride > asking the same question you asked, but got no replies. No replies until now....sigh!!! Davide From owner-robots Tue Nov 7 06:17:49 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08070; Tue, 7 Nov 95 06:17:49 -0800 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199511071417.OAA06656@wsinis11.win.tue.nl> Subject: Re: meta tag implementation To: robots@webcrawler.com Date: Tue, 7 Nov 1995 15:17:26 +0100 (MET) In-Reply-To: <9511071205.AA13152@anti.tesi.dsi.unimi.it> from "Davide Musella" at Nov 7, 95 01:05:21 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (Davide Musella) write: > >> Hi there Jef, know this sounds cheeky, but if you get any useful repliesfor >> robots that have nothing to do with PERL, could you let me know. I tride >> asking the same question you asked, but got no replies. > >No replies until now....sigh!!! You might use Lynx (2.4.FM); it has a -traverse switch now. Experimental, and I don't think it supports the RES (Robot Exclusion Standard) yet. We have a simple robot written in C, but it doesn't follow the RES either. What's your resaon to stay away from Perl? >Davide -- Reinier Post reinpost@win.tue.nl From owner-robots Tue Nov 7 06:54:36 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09686; Tue, 7 Nov 95 06:54:36 -0800 Date: Tue, 7 Nov 95 14:41:39 GMT Message-Id: <9511071441.AA11827@raphael.doc.aca.mmu.ac.uk> X-Sender: steven@raphael.doc.aca.mmu.ac.uk X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Steve Nisbet <S.Nisbet@DOC.MMU.AC.UK> Subject: Re: meta tag implementation Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi Davide, thanks very much for the info. I stay away from Perl here because it was badly set up and I have to reinstall it. SO its more of a grudge :) Other than that I think its a good thing. I will do as you suggest. All the best in you endeavours. From owner-robots Tue Nov 7 07:11:12 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10944; Tue, 7 Nov 95 07:11:12 -0800 Message-Id: <m0tCpg2-0003LMC@giant.mindlink.net> Date: Tue, 7 Nov 95 07:11 PST X-Sender: a07893@giant.mindlink.net X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Tim Bray <tbray@opentext.com> Subject: Re: Preliminary robot.faq (Please Send Questions or Comments) Cc: robots@webcrawler.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >1.1) What is a ROBOT? > > A Robot is a program that traverses the World Wide Web, gathering some >sort of information from each site it visits. This journey is accomplished >by visiting a web page and then recursively visiting all or some of it's >linked pages. True but misleading; there are much better strategies for covering the web than this kind of direct recursion. Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com) From owner-robots Wed Nov 8 01:30:52 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21486; Wed, 8 Nov 95 01:30:52 -0800 Date: Wed, 8 Nov 1995 03:30:45 -0600 Message-Id: <9511080930.AA35454@nic.smsu.edu> X-Sender: kdf274s@nic.smsu.edu X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Keith Fischer <kfischer@mail.win.org> Subject: Re: Preliminary robot.faq (Please Send Questions or Comments) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >>1.1) What is a ROBOT? >> >> A Robot is a program that traverses the World Wide Web, gathering some >>sort of information from each site it visits. This journey is accomplished >>by visiting a web page and then recursively visiting all or some of it's >>linked pages. > >True but misleading; there are much better strategies for covering >the web than this kind of direct recursion. > > >Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com) 1.1) What is a ROBOT? A Robot is a program that traverses the World Wide Web, gathering some sort of information from each site it visits. This journey is accomplished by visiting a web page and then visiting some or all of its linked pages. The method one follows whether it's recursive or some sort of fuzzy logic determines the effectivness of the search. How is the above. If you like, this will be the new 1.1. Also, could you please elaborate on better stratagies. (I'm assuming you are talking about the fuzzy logic that Yahoo and Lycos use.) Keith kfischer@mail.win.org kdf274s@nic.smsu.edu Keith D. Fischer kfischer@mail.win.org kdf274s@nic.smsu.edu "Misery loves company" By Anonymous "Today is a good day to die." By Crazy Horse "To be or not to be ..." Hamlet -- William Shakespeare From owner-robots Wed Nov 8 05:45:00 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03365; Wed, 8 Nov 95 05:45:00 -0800 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199511081344.NAA17571@wsinis02.win.tue.nl> Subject: Re: Preliminary robot.faq (Please Send Questions or Comments) To: robots@webcrawler.com Date: Wed, 8 Nov 1995 14:44:43 +0100 (MET) In-Reply-To: <9511080930.AA35454@nic.smsu.edu> from "Keith Fischer" at Nov 8, 95 03:30:45 am X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (Keith Fischer) write: >1.1) What is a ROBOT? > > A Robot is a program that traverses the World Wide Web, gathering some >sort of information from each site it visits. This journey is accomplished >by visiting a web page and then visiting some or all of its linked pages. >The method one follows whether it's recursive or some sort of fuzzy logic >determines the effectivness of the search. We have a robot which does 'fuzzy' searching, for which your description is appropriate. But in general, the document collection process (= robot) and the search process executed in response to a user query (on the resulting collection) are completely separate. Besides, searching the contents of document collections is not the only purpose of robots; robots can be used to check the validity of hyperlinks, for example. Your description is accurate, as applied to the robot process itself, but it may be confusing. A minor quibble: robots must use some heuristics in determining which links to follow. All robots are 'recursive', and most of them cut off the process in a more or less arbitrary way, which could be called 'fuzzy'. There is no or/or decision here. -- Reinier Post reinpost@win.tue.nl a.k.a. <A HREF="http://www.win.tue.nl/win/cs/is/reinpost/">me</A> [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] From owner-robots Wed Nov 8 08:38:48 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA13242; Wed, 8 Nov 95 08:38:48 -0800 Subject: Re: Preliminary robot.faq (Please Send Questions or Comments) From: YUWONO BUDI <yuwono@uxmail.ust.hk> To: robots@webcrawler.com Date: Thu, 9 Nov 1995 00:37:33 +0800 (HKT) In-Reply-To: <9511080930.AA35454@nic.smsu.edu> from "Keith Fischer" at Nov 8, 95 03:30:45 am X-Mailer: ELM [version 2.4 PL24alpha3] Content-Type: text Content-Length: 1603 Message-Id: <95Nov9.003740hkt.19032-3+260@uxmail.ust.hk> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > >>1.1) What is a ROBOT? > >> > >> A Robot is a program that traverses the World Wide Web, gathering some > >>sort of information from each site it visits. This journey is accomplished > >>by visiting a web page and then recursively visiting all or some of it's > >>linked pages. > > > >True but misleading; there are much better strategies for covering > >the web than this kind of direct recursion. > > > > > >Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com) > > > 1.1) What is a ROBOT? > > A Robot is a program that traverses the World Wide Web, gathering some > sort of information from each site it visits. This journey is accomplished > by visiting a web page and then visiting some or all of its linked pages. > The method one follows whether it's recursive or some sort of fuzzy logic > determines the effectivness of the search. I am not sure understand what the original comment is getting at. But it seems to me that the word "recursive" is somewhat overloaded. To those with CS background, a "recursive" visit implies a "depth first" tree traversal. Most robot implementations that I'm aware of use "breadth first" traversals. Among the reasons is that you would want to be able to limit the depth your robot digs into. Whether depth limitation is more useful than breadth limitation is another issue, IMHO. One thing for sure, stopping the robot after it reaches a certain depth is much simpler than deciding which links to follow/ignore. I don't know what would be the more general term in place of "recursively," "sequentially" perhaps? -Budi. From owner-robots Thu Nov 9 08:53:37 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12795; Thu, 9 Nov 95 08:53:37 -0800 Resent-Message-Id: <9511091653.AA12783@webcrawler.com> Resent-From: mak@beach.webcrawler.com Resent-To: robots Resent-Date: Thu, 9 Nov 1995 16:53:32 Date: Wed, 8 Nov 95 10:08:51 -0800 From: <owner-robots> Message-Id: <9511081808.AA19321@webcrawler.com> To: owner-robots Subject: BOUNCE robots: Admin request X-Filter: mailagent [version 3.0 PL41] for mak@surfski.webcrawler.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >From tbray@opentext.com Wed Nov 8 10:08:46 1995 Return-Path: <tbray@opentext.com> Received: from giant.mindlink.net by webcrawler.com (NX5.67f2/NX3.0M) id AA19311; Wed, 8 Nov 95 10:08:46 -0800 Received: from Default by giant.mindlink.net with smtp (Smail3.1.28.1 #5) id m0tDEv9-000343C; Wed, 8 Nov 95 10:08 PST Message-Id: <m0tDEv9-000343C@giant.mindlink.net> Date: Wed, 8 Nov 95 10:08 PST X-Sender: a07893@giant.mindlink.net X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Tim Bray <tbray@opentext.com> Subject: Re: Preliminary robot.faq (Please Send Questions or Comments) Cc: robots@webcrawler.com We're wasting too much time on this. All I meant to say was that the original language strongly suggested that robots use the following algorithm: sub RetrievePage(url) text = HttpGet(url) foreach sub_url in text RetrievePage(sub_url) Whereas lots of robots don't. Obviously it is recursive in that you do pull urls out of pages and eventually follow them, but it doesn't feel recursive. The 'fuzzy' stuff is a complete red herring - except for the special case of 'fuzzy logic' (not what's being done here) the word 'fuzzy' in the information retrieval context is a marketing term without semantic content. Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com) From owner-robots Fri Nov 17 09:12:34 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21835; Fri, 17 Nov 95 09:12:34 -0800 Date: Fri, 17 Nov 1995 09:24:00 -0800 (PST) From: Benjamin Franz <snowhare@netimages.com> X-Sender: snowhare@ns.viet.net To: robots@webcrawler.com Subject: Bad robot: WebHopper bounch! Owner: peter@cartes.hut.fi In-Reply-To: <95Nov9.003740hkt.19032-3+260@uxmail.ust.hk> Message-Id: <Pine.LNX.3.91.951117085518.25864A-100000@ns.viet.net> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I was checking my stats and this showed up with 1838 hits on the 9th of November. It tried to completely explore an infinite virtual space in one run, with an average time between hits of 4.3 seconds. Its' parser has to be broken because it was exploring a space defined by a ?cookie=number (used for shopping basket session tracking), but failing to preserve the '=' (generating 'cookienumber' instead of 'cookie=number') between calls and causing a new cookie to be assigned to every request. It went into an infinite loop over the same five base pages as it tried to do a depth first search of the site - for a little over two hours. Argh. Anyone else hit by this rather broken robot? -- Benjamin Franz, Webmaster, Net Images From owner-robots Thu Nov 23 12:44:36 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03420; Thu, 23 Nov 95 12:44:36 -0800 Date: Thu, 23 Nov 1995 12:42:51 -0800 (PST) From: Andrew Daviel <andrew@andrew.triumf.ca> To: libwww-perl@ics.UCI.EDU, /CN=robots/@nexor.co.uk Cc: Daniel Terrer <Daniel.Terrer@sophia.inria.fr> Subject: wwwbot.pl problem Message-Id: <Pine.LNX.3.91.951123111508.16547A-100000@andrew.triumf.ca> Mime-Version: 1.0 Content-Type: text/PLAIN; charset="US-ASCII" Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com (I send a request to libwww-perl-request just before my last message to the list, so I might not be on yet. Please Cc any replies to me.) I was having trouble with wwwbot from the libwww-perl-0.40 library. I continued to work on the problem after posting to the perl list. It seems that botcache is not well enough defined, so that a site with User-Agent: * Disallow / would kill subsequent GETs to a site that was previously in the cache. I have made a patch which adds the address to the cache, and fixes a couple of other odd cases, such as where the address is not fully defined working within a domain, and there are host names such as ypsun, ypsun2 etc. which would become confused with the path count. See ftp://andrew.triumf.ca/pub/wwwbot.patch Andrew Daviel email: advax@triumf.ca TRIUMF voice: 604-222-7376 4004 Wesbrook Mall fax: 604-222-7307 Vancouver BC http://andrew.triumf.ca/~andrew Canada V6T 2A3 49D14.7N 123D13.6W From owner-robots Thu Nov 23 23:45:39 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA07952; Thu, 23 Nov 95 23:45:39 -0800 Date: Fri, 24 Nov 95 16:45:28 JST From: francis@cactus.slab.ntt.jp (Paul Francis) Message-Id: <9511240745.AA03918@cactus.slab.ntt.jp> To: robots@webcrawler.com Subject: yet another robot Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com For all its worth, we have implemented a robot in order to (surprise surprise) gather web resources to build a (distributed) search database. The robot is called Yobot, and http://rodem.slab.ntt.jp:8080/home/robot-e.html tells you who to complain to if Yobot misbehaves. Thanks, PF From owner-robots Fri Nov 24 13:51:35 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17245; Fri, 24 Nov 95 13:51:35 -0800 Date: Sat, 25 Nov 1995 07:53:43 +1000 (EST) From: David Eagles <eaglesd@planets.com.au> To: robots@webcrawler.com Subject: yet another robot, volume 2 In-Reply-To: <9511240745.AA03918@cactus.slab.ntt.jp> Message-Id: <Pine.LNX.3.91.951125075027.1078A-100000@earth.planets.com.au> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com We, too, have developed a robot to provide Web resource search facilities to Australia and the South Pacific. The crawler engine will only follow links to designated domains, and the search engine allows individual selection of the search domain for queries. Named after a famous Australian spider, the FunnelWeb, the service is available at http://funnelweb.net.au Enjoy. Regards, David Eagles From owner-robots Fri Nov 24 15:20:08 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22501; Fri, 24 Nov 95 15:20:08 -0800 Date: Sat, 25 Nov 95 09:29:44 +1100 (EST) Message-Id: <v01530506acdc92c54ddb@[192.190.215.44]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: radio@mpx.com.au (James) Subject: Re: yet another robot, volume 2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >We, too, have developed a robot to provide Web resource search facilities >to Australia and the South Pacific. The crawler engine will only follow >links to designated domains, and the search engine allows individual >selection of the search domain for queries. > >Named after a famous Australian spider, the FunnelWeb, the service is >available at http://funnelweb.net.au > >Enjoy. >David we tried it out the other day.We lodged AAA and Tourist Radio(2 Sites) Great VISION Keith Ashton >Regards, >David Eagles AAA Australia Announce Archive / Tourist Radio Home of the Australian Cool Site of the Day ! http://www.com.au/aaa From owner-robots Fri Nov 24 16:13:17 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25777; Fri, 24 Nov 95 16:13:17 -0800 Date: Sat, 25 Nov 95 11:13:05 +1100 (EST) Message-Id: <v01530507acdcab771ad7@[192.190.215.44]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: radio@mpx.com.au (James) Subject: Re: yet another robot, volume 2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >>We, too, have developed a robot to provide Web resource search facilities >>to Australia and the South Pacific. The crawler engine will only follow >>links to designated domains, and the search engine allows individual >>selection of the search domain for queries. >> >>Named after a famous Australian spider, the FunnelWeb, the service is >>available at http://funnelweb.net.au >> > >>Enjoy. >>David we tried it out the other day.We lodged AAA and Tourist Radio(2 Sites) >Great VISION > >Keith Ashton > > > ____________________________________________________________________________ ___________ David, We just got an Email back from you but there was no content Keith ____________________________________________________________________________ ____________ > > > >>Regards, >>David Eagles > >AAA Australia Announce Archive / Tourist Radio >Home of the Australian Cool Site of the Day ! >http://www.com.au/aaa AAA Australia Announce Archive / Tourist Radio Home of the Australian Cool Site of the Day ! http://www.com.au/aaa From owner-robots Sat Nov 25 06:21:14 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05034; Sat, 25 Nov 95 06:21:14 -0800 From: Byung-Gyu Chang <chitos@ktmp.kaist.ac.kr> Message-Id: <199511251419.XAA02550@ktmp.kaist.ac.kr> Subject: Q: Cooperation of robots To: robots@webcrawler.com (Robot Mailing list) Date: Sat, 25 Nov 1995 23:19:12 +0900 (KST) X-Mailer: ELM [version 2.4 PL21-h4] Mime-Version: 1.0 Content-Type: text/plain; charset=iso-2022-kr Content-Transfer-Encoding: 7bit Content-Length: 378 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, I am newbie to this mailing-list. If I do some mistake, plz reply to me. I have one question : Is there some effort for robots to do gathering informations in cooperative work style? That is, Sharing informations gathered by the other kind of robots with some communication between robots like the that of intelligent agents in Intelligent Agent area. - Byung-Gyu Chang From owner-robots Sat Nov 25 10:19:10 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15907; Sat, 25 Nov 95 10:19:10 -0800 Date: Sat, 25 Nov 1995 13:19:03 -0500 Message-Id: <199511251819.NAA27702@moe.infi.net> X-Sender: magi@infi.net (Unverified) X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Michael Goldberg <magi@infi.net> Subject: Smart Agent help Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com A am developing sites for numerous large associations. I want to provide a service to the members by which they can choose from selected topics..say mortgage interest rates..and a robot goes out and searches selected sites and provides either by e-mail a formated "newsletter" or return a "newsletter" in html. Any suggestions? <<< Media Access Group>>> Local Access to electronic marketing Triad member- Network Hampton Roads 2101 Parks Ave. Suite 606 Virginia Beach, VA 23451 804-422-4481 From owner-robots Sat Nov 25 15:22:58 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01362; Sat, 25 Nov 95 15:22:58 -0800 Date: Sun, 26 Nov 1995 09:24:39 +1000 (EST) From: David Eagles <eaglesd@planets.com.au> To: robots@webcrawler.com Subject: Re: Q: Cooperation of robots In-Reply-To: <199511251419.XAA02550@ktmp.kaist.ac.kr> Message-Id: <Pine.LNX.3.91.951126091817.2816A-100000@earth.planets.com.au> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Sat, 25 Nov 1995, Byung-Gyu Chang wrote: > Hi, I am newbie to this mailing-list. If I do > some mistake, plz reply to me. > > > I have one question : > > Is there some effort for robots to do gathering > informations in cooperative work style? > That is, Sharing informations gathered by the other kind of > robots with some communication between robots like > the that of intelligent agents in Intelligent Agent area. > > - Byung-Gyu Chang > I'm not sure if there is any official cooperation going on, but I'm currently enhancing my web crawler (http://funnelweb.net.au) to include support for this type of operation. Basically, here's what I'm planning: The current web crawler, based in Australia, limits it's searching and collection to countries in the South Pacific. I'm planning to enhance this such that any URL's found (during the crawling process) for non-South Pacific countries will be forwarded to the web crawler responsible for that domain (as determined by a simple config file - maybe an automated registration process in the future). Similarly, the search engine will allow ANY individual country(s) to be searched (as is the case now for only South Pacific countries), and will fork the request off to the appropriate engine. Is this the type of info you were after? Regards, David Eagles From owner-robots Sun Nov 26 09:10:54 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17874; Sun, 26 Nov 95 09:10:54 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: <v02130502acde4cefcae6@[192.187.143.12]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sun, 26 Nov 1995 09:10:32 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Q: Cooperation of robots Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 11:19 PM 11/25/95, Byung-Gyu Chang wrote: >Is there some effort for robots to do gathering >informations in cooperative work style? >That is, Sharing informations gathered by the other kind of >robots with some communication between robots like >the that of intelligent agents in Intelligent Agent area. There are various efforts, but the most significant one is probably the Harvest project at the University of Colorado. I can't remember their URL at the moment, but I know we have a link to it from: http://www.verity.com/customers.html Nick From owner-robots Sun Nov 26 16:57:32 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11355; Sun, 26 Nov 95 16:57:32 -0800 Date: Mon, 27 Nov 95 09:57:15 JST From: francis@cactus.slab.ntt.jp (Paul Francis) Message-Id: <9511270057.AA12772@cactus.slab.ntt.jp> To: robots@webcrawler.com Subject: Re: Smart Agent help Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > A am developing sites for numerous large associations. I want to provide a > service to the members by which they can choose from selected topics..say > mortgage interest rates..and a robot goes out and searches selected sites > and provides either by e-mail a formated "newsletter" or return a > "newsletter" in html. > > Any suggestions? A number of people are working towards the ability to search selected sites, though I haven't heard of anyone trying to put the result in a newletter format. Harvest allows the user to custom build his own database, which is then locally accessed at search time. (http://harvest.cs.colorado.edu/) MetaCrawler, Silk, IBMinfoMarket, and no doubt many others query multiple pre-configured search databases at search time. (http://metacrawler.cs.washington.edu:8080/home.html http://services.bunyip.com:8000/products/silk/silk.html http://www.infomkt.ibm.com/about.htm) I'm looking forward to the day when two of these "meta" search services point to each other and create an infinite search loop.... PF ps. If you're going to the WWW conference in Boston, I'll be chairing a BOF on distributed searching. Please see http://rodem.slab.ntt.jp:8080/paulStuff/ From owner-robots Sun Nov 26 18:28:42 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16185; Sun, 26 Nov 95 18:28:42 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: <v02130509acded1bfff27@[192.187.143.12]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sun, 26 Nov 1995 18:28:33 -0800 To: robots@webcrawler.com, owner-robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: BOUNCE robots: Admin request Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 10:08 AM 11/8/95, <owner-robots@webcrawler.com> wrote: >Whereas lots of robots don't. Obviously it is recursive in that you >do pull urls out of pages and eventually follow them, but it doesn't >feel recursive. The 'fuzzy' stuff is a complete red herring - except >for the special case of 'fuzzy logic' (not what's being done here) the >word 'fuzzy' in the information retrieval context is a marketing term >without semantic content. Minor point -- let's not assume that no one on the list is using fuzzy logic to decide which links to follow. After all, some of us have search engines that use fuzzy logic operators. I'm fascinated by using evidential reasoning to build agents that explore. Nick From owner-robots Sun Nov 26 19:43:06 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20091; Sun, 26 Nov 95 19:43:06 -0800 Date: Mon, 27 Nov 95 12:42:56 JST From: francis@cactus.slab.ntt.jp (Paul Francis) Message-Id: <9511270342.AA14195@cactus.slab.ntt.jp> To: robots@webcrawler.com Subject: Re: Q: Cooperation of robots Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > Is there some effort for robots to do gathering > informations in cooperative work style? > That is, Sharing informations gathered by the other kind of > robots with some communication between robots like > the that of intelligent agents in Intelligent Agent area. > I haven't seen anything, but I only pay so much attention to this list. I know that one problem is that many robots run to support profit- (or planned profit-) based services, so don't want to share their info. What do you see as the advantage to sharing information? It is offhand not clear to me that much is to be gained by it. For instance, given that each robot-running organization usually has their own way of processing the resources they find, then they have to go out and retrieve the resources in any event. Thus, not much may be saved by sharing information.... PF From owner-robots Mon Nov 27 01:14:04 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06867; Mon, 27 Nov 95 01:14:04 -0800 From: Jaakko Hyvatti <Jaakko.Hyvatti@www.fi> Message-Id: <199511270913.LAA29177@krisse.www.fi> Subject: Re: Q: Cooperation of robots To: robots@webcrawler.com Date: Mon, 27 Nov 1995 11:13:46 +0200 (EET) In-Reply-To: <9511270342.AA14195@cactus.slab.ntt.jp> from "Paul Francis" at Nov 27, 95 12:42:56 pm X-Mailer: ELM [version 2.4 PL22] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 1744 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com francis@cactus.slab.ntt.jp (Paul Francis): > I haven't seen anything, but I only pay so much > attention to this list. I know that one problem is > that many robots run to support profit- (or planned > profit-) based services, so don't want to share their > info. We at http://www.fi/ have a good coverage of the www-resources of Finland. You are right, we are clearly not willing to share our information base with other search engines in Finland (there is another one). On the other hand, it might be possible to share the database with some or all of the international search engines as a promotion. We would not lose any markets here in finland, 'cause always our site would be the fastest way for Finnish customers to perform searching. > What do you see as the advantage to sharing information? > It is offhand not clear to me that much is to be gained > by it. For instance, given that each robot-running > organization usually has their own way of processing > the resources they find, then they have to go out and > retrieve the resources in any event. Thus, not much > may be saved by sharing information.... If the two co-operating parties agree of common set of information to stre about each individual page, both could modify their robots to comply with this. Possibly even just a compressed .tar.gz archive of the pages could do. Anyway it saves bandwidth in international connections and annoys the servers less. I do not believe that our current database would suit anybody elses needs, but maybe the next time we collect all the pages we could fetch all the information necessary to someone else too. Feel free to contact me at Jaakko.Hyvatti@www.fi if you are interested. We cover almost all of Finland. From owner-robots Mon Nov 27 08:27:10 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04759; Mon, 27 Nov 95 08:27:10 -0800 Message-Id: <9511271626.AA04714@webcrawler.com> Original-Received: from research by ns Pp-Warning: Illegal Received field on preceding line X-Mailer: exmh version 1.6.4 10/10/95 From: Fred Douglis <douglis@research.att.com> To: Andrew Daviel <andrew@andrew.triumf.ca> Cc: libwww-perl@ics.UCI.EDU, /CN=robots/@nexor.co.uk, Daniel Terrer <Daniel.Terrer@sophia.inria.fr> Subject: Re: wwwbot.pl problem In-Reply-To: Your message of "Thu, 23 Nov 1995 12:42:51 PST." <Pine.LNX.3.91.951123111508.16547A-100000@andrew.triumf.ca> X-Face: *lvs`^NFil<?gI%c@~W[5*dWZ5;4-8#&S`1t,Ey&5R5z7nLBE)TKc?44|-sPxDy<i[jb[s Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com XQu4i;It_f~o>3, KN{Fk?$+k063Tiv(F~;?02MoaTUP/:+;eeHIOHWf_Ob-s*iTugCX^)YVicQB<1: {??RaMPnky^1nA7'2!$REBJNc=skHq:poE<ObzL*~*M-w$9Vxx`Lv>ZcirD$]R#_f8~qT,O[Vc)x, G bKn>8, <X)r, rKv|oipe=j/;e0%f/j:#/bRy('D]"f|zB3 X-Uri: http://www.research.att.com/orgs/ssr/people/douglis Date: Mon, 27 Nov 1995 11:15:47 -0500 Sender: douglis@pelican.research.att.com Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" I reported this bug a few months ago and I thought a patch had been installed in the distribution. Roy? -- Fred Douglis MIME accepted douglis@research.att.com AT&T Bell Laboratories 908 582-3633 (office) 600 Mountain Ave., Rm. 2B-105 908 582-3063 (fax) Murray Hill, NJ 07974 http://www.research.att.com/orgs/ssr/people/douglis/ From owner-robots Mon Nov 27 12:29:54 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17259; Mon, 27 Nov 95 12:29:54 -0800 Message-Id: <199511272029.PAA14228@lexington.cs.columbia.edu> To: robots@webcrawler.com Subject: harvest Date: Mon, 27 Nov 1995 15:29:38 -0500 From: "John D. Pritchard" <jdp@cs.columbia.edu> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com there's been some mention of harvest.. the URL is http://harvest.cs.colorado.edu/ this provides a ton of infrastructure for implementing robots on top of, in the form of gatherers and or brokers. harvest sites cooperate so that once (with caching) a set of data (ftp, http, gopher, wais, etc.) has been "harvested" (or gathered), the global harvest database can reuse the gathered info without re-harvesting (re-gathering) from the target data site. this is "responsible"* robots that dont load up data sites with redundant automated downloading and cooperative robots, via brokering. * or ethical: http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Agents/eichmann.ethical/eichmann.html see http://harvest.cs.colorado.edu/harvest/technical.html for more. for a linear robot cooperation, harvest provides Summary Object Interchange Format (SOIF), http://harvest.cs.colorado.edu/Harvest/brokers/soifhelp.html arbitrary extensions to SOIF are on the object, object-attribute model. for nonlinear robot cooperation or interaction, brokers can be defined arbitrarily. i'm presently working on an associative AI which i had developed as a standalone program, but am stripping my lame gathering and brokering code for the sophistication of harvest. -john From owner-robots Mon Nov 27 14:39:00 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18292; Mon, 27 Nov 95 14:39:00 -0800 Date: Mon, 27 Nov 95 15:55:32 EST From: Jason_Murray_at_FCRD@cclink.tfn.com Message-Id: <9510278175.AA817518051@cclink.tfn.com> To: robots@webcrawler.com Subject: Re: Smart Agent help Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Give me a call (617) 345-2465 or send email (netsoft@aol.com). We are in process of creating just such an agent. Jason Murray DataMarket 306 Union St Rockland MA 02370 Fax 617-871-5816 From owner-robots Mon Nov 27 14:58:48 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18458; Mon, 27 Nov 95 14:58:48 -0800 Message-Id: <30BA6C06.444C@infi.net> Date: Mon, 27 Nov 1995 17:55:18 -0800 From: Michael Goldberg <magi@infi.net> Organization: Media Access Group X-Mailer: Mozilla 2.0b2a (Windows; I; 16bit) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: harvest References: <199511272029.PAA14228@lexington.cs.columbia.edu> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Received your email through the robots listserv... I need an application built for a site I am developing... THe application allows users of the site to tailor a specified areas of interest,...say mortgages.. and search specific WWW sites and retrieve the information eith by email or a formatted newsletter. Can Harvest do this? From owner-robots Mon Nov 27 16:38:38 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19231; Mon, 27 Nov 95 16:38:38 -0800 Message-Id: <199511280038.TAA14968@lexington.cs.columbia.edu> To: robots@webcrawler.com Subject: mortgages with: Re: harvest In-Reply-To: Your message of "Mon, 27 Nov 1995 17:55:18 PST." <30BA6C06.444C@infi.net> Date: Mon, 27 Nov 1995 19:38:34 -0500 From: "John D. Pritchard" <jdp@cs.columbia.edu> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Received your email through the robots listserv... > I need an application built for a site I am developing... > THe application allows users of the site to tailor a specified > areas of interest,...say mortgages.. and search specific WWW sites this is the kind of thing that harvest provides for. basically, in "tailoring" information dynamically (as opposed to going to a static menu system) your user is faced with (recursively) traversing an association graph. the user wants to see data with mortgage numbers. associativity is the service we are providing. better associativity, however, classes data, eg, via SOIF, so that the user has more coherent domains to search through than "every document with numeric strings and the string 'mortgage'". presently, SOIF provides for arbitrary degrees of data classification which is a strong solution for most applications, and generally an optimal solution for applications involving fairly regular data formats, eg, reports or forms. harvest provides for sites to cooperate or interoperate efficiently for applications such as these since no one site could ever have space to replicate the entire internet, or even a significant associative slice of it, in providing a monolithic internet database. basically the talent of harvest in linear interoperability, via SOIF, is providing the architecture for this recursively infinite association graph traversal in most forms of data, especially business data. > and retrieve the information eith by email or a formatted newsletter. > Can Harvest do this? certainly you could put an email or such interface on the system, but your users would probably be happier with something more responsive and flexible like a web interface. an interactive interface provides the opportunity for refining data collection, for discovering new sources of data, etc. -john From owner-robots Mon Nov 27 19:52:36 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26675; Mon, 27 Nov 95 19:52:36 -0800 Date: Mon, 27 Nov 1995 22:52:30 -0500 From: Skip Montanaro <skip@automatrix.com> Message-Id: <199511280352.WAA24695@dolphin.automatrix.com> To: robots@webcrawler.com Subject: How frequently should I check /robots.txt? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I'm working on a specialized robot to identify Web sites with concert itineraries (by scoring the contents of the file against expected patterns). I will announce it here when I begin exercising it outside my local network. I'm a bit confused about how often I should update my local copy of a site's /robots.txt file. Clearly I shouldn't check it with each access, since that would double the number of accesses my robot would make to a site. I saw nothing in my server's access logs that would suggest that any of the robots that visit our site ever perform a HEAD request for /robots.txt (indicating they were checking for a Last-modified header). So how about it? How often should /robots.txt be checked? Thx, Skip Montanaro skip@calendar.com (518)372-5583 Musi-Cal: http://www.calendar.com/concerts/ or mailto:concerts@calendar.com Internet Conference Calendar: http://www.calendar.com/conferences/ >>> ZLDF: http://www.netresponse.com/zldf <<< From owner-robots Mon Nov 27 20:31:52 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00165; Mon, 27 Nov 95 20:31:52 -0800 Date: Mon, 27 Nov 1995 23:27:08 -0600 (CST) From: gil cosson <gil@rusty.waterworks.com> To: robots@webcrawler.com Cc: robots@webcrawler.com Subject: Re: How frequently should I check /robots.txt? In-Reply-To: <199511280352.WAA24695@dolphin.automatrix.com> Message-Id: <Pine.LNX.3.91.951127231751.1718A-100000@rusty.waterworks.com> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com How about adding an entry to the robots.txt file that specifies how frequently the robots.txt file should be checked? gil. ========================================================================== "Everybody can be great because anybody can serve. You don't have to have a college degree to serve. You don't have to make your subject and verb agree to serve. You don't have to know the second theory of Thermo Dynamics and physics to serve. You only need a heart full of grace. A soul generated by love." Martin Luther King Jr. On Mon, 27 Nov 1995, Skip Montanaro wrote: > > I'm working on a specialized robot to identify Web sites with concert > itineraries (by scoring the contents of the file against expected patterns). > I will announce it here when I begin exercising it outside my local network. > > I'm a bit confused about how often I should update my local copy of a site's > /robots.txt file. Clearly I shouldn't check it with each access, since that > would double the number of accesses my robot would make to a site. > > I saw nothing in my server's access logs that would suggest that any of the > robots that visit our site ever perform a HEAD request for /robots.txt > (indicating they were checking for a Last-modified header). > > So how about it? How often should /robots.txt be checked? > > Thx, > > Skip Montanaro skip@calendar.com (518)372-5583 > Musi-Cal: http://www.calendar.com/concerts/ or mailto:concerts@calendar.com > Internet Conference Calendar: http://www.calendar.com/conferences/ > >>> ZLDF: http://www.netresponse.com/zldf <<< > From owner-robots Mon Nov 27 23:22:57 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03527; Mon, 27 Nov 95 23:22:57 -0800 Message-Id: <9511280722.AA03518@webcrawler.com> To: robots@webcrawler.com Subject: Re: How frequently should I check /robots.txt? In-Reply-To: Your message of "Mon, 27 Nov 1995 23:27:08 CST." <Pine.LNX.3.91.951127231751.1718A-100000@rusty.waterworks.com> Date: Mon, 27 Nov 1995 23:22:54 -0800 From: Martijn Koster <mak@surfski.webcrawler.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com In message <Pine.LNX.3.91.951127231751.1718A-100000@rusty.waterworks.com>, gil cosson writes: > How about adding an entry to the robots.txt file that specifies how > frequently the robots.txt file should be checked? Hmm.. and then how often do you check if the checking frequency has changed? :-) Seriously though I don't think there'd be a lot of benefit; as an admin you tend not to know when you'll make the next change. From an http point of view robots could be smart, and look at the Expires header. Deciding how often to check for the /robots.txt depends highly on how you run your robot: how many runs per week, how many documents when, etc. I'd say a week is a reasoneable time. If your robot supports end-user submissions you could of course be clever about people submitting their /robots.txt URL; that would give them more influence. -- Martijn __________ Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Wed Nov 29 18:16:32 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11717; Wed, 29 Nov 95 18:16:32 -0800 Message-Id: <9511300215.AA04718@grasshopper.ucsd.edu> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Christopher Penrose <penrose@grasshopper.ucsd.edu> Date: Wed, 29 Nov 95 18:15:27 -0800 To: robots@webcrawler.com Subject: McKinley Spider hit us hard Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com A spider from magellan.mckinley.com hit us hard today and did a deep recursive search of our web tree. Not very friendly, but their spider did check /robots.txt which indicates that they may have successfully implemented the robot exclusion protocol. Christopher Penrose penrose@ucsd.edu http://www-crca.ucsd.edu/TajMahal/after.html here is their internic info if anyone else wants to complain to them: The McKinley Group (MCKINLEY-DOM) 85 Liberty Ship Way Suite 201 Sausalito, CA 94965 Domain Name: MCKINLEY.COM Administrative Contact, Technical Contact, Zone Contact: Cohen, Alexander J. (ASC2) xcohen@MCKINLEY.COM 415-331-1884 FAX Record last updated on 21-Sep-95. Record created on 14-Jul-94. Domain servers in listed order: NS1.NOC.NETCOM.NET 204.31.1.1 NS2.NOC.NETCOM.NET 204.31.1.2 From owner-robots Wed Nov 29 18:58:31 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15455; Wed, 29 Nov 95 18:58:31 -0800 From: Adminstrator <POSTMASTER@ATTAUST1.austria.attgis.com> To: robots@webcrawler.com Subject: Mail failure Date: Thu, 30 Nov 95 03:57:00 PST Message-Id: <30BD9C2C@mailgate.austria.attgis.com> Encoding: 50 TEXT X-Mailer: Microsoft Mail V3.0 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com User mail received addressed to the following unknown addresses: AUSTRIA/ATTAUST1/mostendo ------------------------------------------------------------------------------ Return-Path: <@issaust.austria.ncr.com:robots@webcrawler.com> Message-Id: <9511300215.AA04718@grasshopper.ucsd.edu> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Christopher Penrose <penrose@grasshopper.ucsd.edu> Date: Wed, 29 Nov 95 18:15:27 -0800 To: robots@webcrawler.com Subject: McKinley Spider hit us hard Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com A spider from magellan.mckinley.com hit us hard today and did a deep recursive search of our web tree. Not very friendly, but their spider did check /robots.txt which indicates that they may have successfully implemented the robot exclusion protocol. Christopher Penrose penrose@ucsd.edu http://www-crca.ucsd.edu/TajMahal/after.html here is their internic info if anyone else wants to complain to them: The McKinley Group (MCKINLEY-DOM) 85 Liberty Ship Way Suite 201 Sausalito, CA 94965 Domain Name: MCKINLEY.COM Administrative Contact, Technical Contact, Zone Contact: Cohen, Alexander J. (ASC2) xcohen@MCKINLEY.COM 415-331-1884 FAX Record last updated on 21-Sep-95. Record created on 14-Jul-94. Domain servers in listed order: NS1.NOC.NETCOM.NET 204.31.1.1 NS2.NOC.NETCOM.NET 204.31.1.2 From owner-robots Wed Nov 29 19:16:42 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17176; Wed, 29 Nov 95 19:16:42 -0800 From: Adminstrator <POSTMASTER@ATTAUST1.austria.attgis.com> To: robots@webcrawler.com Subject: Mail failure Date: Thu, 30 Nov 95 04:15:00 PST Message-Id: <30BDA075@mailgate.austria.attgis.com> Encoding: 70 TEXT X-Mailer: Microsoft Mail V3.0 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com User mail received addressed to the following unknown addresses: AUSTRIA/ATTAUST1/mostendo ------------------------------------------------------------------------------ Return-Path: <@issaust.austria.ncr.com:robots@webcrawler.com> From: Adminstrator <POSTMASTER@attaust1.austria.attgis.com> To: robots@webcrawler.com Subject: Mail failure Date: Thu, 30 Nov 95 03:57:00 PST Message-Id: <30BD9C2C@mailgate.austria.attgis.com> Encoding: 50 TEXT X-Mailer: Microsoft Mail V3.0 Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com User mail received addressed to the following unknown addresses: AUSTRIA/ATTAUST1/mostendo ------------------------------------------------------------------------------ Return-Path: <@issaust.austria.ncr.com:robots@webcrawler.com> Message-Id: <9511300215.AA04718@grasshopper.ucsd.edu> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Christopher Penrose <penrose@grasshopper.ucsd.edu> Date: Wed, 29 Nov 95 18:15:27 -0800 To: robots@webcrawler.com Subject: McKinley Spider hit us hard Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com A spider from magellan.mckinley.com hit us hard today and did a deep recursive search of our web tree. Not very friendly, but their spider did check /robots.txt which indicates that they may have successfully implemented the robot exclusion protocol. Christopher Penrose penrose@ucsd.edu http://www-crca.ucsd.edu/TajMahal/after.html here is their internic info if anyone else wants to complain to them: The McKinley Group (MCKINLEY-DOM) 85 Liberty Ship Way Suite 201 Sausalito, CA 94965 Domain Name: MCKINLEY.COM Administrative Contact, Technical Contact, Zone Contact: Cohen, Alexander J. (ASC2) xcohen@MCKINLEY.COM 415-331-1884 FAX Record last updated on 21-Sep-95. Record created on 14-Jul-94. Domain servers in listed order: NS1.NOC.NETCOM.NET 204.31.1.1 NS2.NOC.NETCOM.NET 204.31.1.2 From owner-robots Wed Nov 29 19:29:34 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18110; Wed, 29 Nov 95 19:29:34 -0800 From: Adminstrator <POSTMASTER@ATTAUST1.austria.attgis.com> To: robots@webcrawler.com Subject: Mail failure Date: Thu, 30 Nov 95 04:28:00 PST Message-Id: <30BDA376@mailgate.austria.attgis.com> Encoding: 91 TEXT X-Mailer: Microsoft Mail V3.0 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com User mail received addressed to the following unknown addresses: AUSTRIA/ATTAUST1/mostendo ------------------------------------------------------------------------------ Return-Path: <@issaust.austria.ncr.com:robots@webcrawler.com> From: Adminstrator <POSTMASTER@attaust1.austria.attgis.com> To: robots@webcrawler.com Subject: Mail failure Date: Thu, 30 Nov 95 04:15:00 PST Message-Id: <30BDA075@mailgate.austria.attgis.com> Encoding: 70 TEXT X-Mailer: Microsoft Mail V3.0 Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com User mail received addressed to the following unknown addresses: AUSTRIA/ATTAUST1/mostendo ------------------------------------------------------------------------------ Return-Path: <@issaust.austria.ncr.com:robots@webcrawler.com> From: Adminstrator <POSTMASTER@attaust1.austria.attgis.com> To: robots@webcrawler.com Subject: Mail failure Date: Thu, 30 Nov 95 03:57:00 PST Message-Id: <30BD9C2C@mailgate.austria.attgis.com> Encoding: 50 TEXT X-Mailer: Microsoft Mail V3.0 Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com User mail received addressed to the following unknown addresses: AUSTRIA/ATTAUST1/mostendo ------------------------------------------------------------------------------ Return-Path: <@issaust.austria.ncr.com:robots@webcrawler.com> Message-Id: <9511300215.AA04718@grasshopper.ucsd.edu> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Christopher Penrose <penrose@grasshopper.ucsd.edu> Date: Wed, 29 Nov 95 18:15:27 -0800 To: robots@webcrawler.com Subject: McKinley Spider hit us hard Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com A spider from magellan.mckinley.com hit us hard today and did a deep recursive search of our web tree. Not very friendly, but their spider did check /robots.txt which indicates that they may have successfully implemented the robot exclusion protocol. Christopher Penrose penrose@ucsd.edu http://www-crca.ucsd.edu/TajMahal/after.html here is their internic info if anyone else wants to complain to them: The McKinley Group (MCKINLEY-DOM) 85 Liberty Ship Way Suite 201 Sausalito, CA 94965 Domain Name: MCKINLEY.COM Administrative Contact, Technical Contact, Zone Contact: Cohen, Alexander J. (ASC2) xcohen@MCKINLEY.COM 415-331-1884 FAX Record last updated on 21-Sep-95. Record created on 14-Jul-94. Domain servers in listed order: NS1.NOC.NETCOM.NET 204.31.1.1 NS2.NOC.NETCOM.NET 204.31.1.2 From owner-robots Wed Nov 29 20:03:47 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20992; Wed, 29 Nov 95 20:03:47 -0800 From: Adminstrator <POSTMASTER@ATTAUST1.austria.attgis.com> To: robots@webcrawler.com Subject: Mail failure Date: Thu, 30 Nov 95 04:44:00 PST Message-Id: <30BDA71C@mailgate.austria.attgis.com> Encoding: 113 TEXT X-Mailer: Microsoft Mail V3.0 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com User mail received addressed to the following unknown addresses: AUSTRIA/ATTAUST1/mostendo ------------------------------------------------------------------------------ Return-Path: <@issaust.austria.ncr.com:robots@webcrawler.com> From: Adminstrator <POSTMASTER@attaust1.austria.attgis.com> To: robots@webcrawler.com Subject: Mail failure Date: Thu, 30 Nov 95 04:28:00 PST Message-Id: <30BDA376@mailgate.austria.attgis.com> Encoding: 91 TEXT X-Mailer: Microsoft Mail V3.0 Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com User mail received addressed to the following unknown addresses: AUSTRIA/ATTAUST1/mostendo ------------------------------------------------------------------------------ Return-Path: <@issaust.austria.ncr.com:robots@webcrawler.com> From: Adminstrator <POSTMASTER@attaust1.austria.attgis.com> To: robots@webcrawler.com Subject: Mail failure Date: Thu, 30 Nov 95 04:15:00 PST Message-Id: <30BDA075@mailgate.austria.attgis.com> Encoding: 70 TEXT X-Mailer: Microsoft Mail V3.0 Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com User mail received addressed to the following unknown addresses: AUSTRIA/ATTAUST1/mostendo ------------------------------------------------------------------------------ Return-Path: <@issaust.austria.ncr.com:robots@webcrawler.com> From: Adminstrator <POSTMASTER@attaust1.austria.attgis.com> To: robots@webcrawler.com Subject: Mail failure Date: Thu, 30 Nov 95 03:57:00 PST Message-Id: <30BD9C2C@mailgate.austria.attgis.com> Encoding: 50 TEXT X-Mailer: Microsoft Mail V3.0 Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com User mail received addressed to the following unknown addresses: AUSTRIA/ATTAUST1/mostendo ------------------------------------------------------------------------------ Return-Path: <@issaust.austria.ncr.com:robots@webcrawler.com> Message-Id: <9511300215.AA04718@grasshopper.ucsd.edu> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Christopher Penrose <penrose@grasshopper.ucsd.edu> Date: Wed, 29 Nov 95 18:15:27 -0800 To: robots@webcrawler.com Subject: McKinley Spider hit us hard Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com A spider from magellan.mckinley.com hit us hard today and did a deep recursive search of our web tree. Not very friendly, but their spider did check /robots.txt which indicates that they may have successfully implemented the robot exclusion protocol. Christopher Penrose penrose@ucsd.edu http://www-crca.ucsd.edu/TajMahal/after.html here is their internic info if anyone else wants to complain to them: The McKinley Group (MCKINLEY-DOM) 85 Liberty Ship Way Suite 201 Sausalito, CA 94965 Domain Name: MCKINLEY.COM Administrative Contact, Technical Contact, Zone Contact: Cohen, Alexander J. (ASC2) xcohen@MCKINLEY.COM 415-331-1884 FAX Record last updated on 21-Sep-95. Record created on 14-Jul-94. Domain servers in listed order: NS1.NOC.NETCOM.NET 204.31.1.1 NS2.NOC.NETCOM.NET 204.31.1.2 From owner-robots Thu Nov 30 10:45:43 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12365; Thu, 30 Nov 95 10:45:43 -0800 Date: Thu, 30 Nov 1995 13:43:58 -0500 From: alain@ai.iit.nrc.ca (Alain Desilets) Message-Id: <9511301843.AA28288@ksl1000.iit.nrc.ca> To: robots@webcrawler.com Subject: Re: Looking for a spider Cc: alain@ai.iit.nrc.ca X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Dear Marilyn, Just thought I'd check out the status of your robot testbed. My ListSeeker software (http://ai.iit.nrc.ca/II_public/WebView/ListSeeker.html) is now ready for testing. So if your robot testbed is ready for public use, I am prepared to try it out. Sincerely, Alain Desilets Institute for Information Technology National Research Concil of Canada Building M-50 Montreal Road Ottawa (Ont) K1A 0R6 e-mail: alain@ai.iit.nrc.ca Tel: (613) 990-2813 Fax: (613) 952-7151 From owner-robots Thu Nov 30 12:30:51 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18231; Thu, 30 Nov 95 12:30:51 -0800 Date: Thu, 30 Nov 1995 21:29:30 +0100 (MET) From: Karoly Negyesi <chx@cs.elte.hu> X-Sender: chx@turan To: robots@webcrawler.com Subject: Small robot needed Message-Id: <Pine.SV4.3.91.951130212824.4490A@turan> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi! I'd need a very small robot which download a given URL (most probably a HTML page) and everything directly referenced (HREFs LINKs SRCs) Thanks, ___ ___ Charlie Negyesi chx@cs.elte.hu ___ ___ {~._.~} {~._.~} (+361) 203-5962 (7pm-9pm) {~._.~} {~._.~} _( Y )_ ( * ) Hungary, Budapest ( * ) _( Y )_ (:_~*~_:) ()~*~() H-1462, P.o.box 503 ()~*~() (:_~*~_:) (_)-(_) (_)-(_) May the Bear be with you! (_)-(_) (_)-(_) From owner-robots Thu Nov 30 13:15:32 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20570; Thu, 30 Nov 95 13:15:32 -0800 Date: Thu, 30 Nov 1995 16:15:21 -0500 From: Skip Montanaro <skip@automatrix.com> Message-Id: <199511302115.QAA04958@dolphin.automatrix.com> To: robots@webcrawler.com Subject: New robot turned loose on an unsuspecting public... and a DNS question Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com No, it's not really another Godzilla movie. I started running the Musi-Cal Robot today. It has the following properties: 1. Understands (and obeys!) the robots.txt protocol. 2. Doesn't revisit the same server more than once every 10 minutes. 3. Doesn't revisit the same URL more than once per month. 4. Only groks HTTP URLs at the moment. 5. Announces itself in requests as "Musi-Cal-Robot/0.1". 6. Gives my email ("skip@calendar.com") in the From: field of the request. 7. It's looking for music-related sites, so you may never see it. 8. The HTML parser I'm using is rather slow, which helps avoid network congestion. 9. You should only ever see it running from dolphin.automatrix.com, a machine connected via 28.8k modem - again, a fine network/server congestion avoidance tool. 10. It randomizes its list of outstanding URLs after every pass through the list to minimize beating up a single server. If there's anything I've forgotten to do (like announce it somewhere on Usenet) or any parameter needs obvious tweaking, let me know. I have been struggling with DNS resolution and was wondering if people could give me some feedback. Ideally, I want to make sure I treat all aliases for a server as the same server, so I was attempting to execute gethostbyaddr(gethostbyname('www.wherever.com')) but that seemed terribly slow and tcpdump traces suggested that it would get stuck banging on the same server. Then I tried just the gethostbyname(), but that wasn't much better. For now, I just accept what I have for a host name and map a couple places I know that do round-robin DNS back into the canonical name. What do other robot writers do about name resolution? Feedback appreciated. Thanks, Skip Montanaro skip@calendar.com (518)372-5583 Musi-Cal: http://www.calendar.com/concerts/ or mailto:concerts@calendar.com Internet Conference Calendar: http://www.calendar.com/conferences/ >>> ZLDF: http://www.netresponse.com/zldf <<< From owner-robots Thu Nov 30 17:40:52 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09699; Thu, 30 Nov 95 17:40:52 -0800 Message-Id: <199512010140.RAA28005@fiji.verity.com> X-Authentication-Warning: fiji.verity.com: Host localhost.verity.com didn't use HELO protocol To: skip@calendar.com Cc: robots@webcrawler.com Subject: Re: New robot turned loose on an unsuspecting public... and a DNS question In-Reply-To: Your message of "Thu, 30 Nov 1995 16:15:21 EST." <199511302115.QAA04958@dolphin.automatrix.com> Date: Thu, 30 Nov 1995 17:40:32 -0800 From: Thomas Maslen <tmaslen@Verity.COM> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > What do other robot writers do about name resolution? In our case... cache the results of lookups so that we only do the gethostbyname("foo") once for any particular "foo". This still gives pretty evil behaviour on, say, a page of links to cool places where almost every link points to a different host, but the average behaviour is much better than not caching. Also, if you're looking for a canonical representation for hosts so that you can test "is this host the same as that one?", I'd suggest that you _not_ try matching the hostnames: rather, do the gethostbyaddr() and then look for an intersection in the sets of IP addresses (but be prepared to rewrite the code next year to deal with IPv6 addresses!). In other words, the canonical representation for a host should be the set of IP addresses, not the hostname strings. Thomas tmaslen@verity.com My opinions, not Verity's From owner-robots Fri Dec 1 08:24:24 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00727; Fri, 1 Dec 95 08:24:24 -0800 Date: Fri, 1 Dec 95 10:33:28 EST From: wulfekuh@cps.msu.edu (Marilyn R Wulfekuhler) Message-Id: <9512011533.AA14431@pixel.cps.msu.edu> To: robots@webcrawler.com Subject: Re: Looking for a spider Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, Sorry to say, we had a disk problem and lost the original data. In the meantime, we have ordered a new (9 gig) disk, and also uncovered some more bugs in htmlgobble, and are trying to get things back. The known bugs are fixed, but the word on the new disk is still "any day now". You've been patient so far: sorry I didn't let you know the status earlier. I'll try to keep you informed, and when we have stuff (even before I announce it to the list), I'll let you know. Thanks for your patience, Marilyn From owner-robots Fri Dec 1 08:59:15 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03564; Fri, 1 Dec 95 08:59:15 -0800 Date: Fri, 1 Dec 1995 17:20:47 +0200 (EET) From: Cristian Ionitoiu <cristi@cs.utt.ro> X-Sender: cristi@tempus5 To: robots@webcrawler.com Subject: inquiry about robots Message-Id: <Pine.SUN.3.91.951201171701.5311A-100000@tempus5> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi to everybody, I'm quite new on the list, and I'm interested in Internet navigating robots. I would like to know if there is any robot which offer a certain API for the programmer? Or if there any public available robot together with its sources? And I would prefer an non-perl implementation. Thank you in advance for all your information! --Cristian ============================================================================== CRISTIAN IONITOIU - Computer Science Department, "Politehnica" University of teaching Timisoara. assistant Email: cristi@utt.ro, cristi@ns.utt.ro, cristi@cs.utt.ro WWW: http://www.utt.ro/~cristi Office: Bdul. Vasile Parvan No. 2, 1900 Timisoara, Romania Private: O.P. 5, C.P. 641, 1900 Timisoara, Romania Fax&Phone: (office): +40 56 192 049 ______________________________________________________________________________ Science is what happens when preconception meets verification. ============================================================================== From owner-robots Fri Dec 1 09:26:06 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06317; Fri, 1 Dec 95 09:26:06 -0800 Date: Fri, 1 Dec 1995 12:24:16 -0500 From: alain@ai.iit.nrc.ca (Alain Desilets) Message-Id: <9512011724.AA00940@ksl1000.iit.nrc.ca> To: robots@webcrawler.com Subject: Re: Looking for a spider Cc: alain@ai.iit.nrc.ca X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Hi, > > Sorry to say, we had a disk problem and lost the original data. That's a bummer... > In the > meantime, we have ordered a new (9 gig) disk, and also uncovered some more > bugs in htmlgobble, and are trying to get things back. The known bugs are > fixed, but the word on the new disk is still "any day now". > > You've been patient so far: sorry I didn't let you know the status earlier. > > I'll try to keep you informed, and when we have stuff (even before I announce > it to the list), I'll let you know. > Don't worry about me. We have some data here that I can use to test my approach on a small scale, and I am talking to some other people about getting about 1G of additional data. Your data would be a good addition to that (the more data the better). Good luck with your work and let me know how it goes. Alain From owner-robots Fri Dec 1 09:41:33 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA07351; Fri, 1 Dec 95 09:41:33 -0800 Date: Fri, 1 Dec 1995 09:40:26 -0800 Message-Id: <199512011740.JAA05988@ix13.ix.netcom.com> From: wessman@ix.netcom.com (Gene Essman ) Subject: Re: Looking for a spider To: robots@webcrawler.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You wrote: > > >Hi, > >Sorry to say, we had a disk problem and lost the original data. In the (snip) Sorry to seem so ignorant, but I have just been hanging around the Internet a short time. In that time, I have wondered about the whole "robot/spider" thing and have a couple of questions. Perhaps someone could take the time to help me out. Are robots for sale or can one "hire" someone who has one to do some work, or how does that whole thing work. Thanks, Gene Essman From owner-robots Fri Dec 1 10:28:36 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10173; Fri, 1 Dec 95 10:28:36 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: <v02130501ace4f418b338@[192.187.143.12]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 1 Dec 1995 10:28:27 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Looking for a spider Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 9:40 AM 12/1/95, Gene Essman wrote: >Are robots for sale or can one "hire" someone who has one to do some >work, or how does that whole thing work. Verity offers a couple of variations of its Web robot, but they are designed specifically to build Verity search indexes, not as general-purpose robots. The only generally available robot-ish code that I know about is the Harvest Gatherer code. Its primary purpose is to index the server on which is it running, but it's a fairly small step to make it do the same over the wire. I think there's a widespread reluctance to push robots hard in the commercial space, since marketing success would fairly quickly breed failure -- having lots of robots doing redundant work would be a huge inefficiency. Nick From owner-robots Fri Dec 1 17:22:39 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03387; Fri, 1 Dec 95 17:22:39 -0800 Message-Id: <m0tLgeS-0004gSC@rsoft.rsoft.bc.ca> Date: Sat, 2 Dec 1995 00:12:58 +0000 From: Ted Sullivan <tsullivan@blizzard.snowymtn.com> Subject: Re: Looking for a spider To: robots <robots@webcrawler.com> X-Mailer: Worldtalk (NetConnex V3.50a)/MIME Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Not to make this a sales pitch but if you need a real specialized spider for commercial work then we can build one for you that interfaces with ObjectStore a Object Database and any other applications you might have around. Ted Sullivan ---------- From: robots To: robots Subject: Re: Looking for a spider Date: Friday, December 01, 1995 9:40AM You wrote: > > >Hi, > >Sorry to say, we had a disk problem and lost the original data. In the (snip) Sorry to seem so ignorant, but I have just been hanging around the Internet a short time. In that time, I have wondered about the whole "robot/spider" thing and have a couple of questions. Perhaps someone could take the time to help me out. Are robots for sale or can one "hire" someone who has one to do some work, or how does that whole thing work. Thanks, Gene Essman From owner-robots Fri Dec 1 19:54:36 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03896; Fri, 1 Dec 95 19:54:36 -0800 Date: Fri, 1 Dec 1995 20:52:46 -0700 Message-Id: <199512020352.UAA24347@web.azstarnet.com> X-Sender: drose@azstarnet.com X-Mailer: Windows Eudora Version 1.4.4 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: drose@AZStarNet.com Subject: Re: Looking for a spider Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Ted: I very much need a specialized spider. Could you let me know something about your capabilities? Assume that I want to research *everything* on the web about, say, stamp collecting (not) on an historical and contemporary basis, how would your spider work? I look forward to hearing from you. -David M. Rose > >Not to make this a sales pitch but if you need a real specialized spider for >commercial work then we can build one for you that interfaces with >ObjectStore a Object Database and any other applications you might have >around. > >Ted Sullivan > ---------- From owner-robots Fri Dec 1 20:47:03 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04060; Fri, 1 Dec 95 20:47:03 -0800 Message-Id: <30BF86FE.183@mcc.tamu.edu> Date: Fri, 01 Dec 1995 22:51:42 +0000 From: Lance Ogletree <Lance.Ogletree@mcc.tamu.edu> X-Mailer: Mozilla 2.0b3 (Macintosh; I; PPC) Mime-Version: 1.0 To: robots@webcrawler.com Subject: MacPower Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Interested in Power Macintosh Computers? Stop by a site on the web. MacPower!!!!!!!! http://mccnet.tamu.edu/MacPower/MacPower.html From owner-robots Sat Dec 2 08:11:51 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05472; Sat, 2 Dec 95 08:11:51 -0800 Message-Id: <m0tLuX6-0004oqC@rsoft.rsoft.bc.ca> Date: Sat, 2 Dec 1995 05:09:58 +0000 From: Ted Sullivan <tsullivan@blizzard.snowymtn.com> Subject: Re: Looking for a spider To: robots <robots@webcrawler.com> X-Mailer: Worldtalk (NetConnex V3.50a)/MIME Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Could you send me your e-mail address to tsullivan@snowymtn.com so we can have this discussion offline the robots mailing list. I am sure the other would appreciate it. Ted ---------- From: robots To: robots Subject: Re: Looking for a spider Date: Friday, December 01, 1995 7:52PM Ted: I very much need a specialized spider. Could you let me know something about your capabilities? Assume that I want to research *everything* on the web about, say, stamp collecting (not) on an historical and contemporary basis, how would your spider work? I look forward to hearing from you. -David M. Rose > >Not to make this a sales pitch but if you need a real specialized spider for >commercial work then we can build one for you that interfaces with >ObjectStore a Object Database and any other applications you might have >around. > >Ted Sullivan > ---------- From i.bromwich@nexor.co.uk Mon Dec 4 02:24:00 1995 Return-Path: <i.bromwich@nexor.co.uk> Received: from lancaster.nexor.co.uk by webcrawler.com (NX5.67f2/NX3.0M) id AA00398; Mon, 4 Dec 95 02:24:00 -0800 X400-Received: by /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 4 Dec 1995 10:23:23 +0000 X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/; Relayed; Mon, 4 Dec 1995 10:23:23 +0000 Date: Mon, 4 Dec 1995 10:23:23 +0000 X400-Originator: i.bromwich@nexor.co.uk X400-Recipients: non-disclosure:; X400-Mts-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:166150:951204102333] Content-Identifier: XT-MS Message Priority: Non-Urgent From: "i.bromwich" <i.bromwich@nexor.co.uk> Message-Id: <"-2131556092-16615-00001 951204102333Z*/I=i/S=bromwich/O=NEXOR/PRMD=NEXOR/ADMD= /C=GB/"@MHS> To: robots-archive <robots-archive@webcrawler.com> Reply-To: mak <mak@webcrawler.com> X-Mua-Version: XT-MUA 1.4 (dornier) of Tue Aug 22 03:03:53 BST 1995 // martijn, can't think of any other way to get these to you easily. Get in // touch if you need more help get stop From owner-robots Mon Dec 4 04:36:17 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00646; Mon, 4 Dec 95 04:36:17 -0800 From: Jaakko Hyvatti <Jaakko.Hyvatti@www.fi> Message-Id: <199512041236.OAA16470@krisse.www.fi> Subject: Re: MacPower To: robots@webcrawler.com Date: Mon, 4 Dec 1995 14:36:07 +0200 (EET) In-Reply-To: <30BF86FE.183@mcc.tamu.edu> from "Lance Ogletree" at Dec 1, 95 10:51:42 pm X-Mailer: ELM [version 2.4 PL22] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 174 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Interested in Power Macintosh Computers? > Stop by a site on the web. > MacPower!!!!!!!! > http://mccnet.tamu.edu/MacPower/MacPower.html No, I am not very interested. From owner-robots Mon Dec 4 04:47:22 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00689; Mon, 4 Dec 95 04:47:22 -0800 From: Jaakko Hyvatti <Jaakko.Hyvatti@www.fi> Message-Id: <199512041247.OAA16694@krisse.www.fi> Subject: Re: MacPower (an apology, I am very sorry) To: robots@webcrawler.com Date: Mon, 4 Dec 1995 14:47:14 +0200 (EET) In-Reply-To: <199512041236.OAA16470@krisse.www.fi> from "Jaakko Hyvatti" at Dec 4, 95 02:36:07 pm X-Mailer: ELM [version 2.4 PL22] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 148 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > http://mccnet.tamu.edu/MacPower/MacPower.html > > No, I am not very interested. I am very sorry this reply to the spam got into the list. From owner-robots Tue Dec 5 12:57:01 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19484; Tue, 5 Dec 95 12:57:01 -0800 From: Michael Van Biesbrouck <mlvanbie@undergrad.math.uwaterloo.ca> Message-Id: <199512052056.PAA24672@mobius07.math.uwaterloo.ca> Subject: Re: McKinley Spider hit us hard To: robots@webcrawler.com Date: Tue, 5 Dec 1995 15:56:33 -0500 (EST) In-Reply-To: <9511300215.AA04718@grasshopper.ucsd.edu> from "Christopher Penrose" at Nov 29, 95 06:15:27 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 1110 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > A spider from magellan.mckinley.com hit us hard today and did a > deep recursive search of our web tree. Not very friendly, but their > spider did check /robots.txt which indicates that they may have > successfully implemented the robot exclusion protocol. > > > Christopher Penrose > penrose@ucsd.edu > http://www-crca.ucsd.edu/TajMahal/after.html > > here is their internic info if anyone else wants to complain to them: The spider in question is Wobot/1.00; the correct person to bother with complaints is cedeno@mckinley.com. They visited a site that I watch over on 21 Nov and did nothing after reading /robots.txt. The robots.txt is somewhat long, but not very restrictive. However, it seems to have gone ballastic today on another machine. As a result I will be complaining. In this case it came from radar.mckinley.com. I sugest that other people check their logs and complain if necessary. -- "You're obviously on drugs, Michael Van Biesbrouck but not the right ones." ACM East Central Winning Team -- bwross about mlvanbie http://csclub.uwaterloo.ca/u/mlvanbie/ From owner-robots Tue Dec 5 22:02:14 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25735; Tue, 5 Dec 95 22:02:14 -0800 Date: Tue, 5 Dec 1995 22:02:01 -0800 X-Sender: julian @best.com Message-Id: <v01530501acea810e4f32@[206.86.2.106]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: julian@ugorilla.com (Julian Gorodsky) Subject: Re: Returned mail: Service unavailableHELP HELP! Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >The original message was received at Mon, 4 Dec 1995 20:35:54 -0800 >from julian.vip.best.com [206.86.2.106] > > ----- The following addresses had delivery problems ----- ><Majordomo-Owner@webcrawler.com> (unrecoverable error) > > ----- Transcript of session follows ----- >... while talking to surfski.webcrawler.com.: >>>> RCPT To:<Majordomo-Owner@webcrawler.com> ><<< 554 <Majordomo-Owner@webcrawler.com>... 550 User unknown >554 <Majordomo-Owner@webcrawler.com>... Service unavailable > > ----- Original message follows ----- > >Content-Type: message/rfc822 > >Return-Path: julian@ugorilla.com >Received: from [206.86.2.106] (julian.vip.best.com [206.86.2.106]) by >blob.best.net (8.6.12/8.6.5) with SMTP id UAA10780 for ><Majordomo-Owner@webcrawler.com>; Mon, 4 Dec 1995 20:35:54 -0800 >Date: Mon, 4 Dec 1995 20:35:54 -0800 >X-Sender: julian @best.com >Message-Id: <v01530500ace91a991809@[206.86.2.106]> >Mime-Version: 1.0 >Content-Type: text/plain; charset="us-ascii" >To: Majordomo-Owner@webcrawler.com >From: julian@ugorilla.com (Julian Gorodsky) >Subject: Re: Majordomo results > >>-- >> >>>>>> unsubscribe julian@best.com >>**** unsubscribe: 'julian@best.com' is not a member of list 'robots'. >>>>>> >>>>>> julian@ugorilla.com >>**** Command 'julian@ugorilla.com' not recognized. >>>>>> A Renaissance Project >>**** Command 'a' not recognized. >>>>>> >>>>>> >>**** Help for Majordomo: >> >>This is Brent Chapman's "Majordomo" mailing list manager, version 1.93. >> >>In the description below items contained in []'s are optional. When >>providing the item, do not include the []'s around it. >> >>It understands the following commands: >> >> subscribe [<list>] [<address>] >> Subscribe yourself (or <address> if specified) to the named <list>. >> >> unsubscribe [<list>] [<address>] >> Unsubscribe yourself (or <address> if specified) from the named >><list>. >> >> get [<list>] <filename> >> Get a file related to <list>. >> >> index [<list>] >> Return an index of files you can "get" for <list>. >> >> which [<address>] >> Find out which lists you (or <address> if specified) are on. >> >> who [<list>] >> Find out who is on the named <list>. >> >> info [<list>] >> Retrieve the general introductory information for the named <list>. >> >> lists >> Show the lists served by this Majordomo server. >> >> help >> Retrieve this message. >> >> end >> Stop processing commands (useful if your mailer adds a signature). >> >>Commands should be sent in the body of an email message to >>"Majordomo"or to "<list>-request". >> >>The <list> parameter is only optional if the message is sent to an address >>of the form "<list>-request". >> >> >>Commands in the "Subject:" line NOT processed. >> >>If you have any questions or problems, please contact >>"Majordomo-Owner". > >You have a subscriber named julianrz@best.com >Perhaps there's some confusion. > >julian@ugorilla.com >A Renaissance Project julian@ugorilla.com A Renaissance Project From owner-robots Tue Dec 5 22:02:50 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25781; Tue, 5 Dec 95 22:02:50 -0800 Date: Tue, 5 Dec 1995 22:02:39 -0800 X-Sender: julian @best.com Message-Id: <v01530500acea810c4ec1@[206.86.2.106]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: julian@ugorilla.com (Julian Gorodsky) Subject: Re: Returned mail: Service unavailableHELP AGAIN HELP AGAIN! Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >The original message was received at Mon, 4 Dec 1995 20:38:42 -0800 >from julian.vip.best.com [206.86.2.106] > > ----- The following addresses had delivery problems ----- ><Majordomo-Owner@webcrawler.com> (unrecoverable error) > > ----- Transcript of session follows ----- >... while talking to surfski.webcrawler.com.: >>>> RCPT To:<Majordomo-Owner@webcrawler.com> ><<< 554 <Majordomo-Owner@webcrawler.com>... 550 User unknown >554 <Majordomo-Owner@webcrawler.com>... Service unavailable > > ----- Original message follows ----- > >Content-Type: message/rfc822 > >Return-Path: julian@ugorilla.com >Received: from [206.86.2.106] (julian.vip.best.com [206.86.2.106]) by >blob.best.net (8.6.12/8.6.5) with SMTP id UAA12321 for ><Majordomo-Owner@webcrawler.com>; Mon, 4 Dec 1995 20:38:42 -0800 >Date: Mon, 4 Dec 1995 20:38:42 -0800 >X-Sender: julian @best.com >Message-Id: <v01530501ace91bcc6033@[206.86.2.106]> >Mime-Version: 1.0 >Content-Type: text/plain; charset="us-ascii" >To: Majordomo-Owner@webcrawler.com >From: julian@ugorilla.com (Julian Gorodsky) >Subject: Re: Welcome to robots > >>-- >> >>Welcome to the robots mailing list! >> >>If you ever want to remove yourself from this mailing list, >>send the following command in email to >>"robots-request": >> >> unsubscribe >> >>Or you can send mail to "Majordomo" with the following command >>in the body of your email message: >> >> unsubscribe robots Julian Rozentur <julianrz@best.com> >> >>Here's the general information for the list you've >>subscribed to, in case you don't already have it: >> >> >>This information is also available on the World-Wide Web in >>http://info.webcrawler.com/mailing-lists/robots/info.html >> >>CHARTER >> >>The robots@webcrawler.com mailing-list is intended as a technical >>forum for authors, maintainers and administrators of WWW robots. Its >>aim is to maximise the benefits WWW robots can offer while minimising >>drawbacks and duplication of effort. It is intended to address both >>development and operational aspects of WWW robots. >> >>This list is not intended for general discussion of WWW development >>efforts, or as a first line of support for users of robot facilities. >> >>Postings to this list are informal, and decisions and recommendations >>formulated here do not constitute any official standards. Postings to >>this list will be made available publicly through a mailing list >>archive. The administrator of this list nor his company accept any >>responsibility for the content of the postings. >> >>SUBSCRIPTION DETAILS >> >>To subscribe to this list, send a mail message to >>robots-request@webcrawler.com, with the word subscribe on the first >>line of the body. >> >>To unsubscribe to this list, send a mail message to >>robots-request@webcrawler.com, with the word unsubscribe on the first >>line of the body. >> >>Should this fail or should you otherwise need human assistance, send a >>message to owner-robots@webcrawler.com. >> >>To send message to all subscribers on the list itself, mail >>robots@webcrawler.com. >> >>THE ARCHIVE >> >>Messages to this list are archived. The preferred way of accessing the >>archived messages is using the Robots Mailing List Archive provided by >>Hypermail, on http://info.webcrawler.com/mailing-lists/robots/archive/ >> >>Behind the scenes this list is currently managed by Majordomo, an >>automated mailing list manager written in Perl. Majordomo also allows >>acces to archived messages; send mail to robots-request@webcrawler.com >>with the word help in the body to find out how. >> >> >>-- The Robots Mailing List Administrator <owner-robots@webcrawler.com> > >This is the original "warning" that a deluge of someone else's email would >arrive in my box > Not OK. > >julian@ugorilla.com >A Renaissance Project julian@ugorilla.com A Renaissance Project From owner-robots Tue Dec 5 22:43:34 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28361; Tue, 5 Dec 95 22:43:34 -0800 Message-Id: <199512060643.PAA01955@yamato.mtl.t.u-tokyo.ac.jp> To: robots@webcrawler.com Subject: Indexing two-byte text Date: Wed, 06 Dec 1995 15:43:23 +0900 From: Harry Munir Behrens <behrens@mtl.t.u-tokyo.ac.jp> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hello there, here at the Univ. of Tokyo we are currently installing Harvest and were wondering if anybody has experience with the problems encountered when indexing Japanese text. (no word boundaries, two-byte code etc.) I would be very grateful for any help pointing me to an international version of agrep/glimpse or something similar. Cheers, Harry Behrens PhD. candidate Dept. of Electrical Engineering Univ. of Tokyo behrens@mtl.t.u-tokyo.ac.jp From owner-robots Tue Dec 5 23:30:12 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01387; Tue, 5 Dec 95 23:30:12 -0800 Message-Id: <199512060730.CAA17199@lexington.cs.columbia.edu> To: robots@webcrawler.com Subject: Re: Indexing two-byte text In-Reply-To: Your message of "Wed, 06 Dec 1995 15:43:23 +0900." <199512060643.PAA01955@yamato.mtl.t.u-tokyo.ac.jp> Date: Wed, 06 Dec 1995 02:30:07 -0500 From: "John D. Pritchard" <jdp@cs.columbia.edu> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com maybe people in unicode would have one (agrep), like maybe the folks at http://plan9.att.com/ > here at the Univ. of Tokyo we are currently installing Harvest and were > wondering if anybody has experience with the problems encountered > when indexing Japanese text. (no word boundaries, two-byte code etc.) > I would be very grateful for any help pointing me to an international > version of agrep/glimpse or something similar. From owner-robots Tue Dec 5 23:47:15 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02152; Tue, 5 Dec 95 23:47:15 -0800 Date: Wed, 6 Dec 95 16:46:55 JST From: francis@cactus.slab.ntt.jp (Paul Francis) Message-Id: <9512060746.AA15834@cactus.slab.ntt.jp> To: robots@webcrawler.com Subject: Re: Indexing two-byte text Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > here at the Univ. of Tokyo we are currently installing Harvest and were > wondering if anybody has experience with the problems encountered > when indexing Japanese text. (no word boundaries, two-byte code etc.) > I would be very grateful for any help pointing me to an international > version of agrep/glimpse or something similar. > We are doing a multi-lingual navigation project (called Ingrid) that involves indexing Japanese text. We use JUMAN to extract japanese text (because it is public domain---it actually doesn't do such a good job), and some home grown perl stuff to filter out garbage, weight terms, and do stemming. But, for searching, we are for now doing exact string matching only. I suggest you ask this question on the comp.infosystems.harvest and also on the winter (web internationalization) mailing list at winter@dorado.crpht.lu. (please see http://dorado.crpht.lu:80/~carrasco/winter/ for the winter web page). I think there may be some mule tools for international grep like things, but I'm not absolutely sure about it... PF From owner-robots Wed Dec 6 18:19:17 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04473; Wed, 6 Dec 95 18:19:17 -0800 Message-Id: <v02130503acebfafcb272@[202.243.51.210]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 7 Dec 1995 11:18:57 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: Re: Indexing two-byte text Cc: robots@webcrawler.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 4:46 PM 12/6/95, Paul Francis wrote: >We are doing a multi-lingual navigation project >(called Ingrid) that involves indexing Japanese >text. We use JUMAN to extract japanese text >(because it is public domain---it actually doesn't >do such a good job), and some home grown perl >stuff to filter out garbage, weight terms, and >do stemming. Is there publicly available code to handle stemming for Japanese, or is there a description of the algorithm involved anywhere (in English or in Japanese)? And what sort of garbage remains after using JUMAN? --Mark From owner-robots Wed Dec 6 18:48:42 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06782; Wed, 6 Dec 95 18:48:42 -0800 Date: Thu, 7 Dec 95 11:48:24 JST From: francis@cactus.slab.ntt.jp (Paul Francis) Message-Id: <9512070248.AA19999@cactus.slab.ntt.jp> To: robots@webcrawler.com Subject: Re: Indexing two-byte text Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > Is there publicly available code to handle stemming for Japanese, or is > there a description of the algorithm involved anywhere (in English or in > Japanese)? Our Japanese "publisher" code will be made publicly available after 1) it is in decent shape, and 2) we get approval from management to release it (don't worry, we *WILL* get approval, one way or another :-). As for stemming. After making a weak attempt at finding out what other people are doing, we couldn't find anything about Japanese stemming. I think this may be because, since a dictionary is necessary simply to parse out the individual words, algorithmic stemming isn't really necessary. The stems are already in the dictionary. I wanted to minimize dependence on a dictionary, though, so we put our heads together and decided that effective stemming for Japanese simply requires removing any kana that appears after a kanji in a single "term". In other words, the kanji is the stem, in all cases. If the term has no kanji, then we don't stem at all. Though surely this simple algorithm must break for some cases, in our limited experience so far, we haven't found any problems. > > And what sort of garbage remains after using JUMAN? > JUMAN doesn't remove any text per se, just tries to separate out the individual terms. So, in general, text has all kinds of junk in it that isn't a valid term, including numbers, various symbols such as stars, circles, X's, etc. So, we try to filter as much of that out as we can without removing any valid stuff. As for JUMAN's term isolation ability, it suffers from a small dictionary. For example "intaanetto" (in romaji, "internet" in English) is broken into "intaa" and "netto", because JUMAN doesn't have "intaanetto" in its dictionary. I believe we'll be able to fix most of these by doing simple phrase detection. That is, if we see that "intaa" is always or very often followed by "netto", we can assume that they constitute a single phrase (or, in the no-white-space case, a single term). We will implement phrase detection next, and expect to have it by late January. PF ps. By the way, our Japanese publisher will be a single component of a multi-lingual publisher that will have language detection built in. We are doing Japanese and English, but expect to add others as they are done. pps. I really don't think this thread is so interesting to the robot list people. Maybe we should take it off-line. From owner-robots Wed Dec 6 23:29:02 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22374; Wed, 6 Dec 95 23:29:02 -0800 Message-Id: <199512070730.JAA08451@dns2.netvision.net.il> X-Sender: smadja@dns2.netvision.net.il X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 07 Dec 1995 09:27:25 -0500 To: robots@webcrawler.com From: Frank Smadja <smadja@netvision.net.il> Subject: Re: Indexing two-byte text Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I am interested in this thread. Please keep it online or keep me posted. Thanks At 11:48 AM 12/7/95 JST, you wrote: >> >> Is there publicly available code to handle stemming for Japanese, or is >> there a description of the algorithm involved anywhere (in English or in >> Japanese)? > >Our Japanese "publisher" code will be made publicly >available after 1) it is in decent shape, and 2) we >get approval from management to release it (don't >worry, we *WILL* get approval, one way or another :-). > >As for stemming. After making a weak attempt at finding >out what other people are doing, we couldn't find >anything about Japanese stemming. I think this may be >because, since a dictionary is necessary simply to >parse out the individual words, algorithmic stemming >isn't really necessary. The stems are already in the >dictionary. > >I wanted to minimize dependence on a dictionary, though, >so we put our heads together and decided that effective >stemming for Japanese simply requires removing any kana >that appears after a kanji in a single "term". In other >words, the kanji is the stem, in all cases. If the term >has no kanji, then we don't stem at all. > >Though surely this simple algorithm must break for some >cases, in our limited experience so far, we haven't found >any problems. > >> >> And what sort of garbage remains after using JUMAN? >> > >JUMAN doesn't remove any text per se, just tries to separate >out the individual terms. So, in general, text has all >kinds of junk in it that isn't a valid term, including >numbers, various symbols such as stars, circles, X's, etc. >So, we try to filter as much of that out as we can without >removing any valid stuff. > >As for JUMAN's term isolation ability, it suffers from a >small dictionary. For example "intaanetto" (in romaji, >"internet" in English) is broken into "intaa" and "netto", >because JUMAN doesn't have "intaanetto" in its dictionary. >I believe we'll be able to fix most of these by doing >simple phrase detection. That is, if we see that "intaa" >is always or very often followed by "netto", we can assume >that they constitute a single phrase (or, in the no-white-space >case, a single term). We will implement phrase detection >next, and expect to have it by late January. > >PF > >ps. By the way, our Japanese publisher will be a single >component of a multi-lingual publisher that will have >language detection built in. We are doing Japanese and >English, but expect to add others as they are done. > >pps. I really don't think this thread is so interesting >to the robot list people. Maybe we should take it off-line. > > From owner-robots Wed Dec 6 23:42:25 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23372; Wed, 6 Dec 95 23:42:25 -0800 Date: Thu, 7 Dec 95 16:42:14 JST From: francis@cactus.slab.ntt.jp (Paul Francis) Message-Id: <9512070742.AA21981@cactus.slab.ntt.jp> To: robots@webcrawler.com Subject: Re: Indexing two-byte text Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > I am interested in this thread. Please keep it online or keep me posted. > I have heard that there are only 4 numbers in computer science...0, 1, 2, and many. Thus, it seems that many people are interested in this thread.... :-) I'm more than happy to keep in online. PF From owner-robots Thu Dec 7 05:15:16 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08376; Thu, 7 Dec 95 05:15:16 -0800 Message-Id: <v02130503acec96754345@[202.243.51.214]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 7 Dec 1995 22:15:48 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: Re: Indexing two-byte text Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >> I am interested in this thread. Please keep it online or keep me posted. > >I have heard that there are only 4 numbers in >computer science...0, 1, 2, and many. > >Thus, it seems that many people are interested >in this thread.... :-) > >I'm more than happy to keep in online. > >PF Susumu Shimizu has started a Japanese language robots mailing list if anyone is interested: w3-search@rodem.slab.ntt.jp You can contact him at shimizu@rodem.slab.ntt.jp to join. The charter members are those of us who attended his BOF at the recent Japan WWW Conference '95 in Kobe. --Mark From owner-robots Thu Dec 7 07:30:26 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16064; Thu, 7 Dec 95 07:30:26 -0800 Message-Id: <199512071526.AAA08232@luxion.mtl.t.u-tokyo.ac.jp> To: robots@webcrawler.com Subject: Indexing two-byte text Date: Fri, 08 Dec 1995 00:26:27 +0900 From: Harry Munir Behrens <behrens@mtl.t.u-tokyo.ac.jp> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi guys, terrific echo, thanks to all that were interested and helpful. I have asked around some more in the university circus and we have arrived at the following project plan: We are putting in place a three-phase system based on JUMAN (for now) and an existing dictionary based rule-based system. In the first phase the system scans the text looking for two- and four- kanji components that the dictionary knows. This are singled out as "sure hits" and are stemmed were appropriate. In the second phase we run JUMAN over the resulting text. The third phase is going to be very similar to the first, but will be only for verificaction purposes; meaning that if JUMAN generates terms the dictionary doesn't know about error messages are ouput. The fourth stage is manual editing of these error messages :-( If there's anybody out there who is interested in more detailed info please get in touch on : behrens@mtl.t.u-tokyo.ac.jp I'm happy for any comments, suggestions etc. Harry Behrens PhD. candidate Dept. of Electrical Engineering Univ. of Tokyo behrens@mtl.t.u-tokyo.ac.jp From owner-robots Thu Dec 7 12:44:47 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03647; Thu, 7 Dec 95 12:44:47 -0800 Message-Id: <v02130504aced024e3b85@[202.243.51.208]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 8 Dec 1995 05:45:10 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: Re: Indexing two-byte text Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 12:26 AM 12/8/95, Harry Munir Behrens wrote: >We are putting in place a three-phase system based on JUMAN >(for now) and an existing dictionary based rule-based system. Is the "existing dictionary-based rule system" different from juman? Is juman not a dictionary-based rule system? --Mark From owner-robots Thu Dec 7 12:52:19 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03988; Thu, 7 Dec 95 12:52:19 -0800 Date: Thu, 7 Dec 1995 14:52:09 -0500 (EST) From: Randall Hill <rlh@conan.ids.net> To: robots@webcrawler.com Subject: Either a spider or a hacker? ww2.allcon.com Message-Id: <Pine.SUN.3.90.951207143241.24599B-100000@conan.ids.net> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi all, I'm setting up a new site and am getting persistent requests from ww2.allcon.com for a single file, home.shtml, that is under development and is not linked to anything. the default for my server is index.html which has NOT been requested. Any one seen them before TIA, -randy hill From owner-robots Thu Dec 7 23:51:21 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09790; Thu, 7 Dec 95 23:51:21 -0800 Message-Id: <v02130509aced92cda9fe@[202.243.51.212]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 8 Dec 1995 16:51:06 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: Indexing two-byte text Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com francis@cactus.slab.ntt.jp (Paul Francis) wrote: >Our Japanese "publisher" code will be made publicly >available after 1) it is in decent shape, and 2) we >get approval from management to release it (don't >worry, we *WILL* get approval, one way or another :-). You may get approval, but I assume that it couldn't be freely used for commercial purposes? >As for stemming. After making a weak attempt at finding >out what other people are doing, we couldn't find >anything about Japanese stemming. I think this may be >because, since a dictionary is necessary simply to >parse out the individual words, algorithmic stemming >isn't really necessary. The stems are already in the >dictionary. >I wanted to minimize dependence on a dictionary, though, >so we put our heads together and decided that effective >stemming for Japanese simply requires removing any kana >that appears after a kanji in a single "term". In other >words, the kanji is the stem, in all cases. If the term >has no kanji, then we don't stem at all. > >Though surely this simple algorithm must break for some >cases, in our limited experience so far, we haven't found >any problems. I don't think perfection is necessary here anyway to produce a useful system. But couldn't you just swap out the dictionary for a better dictionary? I just got a copy of juman, though, and although I just glanced at the files, it seemed like the dictionary was broken up by parts of speech. But most new coinages in a language tend to be nouns I would think. This could be a business opportunity for someone--just like software companies in the U.S. buy their spell checkers from specialized companies, someone could develop and market a morphological root dictionary for Japanese. >As for JUMAN's term isolation ability, it suffers from a >small dictionary. For example "intaanetto" (in romaji, >"internet" in English) is broken into "intaa" and "netto", >because JUMAN doesn't have "intaanetto" in its dictionary. >I believe we'll be able to fix most of these by doing >simple phrase detection. That is, if we see that "intaa" >is always or very often followed by "netto", we can assume >that they constitute a single phrase (or, in the no-white-space >case, a single term). We will implement phrase detection >next, and expect to have it by late January. Ha! A programmer's solution. It seems like just upping the dictionary is more straightforward. ;-) >ps. By the way, our Japanese publisher will be a single >component of a multi-lingual publisher that will have >language detection built in. We are doing Japanese and >English, but expect to add others as they are done. I'm not sure what you mean by a "publisher"--I'm not sure what this does. Is this different from Ingrid? --Mark From owner-robots Fri Dec 8 00:25:22 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11463; Fri, 8 Dec 95 00:25:22 -0800 Date: Fri, 8 Dec 95 17:25:12 JST From: francis@cactus.slab.ntt.jp (Paul Francis) Message-Id: <9512080825.AA29356@cactus.slab.ntt.jp> To: robots@webcrawler.com Subject: Re: Indexing two-byte text Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > You may get approval, but I assume that it couldn't be freely used for > commercial purposes? You are right. I suppose you could liscense it, but I hardly think it would be worth it. A good programmer could throw it together in a week.... > > I don't think perfection is necessary here anyway to produce a useful > system. But couldn't you just swap out the dictionary for a better > dictionary? I just got a copy of juman, though, and although I just glanced One major problem is that all the better dictionaries we know are commercial, so it broke our requirement for freely usable code. Second, I think using a dictionary is a never-ending battle. Each specialization has its own terms and require their own dictionary. Further, language evolves fast, especially in fast-moving fields. I don't want the headache of always trying to maintain the dictionary. > >case, a single term). We will implement phrase detection > >next, and expect to have it by late January. > > Ha! A programmer's solution. It seems like just upping the dictionary is > more straightforward. ;-) Your a manager, eh? :-) But, we need phrase detection in any event. So, I hope it handles the term isolation part as well. > > I'm not sure what you mean by a "publisher"--I'm not sure what this does. > Is this different from Ingrid? > "Publisher" is the (rather poor) term we use for the component of Ingrid that takes a resource, automatically pulls out key terms, generates some other info about the resource (size, type, title, etc.) and gives it to the component of Ingrid that inserts it into the navigation topology. PF From owner-robots Fri Dec 8 02:19:34 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16967; Fri, 8 Dec 95 02:19:34 -0800 Message-Id: <v02130517acedc10886ff@[202.243.51.212]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 8 Dec 1995 19:19:15 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: Re: Indexing two-byte text Cc: robots@webcrawler.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 5:25 PM 12/8/95, Paul Francis wrote: >One major problem is that all the better dictionaries we know >are commercial, so it broke our requirement for freely usable >code. Second, I think using a dictionary is a never-ending >battle. Each specialization has its own terms and require their >own dictionary. Further, language evolves fast, especially in >fast-moving fields. I don't want the headache of always trying >to maintain the dictionary. But by that time you'll be off on another project, and it'll be someone else's headache. ;-) >> I'm not sure what you mean by a "publisher"--I'm not sure what this does. >> Is this different from Ingrid? > >"Publisher" is the (rather poor) term we use for the component >of Ingrid that takes a resource, automatically pulls out key >terms, generates some other info about the resource (size, type, >title, etc.) and gives it to the component of Ingrid that inserts >it into the navigation topology. Is this navigation topology the part that you intend to patent? --Mark From owner-robots Fri Dec 8 07:08:42 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02625; Fri, 8 Dec 95 07:08:42 -0800 Date: Fri, 8 Dec 95 18:16:44 EST From: smadja@netvision.net.il Subject: RE: Indexing two-byte text To: robots@webcrawler.com X-Mailer: Chameleon ARM_55, TCP/IP for Windows, NetManage Inc. Message-Id: <Chameleon.951208181716.smadja@Haifa.netvision.net.il> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com How can we get JUMAN ? Is it freeware, shareware, commercial? Thanks ------------------------------------- Name: Frank Smadja E-mail: smadja@netvision.net.il Date: 12/08/95 Time: 18:16:44 ------------------------------------- From owner-robots Fri Dec 8 20:16:18 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20477; Fri, 8 Dec 95 20:16:18 -0800 Message-Id: <v02130505aceebd0aeb88@[202.243.51.214]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sat, 9 Dec 1995 13:16:43 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: RE: Indexing two-byte text Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >How can we get JUMAN ? >Is it freeware, shareware, commercial? > >Thanks > >------------------------------------- >Name: Frank Smadja >E-mail: smadja@netvision.net.il >Date: 12/08/95 >Time: 18:16:44 >------------------------------------- You can find software like this by using Archie, and picking a Japanese Archie server. Look for the most recently uploaded version. Juman, for instance, is on the Sony Computer Science Labs FTP server, among others. I think it's freeware or public domain, and it's written by the Nara Institute of Something-or-Other. --Mark From owner-robots Sun Dec 10 17:33:15 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17521; Sun, 10 Dec 95 17:33:15 -0800 Message-Id: <199512110132.KAA25906@azalea.kawasaki.flab.fujitsu.co.jp> To: robots@webcrawler.com Subject: RE: Indexing two-byte text In-Reply-To: Your message of "Sat, 9 Dec 1995 13:16:43 +0900" References: <v02130505aceebd0aeb88@[202.243.51.214]> X-Mailer: Mew beta version 0.91 on Emacs 19.28.1, Mule 2.2 Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Date: Mon, 11 Dec 1995 10:30:03 +0900 From: Noboru Iwayama <iwayama@flab.fujitsu.co.jp> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Mark> I think it's freeware or public domain, and it's written by the Nara Mark> Institute of Something-or-Other. You can get JUMAN from ftp://pr.aist-nara.ac.jp/pub/nlp/tools/juman/ Noboru I From owner-robots Tue Dec 12 20:28:25 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20743; Tue, 12 Dec 95 20:28:25 -0800 From: ecarp@tssun5.dsccc.com Date: Tue, 12 Dec 1995 22:25:31 -0600 Message-Id: <9512130425.AA27447@tssun5.> To: robots@webcrawler.com Subject: Freely available robot code in C available? X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com DSC Communications is a multinational company located in Plano, TX with offices all over the world. We have lots of technically-savvy people in the company, but not a lot of information on what different divisions are doing within the company, especially in regards to web activities. Since the division that I work for is Information Services, we feel that we would like to get a handle on who is running web servers in the company, what they have on them, and what they are being used for. The idea is to eliminate duplication of effort (two or more departments put up servers, each with similar information), and provide consistent information to our internal departments. I myself have been running a server (both internally and externally) for over a year, and have many years of CS experience, so I feel that the task of collecting information on who is doing what wouldn't be a overwhelming task. It is felt that the best way of collecting the information needed would be to either write some sort of web collection program from scratch or obtain a freely-available one from the net and modify it for our needs. I have read the proposed FAQ and all of the etiquette documents, and the plan of attack is to write or obtain a robot that would scan HTML text only, signaling the server that we can handle only text (avoiding the overhead of having to download images only to discard them), then build an Oracle database composed of URLs and text which could be searchable via an SQL query. Comments or sample source code on doing such a task, or pointers to freely- available code, would be greatly appreciated. If no such code is available, pointers on writing such a beast would be also appreciated. One more note: if I hadn't made it clear already, the robot would, under no circumstances, be allowed to search outside the DSC domain, and we have no direct access to the outside world except through our firewall (which will only filter selected packets from selected sites, and the internal web server isn't on the list). This is intended to be an 'internal use only' project, and so would not be used to generate revenue, nor would it be allowed to roam the net at large. The other restriction on the server is that it must be written in C. ANSI C is not a requirement. Any help or comments would be greatly appreciated. Thanks in advance... -- Ed Carp, Senior Operations Analyst, DSC Communications Please note that I do not speak for DSC Communications, nor are any statements made herein meant to be taken as a position, official or otherwise, of DSC Communications. From owner-robots Tue Dec 12 21:52:55 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25270; Tue, 12 Dec 95 21:52:55 -0800 Message-Id: <9512130553.AA03554@marys.smumn.edu> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Kevin Hoogheem <khooghee@marys.smumn.edu> Date: Tue, 12 Dec 95 23:54:40 -0600 To: robots@webcrawler.com Subject: Freely available robot code in C available? References: <9512130425.AA27447@tssun5.> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com If you get any good C code would you please send it along to me I am a CS student and am trying to write an indexing robot I have alreay wrote a cheesy performance robot which you can find on yahoo. it is called bomb. it is very simple and plain jane From owner-robots Tue Dec 12 22:21:57 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26942; Tue, 12 Dec 95 22:21:57 -0800 Message-Id: <v02130500acf420454c14@[202.243.51.222]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 13 Dec 1995 15:21:33 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: Re: Freely available robot code in C available? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >Since the division that I work for is Information Services, we feel that we >would like to get a handle on who is running web servers in the company, >what they have on them, and what they are being used for. The idea is to >eliminate duplication of effort (two or more departments put up servers, each >with similar information), and provide consistent information to our internal >departments. Ed: Why don't you just buy a turnkey package from Open Text, Architext, or one of the other companies selling this sort of thing rather than make it from scratch? --Mark From owner-robots Wed Dec 13 05:45:49 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11573; Wed, 13 Dec 95 05:45:49 -0800 Date: Wed, 13 Dec 95 08:42:23 EST From: "Jim Meritt" <jmeritt@smtpinet.aspensys.com> Message-Id: <9511138188.AA818873047@smtpinet.aspensys.com> To: robots@webcrawler.com Subject: Harvest question Content-Length: 563 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com First, is someone aware of a Harvest list? Next, my problem. I've gotten Harvest-1.4 patch level 1 onto a Sun Sparcstation 20 running Solaris 2.3. Watching the logs during gathering shows that it appears to be Gatherering, but when I try the broker, I don't get errors on the broker screen, just "no hits" and in the broker.out log I get "GL_do_query_inline: connect: Connection refused". What is it trying to connect to, and does anyone have a suggestion on how to get this working? Jim Meritt From owner-robots Wed Dec 13 07:47:09 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14957; Wed, 13 Dec 95 07:47:09 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: <v02130501acf4a5481384@[192.187.143.12]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 13 Dec 1995 07:48:15 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Freely available robot code in C available? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 10:25 PM 12/12/95, ecarp@tssun5.dsccc.com wrote: >... then build an Oracle database >composed of URLs and text which could be searchable via an SQL query. Aside from the question of my you want to build your own, rather than buying an off-the-shelf solution (we have one, too) -- why Oracle? A text search engine will give much better performance and have many more text-oriented features than Oracle or another RDBMS. Search engines are a kind of database, of course, but one that is oriented toward text, rather than fielded data (which some of them also support). Nick From owner-robots Wed Dec 13 13:18:37 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01989; Wed, 13 Dec 95 13:18:37 -0800 Message-Id: <9512132114.AA21778@tssun5.> Comments: Authenticated sender is <ecarp@tssun5.dsccc.com> From: "Edwin Carp" <ecarp@tssun5.dsccc.com> Organization: DSC Communications To: narnett@Verity.COM (Nick Arnett), robots@webcrawler.com Date: Wed, 13 Dec 1995 15:15:17 +0000 Subject: Re: Freely available robot code in C available? Priority: normal X-Mailer: Pegasus Mail for Windows (v2.23) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Date: Wed, 13 Dec 1995 07:48:15 -0800 > To: robots@webcrawler.com > From: narnett@Verity.COM (Nick Arnett) > Subject: Re: Freely available robot code in C available? > Reply-to: robots@webcrawler.com > At 10:25 PM 12/12/95, ecarp@tssun5.dsccc.com wrote: > >... then build an Oracle database > >composed of URLs and text which could be searchable via an SQL > >query. > > Aside from the question of my you want to build your own, rather > than buying an off-the-shelf solution (we have one, too) -- why > Oracle? A text search engine will give much better performance and > have many more text-oriented features than Oracle or another RDBMS. > Search engines are a kind of database, of course, but one that is > oriented toward text, rather than fielded data (which some of them > also support). The problem with an off-the-shelf solution is that most of them are not flexiabel enough for our needs. Also, we are tied to a product that does not allow us to make any changes unless we go back to the vendor. Customizations are likely to be expensive, and this project is being done on a literal shoestring, using existing hardware and home-grown software. Oracle, because taht's what we have in-house, and we have lots and lots of reporting and search tools for it. From owner-robots Wed Dec 13 17:06:12 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16103; Wed, 13 Dec 95 17:06:12 -0800 Message-Id: <v02130500acf5287dd3c9@[202.243.51.222]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 14 Dec 1995 10:05:47 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: Re: Harvest question Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > First, is someone aware of a Harvest list? > > Next, my problem. I've gotten Harvest-1.4 patch level 1 onto a Sun > Sparcstation 20 running Solaris 2.3. Watching the logs during > gathering shows that it appears to be Gatherering, but when I try the > broker, I don't get errors on the broker screen, just "no hits" and in > the broker.out log I get "GL_do_query_inline: connect: Connection > refused". What is it trying to connect to, and does anyone have a > suggestion on how to get this working? > > Jim Meritt There's a full-blown newsgroup, comp.infosystems.harvest --Mark From owner-robots Thu Dec 14 02:40:19 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14819; Thu, 14 Dec 95 02:40:19 -0800 Date: Thu, 14 Dec 1995 10:34:13 GMT From: cs0sst@isis.sunderland.ac.uk (Simon.Stobart) Message-Id: <9512141034.AA19413@osiris.sund.ac.uk> To: robots@webcrawler.com Subject: Announcement and Help Requested X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com New Robot Announcement ~~~~~~~~~~~~~~~~~~~~~~ Name: IncyWincy Home: University of Sunderland, UK Implementation Language: C++ Supports Robot Exclusion standard: Yes Purpose: Various research projects Status: This robot has not yet been released outside of Sunderland Authors: Simon Stobart, Reg Arthington Help Requested ~~~~~~~~~~~~~~ The user-agent, from and referer http fields are not set to anything currently. Obviously, I wish these to conatin informative information. So, how do you send this information to the web server? The values which I wish to set these fields to are: User-Agent: IncyWincy V?.? From: simon.stobart@sunderland.ac.uk Many Thanks |------------------------------------+-------------------------------------| | Simon Stobart, | Net: simon.stobart@sunderland.ac.uk | | Lecturer in Computing, | Voice: (+44) 091 515 2783 | | School of Computing | Fax: (+44) 091 515 2781 | | & Information Systems, + ------------------------------------| | University of Sunderland, SR1 3SD, | 007: Balls Q? | | England. | Q: Bolas 007! | |------------------------------------|-------------------------------------| From owner-robots Thu Dec 14 03:58:35 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18495; Thu, 14 Dec 95 03:58:35 -0800 Date: Thu, 14 Dec 1995 10:34:13 GMT From: cs0sst@isis.sunderland.ac.uk (Simon.Stobart) Message-Id: <9512141034.AA19413@osiris.sund.ac.uk> To: robots@webcrawler.com Subject: Announcement and Help Requested X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com New Robot Announcement ~~~~~~~~~~~~~~~~~~~~~~ Name: IncyWincy Home: University of Sunderland, UK Implementation Language: C++ Supports Robot Exclusion standard: Yes Purpose: Various research projects Status: This robot has not yet been released outside of Sunderland Authors: Simon Stobart, Reg Arthington Help Requested ~~~~~~~~~~~~~~ The user-agent, from and referer http fields are not set to anything currently. Obviously, I wish these to conatin informative information. So, how do you send this information to the web server? The values which I wish to set these fields to are: User-Agent: IncyWincy V?.? From: simon.stobart@sunderland.ac.uk Many Thanks |------------------------------------+-------------------------------------| | Simon Stobart, | Net: simon.stobart@sunderland.ac.uk | | Lecturer in Computing, | Voice: (+44) 091 515 2783 | | School of Computing | Fax: (+44) 091 515 2781 | | & Information Systems, + ------------------------------------| | University of Sunderland, SR1 3SD, | 007: Balls Q? | | England. | Q: Bolas 007! | |------------------------------------|-------------------------------------| From owner-robots Thu Dec 14 05:45:40 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23662; Thu, 14 Dec 95 05:45:40 -0800 Date: Thu, 14 Dec 95 08:47:11 EST From: "Jim Meritt" <jmeritt@smtpinet.aspensys.com> Message-Id: <9511148189.AA818959719@smtpinet.aspensys.com> To: robots@webcrawler.com Subject: Re[2]: Harvest question Content-Length: 399 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I know about the newsgroup - which is why I asked about a mailing list... ______________________________ Reply Separator _________________________________ Subject: Re: Harvest question Author: robots@webcrawler.com at SMTPINET Date: 12/13/95 8:26 PM > First, is someone aware of a Harvest list? > Jim Meritt There's a full-blown newsgroup, comp.infosystems.harvest --Mark From owner-robots Thu Dec 14 07:53:50 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00520; Thu, 14 Dec 95 07:53:50 -0800 From: mschrimsher@twics.com Message-Id: <v02130503acf5f7cc13e2@[202.243.51.222]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 15 Dec 1995 00:54:17 +0900 To: robots@webcrawler.com Subject: Robot on the Rampage Cc: w3-search@rodem.slab.ntt.jp, infotalk@square.brl.ntt.jp Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Can anyone identify the following robot: 206.214.202.44 It went through my site (a 600-page web directory service) grabbing several pages a second, despite the fact that I prohibit robots in my robots.txt file. --Mark From owner-robots Thu Dec 14 09:37:21 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06123; Thu, 14 Dec 95 09:37:21 -0800 Date: Fri, 15 Dec 1995 02:37:23 +0900 From: shimizu@rodem.slab.ntt.jp (Susumu Shimizu) Message-Id: <199512141737.CAA24695@rodem.slab.ntt.jp> To: mschrimsher@twics.com Cc: robots@webcrawler.com, w3-search@rodem.slab.ntt.jp, infotalk@square.brl.ntt.jp In-Reply-To: <v02130503acf5f7cc13e2@[202.243.51.222]> (mschrimsher@twics.com) Subject: Re: Robot on the Rampage Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Mark, here you are. Name: magellan.mckinley.com Address: 206.214.202.44 -- shim From owner-robots Thu Dec 14 10:14:54 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA07660; Thu, 14 Dec 95 10:14:54 -0800 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199512141814.TAA01403@wsinis10.win.tue.nl> Subject: Re: Robot on the Rampage To: robots@webcrawler.com Date: Thu, 14 Dec 1995 19:14:53 +0100 (MET) In-Reply-To: <v02130503acf5f7cc13e2@[202.243.51.222]> from "mschrimsher@twics.com" at Dec 15, 95 00:54:17 am X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Content-Length: 244 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (mschrimsher@twics.com) write: > >Can anyone identify the following robot: > > 206.214.202.44 % host 206.214.202.44 Name: magellan.mckinley.com Try http://www.mckinley.com/ to find out more about their search service. -- Reinier From owner-robots Thu Dec 14 11:34:46 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11581; Thu, 14 Dec 95 11:34:46 -0800 Date: Thu, 14 Dec 1995 14:32:43 -0600 (CST) From: Cees Hek <hekc@phoenix.cis.mcmaster.ca> To: robots@webcrawler.com Subject: Checking Log files In-Reply-To: <v02130503acf5f7cc13e2@[202.243.51.222]> Message-Id: <Pine.LNX.3.91.951214142420.2308A-100000@phoenix.cis.mcmaster.ca> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Does anyone have a small script that will parse a log file (NCSA 1.3 common log format) and check for "nasty" robots. I don't have a /robots.txt file on the server, since we welcome anyone to index our site, but I would like to keep track of any robots that are hammering the system. Currently our log file grows at about a half a Meg a day, and I don't have time to go through it myself. Any help would be appreciated Cees Hek Computing & Information Services Email: hekc@mcmaster.ca McMaster University Hamilton, Ontario, Canada From owner-robots Thu Dec 14 13:08:37 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15607; Thu, 14 Dec 95 13:08:37 -0800 Message-Id: <9512142109.AA05041@marys.smumn.edu> Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable From: Kevin Hoogheem <khooghee@marys.smumn.edu> Date: Thu, 14 Dec 95 15:11:15 -0600 To: robots@webcrawler.com Subject: Re: Checking Log files References: <Pine.LNX.3.91.951214142420.2308A-100000@phoenix.cis.mcmaster.ca> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com why not set up a cron job to grep out for all access to robot.txt = out of the log file= From owner-robots Thu Dec 14 18:12:20 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00533; Thu, 14 Dec 95 18:12:20 -0800 Message-Id: <v02130503acf687e1f79d@[202.237.148.40]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 15 Dec 1995 11:11:57 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: Re: Checking Log files Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >Does anyone have a small script that will parse a log file (NCSA 1.3 >common log format) and check for "nasty" robots. I don't have a >/robots.txt file on the server, since we welcome anyone to index our >site, but I would like to keep track of any robots that are hammering the >system. > >Currently our log file grows at about a half a Meg a day, and I don't have >time to go through it myself. Any help would be appreciated > > >Cees Hek >Computing & Information Services Email: hekc@mcmaster.ca >McMaster University >Hamilton, Ontario, Canada You can make a robots.txt file that permits all accesses, and then check the log for requests for that file. But it won't catch robots that don't check for robots.txt. --Mark From owner-robots Thu Dec 14 18:46:45 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02442; Thu, 14 Dec 95 18:46:45 -0800 Message-Id: <n1393155260.72132@mail.intouchgroup.com> Date: 14 Dec 1995 18:50:11 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [1]RE>Checking Log files To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [1]RE>Checking Log files 12/14/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Thu Dec 14 19:17:42 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04222; Thu, 14 Dec 95 19:17:42 -0800 Message-Id: <n1393153402.82326@mail.intouchgroup.com> Date: 14 Dec 1995 19:20:30 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [2]RE>Checking Log files To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [2]RE>Checking Log files 12/14/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Thu Dec 14 19:43:43 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05763; Thu, 14 Dec 95 19:43:43 -0800 Message-Id: <n1393151841.75020@mail.intouchgroup.com> Date: 14 Dec 1995 19:46:33 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [3]RE>Checking Log files To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [3]RE>Checking Log files 12/14/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Thu Dec 14 20:13:46 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA07395; Thu, 14 Dec 95 20:13:46 -0800 Message-Id: <n1393150039.85103@mail.intouchgroup.com> Date: 14 Dec 1995 20:16:52 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [4]RE>Checking Log files To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [4]RE>Checking Log files 12/14/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Thu Dec 14 20:43:51 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08988; Thu, 14 Dec 95 20:43:51 -0800 Message-Id: <n1393148232.94882@mail.intouchgroup.com> Date: 14 Dec 1995 20:46:54 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [5]RE>Checking Log files To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [5]RE>Checking Log files 12/14/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Thu Dec 14 22:23:54 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14421; Thu, 14 Dec 95 22:23:54 -0800 Message-Id: <v0213050aacf6c4753632@[202.237.148.34]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 15 Dec 1995 15:23:30 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: [5]RE>Checking Log files Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >Date: 14 Dec 1995 20:46:54 -0800 >From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> >Subject: [5]RE>Checking Log files >To: robots@webcrawler.com >Sender: owner-robots@webcrawler.com >Precedence: bulk >Reply-To: robots@webcrawler.com > > [5]RE>Checking Log files 12/14/95 > >Thanks for you message. > >I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail >before I get back. Is there any way to stop Roger's infinite loop? January 5 is a long way off. --Mark From owner-robots Fri Dec 15 02:08:57 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25729; Fri, 15 Dec 95 02:08:57 -0800 Message-Id: <9512151006.AA25636@webcrawler.com> X-Mailer: exmh version 1.5 11/22/94 To: robots@webcrawler.com Subject: Re: [5]RE>Checking Log files In-Reply-To: Your message of "Fri, 15 Dec 95 15:23:30 +0900." <v0213050aacf6c4753632@[202.237.148.34]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 15 Dec 95 09:55:02 +0000 From: M.Levy@cs.ucl.ac.uk Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Is there any way to stop Roger's infinite loop? January 5 is a long way off. > > --Mark > > Maybe it's worth mailing the system administrator at twics.com From owner-robots Fri Dec 15 02:55:10 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28164; Fri, 15 Dec 95 02:55:10 -0800 Message-Id: <n1393125956.34566@mail.intouchgroup.com> Date: 15 Dec 1995 02:58:18 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [1]RE>[5]RE>Checking Log fi To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [1]RE>[5]RE>Checking Log files 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 03:55:58 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01483; Fri, 15 Dec 95 03:55:58 -0800 Message-Id: <n1393122315.53763@mail.intouchgroup.com> Date: 15 Dec 1995 03:58:38 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [2]RE>[5]RE>Checking Log fi To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [2]RE>[5]RE>Checking Log fi 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 08:12:39 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03110; Fri, 15 Dec 95 08:12:39 -0800 From: Byung-Gyu Chang <chitos@ktmp.kaist.ac.kr> Message-Id: <199512151254.VAA10969@ktmp.kaist.ac.kr> Subject: Wobot? To: robots@webcrawler.com (Robot Mailing list) Date: Fri, 15 Dec 1995 21:54:17 +0900 (KST) X-Mailer: ELM [version 2.4 PL21-h4] Mime-Version: 1.0 Content-Type: text/plain; charset=iso-2022-kr Content-Transfer-Encoding: 7bit Content-Length: 504 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Did anyone know about Wobot from magellan.mckinley.com ? They represents them "Wobot" in User-Agent field. Martijn Koster write in "List of Robots" html page : -- McKinley Robot It's unclear who administers this, but a number of people have complained about rapid-fire hits from magellan.mckinley.com. There have been no replies to direct complaints. Not very nice... -- Yeah, okay. I know what is Wobot now. My question is : It there any method to exclude *only* Wobot in my access list ? -chitos From owner-robots Fri Dec 15 08:46:56 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06966; Fri, 15 Dec 95 08:46:56 -0800 Date: Fri, 15 Dec 1995 11:46:48 -0500 (EST) From: "Mordechai T. Abzug" <mabzug1@gl.umbc.edu> To: Robots mailing list <robots@webcrawler.com> Subject: Announcing NaecSpyr, a new. . . robot? Message-Id: <Pine.SGI.3.91.951215000641.16482A-100000@umbc10.umbc.edu> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com NaecSpyr is an agent that checks if URLs have changed. In purpose, it is similar to URL Minder, w3new, and Web Watch; in implementation, it takes a slightly different approach, running centrally on a server (like URL Minder) but providing a web interface for each user. NaecSpyr may not be a robot according to the definition in the robots homepage (it doesn't scan HTML for new URLs), but it's compliant to the robot protocol, anyway. ;> See <http://www.gl.umbc.edu/~mabzug1/NaecSpyr> for a little (*very* little) more info. Mordechai T. Abzug http://umbc.edu/~mabzug1 mabzug1@umbc.edu finger -l mabzug1@gl.umbc.edu So many bytes, so few CPS. From owner-robots Fri Dec 15 09:03:29 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08911; Fri, 15 Dec 95 09:03:29 -0800 Message-Id: <n1393103858.61698@mail.intouchgroup.com> Date: 15 Dec 1995 09:06:06 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [3]RE>[5]RE>Checking Log fi To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [3]RE>[5]RE>Checking Log fi 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 09:08:25 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09443; Fri, 15 Dec 95 09:08:25 -0800 Message-Id: <n1393103559.81023@mail.intouchgroup.com> Date: 15 Dec 1995 09:11:21 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [1]Wobot? To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [1]Wobot? 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 09:23:32 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10895; Fri, 15 Dec 95 09:23:32 -0800 Message-Id: <n1393102654.36226@mail.intouchgroup.com> Date: 15 Dec 1995 09:26:58 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [1]Announcing NaecSpyr, a n To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [1]Announcing NaecSpyr, a new. . . robot? 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 09:55:54 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14150; Fri, 15 Dec 95 09:55:54 -0800 Message-Id: <n1393100724.52255@mail.intouchgroup.com> Date: 15 Dec 1995 09:58:31 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [2]Announcing NaecSpyr, a n To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [2]Announcing NaecSpyr, a n 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 09:55:56 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14154; Fri, 15 Dec 95 09:55:56 -0800 Message-Id: <n1393100712.52210@mail.intouchgroup.com> Date: 15 Dec 1995 09:59:09 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [2]Wobot? To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [2]Wobot? 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 10:06:31 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15105; Fri, 15 Dec 95 10:06:31 -0800 Message-Id: <n1393100072.90931@mail.intouchgroup.com> Date: 15 Dec 1995 10:09:01 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [4]RE>[5]RE>Checking Log fi To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [4]RE>[5]RE>Checking Log fi 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 10:14:24 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15864; Fri, 15 Dec 95 10:14:24 -0800 Date: Fri, 15 Dec 1995 10:14:19 -0800 From: gordon@BASISinc.com (Gordon Bainbridge) Message-Id: <9512151814.AA01071@outland.BASISinc.com> To: robots@webcrawler.com Subject: Re: [3]RE>[5]RE>Checking Log fi X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com ----- Begin Included Message ----- From owner-robots@webcrawler.com Fri Dec 15 09:58 PST 1995 Date: 15 Dec 1995 09:06:06 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [3]RE>[5]RE>Checking Log fi To: robots@webcrawler.com Reply-To: robots@webcrawler.com [3]RE>[5]RE>Checking Log fi 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. ----- End Included Message ----- Please, please, please, if you check your mail, take care of this. I'm receiving this message from you constantly. I'll be away for a week, and fear that your messages will completely overflow my mailbox. DO SOMETHING!!! From owner-robots Fri Dec 15 10:20:18 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16063; Fri, 15 Dec 95 10:20:18 -0800 From: micah@fsu.fsufay.edu (Micah A. Williams) Message-Id: <199512151819.NAA05590@fsu.fsufay.edu> Subject: Re: [2]RE>[5]RE>Checking Log fi To: robots@webcrawler.com Date: Fri, 15 Dec 95 13:19:50 EST In-Reply-To: <n1393122315.53763@mail.intouchgroup.com>; from "Roger Dearnaley" at Dec 15, 95 3:58 am X-Mailer: ELM [version 2.3 PL0] Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com In the words of Roger Dearnaley, > > [2]RE>[5]RE>Checking Log fi 12/15/95 > > Thanks for you message. > > I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail > before I get back. > I think it's obvious that some sort of recursion or infinite loop is happening at the site this mail is originating from: mail.intouchgroup.com. All the duplicate messages have different mail-spooler ID's from this site, so the mail is being queued over and over again for some reason. (Perhaps it is being sumbitted repeatedly by a mail client)... Oh well...I guess we could get a lot of messages between now and Jan 5th :-) Could the list maintainer maybe mail root@mail.intouchgroup.com and inform them of this problem? Thanks. -Micah -- ==================================================================== Micah A. Williams micah@fsu.uncfsu.edu Computer Science tndf20c@prodigy.com Fayetteville State University http://fsu.uncfsu.edu/~micah Bjork WebPage: http://fsu.uncfsu.edu/~micah/bjork.html ==================================================================== From owner-robots Fri Dec 15 10:32:18 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17084; Fri, 15 Dec 95 10:32:18 -0800 Message-Id: <n1393098526.84465@mail.intouchgroup.com> Date: 15 Dec 1995 10:35:10 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [3]Wobot? To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [3]Wobot? 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 10:43:44 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18068; Fri, 15 Dec 95 10:43:44 -0800 Message-Id: <n1393097842.22896@mail.intouchgroup.com> Date: 15 Dec 1995 10:46:37 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [1]RE>[3]RE>[5]RE>Checking To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [1]RE>[3]RE>[5]RE>Checking Log fi 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 10:43:43 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18065; Fri, 15 Dec 95 10:43:43 -0800 Message-Id: <n1393097842.22938@mail.intouchgroup.com> Date: 15 Dec 1995 10:45:57 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [3]Announcing NaecSpyr, a n To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [3]Announcing NaecSpyr, a n 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 10:48:40 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18534; Fri, 15 Dec 95 10:48:40 -0800 Message-Id: <n1393097544.41865@mail.intouchgroup.com> Date: 15 Dec 1995 10:51:21 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [5]RE>[5]RE>Checking Log fi To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [5]RE>[5]RE>Checking Log fi 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 10:49:51 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18645; Fri, 15 Dec 95 10:49:51 -0800 From: Vince Taluskie <vince@psa.pencom.com> Message-Id: <199512151849.MAA10429@psa.pencom.com> Subject: Contact for Intouchgroup.com To: robots@webcrawler.com Date: Fri, 15 Dec 1995 12:49:41 -0600 (CST) In-Reply-To: <n1393100712.52210@mail.intouchgroup.com> from "Roger Dearnaley" at Dec 15, 95 09:59:09 am X-Mailer: ELM [version 2.4 PL24] Content-Type: text Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com WHOIS shows the following as an administrative contact at the site: Hunter, Kurt (KH258) kurt_hunter@INTOUCHGROUP.COM 415-974-5000 I phoned Kurt and left voicemail for him about this user asking him to disable the auto-responder on the account.... Cheers, Vince -- ___ ____ __ | _ \/ __/| \ Vince Taluskie, at Fidelity Investments Boston, MA | _/\__ \| \ \ Pencom Systems Administration Phone: 617-563-8349 |_| /___/|_|__\ vince@pencom.com Pager: 800-253-5353, #182-6317 -------------------------------------------------------------------------- "We are smart, we make things go" From owner-robots Fri Dec 15 10:59:39 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19483; Fri, 15 Dec 95 10:59:39 -0800 Message-Id: <n1393096884.81309@mail.intouchgroup.com> Date: 15 Dec 1995 11:02:20 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [1]RE>[2]RE>[5]RE>Checking To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [1]RE>[2]RE>[5]RE>Checking Log fi 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 11:12:00 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20619; Fri, 15 Dec 95 11:12:00 -0800 From: micah@fsu.fsufay.edu (Micah A. Williams) Message-Id: <199512151911.OAA06554@fsu.fsufay.edu> Subject: Re: [3]RE>[5]RE>Checking Log fi To: robots@webcrawler.com Date: Fri, 15 Dec 95 14:11:30 EST In-Reply-To: <n1393103858.61698@mail.intouchgroup.com>; from "Roger Dearnaley" at Dec 15, 95 9:06 am X-Mailer: ELM [version 2.3 PL0] Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com -- ==================================================================== Micah A. Williams micah@fsu.uncfsu.edu Computer Science tndf20c@prodigy.com Fayetteville State University http://fsu.uncfsu.edu/~micah Bjork WebPage: http://fsu.uncfsu.edu/~micah/bjork.html ==================================================================== From owner-robots Fri Dec 15 11:14:51 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20891; Fri, 15 Dec 95 11:14:51 -0800 Message-Id: <n1393095974.37243@mail.intouchgroup.com> Date: 15 Dec 1995 11:18:46 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [4]Wobot? To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [4]Wobot? 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 11:22:20 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21568; Fri, 15 Dec 95 11:22:20 -0800 Date: Fri, 15 Dec 1995 11:22:29 -0800 From: gordon@BASISinc.com (Gordon Bainbridge) Message-Id: <9512151922.AA01083@outland.BASISinc.com> To: robots@webcrawler.com Subject: Re: [2]RE>[5]RE>Checking Log fi X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com ----- Begin Included Message ----- From owner-robots@webcrawler.com Fri Dec 15 10:59 PST 1995 From: micah@fsu.fsufay.edu (Micah A. Williams) Subject: Re: [2]RE>[5]RE>Checking Log fi To: robots@webcrawler.com Date: Fri, 15 Dec 95 13:19:50 EST Reply-To: robots@webcrawler.com In the words of Roger Dearnaley, > > [2]RE>[5]RE>Checking Log fi 12/15/95 > > Thanks for you message. > > I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail > before I get back. > Oh well...I guess we could get a lot of messages between now and Jan 5th :-) Could the list maintainer maybe mail root@mail.intouchgroup.com and inform them of this problem? Thanks. -Micah ----- End Included Message ----- I've already tried it. My mail was returned with the message "Unknown Quicktime Receipient(s)". Not only that, but my message has been returned to me TWICE. Does anyone else have any ideas? I'll be gone for a week, and don't want my mail box cluttered with hundreds of e-mails from good ol' Roger. I guess I could unsubscribe until January 5, but I'd rather not do it if there's an alternative. -Gordon Bainbridge BASIS Inc Emeryville, CA From owner-robots Fri Dec 15 11:25:52 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21899; Fri, 15 Dec 95 11:25:52 -0800 Message-Id: <n1393095317.74876@mail.intouchgroup.com> Date: 15 Dec 1995 11:28:52 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [4]Announcing NaecSpyr, a n To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [4]Announcing NaecSpyr, a n 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 11:25:52 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21900; Fri, 15 Dec 95 11:25:52 -0800 Message-Id: <n1393095317.74946@mail.intouchgroup.com> Date: 15 Dec 1995 11:28:25 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [2]RE>[3]RE>[5]RE>Checking To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [2]RE>[3]RE>[5]RE>Checking 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 11:32:00 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22520; Fri, 15 Dec 95 11:32:00 -0800 Date: Fri, 15 Dec 1995 14:31:50 -0500 From: Skip Montanaro <skip@automatrix.com> Message-Id: <199512151931.OAA10572@dolphin.automatrix.com> To: robots@webcrawler.com Cc: postmaster@mail.intouchgroup.com, owner-robots@webcrawler.com Subject: Re: [2]RE>[5]RE>Checking Log fi In-Reply-To: <199512151819.NAA05590@fsu.fsufay.edu> References: <n1393122315.53763@mail.intouchgroup.com> <199512151819.NAA05590@fsu.fsufay.edu> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Could the list maintainer maybe mail root@mail.intouchgroup.com and inform them of this problem? I already sent postmaster@mail.intouchgroup.com a note about the problem. No response yet. (Dear postmaster: For what it's worth, this fellow's mailbot has spewed, oh I don't know, maybe 30 messages back at the robots mailing list. I suspect any other lists he's on are similarly affected.) Perhaps the robots list owner could remove this fellow from the list for now... Skip Montanaro skip@calendar.com (518)372-5583 Musi-Cal: http://www.calendar.com/concerts/ or mailto:concerts@calendar.com Internet Conference Calendar: http://www.calendar.com/conferences/ >>> ZLDF: http://www.netresponse.com/zldf <<< From owner-robots Fri Dec 15 11:32:34 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22578; Fri, 15 Dec 95 11:32:34 -0800 From: micah@fsu.fsufay.edu (Micah A. Williams) Message-Id: <199512151932.OAA06841@fsu.fsufay.edu> Subject: Dearnaley Auto Reply Cannon? To: robots@webcrawler.com Date: Fri, 15 Dec 95 14:32:05 EST X-Mailer: ELM [version 2.3 PL0] Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I'm sure many of you have figured this out already but I think maybe Mr. Dearlarney is running some kind of automated reply program from his account. Any mail that is sent to his inbox gets an auto reply with the body of the mail being the subject followed by ... "I'm gone 'til Jan 5, etc..". The recursion is occuring because he is a member of the very list he is sending auto-replys to. So not only is he receiving and auto-replying to his own replys over and over again, he is also starting new recursion threads with any new message sent to the list. (Actually, this is kinda cool..I like recursion problems..but I'm sure the list maintainer and everybody else dislikes a mailbox full of Re:Re:Re: messages) .. The solution: (As Bonnie Scott pointed out to me) Temporarily remove Roger Dearlaney from the list. Sorry If I wasted bandwidth with this, but I just got a sudden realization of how this was all unfolding. -Micah -- ==================================================================== Micah A. Williams micah@fsu.uncfsu.edu Computer Science tndf20c@prodigy.com Fayetteville State University http://fsu.uncfsu.edu/~micah Bjork WebPage: http://fsu.uncfsu.edu/~micah/bjork.html ==================================================================== From owner-robots Fri Dec 15 11:35:33 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22843; Fri, 15 Dec 95 11:35:33 -0800 Message-Id: <9512151935.AA05873@marys.smumn.edu> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Kevin Hoogheem <khooghee@marys.smumn.edu> Date: Fri, 15 Dec 95 13:36:13 -0600 To: robots@webcrawler.com References: <199512151819.NAA05590@fsu.fsufay.edu> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com How hard would this be.. Would the admin please take him off the darn list please From owner-robots Fri Dec 15 11:36:09 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22904; Fri, 15 Dec 95 11:36:09 -0800 Message-Id: <n1393094695.13294@mail.intouchgroup.com> Date: 15 Dec 1995 11:38:24 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [2]RE>[2]RE>[5]RE>Checking To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [2]RE>[2]RE>[5]RE>Checking 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 11:42:31 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23546; Fri, 15 Dec 95 11:42:31 -0800 Message-Id: <n1393094316.33627@mail.intouchgroup.com> Date: 15 Dec 1995 11:43:55 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [1]Contact for Intouchgroup To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [1]Contact for Intouchgroup.com 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 11:42:40 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23568; Fri, 15 Dec 95 11:42:40 -0800 Message-Id: <n1393094310.33444@mail.intouchgroup.com> Date: 15 Dec 1995 11:45:14 -0800 Priority: Urgent From: "Saul Jacobs" <saul_jacobs@mail.intouchgroup.com> Subject: Re: [2]RE>[5]RE>Checking Lo To: robots@webcrawler.com, "Skip Montanaro" <skip@calendar.com> Cc: owner-robots@webcrawler.com, "postmaster@mail.intouchgroup.co" <postmaster@mail.intouchgroup.com> X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com intouch Reply 12/15/95 Subject:RE>>[2]RE>[5]RE>Checking Log fi 11:43 I am the postmaster. I am working on killing our user's forward. The mails should stop within 2 hours. But check out our webite: http://WorldWideMusic.com/ Saul Jacobs Coputer Systems Manger intouch group, inc. -------------------------------------- Date: 12/15/95 11:42 To: Saul Jacobs From: Skip Montanaro Could the list maintainer maybe mail root@mail.intouchgroup.com and inform them of this problem? I already sent postmaster@mail.intouchgroup.com a note about the problem. No response yet. (Dear postmaster: For what it's worth, this fellow's mailbot has spewed, oh I don't know, maybe 30 messages back at the robots mailing list. I suspect any other lists he's on are similarly affected.) Perhaps the robots list owner could remove this fellow from the list for now... Skip Montanaro skip@calendar.com (518)372-5583 Musi-Cal: http://www.calendar.com/concerts/ or mailto:concerts@calendar.com Internet Conference Calendar: http://www.calendar.com/conferences/ >>> ZLDF: http://www.netresponse.com/zldf <<< ------------------ RFC822 Header Follows ------------------ Received: by mail.intouchgroup.com with SMTP;15 Dec 1995 11:39:17 -0800 Received: (from skip@localhost) by dolphin.automatrix.com (8.6.12/8.6.12) id OAA10572; Fri, 15 Dec 1995 14:31:50 -0500 Date: Fri, 15 Dec 1995 14:31:50 -0500 From: Skip Montanaro <skip@automatrix.com> Message-Id: <199512151931.OAA10572@dolphin.automatrix.com> To: robots@webcrawler.com CC: postmaster@mail.intouchgroup.com, owner-robots@webcrawler.com Subject: Re: [2]RE>[5]RE>Checking Log fi In-Reply-To: <199512151819.NAA05590@fsu.fsufay.edu> References: <n1393122315.53763@mail.intouchgroup.com> <199512151819.NAA05590@fsu.fsufay.edu> Reply-To: skip@calendar.com (Skip Montanaro) From owner-robots Fri Dec 15 11:47:06 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24012; Fri, 15 Dec 95 11:47:06 -0800 Message-Id: <n1393094037.53612@mail.intouchgroup.com> Date: 15 Dec 1995 11:49:43 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: [5]Wobot? To: robots@webcrawler.com X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [5]Wobot? 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. From owner-robots Fri Dec 15 12:00:44 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25042; Fri, 15 Dec 95 12:00:44 -0800 Message-Id: <199512152000.PAA12523@tinman.dev.prodigy.com> X-Sender: bonnie@192.203.241.117 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 15 Dec 1995 15:00:35 -0400 To: robots@webcrawler.com From: bonnie@dev.prodigy.com (Bonnie Scott) Subject: Re: [2]RE>[5]RE>Checking Lo X-Mailer: <Windows Eudora Version 2.0.2> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >I am the postmaster. I am working on killing our user's forward. The mails >should stop within 2 hours. > >But check out our webite: http://WorldWideMusic.com/ > >Saul Jacobs >Coputer Systems Manger >intouch group, inc. Thanks Saul, if you're on this list. I had taken matters into my own hands a half hour ago, and told majordomo I was Roger and I told it to "unsubscribe robots." I then did a "who robots," and Roger doesn't appear to be on it anymore. I'll apologize to Roger and his autoresponder myself. :) I thought that the mail client community figured out that autoresponders should reply to "Sender:" and not even bother to answer "Precedence: bulk" messages (both of which are present in this list's headers) back in '93 or so with the big MCI mail snafu. Bonnie Scott Prodigy Services Company (whose mail client ALWAYS replies to sender, even if there's a "Reply-to:". Not my app. <g>) From owner-robots Fri Dec 15 12:06:30 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25532; Fri, 15 Dec 95 12:06:30 -0800 From: ecarp@tssun5.dsccc.com Date: Fri, 15 Dec 1995 14:03:20 -0600 Message-Id: <9512152003.AA10658@tssun5.> To: robots@webcrawler.com Subject: Re: [2]RE>[5]RE>Checking Log fi X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Perhaps the list maintainer can filter out this address until Jan 5. - that might be an easier and faster solution. From owner-robots Fri Dec 15 14:38:55 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06614; Fri, 15 Dec 95 14:38:55 -0800 X-Sender: mak@surfski.webcrawler.com Message-Id: <v02140800acf5eebdebc4@[199.221.45.66]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 15 Dec 1995 16:38:28 -0500 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: Re: Announcement and Help Requested Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >New Robot Announcement Can you fill out http://info.webcrawler.com/mak/projects/robots/active.html so I have all the bits I need toadd you to the list? >The user-agent, from and referer http fields are not set to anything >currently. Obviously, I wish these to conatin informative information. So, >how do you send this information to the web server? Ehr, by adding them as headers to the request? >User-Agent: IncyWincy V?.? Check the HTTP spec, it suggests forms like IncyWincy/1.1 -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Fri Dec 15 15:44:12 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10472; Fri, 15 Dec 95 15:44:12 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: <v02130504acf7b860b790@[192.187.143.12]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 15 Dec 1995 15:45:18 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Wobot? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 9:54 PM 12/15/95, Byung-Gyu Chang wrote: >Did anyone know about Wobot from magellan.mckinley.com ? >They represents them "Wobot" in User-Agent field. > >Martijn Koster write in "List of Robots" html page : >-- >McKinley Robot > >It's unclear who administers this, but a number of people have complained >about rapid-fire hits from magellan.mckinley.com. There have been no replies >to direct complaints. Not very nice... I've forwarded some of the complaints here to the head of development at McKinley. When I hear back, I'll post to the list. If all else fails, I do have his home phone number... ;-) Often, this sort of thing is an isolated test gone haywire. Nick From owner-robots Fri Dec 15 19:43:52 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25112; Fri, 15 Dec 95 19:43:52 -0800 X-Sender: mak@surfski.webcrawler.com (Unverified) Message-Id: <v02140808acf7b2ecfb47@[199.221.45.66]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 15 Dec 1995 15:25:39 -0800 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: Vacation wars Cc: bonnie@dev.prodigy.com (Bonnie Scott) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com In message <199512152000.PAA12523@tinman.dev.prodigy.com>, Bonnie Scott writes: > Thanks Saul, if you're on this list. I had taken matters into my own hands a > half hour ago, and told majordomo I was Roger and I told it to "unsubscribe > robots." I then did a "who robots," and Roger doesn't appear to be on it > anymore. I'll apologize to Roger and his autoresponder myself. :) Ah, that explains why I couldn't find him. :-) Thanks; sometimes I can catch these in time, but this time I had to be bleeped away from my day off :-/ > I thought that the mail client community figured out that autoresponders > should reply to "Sender:" and not even bother to answer "Precedence: bulk" > messages (both of which are present in this list's headers) back in '93 or > so with the big MCI mail snafu. Quite. Not sure quite what "Mail*Link SMTP-QM 3.0.2" is, but with lots of gatewaying and simplistic PC packages nowadays this does happen every once in a while. For anyone thinking about using autoresponders on UNIX, check out mailagent, which goes to all sorts of lengths to prevent such loops (but let me declare auto-reponder mail off-topic for this group) I guess it is time to modify majordomo to filter out vaction messages... Sorry for the inconvenience caused, -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Sat Dec 16 05:48:06 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26270; Sat, 16 Dec 95 05:48:06 -0800 Date: Sat, 16 Dec 1995 14:47:55 +0100 (MET) From: Bjorn-Olav Strand <bjorn-ol@ifi.uio.no> To: robots@webcrawler.com Subject: Re: [2]RE>[5]RE>Checking Log fi In-Reply-To: <199512151819.NAA05590@fsu.fsufay.edu> Message-Id: <Pine.SUN.3.91.951216144542.3188A-100000@beli.ifi.uio.no> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Fri, 15 Dec 1995, Micah A. Williams wrote: > > I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail > > before I get back. > Oh well...I guess we could get a lot of messages between > now and Jan 5th :-) He sends a reply on all the mail he gets that he is on vacation. But when the mail is from robots@webcrawler.com he will send it to that address, and then get the message back, and then reply to it again... There are 2 solutions. 1. Take him off the list. 2. Talk to his postmaster. ----- XxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxX Bjorn-Olav Strand . Nedre Berglia 56 . 1353 BAERUMS VERK . NORWAY (+47) 967 68 054 . bolav@pobox.com . http://www.pobox.com/~bolav/ XxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxX From owner-robots Sat Dec 16 12:28:19 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19283; Sat, 16 Dec 95 12:28:19 -0800 X-Sender: dhender@oly.olympic.net Message-Id: <v01510103acf8dc452db8@[205.240.23.66]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sat, 16 Dec 1995 12:28:38 -0800 To: robots@webcrawler.com From: david@olympic.net (David Henderson) Subject: Re: [2]RE>[5]RE>Checking Lo Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >>I am the postmaster. I am working on killing our user's forward. The mails >>should stop within 2 hours. >> >>But check out our webite: http://WorldWideMusic.com/ >> >>Saul Jacobs >>Coputer Systems Manger >>intouch group, inc. > >Thanks Saul, if you're on this list. I had taken matters into my own hands a >half hour ago, and told majordomo I was Roger and I told it to "unsubscribe >robots." I then did a "who robots," and Roger doesn't appear to be on it >anymore. I'll apologize to Roger and his autoresponder myself. :) > >I thought that the mail client community figured out that autoresponders >should reply to "Sender:" and not even bother to answer "Precedence: bulk" >messages (both of which are present in this list's headers) back in '93 or >so with the big MCI mail snafu. > >Bonnie Scott >Prodigy Services Company >(whose mail client ALWAYS replies to sender, even if there's a "Reply-to:". > Not my app. <g>) congratulation bonnie, ______________________________________________________________________ David Henderson QUICKimage Homepage Development and Marketing HOME PH/FAX: 360-377-2182 WORK PH: 206-443-1430 WORK FAX: 206-443-5670 From owner-robots Sat Dec 16 12:48:25 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20374; Sat, 16 Dec 95 12:48:25 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: <v02130501acf8e09606e0@[192.187.143.12]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sat, 16 Dec 1995 12:49:27 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Vacation wars Cc: m.koster@webcrawler.com (Martijn Koster) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 3:25 PM 12/15/95, Martijn Koster wrote: >... Not sure quite what "Mail*Link SMTP-QM 3.0.2" is... FYI, it's the StarNine QuickMail-SMTP gateway package. It gateways Internet mail to a Macintosh QuickMail server. (StarNine, which publishes the Mac Web server, WebStar, recently was acquired by Quarterdeck.) Nick From owner-robots Sun Dec 17 18:53:29 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20783; Sun, 17 Dec 95 18:53:29 -0800 X-Sender: dhender@oly.olympic.net Message-Id: <v01510100acfa871416e8@[205.240.23.66]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sun, 17 Dec 1995 18:53:55 -0800 To: robots@webcrawler.com From: david@olympic.net (David Henderson) Subject: New Robot??? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I discovered this robot hitting my server. From: hyrax.bio.indiana.edu. User-Agent: WebSCANNER libwww-perl/0.20 I have a very limited knowledge about robots so far. Is this a known robot? ______________________________________________________________________ David Henderson Webmaster QUICKimage HOME PH/FAX: 360-377-2182 WORK PH: 206-443-1430 WORK FAX: 206-443-5670 From owner-robots Mon Dec 18 04:23:53 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16169; Mon, 18 Dec 95 04:23:53 -0800 From: jeremy@mari.co.uk (Jeremy.Ellman) Message-Id: <9512181220.AA03749@kronos> Subject: Re: Announcement and Help Requested To: robots@webcrawler.com Date: Mon, 18 Dec 1995 12:20:21 +0000 (GMT) In-Reply-To: <9512141034.AA19413@osiris.sund.ac.uk> from "Simon.Stobart" at Dec 14, 95 10:34:13 am X-Mailer: ELM [version 2.4 PL24] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Jeremy.Ellman@mari.co.uk > > New Robot Announcement > ~~~~~~~~~~~~~~~~~~~~~~ > Name: IncyWincy > Home: University of Sunderland, UK > Implementation Language: C++ > Supports Robot Exclusion standard: Yes > Purpose: Various research projects > Status: This robot has not yet been released outside of Sunderland > Authors: Simon Stobart, Reg Arthington > > Help Requested > ~~~~~~~~~~~~~~ > The user-agent, from and referer http fields are not set to anything currently. Obviously, I wish these to conatin informative information. So, how do you send this information to the web server? > > The values which I wish to set these fields to are: > > User-Agent: IncyWincy V?.? > From: simon.stobart@sunderland.ac.uk > > Many Thanks > > |------------------------------------+-------------------------------------| > | Simon Stobart, | Net: simon.stobart@sunderland.ac.uk | > | Lecturer in Computing, | Voice: (+44) 091 515 2783 | > | School of Computing | Fax: (+44) 091 515 2781 | > | & Information Systems, + ------------------------------------| > | University of Sunderland, SR1 3SD, | 007: Balls Q? | > | England. | Q: Bolas 007! | > |------------------------------------|-------------------------------------| > From owner-robots Mon Dec 18 06:46:15 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22346; Mon, 18 Dec 95 06:46:15 -0800 Date: Mon, 18 Dec 1995 14:45:51 GMT From: cs0sst@isis.sunderland.ac.uk (Simon.Stobart) Message-Id: <9512181445.AA10893@osiris.sund.ac.uk> To: robots@webcrawler.com Subject: Re: Announcement and Help Requested X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, I got this from you - but there is no message. Simon ----- Begin Included Message ----- From owner-robots@webcrawler.com Mon Dec 18 14:34 GMT 1995 From: jeremy@mari.co.uk (Jeremy.Ellman) Subject: Re: Announcement and Help Requested Date: Mon, 18 Dec 1995 12:20:21 +0000 (GMT) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Jeremy.Ellman@mari.co.uk > > New Robot Announcement > ~~~~~~~~~~~~~~~~~~~~~~ > Name: IncyWincy > Home: University of Sunderland, UK > Implementation Language: C++ > Supports Robot Exclusion standard: Yes > Purpose: Various research projects > Status: This robot has not yet been released outside of Sunderland > Authors: Simon Stobart, Reg Arthington > > Help Requested > ~~~~~~~~~~~~~~ > The user-agent, from and referer http fields are not set to anything currently. Obviously, I wish these to conatin informative information. So, how do you send this information to the web server? > > The values which I wish to set these fields to are: > > User-Agent: IncyWincy V?.? > From: simon.stobart@sunderland.ac.uk > > Many Thanks > > |------------------------------------+-------------------------------------| > | Simon Stobart, | Net: simon.stobart@sunderland.ac.uk | > | Lecturer in Computing, | Voice: (+44) 091 515 2783 | > | School of Computing | Fax: (+44) 091 515 2781 | > | & Information Systems, + ------------------------------------| > | University of Sunderland, SR1 3SD, | 007: Balls Q? | > | England. | Q: Bolas 007! | > |------------------------------------|-------------------------------------| > ----- End Included Message ----- From owner-robots Mon Dec 18 07:44:40 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25834; Mon, 18 Dec 95 07:44:40 -0800 Date: Mon, 18 Dec 1995 10:42:39 -0600 (CST) From: Cees Hek <hekc@phoenix.cis.mcmaster.ca> To: robots@webcrawler.com Subject: Re: Checking Log files In-Reply-To: <v02130503acf687e1f79d@[202.237.148.40]> Message-Id: <Pine.LNX.3.91.951218102344.6222A-100000@phoenix.cis.mcmaster.ca> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Now that things have calmed down a bit on this list..... :-) What I was looking for was something that actually did some analysis on the log file, like a log statistics package but one that was geared toward robots. It would check to see if the robots are actually following the standard for Robot exclusion. It could check if there are multiple accesses to the server and how far apart they are, how many times in a month the robot returns, and how often the robots.txt file is accessed to name a few. If nothing like this has been written, I may just write it myself (if I can find some free time). I would welcome any suggestions as to what a program like this should contain. For now though I guess I will have to live with grepping the log file.... Cees Hek Computing & Information Services Email: hekc@mcmaster.ca McMaster University Hamilton, Ontario, Canada On Fri, 15 Dec 1995, Mark Schrimsher wrote: > You can make a robots.txt file that permits all accesses, and then check > the log for requests for that file. But it won't catch robots that don't > check for robots.txt. > > --Mark From owner-robots Mon Dec 18 14:59:15 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19060; Mon, 18 Dec 95 14:59:15 -0800 Message-Id: <9512182259.AA19051@webcrawler.com> To: robots Subject: test; please ignore From: Martijn Koster <m.koster@webcrawler.com> Date: Mon, 18 Dec 1995 14:59:15 -0800 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [2]RE>[2]RE>[5]RE>Checking 12/15/95 Thanks for you message. I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail before I get back. Do ingore this message; it's not true, and merely a test-case for a new bounce rule in majordomo, designed to prevent at least some "Roger Dearnaley" problems. The fact you saw this message indicates the initial easy fix didn't work, so it's back to the drawing board :-( Don't worry, further testing will take place on a specific test list... -- Martijn __________ Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Mon Dec 18 19:07:09 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24551; Mon, 18 Dec 95 19:07:09 -0800 Message-Id: <v02130503acfbdc9fea77@[202.237.148.35]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Tue, 19 Dec 1995 12:06:43 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: Re: test; please ignore Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > [2]RE>[2]RE>[5]RE>Checking 12/15/95 > >Thanks for you message. > >I am on vacation until Jan 5th, 1996, and am unlikely to check my e-mail >before I get back. > >Do ingore this message; it's not true, and merely a test-case for >a new bounce rule in majordomo, designed to prevent at least some >"Roger Dearnaley" problems. The fact you saw this message indicates >the initial easy fix didn't work, so it's back to the drawing board :-( >Don't worry, further testing will take place on a specific test list... > >-- Martijn >__________ >Email: m.koster@webcrawler.com >WWW: http://info.webcrawler.com/mak/mak.html Martijn: How can I subscribe to the test list. ;-) --<arl From owner-robots Thu Dec 21 07:50:52 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11270; Thu, 21 Dec 95 07:50:52 -0800 Date: Thu, 21 Dec 1995 07:57:20 -0800 X-Sender: dhender@oly.olympic.net Message-Id: <v01510101acfec33d79dd@[204.182.64.25]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: david@quickimage.com (David Henderson) Subject: Re: test; please ignore Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com test _____________________________________________________________ David Henderson - Webmaster - QUICKimage _____ HOME PH/FAX: 360-377-2182 / \ WORK PH: 206-443-1430 @ 0 0 @ WORK FAX: 206-443-5670 | \_/ | Check out my newest creation "MeatPower" \_____/ at 'http://www.qinet.com/meat' From owner-robots Fri Dec 22 07:31:42 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA07619; Fri, 22 Dec 95 07:31:42 -0800 Message-Id: <199512221530.HAA09354@sparty.surf.com> Date: Thu, 21 Dec 95 19:29:36 -0800 From: Murray Bent <murrayb@surf.com> X-Mailer: Mozilla 1.12 (X11; I; IRIX 5.3 IP22) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Unfriendly Lycos , again ... X-Url: http://www.whitehouse.gov/White_House/Publications/html/Publications.html Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Is anyone getting requests from an anonymous robot, presumably, from the lycos domain presumably (cmu.edu), as follows .. bragi.cc.cmu.edu - - [15/Dec/1995:11:42:50 -0800] "GET /" 200 3039 bragi.cc.cmu.edu - - [15/Dec/1995:22:30:21 -0800] "GET /" 200 3039 bragi.cc.cmu.edu - - [17/Dec/1995:21:08:52 -0800] "GET /" 200 3039 bragi.cc.cmu.edu - - [22/Dec/1995:06:51:27 -0800] "GET /" 200 3413 Nothing appears in the agents log. mj From owner-robots Sat Dec 23 05:57:14 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08742; Sat, 23 Dec 95 05:57:14 -0800 Message-Id: <01BAD181.37E12BC0@pluto.planets.com.au> From: David Eagles <eaglesd@planets.com.au> To: "'robots@webcrawler.com'" <robots@webcrawler.com> Subject: Inter-robot Comms Port Date: Sat, 23 Dec 1995 21:54:20 +-1100 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Has anyone thought about applying for a TCP port number dedicated for = intercommunications between various robots, as well as the additional = protocol for exchange of info. As many of you will have seen, and some = I have spoken to, I have developed a web crawler = (http://funnelweb.net.au) which performs searchs/indexing for the South = Pacific countries (based in Australia), selectable by individual = country. I have received a lot of queries in regard to others using the = code for various projects in other countries (and even internal = corporate networks). As a result, I'm currently implementing a = distributed searching/indexing facility. The best approach I can think = of is to have a dedicated port which can be used by remote agents to = either conduct searchs of another country's data or to register a URL = for processing and indexing by that agent for the remote database (hope = that makes sense). Any comments on this would be GREATLY appreciated. Regards, David Eagles PlaNET Consulting Pty Limited Brisbane Australia From owner-robots Tue Dec 26 12:42:19 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04648; Tue, 26 Dec 95 12:42:19 -0800 Message-Id: <199512262042.PAA25736@lexington.cs.columbia.edu> To: robots@webcrawler.com Subject: Re: Inter-robot Comms Port In-Reply-To: Your message of "Sat, 23 Dec 1995 21:54:20." <01BAD181.37E12BC0@pluto.planets.com.au> Date: Tue, 26 Dec 1995 15:42:08 -0500 From: "John D. Pritchard" <jdp@cs.columbia.edu> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com this is precisely what harvest http://www.cs.colorado.edu/harvest provides for, with a information model which is widely applicable. this software provides a very nice place to build such systems on top of, imo. for some of the related TRs see.. http://harvest.cs.colorado.edu/harvest/user-manual-1.1/node73.html -john > Has anyone thought about applying for a TCP port number dedicated for = > intercommunications between various robots, as well as the additional = > protocol for exchange of info. As many of you will have seen, and some = > I have spoken to, I have developed a web crawler = > (http://funnelweb.net.au) which performs searchs/indexing for the South = > Pacific countries (based in Australia), selectable by individual = > country. I have received a lot of queries in regard to others using the = > code for various projects in other countries (and even internal = > corporate networks). As a result, I'm currently implementing a = > distributed searching/indexing facility. The best approach I can think = > of is to have a dedicated port which can be used by remote agents to = > either conduct searchs of another country's data or to register a URL = > for processing and indexing by that agent for the remote database (hope = > that makes sense). > > Any comments on this would be GREATLY appreciated. From owner-robots Tue Dec 26 14:42:43 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09548; Tue, 26 Dec 95 14:42:43 -0800 Message-Id: <199512261040.CAA07443@www2> Date: Tue, 26 Dec 95 02:40:33 -0800 From: Super-User <root@www2> X-Mailer: Mozilla 1.12 (X11; I; IRIX 5.3 IP22) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: Inter-robot Comms Port X-Url: http://www.niyp.com/cgi/nyp_narrow.cgi Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >this is precisely what harvest http://www.cs.colorado.edu/harvest provides Don't let this stop you trying to build a better one, though! From owner-robots Tue Dec 26 16:47:38 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14539; Tue, 26 Dec 95 16:47:38 -0800 Date: Wed, 27 Dec 1995 01:45:49 +0100 (GMT+0100) From: Carlos Baquero <cbm@di.uminho.pt> To: Super-User <root@www2.webcrawler.com> Cc: robots@webcrawler.com Subject: Re: Inter-robot Comms Port In-Reply-To: <199512261040.CAA07443@www2> Message-Id: <Pine.LNX.3.91.951227013834.154C-100000@poe.di.uminho.pt> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 565 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Tue, 26 Dec 1995, Super-User wrote: > >this is precisely what harvest http://www.cs.colorado.edu/harvest provides > > Don't let this stop you trying to build a better one, though! > Yes. Specially if its a common interface for the interchange of information among robot databases. But profit might interfere with such a project. Carlos Baquero PhD Student, Distributed Systems Fax +351 (53) 612954 University of Minho, Portugal Voice +351 (53) 604475 cbm@di.uminho.pt http://shiva.di.uminho.pt/~cbm From owner-robots Thu Dec 28 13:38:21 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09788; Thu, 28 Dec 95 13:38:21 -0800 To: robots@webcrawler.com Subject: Re: Unfriendly Lycos , again ... X-Url: http://www.miranova.com/%7Esteve/ References: <199512221530.HAA09354@sparty.surf.com> From: steve@miranova.com (Steven L. Baur) Date: 28 Dec 1995 13:36:07 -0800 In-Reply-To: Murray Bent's message of 21 Dec 1995 19:29:36 -0800 Message-Id: <m2u42keurc.fsf@diana.miranova.com> Organization: Miranova Systems, Inc. Lines: 8 X-Mailer: September Gnus v0.26/XEmacs 19.13 Mime-Version: 1.0 (generated by tm-edit 7.38) Content-Type: text/plain; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Is anyone getting requests from an anonymous robot, presumably, > from the lycos domain presumably (cmu.edu), as follows .. BRAGI.CC.CMU.EDU - - [16/Dec/1995:19:18:55 +0800] "GET /" 200 2427 One request since August doesn't seem unfriendly to me. -- steve@miranova.com baur From owner-robots Thu Dec 28 17:47:01 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21949; Thu, 28 Dec 95 17:47:01 -0800 Message-Id: <01BAD5E2.CD1A7BA0@pluto.planets.com.au> From: David Eagles <eaglesd@planets.com.au> To: "'robots@webcrawler.com'" <robots@webcrawler.com> Subject: Inter-robot Communications - Part II Date: Fri, 29 Dec 1995 11:42:56 +-1100 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Well, I never expected to receive such a favourable response about a = standard port/protocol for communication between robots. Although the = work being done by Harvest was mentioned several times, the great = majority of people who replied thought the Harvest system was too = complicated now, and I believe it also lacks some useful features (and = it's not on a standardised port yet). I'm going away for a couple of weeks, but I'll put some thought into it = during that time. Any comments, requests, ideas for any aspect would be = greatly appreciated (after all ,that's how the Internet was built). = When I return I'll setup a part of my WWW server dedicated to this = project (think I'll call it Project Asimov - seems appropriate for a = global robot communications system). The key features I have thought of so far as listed below, so you can = comment on these also (ie. tell m,e if I'm being too = stupid/ambitious/etc) 1. Dedicated port approved as an Internet standard port number. (What = does this require?) 2. Protocol (similar to FTP I think) which allows remote agents to = exchange URL's, perform searchs and get the results in a standard = format, database mirroring(?), etc. The idea behind this is that if = Robot A finds a URL handled by another remote Robot (such as by domain = name, keywords(?), etc), then it can inform the remote robot of it's = existance. Similarly, if a user wants to search for something which = happens to be handled by the remote server, a standard data format will = be returned which can them be presented in any format. 3. A method of correlating Robots with specialties (what the robot is = for). An approach similar to DNS may come in handy here - limited = functionality could be obtained by using a "hosts" type file (called = "robots" ?), while large scale, transparent functionality would probably = require a centralised site which would maintain a list of all know = robots and their specialties. Remote robots would download the list( or = search parts of it) as required. This could probably be another = protocol command on the port above. 4. A standard set of data, plus some way to extend it for implementation = specific users. I use the following fields in FunelWeb URL Title (from <TITLE>) Headings (from <Hx>) Link Descriptions (from <A HREF=3D"">...</A>) Keywords (from user entry) Body Text (from all other non-HTML text) Document Size (from Content-Length: server field) Last-Modified Date (from Last-Modified: server field) Time-To-Live (server dependant) This also highlights one MAJOR consideration - These fields are = generally only useful to HTML robots. Something needs to be considered = to handle any input format, including FTP, WAIS and GOPHER. Well, this is now MUCH longer than I first intended it to be. Sorry to = have wasted your time and bandwidth. Hope you all had a great Christmas and have a Happy New Year. Regards, David From owner-robots Fri Dec 29 06:06:20 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22493; Fri, 29 Dec 95 06:06:20 -0800 Message-Id: <9512291705.AA3706@wscnotes.hammer.net> To: robots <robots@webcrawler.com> From: "Christopher J. Tomasello/WSC" <Christopher_J.._Tomasello@hammer.net> Date: 29 Dec 95 9:04:46 EDT Subject: unknown robot Mime-Version: 1.0 Content-Type: Text/Plain Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Anyone with information on this robot please respond to the group or perferably to me ctomasello@hammer.net There is a robot hitting our web server on a regular basis. It hits every file on the server in a very rapid rate (many requests per second). The curious thing about this is that the robot is using our IP/domain name to gain access. So in the log files it looks like one of our internal servers is hitting the site. Also, all the the requests this robot makes are returning 404 errors. I have heard rumors that the Alta Vista spider is doing this kind of spoofing - but I have also heard that it is not. Any information would be greatly appreciated. From owner-robots Fri Dec 29 08:26:19 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA29356; Fri, 29 Dec 95 08:26:19 -0800 X-Sender: mak@surfski.webcrawler.com Message-Id: <v02140800ad09a89121ec@[199.221.45.139]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 29 Dec 1995 08:26:27 -0700 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: Re: Inter-robot Communications - Part II Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 4:42 AM 12/29/95, David Eagles wrote: > Well, I never expected to receive such a favourable response > about a standard port/protocol for communication between robots. Cool. Many robot authors among the respondents? >The key features I have thought of so far as listed below, > so you can comment on these also (ie. tell m,e if I'm being too > stupid/ambitious/etc) > >1. Dedicated port approved as an Internet standard port number. > (What does this require?) Not sure, but there's no point until there is an RFC which specifies what the protocol on that port is doing, so concentrate on that first. >2. Protocol (similar to FTP I think) which allows remote agents > to exchange URL's, perform searchs and get the results in a standard > format, database mirroring(?), etc. Why on earth like FTP? FTP is reseonably complex and inefficient. If we're talking about the web, use HTTP! We know how/that that works, we have many implementations, and it's reaseonably OK. It's at least as efficient as FTP for this kind of thing, and when HTTP/NG comes along you can just plug that in. This allows you to concentrate on just the data format; so just invent a new Media type: text/foo. > The idea behind this is that if Robot A finds a URL handled by another > remote Robot (such as by domain name, keywords(?), etc), then it can > inform the remote robot of it's existance. This would be easy deployable if you use HTTP: POST a form or PUT a file using a client library such as libwww-perl, and handle it in a CGI script. Hey, we'll just link it to our submit form :-) This has been discussed before actually, we never got time to make it go anywhere... > Similarly, if a user wants to search for something which happens to be > handled by the remote server, a standard data format will be returned > which can them be presented in any format. Distributed interactive searching is more complicated than that though... what do you do when there are three thousand of these servers around? It is also complicated because these days you don't want all results; there are too many of them. But masaging the results using whatever selection and relevance feedback is very robot-specific, because everyone uses different kinds of search engines. This sounds to me like a problem for which there is no good and easy answer. However, it'd be nice to come up with a way to efficiently hoover other robots for URL's; this could be done with a mechanism such as you describe. What we can learn from Harvest is that these issues can be separated into different processes, making it all a bit more flexible and clear. >3. A method of correlating Robots with specialties (what the robot is for). > An approach similar to DNS may come in handy here - > limited functionality could be obtained by using a "hosts" type file > (called "robots" ?), while large scale, transparent functionality would > probably require a centralised site which would maintain a list of all > know robots and their specialties. Remote robots would download the > list( or search parts of it) as required. This could probably be > another protocol command on the port above. The words "scalable" and "centralised site" don't mix :-) Hmmm... expressing "what the robot is for" is probably very difficult to express. Meta information categorization is always a nightmare. What classification to use? >4. A standard set of data, plus some way to extend it for implementation > specific users. I use the following fields in FunelWeb > URL > Title (from <TITLE>) > Headings (from <Hx>) > Link Descriptions (from <A HREF="">...</A>) > Keywords (from user entry) > Body Text (from all other non-HTML text) > Document Size (from Content-Length: server field) > Last-Modified Date (from Last-Modified: server field) > Time-To-Live (server dependant) Hmm, this is where it gets tricky. URL, Title, and Keywords are obvious. Content-length and Last-Modified Date sound good, but do under-represent the HTTP server response; what about Content-language and other variants? Headings, link descriptions, and body text: hmmm. Which headers, how are they ordered? same for links? What is "body text" in the company of HTML tables etc? What about losing info from in-line images and HEAD elements? What about frames? This is a slippery slope; why not simply send the entire document content compressed? As efficient, and gives complete freedom) Also check out the URC work, sounds like some potential overlap here. (Damn, I'm starting to sound like Dan Connoly :-) > This also highlights one MAJOR consideration - These fields are generally > only useful to HTML robots. Something needs to be considered to handle > any input format, including FTP, WAIS and GOPHER. Even if you ignore that for now you'd be scoring... I'll share a different idea I had about this stuff (oe, or do I need to patent it first these days?). If in the distributed gathering part we start sending URL's, HTTP response headers, and complete content, doesn't the word "caching proxy" spring to mind? I need to think more about this, but it sounds to me that if we had an efficient way of updating and pre-loading distributed caches using a between-cache protocol, we'd be killing two birds with one stone: better caching performance than the current 30%, and complete freedom to do whatever you want with the content for robot purposes... Happy New Year, -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Fri Dec 29 08:50:27 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00753; Fri, 29 Dec 95 08:50:27 -0800 Message-Id: <199512291646.OAA01900@desterro.edugraf.ufsc.br> X-Sender: fernando@edugraf.ufsc.br Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 29 Dec 1995 14:43:48 -0400 To: robots@webcrawler.com From: fernando@edugraf.ufsc.br (Luiz Fernando) Subject: Re: unknown robot X-Mailer: <PC Eudora Version 1.4> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 09:04 AM 12/29/95 EDT, robots@webcrawler.com wrote: >I have heard rumors that the Alta Vista spider is doing this kind of spoofing >but I have also heard that it is not. Any information would be greatly >appreciated. btw, I would like to receive also any info on the AltaVista's inner workings, pse fernando ---------------------------------------- fernando@hipernet.ufsc.br http://www.hiperNet.ufsc.br From owner-robots Fri Dec 29 09:47:16 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02998; Fri, 29 Dec 95 09:47:16 -0800 From: <monier@pa.dec.com> Message-Id: <9512291743.AA05536@evil-twins.pa.dec.com> To: robots@webcrawler.com Subject: Re: unknown robot In-Reply-To: Your message of "29 Dec 95 09:04:46 EDT." <9512291705.AA3706@wscnotes.hammer.net> Date: Fri, 29 Dec 95 09:43:17 -0800 X-Mts: smtp Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Gang, I'm the father of Scooter, the robot behind Alta Vista. I can garantee you that the robot does not do anything funny: no IP spoofing or other arguable behavior. It is usually run from scooter.pa-x.dec.com, sometimes for short tests from inside the Digital firewall. It sets the following fields: User-Agent: Scooter/1.0 scooter@pa.dec.com From: scooter@pa.dec.com and it is registered at Martijn's site. It's anything but a stealth robot. So please help squash this rumor. I'll run this message by our network gurus, they might think of ways of catching the bad guys. Cheers, --Louis From owner-robots Fri Dec 29 12:10:00 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08527; Fri, 29 Dec 95 12:10:00 -0800 From: "Mordechai T. Abzug" <mabzug1@gl.umbc.edu> Message-Id: <199512292009.PAA24281@umbc8.umbc.edu> Subject: Re: Inter-robot Communications - Part II To: robots@webcrawler.com Date: Fri, 29 Dec 1995 15:09:51 -0500 (EST) In-Reply-To: <01BAD5E2.CD1A7BA0@pluto.planets.com.au> from "David Eagles" at Dec 29, 95 11:42:56 am X-Mailer: ELM [version 2.4 PL24] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 2707 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com "DE" == David Eagles spake thusly: DE> DE> Well, I never expected to receive such a favourable response about a = DE> standard port/protocol for communication between robots. Although the = DE> work being done by Harvest was mentioned several times, the great = DE> majority of people who replied thought the Harvest system was too = DE> complicated now, and I believe it also lacks some useful features (and = DE> it's not on a standardised port yet). DE> DE> I'm going away for a couple of weeks, but I'll put some thought into it = DE> during that time. Any comments, requests, ideas for any aspect would be = DE> greatly appreciated (after all ,that's how the Internet was built). = DE> When I return I'll setup a part of my WWW server dedicated to this = DE> project (think I'll call it Project Asimov - seems appropriate for a = DE> global robot communications system). DE> DE> The key features I have thought of so far as listed below, so you can = DE> comment on these also (ie. tell m,e if I'm being too = DE> stupid/ambitious/etc) DE> DE> 1. Dedicated port approved as an Internet standard port number. (What = DE> does this require?) DE> 2. Protocol (similar to FTP I think) which allows remote agents to = DE> exchange URL's, perform searchs and get the results in a standard = DE> format, database mirroring(?), etc. The idea behind this is that if = DE> Robot A finds a URL handled by another remote Robot (such as by domain = DE> name, keywords(?), etc), then it can inform the remote robot of it's = DE> existance. Similarly, if a user wants to search for something which = DE> happens to be handled by the remote server, a standard data format will = DE> be returned which can them be presented in any format. DE> 3. A method of correlating Robots with specialties (what the robot is = DE> for). An approach similar to DNS may come in handy here - limited = DE> functionality could be obtained by using a "hosts" type file (called = DE> "robots" ?), while large scale, transparent functionality would probably = DE> require a centralised site which would maintain a list of all know = DE> robots and their specialties. Remote robots would download the list( or = DE> search parts of it) as required. This could probably be another = DE> protocol command on the port above. [snip] Before doing anything new on robot/agent communication, you wish to look into some of the already in-place efforts, ie. KQML and KIF. Check out <http://www.cs.umbc.edu/kse>. They don't do everything you want, but they do provide a framework. -- Mordechai T. Abzug http://umbc.edu/~mabzug1 mabzug1@umbc.edu finger -l mabzug1@gl.umbc.edu If you believe in telekinesis, raise my hand. From owner-robots Fri Dec 29 15:10:16 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17154; Fri, 29 Dec 95 15:10:16 -0800 Message-Id: <01BAD696.20CAA5A0@pluto.planets.com.au> From: David Eagles <eaglesd@planets.com.au> To: "'robots@webcrawler.com'" <robots@webcrawler.com> Subject: RE: Inter-robot Communications - Part II Date: Sat, 30 Dec 1995 09:06:36 +-1100 Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="---- =_NextPart_000_01BAD696.20DB6E80" Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com ------ =_NextPart_000_01BAD696.20DB6E80 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable I'll share a different idea I had about this stuff (oe, or do I need to patent it first these days?). If in the distributed gathering part we start sending URL's, HTTP response headers, and complete content, doesn't the word "caching proxy" spring to mind? I need to think more about this, but it sounds to me that if we had an efficient way of updating and pre-loading distributed caches using a between-cache protocol, we'd be killing two birds with one stone: better caching performance than the current 30%, and complete freedom to do whatever you want with the content for robot purposes... I'd actually had the same idea, but admittedly hadn't thought of taking = it as far as a distributed cache situation. I currently generate my = database using the cache files from a fairly large Australian ISP. Just = tar all the appropriate files I require (which is easy in the case of = FunnelWeb because it uses domain names to determine if the data is = relevant to it or not - I just tar *.au, *.nz, etc), then download them = at let FunnelWeb go crazy. Accessing the local filesystem makes the = initial data gathering very fast obviously, and I can then re-visit each = host in the database and try a more thorough traversal of the site. Now, if we had a distributed cache mechanism, I wouldn't need to grab = their cache file anymore - the robot itself could either access the = cache files directly, or talk to the local cache handler using the = between-cache protocol. The storage format of the CERN proxy-cache is quite convenient for file = access by robots (except it should compress the data - I haven't looked = at it lately, so if it does now please ignore the last comment). Unfortunately, the same problems come up as I described in the last = message. It is a waste of bandwidth, time and storage to completely = duplicate entire caches. The ideal way would be to have some selection = criteria, but what? Later all, David ------ =_NextPart_000_01BAD696.20DB6E80 Content-Type: application/ms-tnef Content-Transfer-Encoding: base64 eJ8+IicWAQaQCAAEAAAAAAABAAEAAQeQBgAIAAAA5AQAAAAAAADoAAENgAQAAgAAAAIAAgABBJAG ACQBAAABAAAADAAAAAMAADADAAAACwAPDgAAAAACAf8PAQAAAEkAAAAAAAAAgSsfpL6jEBmdbgDd AQ9UAgAAAAByb2JvdHNAd2ViY3Jhd2xlci5jb20AU01UUAByb2JvdHNAd2ViY3Jhd2xlci5jb20A AAAAHgACMAEAAAAFAAAAU01UUAAAAAAeAAMwAQAAABYAAAByb2JvdHNAd2ViY3Jhd2xlci5jb20A AAADABUMAQAAAAMA/g8GAAAAHgABMAEAAAAYAAAAJ3JvYm90c0B3ZWJjcmF3bGVyLmNvbScAAgEL MAEAAAAbAAAAU01UUDpST0JPVFNAV0VCQ1JBV0xFUi5DT00AAAMAADkAAAAACwBAOgEAAAACAfYP AQAAAAQAAAAAAAAD0jcBCIAHABgAAABJUE0uTWljcm9zb2Z0IE1haWwuTm90ZQAxCAEEgAEAKQAA AFJFOiBJbnRlci1yb2JvdCBDb21tdW5pY2F0aW9ucyAtIFBhcnQgSUkA5Q0BBYADAA4AAADLBwwA HgAJAAYAJAAGADUBASCAAwAOAAAAywcMAB4ACAA2AAUABgBFAQEJgAEAIQAAADMwMDQ5MjVCRDE0 MUNGMTE5ODZBMDAwMEMwOEMwMzRFAOAGAQOQBgCUBwAAEgAAAAsAIwAAAAAAAwAmAAAAAAALACkA AAAAAAMANgAAAAAAQAA5AMACDOw51roBHgBwAAEAAAApAAAAUkU6IEludGVyLXJvYm90IENvbW11 bmljYXRpb25zIC0gUGFydCBJSQAAAAACAXEAAQAAABYAAAAButY56/tbkgQxQdERz5hqAADAjANO AAAeAB4MAQAAAAUAAABTTVRQAAAAAB4AHwwBAAAAEgAAAGVhZ2xlc2RAcGMuY29tLmF1AAAAAwAG EPYNei8DAAcQDwYAAB4ACBABAAAAZQAAAElMTFNIQVJFQURJRkZFUkVOVElERUFJSEFEQUJPVVRU SElTU1RVRkYoT0UsT1JET0lORUVEVE9QQVRFTlRJVEZJUlNUVEhFU0VEQVlTPylJRklOVEhFRElT VFJJQlVURURHQVQAAAAAAgEJEAEAAAAGBgAAAgYAAIcJAABMWkZ1Mmz6ev8ACgEPAhUCqAXrAoMA UALyCQIAY2gKwHNldDI3BgAGwwKDMgPFAgBwckJxEeJzdGVtAoMztwLkBxMCgzQSzBTFfQqAiwjP Cdk7F58yNTUCgAcKgQ2xC2BuZzEwM18UUAsKFFEL8hNQbxPQY8MFQAqLbGkzNhwhG39RHIJJJ2wD IHMRgWXQIGEgZAaQZgSQCfBNBUBpDbAgAEkgEYBkyx/wBuB1BUB0aAQAH5AEdHUN0CAob2UsUiAF sWRvIQFuCeBkfQqFdCMQCrAT0CCSBUBmnmkRoCGyB5Af4GRhE7D4PykuH0AiYAuAJSIgEZsTwAUQ YiGgCYAgZyRAvyVABRAaoCQhACAKhXcf4D8TwCghH5AJ8CAgJ+FVUghMJ3MisEhUVFDyIBegc3AC ICVhJUAhQN8EkCoRAHAhUAWgbQtQEcDvH+AFoAIwIIEsCoUjAAeQPG4nJRMosAWwIVAiY48A0CHg J+IDYHh5Ih+QlxNQJ9IkAW0LgGQ/IyWzI/Ih0W5rMBAFsGUKhe8heCKwJxEkknMIYClwBCD/L/If 4CHQJEAgsCJgKMEhM7MDoA3BaWMIkCCRdyWgpSLAZgqFdXAlkHQn0vMrshNQZS0XMCFAJ9Imuvsu kgeRdQCQNyIy4BHAKMDdCfAtORMKhRxCbxcRIrDNKMAnIVA6ACBrAxAdkP8vwi4wMuAk4DOhA/Ah 0CLA9yNQIhE90To58hPQBcAuldc61gSQAhByA4FjNBMmVMZjCHAgczMwJSudA1AvI2EDcCPyIwF3 NEFldvsEkAqFeQhgNeE1wj2SQOO/LJQkwAWxA2AG4AVAcAhw8yrAEbBzLkdQCoceChy8vR38YwBA H0EhURyQdQdAzmw2ECEyJnJzYTQBIMLvMtQhQDAgPqFkS3QttAhg3GdoBUA2MCHAYTxgJ+H1JKFh BCBmMgIEICACOH3/H5AkoEtANwACICXwIQFBJf1LcWcJ8ASQJEEwEDYQNuH/AaBPgB/gOYRA41ET JNAsMH9PkQNhH/FPsCTgS3ELYHJ/UuAUsDmAJuAHQAcwA6BJ9FNQUfFKVuFO0TIRH3H5JnJhcBxB L5FTMlU0IRDZF6BxdSTgH+AoQ6A1gPc9sCHxIOBzNhAmRS6QJWENTrFGM4AjUGxXZWL/OfEukDmA TFEFQF1xBCBDAf8LcSNATDEzsw2wPrEwIUxRv07BJoJTsVsSF6AsMHZE4lckASShBbFuRqEtIQFq OVf2Ki5dYCKwYvBuevMisBHAYykisCUxA6AjAPx3bjfSJSJVwQVALDFciUZnIxAFAGF6eVHxQf5j QGAEEFRWFzAukAMgVTP/E7MwEE7wXtImgQuAJKAHMX9gNCd4Q/E2EE+wJQFGgHb/UcA5gEtwK5RS IUC0A6A3of9rwFFhW0Fa8U5AJQEmRlO2/yuyJuA2ECAAMYJOIgNgTmF9b6FhQ/FMIAMgTrJRNGX3 R3ZI70n8TmRwIrA0iVA/71NRBZBAoQQAbSKwIRAuMPx1bE3jMKYJwAGgJSIk4P9UySuhBsBwImIg JnJGdCSg/RGwbCJgBaB30TVAPZE+wf8A0GciVI8gIBegHJBsEgWx/QGQbDFgMPNnxlEEQKFNgN8+ wVQ4Ogs7N3JNVEvyJAAfUyBWoUADTpQmgUNFUv5OLwSB9SHxWlEsVEPwAwB3RfZ51Hy0YjYQRnME ICj8ZXhAYAUxMzJOQXwBK/L/KpFpNGBDYiIRgIdBLcEXMP5vaRAhUSSDC2AT0GwSM2B/NHIkoS1y YdEH4CwhU/Jp/mdh4HAzZ8FrYivxB4ACMPMl4HJcVW5AASIwXqCNVH9L5xxBAmAT4AQgK/FUEXD/ T3IhEA2wBPI6ACFQJkWP8/8HgUwgUuBR8iChUAI18BPB706iU+ApcAPwZCHQY/EHcd8ro4QWJAEr 9ktxZDbAHZD/LpAsUSCBWnI5FFHxg9Igwv8DIDXyd7M8IiQBi+IzUTQB/3uBHIFRwWZxchEHITLU Q6K6P3JcTCRBWFMs9kRxEF8gwHJfHW9IixbBAKVgAAADABAQAAAAAAMAERAAAAAAQAAHMADMBSw4 1roBQAAIMADMBSw41roBHgA9AAEAAAAFAAAAUkU6IAAAAABbcA== ------ =_NextPart_000_01BAD696.20DB6E80-- From owner-robots Fri Dec 29 17:01:25 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22417; Fri, 29 Dec 95 17:01:25 -0800 To: robots@webcrawler.com Cc: John_R_R_Leavitt@NL.CS.CMU.EDU Subject: Re: Unfriendly Lycos , again ... In-Reply-To: Your message of "28 Dec 95 13:36:07 PST." <m2u42keurc.fsf@diana.miranova.com> Date: Fri, 29 Dec 95 20:00:37 EST Message-Id: <9464.820285237@NL.CS.CMU.EDU> From: John_R_R_Leavitt@NL.CS.CMU.EDU Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com steve@miranova.com (Steven L. Baur) wrote: >> Is anyone getting requests from an anonymous robot, presumably, >> from the lycos domain presumably (cmu.edu), as follows .. > >BRAGI.CC.CMU.EDU - - [16/Dec/1995:19:18:55 +0800] "GET /" 200 2427 > >One request since August doesn't seem unfriendly to me. >-- >steve@miranova.com baur Please be aware that cmu.edu != lycos.com. We had our roots in CMU (and you may note that some of us are stilling getting/sending mail from there), but our operations are now almost entirely moved over to the lycos.com domain. Those remaining are in the cs.cmu.edu (computer science) subdomain, not cc.cmu.edu (computer club). Also, our spiders have always clearly identified themselves with the User-agent header. -John. John R. R. Leavitt | Director, Product Development | Lycos Inc. 412 268 8259 | jrrl@lycos.com | http://agent2.lycos.com:8001/jrrl/ Reading: Half the Day is Night by Maureen McHugh From owner-robots Sat Dec 30 12:02:15 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14410; Sat, 30 Dec 95 12:02:15 -0800 X-Sender: mak@surfski.webcrawler.com Message-Id: <v02140800ad0b372b6628@[199.221.45.139]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sat, 30 Dec 1995 12:02:24 -0700 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: RE: Inter-robot Communications - Part II Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 2:06 AM 12/30/95, David Eagles wrote: > Now, if we had a distributed cache mechanism, I wouldn't need to grab > their cache file anymore - the robot itself could either access the > cache files directly, or talk to the local cache handler using the > between-cache protocol. Quite. >The storage format of the CERN proxy-cache is quite convenient for file > access by robots (except it should compress the data - I haven't looked > at it lately, so if it does now please ignore the last comment). Hmmm... I have the feeling the CERN cache is far from ideal these days. > Unfortunately, the same problems come up as I described in the last > message. It is a waste of bandwidth, time and storage to completely > duplicate entire caches. The ideal way would be to have some > selection criteria, but what? You could do all sorts, but for just the caching side you can use the standard caching mechanisms, and base it on popularity etc. For content subject selection you'd have to find some other ways, but at least if you're sitting on a complete cache you have the freedom to choose. Wouldn't it be handy if you could run a java/Safe-perl/whatever selector on the remote cache, so it can choose for itself according to _your_ rules instead of the server? :-) Happy New Year all, -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Sun Dec 31 08:54:08 1995 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09461; Sun, 31 Dec 95 08:54:08 -0800 Date: Sun, 31 Dec 1995 11:49:24 -0600 (CST) From: gil cosson <gil@rusty.waterworks.com> To: robots@webcrawler.com Subject: please add my site Message-Id: <Pine.LNX.3.91.951231114800.15693A-100000@rusty.waterworks.com> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Please crawl my site, but don't kill me... I am at http://www.waterworks.com thanks, gil. From owner-robots Mon Jan 1 09:56:04 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08566; Mon, 1 Jan 96 09:56:04 -0800 X-Sender: mak@surfski.webcrawler.com Message-Id: <v02140804ad0dbf423937@[199.221.45.139]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Mon, 1 Jan 1996 09:56:20 -0700 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: Re: please add my site Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 11:49 AM 12/31/95, gil cosson wrote: >Please crawl my site, but don't kill me... :-) > I am at http://www.waterworks.com Before anyone else is tempted: this is not an appropriate message for this mailing list; the list is for technical discussions. To invite robots round, check out Submit-it! or submit by hand to the services you want. Happy New Year all, -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Tue Jan 2 00:40:53 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15807; Tue, 2 Jan 96 00:40:53 -0800 Date: Tue, 2 Jan 96 17:40:33 KST From: dhkim@sarang.kyungsung.ac.kr (Dong-Hyun Kim) Message-Id: <9601020840.AA04808@sarang.kyungsung.ac.kr> To: robots@webcrawler.com Subject: Please Help ME!! Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi... my dear Everyone. Happy New Year.. I've run Harvest program. But It can't Search and Gather 2bytes Languages like korean, japaness, etc So.. I'm so blue.. How can I fix that problem... If it can be possible where do I fix? Please help me.... I want search 2bytes Languages~~~~ from... DH http://sarang.kyungsung.ac.kr:8585 From owner-robots Tue Jan 2 12:28:10 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21443; Tue, 2 Jan 96 12:28:10 -0800 Message-Id: <199601022028.PAA18749@lexington.cs.columbia.edu> To: robots@webcrawler.com Subject: Re: Inter-robot Communications - Part II In-Reply-To: Your message of "Sat, 30 Dec 1995 12:02:24 MST." <v02140800ad0b372b6628@[199.221.45.139]> Date: Tue, 02 Jan 1996 15:28:06 -0500 From: "John D. Pritchard" <jdp@cs.columbia.edu> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com hi, i think some of the issues coming up could be resolved with a another approach to the protocol problem. i think that negotiated protocols would be one of the most important aspects of a robot to robot link so that new protocols may be created or old ones extended at will. in the simplest approach, protocols are named and versioned, which ident strings can be used for communication set up. this presumes a basic subset of all robot protocols, which would be stateless, of course, for transferring the version string. higher level protocols may want to be stateful, as some search engines/web sites are now for narrow-casting, or for caching. (www.sony.com: the magic cookies it puts into its forms expire, or at least appear to "expire", they're having lots of problems lately) more interestingly, a minimal (stateless) inter-robot language would also provide for more sophisticated forms of negotiation. for example, systems with dynamic local indexing, eg, expiring magic cookies, could negotiate an expired context with previous search info which effectively restores the lost context. in such an environment, caching info is predetermined. for example, saving GET strings with search info. this meets the ideas raised by martijn. as mentioned there are lots of ways to do this, which can be negotiated to some degree. downloading a Java or Safe-perl program is one way, and could be a subset of the proposed protocol, since various people would have various ideas on how to do this. for example, what kind of namespace the script enters, or are there a class of scripts with particular init arguments. these things provide for narrow casting techniques which would be valuable for "client robots", intelligent agents. so, ive presented a radical view which would tend to promote imaginations to prefer just downloading scripts to some anarchy of protocol extensions. but no one has to support any extension one doesnt want to. on the otherhand, with more and more Java browsers, it's possible to download Java code (protocols or protocol extensions) into clients, providing a means for such anarchy. so, can we provide a framework for this kind of environment? i think most of it is already provided via HTTP and Java. the merging of these things under a common umbrella just serves to solve concurrency problems like caching and promote robot interoperability, like a more open than opendoc, ie, corba, http://www.cilabs.org/ approach to robots. another approach would have negotiated semantics, ie, lay out a protocol for doing everything you ever want to do. a whole new language. i think protocol ident strings are useful as a family identifier, but the string approach would require something more, maybe a concatenation of extension ident strings, for identifying extensions. this is the deterministic perspective. a one-degree less deterministic perspective would have extensions negotiated as methods required, as required. the idea of "robustness in failures" as a general model of error catching and zero or more (degrees of) adaptation in communication can make a mess of intended versus actual semantics if applied to a protocol namespace. however, it's required anyway so the question is only to what degree it is a part of the architecture. von neumann permits it to be a principal component of his self-reproducing automata. this is the nondeterministic perspective. -john From owner-robots Wed Jan 3 06:43:07 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA13806; Wed, 3 Jan 96 06:43:07 -0800 From: Byung-Gyu Chang <chitos@ktmp.kaist.ac.kr> Message-Id: <199601031357.WAA08007@ktmp.kaist.ac.kr> Subject: Re: Please Help ME!! To: robots@webcrawler.com Date: Wed, 3 Jan 1996 22:57:23 +0900 (KST) In-Reply-To: <9601020840.AA04808@sarang.kyungsung.ac.kr> from "Dong-Hyun Kim" at Jan 2, 96 05:40:33 pm X-Mailer: ELM [version 2.4 PL21-h4] Mime-Version: 1.0 Content-Type: text/plain; charset=iso-2022-kr Content-Transfer-Encoding: 7bit Content-Length: 484 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Plz use comp.infosystems.harvest newsgroup. I think that Harvest is not the interest of this mailing-list ... ;) > > Hi... my dear Everyone. > Happy New Year.. > > I've run Harvest program. > But It can't Search and Gather 2bytes Languages like korean, japaness, etc > So.. I'm so blue.. > > How can I fix that problem... If it can be possible where do I fix? > Please help me.... > I want search 2bytes Languages~~~~ > > from... DH > > http://sarang.kyungsung.ac.kr:8585 > From owner-robots Fri Jan 5 13:31:54 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12199; Fri, 5 Jan 96 13:31:54 -0800 Message-Id: <n1391273301.8920@mail.intouchgroup.com> Date: 5 Jan 1996 13:34:36 -0800 From: "Roger Dearnaley" <roger_dearnaley@mail.intouchgroup.com> Subject: Infinite e-mail loop To: " " <robots@webcrawler.com> X-Mailer: Mail*Link SMTP-QM 3.0.2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I would like to appologise to everyone on the robots list for the inconvenience which my brain damaged mail software's vacation autoresponder caused about three weeks ago. --Roger Dearnaley From owner-robots Fri Jan 5 16:10:17 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23704; Fri, 5 Jan 96 16:10:17 -0800 Date: Fri, 5 Jan 1996 19:10:05 -0500 From: Skip Montanaro <skip@automatrix.com> Message-Id: <199601060010.TAA06882@dolphin.automatrix.com> To: robots@webcrawler.com Subject: Infinite e-mail loop In-Reply-To: <n1391273301.8920@mail.intouchgroup.com> References: <n1391273301.8920@mail.intouchgroup.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I would like to appologise to everyone on the robots list for the inconvenience which my brain damaged mail software's vacation autoresponder caused about three weeks ago. No big deal. How many replies would you like? :-) Skip Montanaro skip@calendar.com (518)372-5583 Musi-Cal: http://www.calendar.com/concerts/ or mailto:concerts@calendar.com Internet Conference Calendar: http://www.calendar.com/conferences/ >>> ZLDF: http://www.netresponse.com/zldf <<< From owner-robots Fri Jan 5 17:57:45 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA29373; Fri, 5 Jan 96 17:57:45 -0800 Date: Sat, 6 Jan 96 09:20:51 +1100 (EST) Message-Id: <v01530504ad13f013b1d5@[192.190.215.47]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: radio@mpx.com.au (James) Subject: Up to date list of Robots Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Dear sirs, Madams, We wish to make efficient use of services provided by active Robots. We have a submission service for new URL's called World Announce Archive at: http://www.com.au/aaa/linkform.html Does anyone have a up to date list of robots and search engines on the Web. The exisiting material is patchy. Keith AAA Australia Announce Archive / Tourist Radio Home of the Australian Cool Site of the Day ! http://www.com.au/aaa Postal: AAA Australia Announce Archive / Tourist Radio P.O. Box 202, Caringbah 2229 Australia From owner-robots Sun Jan 7 09:35:00 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21583; Sun, 7 Jan 96 09:35:00 -0800 Date: Sat, 6 Jan 1996 09:36:04 -0500 (EST) From: Matthew Gray <mkgray@Netgen.COM> X-Sender: mkgray@fairbanks To: robots@webcrawler.com Subject: Web Robot Message-Id: <Pine.SOL.3.91.960106092927.18700A-100000@fairbanks> Organization: net.Genesis Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Last night we were hit a few hundred times (spaced over many hours) by a robot calling itself 'Web Robot/OTWR:001p116 libwww/2.17' coming from 205.216.146.163. It did not request /robots.txt but was otherwise perfectly reasonable. This one is not on the list of known robots. On a related note, I've been running a number of robots, with User-Agent's as follows: webTool Wander Mk1 Matthew Gray ---------------------------- voice: (617) 577-9800 x240 net.Genesis fax: (617) 577-9850 68 Rogers St. mkgray@netgen.com Cambridge, MA 02142-1119 ------------- http://www.netgen.com/~mkgray From owner-robots Sun Jan 7 13:16:16 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04582; Sun, 7 Jan 96 13:16:16 -0800 Date: Mon, 8 Jan 96 08:15:57 +1100 (EST) Message-Id: <v01530503ad168491c3b4@[192.190.215.50]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: radio@mpx.com.au (James) Subject: Re: Web Robots Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Can someone please advise where we can find this list of known robots. James AAA AAA World Announce Archive Home: Australian Cool Site of the Day ! and Daily News. Web: http://www.com.au/aaa Email: radio@mpx.com.au Postal: AAA Australia Announce Archive / Tourist Radio P.O. Box 202, Caringbah 2229 Australia From owner-robots Sun Jan 7 14:28:33 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08532; Sun, 7 Jan 96 14:28:33 -0800 Comments: Authenticated sender is <jakob@cybernet.dk> From: "Jakob Faarvang" <jakob@jubii.dk> Organization: Jubii / cybernet.dk To: robots@webcrawler.com Date: Sun, 7 Jan 96 23:30:07 +0100 (CET) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Subject: Re: Web Robots Priority: normal X-Mailer: Pegasus Mail for Windows (v2.23) X-Info: Evaluation version at mail.cybernet.dk Message-Id: 22300767208411@cybernet.dk X-Info: cybernet.dk Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Can someone please advise where we can find this list of known robots. http://info.webcrawler.com/mak/projects/robots/active.html - Jakob Faarvang From owner-robots Sun Jan 7 14:53:22 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09636; Sun, 7 Jan 96 14:53:22 -0800 Date: Sun, 7 Jan 96 2:50:47 CET From: Thomas Stets <stets@stets.bb.bawue.de> Message-Id: <30ef26f7.stets@stets.bb.bawue.de> Subject: Does this count as a robot? To: robots@webcrawler.com X-Mailer: ELM [version 2.3 PL11] for OS/2 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I am currently writing (actually it's already written and I'm testing) a program to copy a subtree of a server to my machine. The reason behind this is that every so often I find a web site with some interesting information, but I don't have the time (or money - I have to pay for my connection) to study it all. From the first days of accessing the web I wished I could copy pages or complete subtress to my computer, graphics and all. Well, now I can. :-) OTOH, I don't want to upset anyone with my program. Any comments are appreciated. Here is the basic functionality: - The program starts at a given URL and follows all links that are in the same directory or below. (Starting with http://x/a/b/c/... it would follow /a/b/c/d/... but not /a/b/e/...) (except for IMG graphics) - It will, optionally, follow links to other servers one level deep. - No links with .../cgi-bin/... or ?parameters are followed. - Only http: links are followed. - No Document is requested twice. (To prevent loops) - It will identify itself with User-agent: and From: - It will use HEAD requests when refreshing pages. The program was started primarily for my own use, but I might release it as shareware (when I'm sure it's well-behaved). Since it is intended for the consumer market (it is written for OS/2), the users of this program will generally be connected by modem, (In my case currently with 14.400 bps) which helps keeping used bandwidth down. What I'd like to know: - Should this Program use /robots.txt? Is it the type of program that robots.txt is supposed to control? It is basically a web-browser, the retrieved pages will just be read offline. - How fast should I make my requests? Since this is not a robot in the sense that it visits many different hosts, and since it is not intended to traverse the whole server (after all, I have to store all the data on my PC and I have to pay for the connection), I'd rather not wait too long between requests. My Idea is to read single pages in a similar way the IBM WebExplorer does it: read the main dokument and get all the embedded graphics as fast as possible. Then wait some time (some seconds) before making the next request. - How is the general feeling towards copying web-pages for non-commercial use? TIA Thomas Stets -- ----------------------------------------------------------------------------- Thomas Stets ! Words shrink things that were Holzgerlingen, Germany ! limitless when they were in your ! head to no more than living size stets@stets.bb.bawue.de ! when they're brought out. CIS: 100265,2101 ! [Stephen King] ----------------------------------------------------------------------------- From owner-robots Mon Jan 8 01:59:29 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10834; Mon, 8 Jan 96 01:59:29 -0800 Date: Mon, 8 Jan 1996 09:59:44 GMT From: jeremy@mari.co.uk (Jeremy.Ellman) Message-Id: <9601080959.AA09596@kronos> To: robots@webcrawler.com Subject: Re: Does this count as a robot? X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > I am currently writing (actually it's already written and I'm testing) > a program to copy a subtree of a server to my machine. > Sounds like HTMLGOBBLE. Why not just do use that? I've been trying to fix some bugs in it but it's only real problem is that it does not respect robots.txt Jeremy Ellman MARI Computer Systems From owner-robots Mon Jan 8 07:57:56 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24873; Mon, 8 Jan 96 07:57:56 -0800 Date: Mon, 8 Jan 1996 08:09:24 -0800 (PST) From: Benjamin Franz <snowhare@netimages.com> X-Sender: snowhare@ns.viet.net To: robots@webcrawler.com Subject: Re: Does this count as a robot? In-Reply-To: <30ef26f7.stets@stets.bb.bawue.de> Message-Id: <Pine.LNX.3.91.960108080309.2792A-100000@ns.viet.net> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Sun, 7 Jan 1996, Thomas Stets wrote: > Here is the basic functionality: > > - The program starts at a given URL and follows all links that > are in the same directory or below. (Starting with http://x/a/b/c/... > it would follow /a/b/c/d/... but not /a/b/e/...) > (except for IMG graphics) > - It will, optionally, follow links to other servers one level deep. > - No links with .../cgi-bin/... or ?parameters are followed. > - Only http: links are followed. > - No Document is requested twice. (To prevent loops) > - It will identify itself with User-agent: and From: > - It will use HEAD requests when refreshing pages. From your description, it is vulnerable to looping still. Many sites use symbolic links from lower to upper levels. If you try to suck 'everything', you will end up in an infinite recursion. You need a depth limit (no more than X '/' elements in the URL), and probably a total pages limit (no more than Y pages total) to prevent any obscure cases from sucking it down an unexpected rat hole. -- Benjamin Franz From owner-robots Mon Jan 8 08:05:05 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25286; Mon, 8 Jan 96 08:05:05 -0800 Message-Id: <199601081603.LAA00699@revere.musc.edu> Comments: Authenticated sender is <lindroth@atrium.musc.edu> From: "John Lindroth" <lindroth@musc.edu> Organization: Medical University of South Carolina To: "Christopher J. Tomasello/WSC" <Christopher_J.._Tomasello@hammer.net>, robots@webcrawler.com Date: Mon, 8 Jan 1996 11:03:54 +0000 Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Subject: Re: unknown robot Priority: normal X-Mailer: Pegasus Mail for Windows (v2.23) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Have you ruled out the AutoSurf option in the latest Moasic (for Windows). It basically does the same thing as a robot, only to create a table of URL links. -John > To: robots <robots@webcrawler.com> > From: "Christopher J. Tomasello/WSC" > <Christopher_J.._Tomasello@hammer.net> > Date: 29 Dec 95 9:04:46 EDT > Subject: unknown robot > Reply-to: robots@webcrawler.com > Anyone with information on this robot please respond to the group or perferably > to me ctomasello@hammer.net > > There is a robot hitting our web server on a regular basis. It hits every file > on the server in a very rapid rate (many requests per second). The curious > thing about this is that the robot is using our IP/domain name to gain access. > So in the log files it looks like one of our internal servers is hitting the > site. Also, all the the requests this robot makes are returning 404 errors. > > I have heard rumors that the Alta Vista spider is doing this kind of spoofing - > but I have also heard that it is not. Any information would be greatly > appreciated. > > > ============================================= John Lindroth Senior Systems Programmer Academic & Research Computing Services Center for Computing & Information Technology Medical University of South Carolina E-Mail: lindroth@musc.edu URL: http://www.musc.edu/~lindroth ============================================= Any opinions expressed are mine, not my employer's. And they may be wrong (gasp!) ============================================= From owner-robots Mon Jan 8 10:11:41 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00846; Mon, 8 Jan 96 10:11:41 -0800 Subject: Re: Does this count as a robot? From: YUWONO BUDI <yuwono@uxmail.ust.hk> To: robots@webcrawler.com Date: Tue, 9 Jan 1996 02:10:06 +0800 (HKT) In-Reply-To: <Pine.LNX.3.91.960108080309.2792A-100000@ns.viet.net> from "Benjamin Franz" at Jan 8, 96 08:09:24 am X-Mailer: ELM [version 2.4 PL24alpha3] Content-Type: text Content-Length: 1458 Message-Id: <96Jan9.021013hkt.19035-3+186@uxmail.ust.hk> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > On Sun, 7 Jan 1996, Thomas Stets wrote: > > Here is the basic functionality: > > > > - The program starts at a given URL and follows all links that > > are in the same directory or below. (Starting with http://x/a/b/c/... > > it would follow /a/b/c/d/... but not /a/b/e/...) > > (except for IMG graphics) > > - It will, optionally, follow links to other servers one level deep. > > - No links with .../cgi-bin/... or ?parameters are followed. > > - Only http: links are followed. > > - No Document is requested twice. (To prevent loops) > > - It will identify itself with User-agent: and From: > > - It will use HEAD requests when refreshing pages. > > >From your description, it is vulnerable to looping still. Many sites use > symbolic links from lower to upper levels. If you try to suck > 'everything', you will end up in an infinite recursion. You need a depth > limit (no more than X '/' elements in the URL), and probably a total > pages limit (no more than Y pages total) to prevent any obscure cases > from sucking it down an unexpected rat hole. One trick that I use to get around symbolic-link loops is to detect any recurring path segment (a /x/) in a URL. Hopefully, no web author creates a subdirectory with the same name as its parent or grand* parent directory (in which case my robot would think there is a loop and stop there). So far (half a thousand sites we visited), I haven't seen such a case. -Budi. From owner-robots Mon Jan 8 10:42:23 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01557; Mon, 8 Jan 96 10:42:23 -0800 Message-Id: <9601081842.AA01551@webcrawler.com> To: robots@webcrawler.com Subject: Re: Does this count as a robot? In-Reply-To: Your message of "Tue, 09 Jan 96 02:10:06 +0800." <96Jan9.021013hkt.19035-3+186@uxmail.ust.hk> Date: Mon, 08 Jan 96 18:41:18 +0000 From: M.Levy@cs.ucl.ac.uk Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >One trick that I use to get around symbolic-link loops is to >detect any recurring path segment (a /x/) in a URL. Hopefully, >no web author creates a subdirectory with the same name as >its parent or grand* parent directory (in which case my robot would >think there is a loop and stop there). So far (half a thousand >sites we visited), I haven't seen such a case. > >-Budi. er, yes, but if there was such a naming convention then you wouldn't be able to tell the difference between that and a recurring path segment. Would you? |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||Micah Levy Department of Computer Science || || University College London || ||Web Page: http://www.cs.ucl.ac.uk/students/M.Levy/ || ||Email: M.Levy@cs.ucl.ac.uk Cestor@delphi.com || || zcacma0@cs.ucl.ac.uk Micah@delphi.com || |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| From owner-robots Mon Jan 8 11:03:53 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01697; Mon, 8 Jan 96 11:03:53 -0800 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199601081904.UAA01225@wsinis10.win.tue.nl> Subject: avoiding infinite regress for robots To: robots@webcrawler.com Date: Mon, 8 Jan 1996 20:04:19 +0100 (MET) In-Reply-To: <Pine.LNX.3.91.960108080309.2792A-100000@ns.viet.net> from "Benjamin Franz" at Jan 8, 96 08:09:24 am X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Content-Length: 1154 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Benjamin Franz wrote: >Many sites use >symbolic links from lower to upper levels. If you try to suck >'everything', you will end up in an infinite recursion. You need a depth >limit (no more than X '/' elements in the URL), and probably a total >pages limit (no more than Y pages total) to prevent any obscure cases >from sucking it down an unexpected rat hole. I'm surprised that no spider seems to use the page content to guess whether or not two document trees are equal. For example, one heuristic would be to keep a checksum for every visited page, and to decide that two subtrees are probably equal if its root nodes and their children have iddentical checksums. Do spiders use the content to cut off walks, and if not, is it because alternative techniques are sufficient? Since my own spiders are rather simple-minded (and not widely used), I'd be interested in seeing a more informed opinion on the usefulness of comparing content. >Benjamin Franz -- Reinier Post reinpost@win.tue.nl a.k.a. <A HREF="http://www.win.tue.nl/win/cs/is/reinpost/">me</A> [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] From owner-robots Mon Jan 8 14:10:10 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02970; Mon, 8 Jan 96 14:10:10 -0800 Message-Id: <01BADE68.EAF50080@pluto.planets.com.au> From: David Eagles <eaglesd@planets.com.au> To: "'robots@webcrawler.com'" <robots@webcrawler.com> Subject: RE: avoiding infinite regress for robots Date: Tue, 9 Jan 1996 08:03:08 +-1100 Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="---- =_NextPart_000_01BADE68.EAFE2840" Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com ------ =_NextPart_000_01BADE68.EAFE2840 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Benjamin Franz wrote: >Many sites use=20 >symbolic links from lower to upper levels. If you try to suck=20 >'everything', you will end up in an infinite recursion. You need a = depth=20 >limit (no more than X '/' elements in the URL), and probably a total=20 >pages limit (no more than Y pages total) to prevent any obscure cases=20 >from sucking it down an unexpected rat hole. I'm surprised that no spider seems to use the page content to guess = whether or not two document trees are equal. For example, one heuristic would be = to keep a checksum for every visited page, and to decide that two subtrees are = probably equal if its root nodes and their children have iddentical checksums. Do spiders use the content to cut off walks, and if not, is it because alternative techniques are sufficient? Since my own spiders are rather simple-minded (and not widely used), I'd be interested in seeing a more informed opinion on the usefulness of comparing content. Yep. This is one of the ways FunnelWeb (the latest version I haven't = quite released yet) checks for looping. What would be REALLY nice, however, would be if the HTML spec was = extended to include a Filename: field sent by the server for every = request. The field would specify the exact filename after all links = were resolved and would therefor eliminate a lot of the guess work, = parsing, etc required by clients, spiders, etc. Hope everyone had a great New Year. Regards, David ------ =_NextPart_000_01BADE68.EAFE2840 Content-Type: application/ms-tnef Content-Transfer-Encoding: base64 eJ8+IgoVAQaQCAAEAAAAAAABAAEAAQeQBgAIAAAA5AQAAAAAAADoAAENgAQAAgAAAAIAAgABBJAG ACQBAAABAAAADAAAAAMAADADAAAACwAPDgAAAAACAf8PAQAAAEkAAAAAAAAAgSsfpL6jEBmdbgDd AQ9UAgAAAAByb2JvdHNAd2ViY3Jhd2xlci5jb20AU01UUAByb2JvdHNAd2ViY3Jhd2xlci5jb20A AAAAHgACMAEAAAAFAAAAU01UUAAAAAAeAAMwAQAAABYAAAByb2JvdHNAd2ViY3Jhd2xlci5jb20A AAADABUMAQAAAAMA/g8GAAAAHgABMAEAAAAYAAAAJ3JvYm90c0B3ZWJjcmF3bGVyLmNvbScAAgEL MAEAAAAbAAAAU01UUDpST0JPVFNAV0VCQ1JBV0xFUi5DT00AAAMAADkAAAAACwBAOgEAAAACAfYP AQAAAAQAAAAAAAAD0jcBCIAHABgAAABJUE0uTWljcm9zb2Z0IE1haWwuTm90ZQAxCAEEgAEAKQAA AFJFOiBhdm9pZGluZyBpbmZpbml0ZSByZWdyZXNzIGZvciByb2JvdHMA8w4BBYADAA4AAADMBwEA CQAIAAMACAACAPIAASCAAwAOAAAAzAcBAAkABwA6ABYAAgA2AQEJgAEAIQAAAEU3QTlDRURCNUE0 QUNGMTE5ODZBMDAwMEMwOEMwMzRFAEwHAQOQBgCYBgAAEgAAAAsAIwAAAAAAAwAmAAAAAAALACkA AAAAAAMANgAAAAAAQAA5AICQX7YM3roBHgBwAAEAAAApAAAAUkU6IGF2b2lkaW5nIGluZmluaXRl IHJlZ3Jlc3MgZm9yIHJvYm90cwAAAAACAXEAAQAAABYAAAABut4MtlfbzqnoSloRz5hqAADAjANO AAAeAB4MAQAAAAUAAABTTVRQAAAAAB4AHwwBAAAAEgAAAGVhZ2xlc2RAcGMuY29tLmF1AAAAAwAG EL+tcaYDAAcQeQQAAB4ACBABAAAAZQAAAEJFTkpBTUlORlJBTlpXUk9URTpNQU5ZU0lURVNVU0VT WU1CT0xJQ0xJTktTRlJPTUxPV0VSVE9VUFBFUkxFVkVMU0lGWU9VVFJZVE9TVUNLRVZFUllUSElO RyxZT1VXSUxMRU4AAAAAAgEJEAEAAAAJBQAABQUAAJsHAABMWkZ1GnmE6f8ACgEPAhUCqAXrAoMA UALyCQIAY2gKwHNldDI3BgAGwwKDMgPFAgBwckJxEeJzdGVtAoMzdwLkBxMCgH0KgAjPCdk78RYP MjU1AoAKgQ2xC2BgbmcxMDMUUAsOMW42CqADYBPQYwVACots+GkzNg3wC1UUUQvyGrYiQgnwamFt C4AgRqJyAHB6IHcawjoKhckKhT5NAHB5IACQE9AtBCB1EbAfd3MGw2ljciAcAG5rBCADUiHwbyJ3 BJAgdG8goHBwASLhbGV2ZWxzLmAgSWYgeQhgIwByQyAwIxFzdWNrH3cnEyOxJKB0aAuAZycsWyRD A/BsAyAJ8GQjMSC3HiEDkQuAZguAIGEgFhBeYwhwAJACICQAWSRhbrMJ4CdgYSANsAUwaB935xwA HhAFQChuIyAEYBYQByMAEYADoFggJy8n/ycwI6AHgAIwBCAeISYwINDwVVJMKSaQAHAnYBqx+mIB oGwgMCngIxABkAMgux+GCrBnB5Eq7wORWS4Q+y/DLtMpIwITUCOxAjAn0Z8gMC5ABPAIcCDQY2ER sH8EIB+GImMlAiZRJ6AFQGTrIsAnw3UpkHgjYBsAKbFLHmAFQGgG8GUuHxxJ+ic0wnITUAQAKbEr 0QVAcStRc3BpBIEgQAngbf8xsiCjLUIvsjOgAiEysiMR9mcKUAQRdy1QLUEFwAWw9wqFK1A7wXcj IDWQKMAssh8kgQngBCAKwCDQZXF1fwdAJAAeQAWxNjAeAAtQZf8mkAIgINAtUAhxE8Ah0T3QtHVs J2BiK7EjIGsJ4P5wCoUp4BFwBZAiMD4wIlDzP8Il8iB2BAAgYS4BL8F/LcQjEQWBObE5BD3CJQBi /z6YLiYKhT8zJ6AkMCBgBCB/A2A9kStQDbA+0UUSLVBp9wXAEXADEGQWEAOgEYAjwP0noGQNsAIw IdAvAUL2I/D9HxxEOXYglC1CO2kowAVA/m8N0B6gB0AiMC3ESJE9gX8mkAQANVJBwDOwILFCZmy3 E9AEoDbQaUsBGuFoAwB/P0A+xSUADdAh0AiQAjA/1T+QUwuAYyDQbTMRNbF/TZY+8jbBPLEKhQCQ QCIt9x4RDbAnYCgt4j2CA/ANsPsugSCxZC2xODBBowuAUlH/B5A2gh4hOgE1IingK4IKhe8oEQWw B4AnYG85oAMAAiD3QHEtMyCxZkGAKZAEEU+A/ztRQCAKwDUiO2U3RhxcGyy9HGxjAEApQCoQP4FU JkD/LPFeAUCRXiEtQk/AE7AeQMM2ECmQbFdlYiswLUJ/C2AgcQVAJfEo8iQQStNuficFQD9AKGQj oDPBJ2B5vxHAMiBC9ENzFaBcgmc3TQZXOSJBZ1JFQUxM/zFQAwBU4CaQNwAi0CXxJpCHQWdIkS1C SFRNTDmB/wWQT7EEIDYwO5FXsiMRVMH3CkBFsSngRgMQCfAeAB7w3yJQCJBBkRGwMsFiJLEtUf0R sHIl8UN5FhBTUl9QYwL/INBwlEFkbgIGkHFUP/EbAf8oMHAUJ9ABgCLhB0ADICIE+yLRKIJzBvAj wCnBJ1FBZP88og3AP8IwElKBb7IVoE9i9y0zPBUFsGsmkGEBAJAZEP8mkBHAIeBywkogKbFxQW9w 31RCUAFNlXvDN01IXIA/Efsl8kCDYSnCCcE20QfCYsDjCsA3TVJlZwsRUABM9r9K8DmwHxwbr2BO FTEAhqAAAAADABAQAAAAAAMAERAAAAAAQAAHMICZtgsM3roBQAAIMICZtgsM3roBHgA9AAEAAAAF AAAAUkU6IAAAAABZ/g== ------ =_NextPart_000_01BADE68.EAFE2840-- From owner-robots Mon Jan 8 16:35:02 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03922; Mon, 8 Jan 96 16:35:02 -0800 Message-Id: <199601090034.TAA10724@honsu.cis.ohio-state.edu> Subject: Re: Does this count as a robot? To: robots@webcrawler.com Date: Mon, 8 Jan 1996 19:34:51 -0500 (EST) In-Reply-To: <9601081842.AA01551@webcrawler.com> from "M.Levy@cs.ucl.ac.uk" at Jan 8, 96 06:41:18 pm From: yuwono@uxmail.ust.hk (YUWONO BUDI) X-Mailer: ELM [version 2.4 PL22] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 864 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com According to your message: > > >One trick that I use to get around symbolic-link loops is to > >detect any recurring path segment (a /x/) in a URL. Hopefully, > >no web author creates a subdirectory with the same name as > >its parent or grand* parent directory (in which case my robot would > >think there is a loop and stop there). So far (half a thousand > >sites we visited), I haven't seen such a case. > > er, yes, but if there was such a naming convention then you wouldn't be able > to tell the difference between that and a recurring path segment. > Would you? I wouldn't. Then again I need not, because my philosophy on robot behavior is "be as non-aggressive as possible," so my robot would simply give it up. If I were a web site admin, I would appreciate that in a robot. Anyway, this trick was actually a 5-minute-hacking solution. -Budi. From owner-robots Tue Jan 9 01:39:03 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05871; Tue, 9 Jan 96 01:39:03 -0800 From: Jaakko Hyvatti <Jaakko.Hyvatti@www.fi> Message-Id: <199601090938.LAA18356@krisse.www.fi> Subject: Recursing heuristics (Re: Does this..) To: robots@webcrawler.com Date: Tue, 9 Jan 1996 11:38:47 +0200 (EET) In-Reply-To: <199601090034.TAA10724@honsu.cis.ohio-state.edu> from "YUWONO BUDI" at Jan 8, 96 07:34:51 pm X-Mailer: ELM [version 2.4 PL22] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 2009 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > I wouldn't. Then again I need not, because my philosophy on robot > behavior is "be as non-aggressive as possible," so my robot would > simply give it up. If I were a web site admin, I would appreciate > that in a robot. > Anyway, this trick was actually a 5-minute-hacking solution. > > -Budi. The robots home pages mention some heuristics that sould be used in recursive traversal, but it does not currently count those recently mentioned here, not to mention it does not count all that are necessary for a modern robot. I think it is time to collect a definitive list of minimum requirements and possible refinements for traversal algorithms. My robot Hämähäkki indexes *.fi -domain, Finland, 207147 URL:s currently, and I could list the following rules it follows (expressed as something like regular expressions): - check the recursion depth limit - check with robots.txt and if the path already was fetched and has not expired, whatever rules are used for that. - recurse only '.*/', '.*\.html?' and paths that seem like they just are missing the ending '/' and usually cause redirection to a index. This means something like '.*/~?[a-zA-Z0-9]+' that does not match '.*bin.*', '.*cgi.*' or '.*\..*' As you see I do not use HEAD check for the type of every link like for example the MOMspider. I might in the future. - drop paths like '.*/cgi-bin/.*', '.*[?=+].*' - drop paths like '.*\.html?/.*' - interpret things like '.*//.*', '.*/\./.*' and '.*/\.\.//*' correctly. These are quite restrictive and might make me miss something, but that's minor and they serve well. I am about to add recursion detection with content comparison by crc shortly. Even if it has not been a problem as only one or two sites out of 755 symlinked substantial subtrees and they were easy to pick out by hand. Otherwise none of the sites hit the recursion limit. Am I missing something important here? Let's collect something useful for the first-time robot-writers. And others. From owner-robots Tue Jan 9 04:18:56 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06291; Tue, 9 Jan 96 04:18:56 -0800 Message-Id: <01BADEDF.A3F92F40@pluto.planets.com.au> From: David Eagles <eaglesd@planets.com.au> To: "'robots@webcrawler.com'" <robots@webcrawler.com> Subject: Recursion Date: Tue, 9 Jan 1996 22:12:58 +-1100 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Seems to me like there are quite a few people using CRC-like methods to = detect recursion. As I am in the process of trying to work out a means = of inter-robot communication, I think it may be useful to use a standard = CRC algorithm. This way, communicating robots can more quickly and = easily determine which URL's to exchange/reject. Now the tricky part - everyone will have they're own technique/algorithm = for this so what will the standard be? Does anyone have a particularly = good algorithm they would care to make available. It should produce a = value which can be represented on any machine architecture (ie. doesn't = use long long's, etc). A single or double "long" value may be the most = simple, but of course suggestions would be welcomed (I won't suggest = using a crypto key generation algorithm, even though I'd like to). Regards, David From owner-robots Tue Jan 9 07:33:37 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06807; Tue, 9 Jan 96 07:33:37 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: <v02130500ad183995c409@[192.187.143.12]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Tue, 9 Jan 1996 07:34:35 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Duplicate docs (was avoiding infinite regress...) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >I'm surprised that no spider seems to use the page content to guess whether or >not two document trees are equal. For example, one heuristic would be to keep >a checksum for every visited page, and to decide that two subtrees are probably >equal if its root nodes and their children have iddentical checksums. We've had requests for that behavior, not only due to sym links, but also because there are many copies of the same document within an enterprise network, and even more so when you're indexing large parts of the Internet. (Imagine how many copies of FAQs are out there, for example.) I think there are two main reasons it hasn't happened yet. One is just that it hasn't risen high enough in the priority list, at least for those of use who have commercial spider tools. For the most part, people are still happy just to get a spider *working* in a convenient, maintainable manner. Thus, most haven't even realized that sym links and duplicates are an issue. Second, the problem of duplicates is a slippery slope. It's probably not hard to find 80 or 90 percent of them, but getting the last bunch, which aren't *exact* duplicates, is going to have to be quite clever, since brute force will probably be slow, at best. Nick From owner-robots Tue Jan 9 13:13:02 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11321; Tue, 9 Jan 96 13:13:02 -0800 From: mabzug1@gl.umbc.edu Message-Id: <199601092112.QAA03572@umbc10.umbc.edu> Subject: Re: Recursion To: robots@webcrawler.com Date: Tue, 9 Jan 1996 16:12:37 -0500 (EST) In-Reply-To: <01BADEDF.A3F92F40@pluto.planets.com.au> from "David Eagles" at Jan 9, 96 10:12:58 pm X-Mailer: ELM [version 2.4 PL24] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 1063 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com "DE" == David Eagles spake thusly: DE> DE> Seems to me like there are quite a few people using CRC-like methods to = [snip] DE> Now the tricky part - everyone will have they're own technique/algorithm = DE> for this so what will the standard be? Does anyone have a particularly = DE> good algorithm they would care to make available. It should produce a = [snip] DE> simple, but of course suggestions would be welcomed (I won't suggest = DE> using a crypto key generation algorithm, even though I'd like to). Might I suggest the standard 'message digest' algorithm, md5, described in rfc1321? An md5 header line is even (officially) part of HTTP, although I haven't seen too many servers that return it. . . yet. There's a standard C implementation, and Neil Winton even put together a Perl implementation. See <http://www.gl.umbc.edu/~mabzug1/md5/md5.html> for (marginally) more information. -- Mordechai T. Abzug http://umbc.edu/~mabzug1 mabzug1@umbc.edu finger -l mabzug1@gl.umbc.edu 1st rule of intelligent tinkering - save all the parts From owner-robots Tue Jan 9 15:46:22 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15896; Tue, 9 Jan 96 15:46:22 -0800 Message-Id: <01BADF3B.12252A40@pluto.planets.com.au> From: David Eagles <eaglesd@planets.com.au> To: "'robots@webcrawler.com'" <robots@webcrawler.com> Subject: RE: Recursion Date: Wed, 10 Jan 1996 09:07:27 +-1100 Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="---- =_NextPart_000_01BADF3B.1246BC00" Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com ------ =_NextPart_000_01BADF3B.1246BC00 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable ---------- From: mabzug1@gl.umbc.edu[SMTP:mabzug1@gl.umbc.edu] Sent: Wednesday, January 10, 1996 3:12 To: robots@webcrawler.com Subject: Re: Recursion "DE" =3D=3D David Eagles spake thusly: DE>=20 DE> Seems to me like there are quite a few people using CRC-like methods = to =3D [snip] DE> Now the tricky part - everyone will have they're own = technique/algorithm =3D DE> for this so what will the standard be? Does anyone have a = particularly =3D DE> good algorithm they would care to make available. It should produce = a =3D [snip] DE> simple, but of course suggestions would be welcomed (I won't suggest = =3D DE> using a crypto key generation algorithm, even though I'd like to). Might I suggest the standard 'message digest' algorithm, md5, described = in rfc1321? An md5 header line is even (officially) part of HTTP, although = I haven't seen too many servers that return it. . . yet. There's a = standard C implementation, and Neil Winton even put together a Perl implementation. = See <http://www.gl.umbc.edu/~mabzug1/md5/md5.html> for (marginally) more information. I agree totally. This was actually the crypto algorithm I was thinking = of but couldn't think of the name. The fact that HTTP already specifies = it's use (which I didn't know) doesn't really leave any other logical = alternative. Guess this means I've got more work to do now :-( Thanks, David ------ =_NextPart_000_01BADF3B.1246BC00 Content-Type: application/ms-tnef Content-Transfer-Encoding: base64 eJ8+Ih8WAQaQCAAEAAAAAAABAAEAAQeQBgAIAAAA5AQAAAAAAADoAAENgAQAAgAAAAIAAgABBJAG ACQBAAABAAAADAAAAAMAADADAAAACwAPDgAAAAACAf8PAQAAAEkAAAAAAAAAgSsfpL6jEBmdbgDd AQ9UAgAAAAByb2JvdHNAd2ViY3Jhd2xlci5jb20AU01UUAByb2JvdHNAd2ViY3Jhd2xlci5jb20A AAAAHgACMAEAAAAFAAAAU01UUAAAAAAeAAMwAQAAABYAAAByb2JvdHNAd2ViY3Jhd2xlci5jb20A AAADABUMAQAAAAMA/g8GAAAAHgABMAEAAAAYAAAAJ3JvYm90c0B3ZWJjcmF3bGVyLmNvbScAAgEL MAEAAAAbAAAAU01UUDpST0JPVFNAV0VCQ1JBV0xFUi5DT00AAAMAADkAAAAACwBAOgEAAAACAfYP AQAAAAQAAAAAAAAD0jcBCIAHABgAAABJUE0uTWljcm9zb2Z0IE1haWwuTm90ZQAxCAEEgAEADgAA AFJFOiBSZWN1cnNpb24AqwQBBYADAA4AAADMBwEACgAJAAcAGwADAAwBASCAAwAOAAAAzAcBAAoA CQAEAAUAAwDzAAEJgAEAIQAAAEVDODg1OEJCQTM0QUNGMTE5ODZBMDAwMEMwOEMwMzRFAC8HAQOQ BgCEBgAAEgAAAAsAIwAAAAAAAwAmAAAAAAALACkAAAAAAAMANgAAAAAAQAA5AECWm9ze3roBHgBw AAEAAAAOAAAAUkU6IFJlY3Vyc2lvbgAAAAIBcQABAAAAFgAAAAG63t7ck7tYiO1KoxHPmGoAAMCM A04AAB4AHgwBAAAABQAAAFNNVFAAAAAAHgAfDAEAAAASAAAAZWFnbGVzZEBwYy5jb20uYXUAAAAD AAYQQIUdawMABxAzBAAAHgAIEAEAAABlAAAALS0tLS0tLS0tLUZST006TUFCWlVHMUBHTFVNQkNF RFVTTVRQOk1BQlpVRzFAR0xVTUJDRURVU0VOVDpXRURORVNEQVksSkFOVUFSWTEwLDE5OTYzOjEy VE86Uk9CT1RTQFdFQgAAAAACAQkQAQAAABIFAAAOBQAAewgAAExaRnVCBepT/wAKAQ8CFQKoBesC gwBQAvIJAgBjaArAc2V0MjcGAAbDAoMyA8UCAHByQnER4nN0ZW0CgzM3AuQHEwKDNARGEzMxIGhG aXgJgHMTsAKAfRcKgAjPCdk7F98yNTUPAoAKgQ2xC2BuZzEwjjMUUAsKFWFzMTgXQE0AQCAKhQqL bGkcUDDBAtFpLTE0NA3wDNBzHqMLWTE2CqADYBPQY30FQC0gxwqHH3sMMCBGRl0DYTohziBGDIIg AMBiBHp1GvBAZ2wudQkG0GMuCYB1W1NN2FRQOiWPJpBdIW8ifa8GYAIwI68ku1cJgG4HkEBkYXks IEoAcHUZCsB5IBsALQAxOTnANiAzOjEyKF8ifTxUbyqfJLsDYAbgdHMIQHdlJlByYXdsNQSQLgWg bS5/KW51Ys5qIIEwnyS7UmU2MDgQPmMIcACQAiAczx3TMzYPH0cUUQvyIEYiREUiACA9PSBEYXZp sGQgRWEmAAeRcwqwAGtlIHRodXNs7Hk6CoU78D4ctj5yBmDzE+AEIHRvJXA9cB4APWOPBJA9cArA PXBxdWkT0IlAwCBmB9FwZW8LUAc9cD2wC4BnIENSQ74tQCMHgD2QBHA/oz0KhTRbcwMAcChGPnJO b88H4EBxPYAFEGNrLYAKsdEgoSBldgSQeQIgPXA9A/BsAyARgEbgQGJ5J29AoUWAA6AgcWgDAEEQ ZZovB0BnBbBBMGhtQ9d/PnICEAXAPZAEAD0gP9B3/xGABUBHY0WyE8AAcCzQCyCgIGJlPyA8UG8H kb8AcEcTR7NBcEZiRhB1C2D+cj3QSftJgARwQMBJd0gC60dQCGBsPKBjQNI/wj1STzxwC3ALYAJg ZS5NIEm9BUBzQ2BRYiBBJpBjQVI3Q+8+NgCQbUIBLQBidfkFQG9mUZAIYRGwPSAlwP5nB5BOoAIg BCBRREzwR1CMZWwzkQmAIChJUTH8bidTQVd0SftCREFwBQD+eQUwP9A9YC2AV5AsoDMgP1fCUEgt AEbRSKFTcWdojVMgJzygQCRvKS447HxNaV4wBUBZUFnWTCsnbQeBczzQPXBkYCBXoSfxXOptZDUt AA2wBPJM8Bc8oAuACoVyEWAxMzI6MU0RQQOgY6FHoGVh3wSBQBFHMUshXaMoVsAecDZjBzE90ClG VFbBSFT/JuAtAAdAXfYKhUeyWZMJ4e8/sVICTaBqoXJG4T+hS5J9F+B0CHADoEEwUwBs8iDWeRHA UwFUQIInTXFMWP5DCoVWEweAAjBck2kBTJB7B7EDEVcLgD/AA6Bdo3DfVpE/wFeQQHJBYVAEkAMg L298UwE/YQqFPGBAdHBQOi8vd3UwLiYJL1p+JxUvY6F2si5gQG3KbEqkKADAcmcLgGfk9wRgF+Bv Bm5KwQDAc6Mzzf869B0sG69gYTzQCdE/sQGQ52fxbZNLIXdhTXEgkC1Qf2fxRaNblUloWVF+8UsB bv5rQmJWwVaCVvFRcFmSgWP/VrJFsnhAB4BtlEGAfyFsBP9owlBBF+BmUGtxQdBnwB5w3weRQTBu ET2wPXAoS4BGEGdeQWJhgqNrbkWAaCBk/01RWZKFgX9yM1BOE2thIGD/clIXcHggUaADIGkhBJFc kftG4F8tRwpQBBFLAweABiK/XmBH0UmABUB4slExcoMwjz/BiHCDsEWBOi0oOOzjbcAAcGtzLD4G PHI47i98ayBGe4UXAQCUoAAAAwAQEAAAAAADABEQAAAAAEAABzCAgVxk3t66AUAACDCAgVxk3t66 AR4APQABAAAABQAAAFJFOiAAAAAAMQQ= ------ =_NextPart_000_01BADF3B.1246BC00-- From owner-robots Wed Jan 10 06:48:31 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09610; Wed, 10 Jan 96 06:48:31 -0800 Date: Wed, 10 Jan 1996 09:48:22 -0500 From: Skip Montanaro <skip@automatrix.com> Message-Id: <199601101448.JAA20509@dolphin.automatrix.com> To: robots@webcrawler.com Subject: MD5 in HTTP headers - where? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com It was mentioned yesterday that there is an HTTP response header useful for sending back an MD5 digest. I just did a quick scan through http://www.w3.org/hypertext/WWW/Protocols/HTTP/HTTP2.html and didn't find any response headers that looked like they were related to use of MD5. Can someone give me a pointer? Thanks, Skip Montanaro | Looking for a place to promote your music venue, new CD skip@calendar.com | or next concert tour? Place a focused banner ad in (518)372-5583 | Musi-Cal! http://www.calendar.com/concerts/ From owner-robots Wed Jan 10 07:32:28 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11794; Wed, 10 Jan 96 07:32:28 -0800 Date: Wed, 10 Jan 1996 10:35:10 -0500 (EST) From: Adam Jack <ajack@corp.micrognosis.com> X-Sender: ajack@becks To: Robots <robots@webcrawler.com> Subject: robots.txt extensions Message-Id: <Pine.SUN.3.91.960110095606.1141B-100000@becks> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hello, Since this list started I've only ever seen one suggestion for an extension to robots.txt. That, from Tim Bray, http://info.webcrawler.com/mailing-lists/robots/0001.html seemed sensible enough -- to add expiry information for the robots.txt file itself. No response appears to have been given -- did people not think it worth while? Did people think the HTTP response field, Expires, should be used for that? I don't know if this was discussed to death somewhere -- but are people still considering extensions to robots.txt? I'd be interested in any pointers to an archive of such a discussion. If there is point in discussion additions pls read on -- otherwise bin this mail. MinRequestInterval: X Minimum request interval in seconds, (0=no minimum), with a default, if missing, of 60. This is for those of us lowely enough not to have huge gathering tasks and the luxury ;-) of a backlog of URLs over distributed sites. (I.e. Those of us doing a sequential search exhausting our interest in a site in one slurp.) Additionally local admins would have more control over wanderers that visted. DefaultIndex: index.html Stating that XXXX/ and XXXX/index.html are identicle. You can argue that this is lamely inadequate - or that it makes a saving. I know the bigger issue is recusion. Here I am merely hoping to save those single page recusions. CGIMask: *.cgi Rather than guessing at CGI urls -- why not get the local admin to answer it? I know that the WN server uses a file extension to indicate a CGI script -- not /cgi-bin/. Q: Are CGI scripts universally avoided in advance -- or do robots look at the HTTP flags of results to try to work out wether some content is dynamically generated? Finally -- I never understood why robots.txt was exclusion only. Why does it not have some of positive hints added? I.e. you are allowed & welcome to browse XXXX/fred.html. Was this a choice built upon pragmatism -- thinking that this would open a can of worms? Thanks for any feedback, Adam -- +1-203-730-5437 | http://www.micrognosis.com/~ajack/index.html From owner-robots Wed Jan 10 08:12:34 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14262; Wed, 10 Jan 96 08:12:34 -0800 Message-Id: <199601101612.LAA17875@mail.internet.com> Comments: Authenticated sender is <raisch@mail.internet.com> From: "Robert Raisch, The Internet Company" <raisch@internet.com> Organization: The Internet Company To: robots@webcrawler.com Date: Wed, 10 Jan 1996 11:08:47 -0400 Subject: Does anyone else consider this irresponsible? Priority: normal X-Mailer: Pegasus Mail for Windows (v2.01) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Altavista, while a marvelous example of what can be done when you throw multiple hundreds of thousands of dollars at the idea of indexing Internet accessible resources, appears to extract data from a host by connecting to EVERY tcp port on the machine. Each probe appears to look for an HTTP service and if found, walks the tree on that port. Ignoring, for the moment that there are 32,000 available ports to probe and that that many tcp connections would seem to be rather excessive... Does anyone else have a problem with this kind of behavior? While I am cognizant of the use of the robots.txt file, it seems more than a little antisocial to index materials that are, for all intents and purposes, unpublished. I, for one, do not believe that just because I run a server on a port, that that gives anyone permission to index and provide others navigation to the material I serve from that port. Many times, a client needs to have access to the service, in the same manner as a typical user, and imposing passwords on the service is an unacceptable burden. I'm looking for comments on this before I take it to a higher level. Thanks. </rr> From owner-robots Wed Jan 10 08:33:08 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15456; Wed, 10 Jan 96 08:33:08 -0800 Message-Id: <199601101632.LAA22167@northsea.com> To: robots@webcrawler.com Subject: Re: Does anyone else consider this irresponsible? In-Reply-To: Your message of "Wed, 10 Jan 1996 11:08:47 -0400." <199601101612.LAA17875@mail.internet.com> Date: Wed, 10 Jan 1996 11:32:33 -0500 From: Stan Norton <norton@northsea.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com In message <199601101612.LAA17875@mail.internet.com>, "Robert Raisch, The Inter net Company" writes: > >Altavista, while a marvelous example of what can be done when >you throw multiple hundreds of thousands of dollars at the idea >of indexing Internet accessible resources, appears to extract >data from a host by connecting to EVERY tcp port on the machine. >... > >I'm looking for comments on this before I take it to a higher >level. > >Thanks. </rr> agreed. absurd behavior. Stan -- Stan Norton -- norton@northsea.com From owner-robots Wed Jan 10 08:47:26 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16272; Wed, 10 Jan 96 08:47:26 -0800 Message-Id: <199601101646.IAA00680@sparty.surf.com> Date: Tue, 09 Jan 96 20:45:01 -0800 From: Super-User <murrayb@surf.com> X-Mailer: Mozilla 1.12 (X11; I; IRIX 5.3 IP22) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: Does anyone else consider this irresponsible? X-Url: http://www.lombard.com/cgi-bin/PACenter/Graph/graph Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com This search engine seemed to me to use and promote the robots-exclusion protocol. Maybe there should be a "URL delete" facility, since the scooter robot was operating long before the search engine became so accessible. Given the amount of resources devoted to this, I'm sure they could provide a "URL delete" facility! Any other requests?? murrayb@surf.com From owner-robots Wed Jan 10 09:38:49 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19011; Wed, 10 Jan 96 09:38:49 -0800 From: <monier@pa.dec.com> Message-Id: <9601101732.AA21836@evil-twins.pa.dec.com> To: robots@webcrawler.com Subject: Re: Does anyone else consider this irresponsible? In-Reply-To: Your message of "Wed, 10 Jan 96 11:32:33 EST." <199601101632.LAA22167@northsea.com> Date: Wed, 10 Jan 96 09:32:44 -0800 X-Mts: smtp Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Gang, I am the father of Scooter, and for the second time I need to rescue my robot from groundless accusations. Soon I'll have enough material for a book... Scooter is a regular robot: it follows links, and only follows links. It does not guess IP addresses, or try out all possible files names (one of my favorite), or spy on sites to guess the "secret test port", or anything like that. In this particular instance, I have to insist that Scooter does not "extract data from a host by connecting to EVERY tcp port". Over 130,000 sites times 32,000 possible ports would amount to a lot of stupid pinging with not much return! The Web is large enough, there is no need to invent new and exotic techniques to access more data. My current estimate BTW is that there are at least 50 million Web pages (text of some sort) publicly available and indexable (not covered by a robots.txt file), so there is really no lack of raw material. Could the next person who feels an urge to speak for the Alta Vista robot please check with me first? Nothing about this project is very secret, all you have to do is ask. Cheers, --Louis From owner-robots Wed Jan 10 10:44:38 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22425; Wed, 10 Jan 96 10:44:38 -0800 Message-Id: <199601101844.NAA21870@mail.internet.com> Comments: Authenticated sender is <raisch@mail.internet.com> From: "Robert Raisch, The Internet Company" <raisch@internet.com> Organization: The Internet Company To: robots@webcrawler.com Date: Wed, 10 Jan 1996 13:40:45 -0400 Subject: Re: Does anyone else consider this irresponsible? Priority: normal X-Mailer: Pegasus Mail for Windows (v2.01) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Re: Hoped for URL Delete facility in Altavista 1. There is no economic incentive for AV to provide such a feature. In fact, there is a strong disincentive as this would affect their claims of "N millions of URLs listed." 2. Why should I have to "unlist" something that they (IMHO) never should have harvested in the first place? </rr> From owner-robots Wed Jan 10 11:16:30 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23625; Wed, 10 Jan 96 11:16:30 -0800 Date: Wed, 10 Jan 1996 14:16:15 -0500 From: Skip Montanaro <skip@automatrix.com> Message-Id: <199601101916.OAA21393@dolphin.automatrix.com> To: robots@webcrawler.com Cc: bob@dolphin.automatrix.com, dick@dolphin.automatrix.com Subject: Responsible behavior, Robots vs. humans, URL botany... In-Reply-To: <199601101612.LAA17875@mail.internet.com> References: <199601101612.LAA17875@mail.internet.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Robert Raisch writes: ... indexing Internet accessible resources, appears to extract data from a host by connecting to EVERY tcp port on the machine. I suspect it's a case of some fool deciding that the ends justify the means. On the Lycos search page they proudly announce: Lycos indexes 91% of the web! Select that link and you get: Lycos has indexed over 10.75 million pages throughout the world.... What could the Alta Vista folks do to top that? How about: You have access to all 8 billion words found in over 16 million Web pages. One way to get to stuff the Lycos folks couldn't find was to be a little more rapacious (ooh, I like that word - makes me think of Jurrasic Park...). <digression> Not to let the AV folks be the only ones getting jabbed, I'll take advantage of the opportunity to jab Lycos a little. They have a small table on their 91% page: Lycos 91% 10.75 Million Open Text 12% 0.80 Million Infoseek 6% 0.40 Million Yahoo <1% 0.05 Million It is obviously a case of apples and oranges to compare Lycos with Yahoo (I can't comment on the others, although I believe they use robots as well), since Yahoo is a reasonably well-organized human-built index. I tend to be able to find things in Yahoo. Lycos, for all the scoring, abstracts, searching options, yadda, yadda, yadda, is still a robot-generated index with all the problems for us mere humans that implies. We tend to like things a bit more structured. I don't normally find poring over a robot's search engine output all that fruitful. I still can't seem to write queries to any of the search engines that provide all that great a "usefulness quotient", even with a degree in Computer Science. If most of what's out there is crap (for the sake of argument, let's just pick a number out of thin air, say, 91%... :-), users of Lycos and the other robot indexes are bound to need real big shovels. On the other hand, presumably the Yahoo folks or the submitters of URLs to Yahoo at least sniff the URLs before deciding whether to add them to the database. In addition, Yahoo tends to index the trunks of URL trees (which I find more useful), not every friggin' leaf and branch. Hypothetical conversation between two botanists on a field trip: Ooh, Bob! look at this oak leaf! It sure is a whole lot different than the one we found on that other tree! Let's remember where we found it! Put that other one back... Has anyone considered adding an option to the various robot search engines that would restrict the depth of URLs returned to a query or at least use the number of components in a URL's path to help score the page? </digression> Sorry for the digression. I'm done venting. Please return to work now. Skip Montanaro | Looking for a place to promote your music venue, new CD skip@calendar.com | or next concert tour? Place a focused banner ad in (518)372-5583 | Musi-Cal! http://www.calendar.com/concerts/ From owner-robots Wed Jan 10 11:20:59 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23825; Wed, 10 Jan 96 11:20:59 -0800 From: "Mordechai T. Abzug" <mabzug1@gl.umbc.edu> Message-Id: <199601101920.OAA20604@umbc8.umbc.edu> Subject: Re: MD5 in HTTP headers - where? To: robots@webcrawler.com Date: Wed, 10 Jan 1996 14:20:46 -0500 (EST) In-Reply-To: <199601101448.JAA20509@dolphin.automatrix.com> from "Skip Montanaro" at Jan 10, 96 09:48:22 am X-Mailer: ELM [version 2.4 PL24] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 737 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com "SM" == Skip Montanaro spake thusly: SM> SM> SM> It was mentioned yesterday that there is an HTTP response header useful for SM> sending back an MD5 digest. I just did a quick scan through SM> SM> http://www.w3.org/hypertext/WWW/Protocols/HTTP/HTTP2.html SM> SM> and didn't find any response headers that looked like they were related to SM> use of MD5. Can someone give me a pointer? As the person who first made the claim, guess the burden of proof is on me. See the IETF draft, available at: <http://www.ics.uci.edu/pub/ietf/http/draft-ietf-http-v11-spec-00.txt>. -- Mordechai T. Abzug http://umbc.edu/~mabzug1 mabzug1@umbc.edu finger -l mabzug1@gl.umbc.edu It's hard to RTFM when you can't find the FM. . . From owner-robots Wed Jan 10 11:24:30 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24010; Wed, 10 Jan 96 11:24:30 -0800 Message-Id: <199601101924.LAA29177@scam.XCF.Berkeley.EDU> X-Authentication-Warning: scam.XCF.Berkeley.EDU: Host localhost [127.0.0.1] didn't use HELO protocol To: robots@webcrawler.com Subject: Re: Does anyone else consider this irresponsible? In-Reply-To: Your message of "Wed, 10 Jan 1996 11:32:33 EST." <199601101632.LAA22167@northsea.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Id: <29174.821301875.1@scam.XCF.Berkeley.EDU> Date: Wed, 10 Jan 1996 11:24:35 -0800 From: Eric Hollander <hh@scam.XCF.Berkeley.EDU> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >In message <199601101612.LAA17875@mail.internet.com>, "Robert Raisch, The Inte r >net Company" writes: >> >>Altavista, while a marvelous example of what can be done when >>you throw multiple hundreds of thousands of dollars at the idea >>of indexing Internet accessible resources, appears to extract >>data from a host by connecting to EVERY tcp port on the machine. >>... > >agreed. absurd behavior. you'll probably find more interesting data if you scan udp ports, anyway. e From owner-robots Wed Jan 10 11:29:28 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24311; Wed, 10 Jan 96 11:29:28 -0800 From: mnorman@netcom.com (Mark Norman) Message-Id: <199601101928.LAA04725@netcom11.netcom.com> Subject: Re: Does anyone else consider this irresponsible? To: robots@webcrawler.com Date: Wed, 10 Jan 1996 11:28:55 -0800 (PST) In-Reply-To: <9601101732.AA21836@evil-twins.pa.dec.com> from "monier@pa.dec.com" at Jan 10, 96 09:32:44 am X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 182 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Your reply to the complaint said your robot finds web sites just by following links. But what links does it start with? Thanks, and thanks for participating in this mail list. bye! From owner-robots Wed Jan 10 11:52:06 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25506; Wed, 10 Jan 96 11:52:06 -0800 Message-Id: <199601101951.OAA23476@mail.internet.com> Comments: Authenticated sender is <raisch@mail.internet.com> From: "Robert Raisch, The Internet Company" <raisch@internet.com> Organization: The Internet Company To: robots@webcrawler.com, monier@pa.dec.com Date: Wed, 10 Jan 1996 14:48:14 -0400 Subject: Re: Does anyone else consider this irresponsible? Priority: normal X-Mailer: Pegasus Mail for Windows (v2.01) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Louis and others, I owe you an apology. It appears that I jumped to an erroneous conclusion when I assumed that Scooter harvested data from every available port. In my defence, it was the only assumption I could reach, based upon the information I had available to me. It seems that, rather than the behavior I suggested, the URLs with which I had a problem were inadvertently exposed to Scooter through the general publishing of an employee's hotlist. Thank you for the clarification and I regret any inconvenience this may have caused. It should be noted that I reached my current state of education regarding this matter via the "link:hostname" mechanism in Altavista. An excellent resource with innovative features. My compliments. </rr> Robert Raisch chief scientist The Internet Company On 10 Jan 96 at 9:32, monier@pa.dec.com wrote: > Gang, > > I am the father of Scooter, and for the second time I need to rescue my robot > from groundless accusations. Soon I'll have enough material for a book... From owner-robots Wed Jan 10 12:09:09 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26420; Wed, 10 Jan 96 12:09:09 -0800 X-Sender: mak@surfski.webcrawler.com Message-Id: <v02140800ad19ba7e82cc@[199.221.45.139]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 10 Jan 1996 12:09:45 -0700 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: FAQ again. Cc: kfischer@mail.win.org Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi all, I've been getting a lot of robot questions recently, so decided the FAQ time is now :-) I wrote the stuff below, and cross-checked with Keith Fischer's preliminary FAQ of early November last year; think I have addressed most of the questions he proposed. Pending comments I'l HTML-ise it and add it to the robot pages this week. Regards, ______________ WWW Robot Frequently Asked Questions Last updated: 10 January 1996 Maintained by Martijn Koster <m.koster@webcrawler.com> Location: http://info.webcrawler.com/mak/projects/robots/faq.html 1) About WWW robots 1.1) What is a WWW robot? 1.2) What is an agent? 1.3) What is a search engine? 1.4) What kinds of robots are there? 1.5) Aren't robots bad for the web? 1.6) Where do I find out more about robots? 2) Indexing robots 2.1) How does a robot decide where to visit? 2.2) How does an indexing robot decide what to index? 2.3) How do I register my page with a robot? 3) For Server Administrators 3.1) How do I know if I've been visited by a robot? 3.2) I've been visited by a robot. Now what? 3.3) A robot is traversing my whole site too fast! 3.4) How do I keep a robot off my server? 4) Robots exclusion standard 4.1) Why do I find entries for /robots.txt in my log files? 4.2) How do I prevent robots scanning my site? 4.3) Where do I find out how /robots.txt files work? 4.4) Will the /robots.txt standard be extended? 5) Availability 5.1) Where can I use a robot? 5.2) Where can I get a robot? 5.3) Where can I get the source code for a robot? 5.4) I'm writing a robot, what do I need to be careful of? 5.5) I've written a robot, how do I list it? 1) About Web Robots =================== 1.1) What is a WWW robot? ------------------------- A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. Note that "recursive" here doesn't limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot. Normal Web browsers are not robots, because the are operated by a human, and don't automatically retrieve referenced documents (other than inline images). Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These names are a bit misleading as they give the impression the software itself moves between sites like a virus; this not the case, a robot simply visits sites by requesting documents from them. 1.2) What is an agent? ---------------------- The word "agent" is used for lots of meanings in computing these days. Specifically: - Autonomous agents are programs that do travel between sites, deciding themselves when to move and what to do (e.g. General Magic's Telescript). These can only travel between special servers and are currently not widespread in the Internet. - Intelligent agents are programs that help users with things, such as choosing a product, or guiding a user through form filling, or even helping users find things. These have generally little to do with networking. - User-agents are a technical name for programs that perform networking tasks for a user, such as Web User-agents like Netscape Explorer, Email User-agent like Qualcomm Eudora etc. 1.3) What is a search engine? ----------------------------- A search engine is a program that searches through some dataset. In the context of the Web, the word "search engine" is most often used for search forms that search through databases of HTML documents gathered by a robot. 1.4) What other kinds of robots are there? ------------------------------------------ Robots can be used for a number of purposes: - Indexing (see section 2) - HTML validation - Link validation - "What's New" monitoring - Mirroring See the list of active robots to see what robot does what. Don't ask me -- all I know is what's on the list... 1.5) Aren't robots bad for the web? ----------------------------------- There are a few reasons people believe robots are bad for the Web: - Certain robot implementations can (and have in the past) overloaded networks and servers. This happens especially with people who are just starting to write a robot; these days there is sufficient information on robots to prevent some of these mistakes. - Robots are operated by humans, who make mistakes in configuration, or simply don't consider the implications of their actions. This means people need to be careful, and robot authors need to make it difficult for people to make mistakes with bad effects - Web-wide indexing robots build a central database of documents, which doesn't scale too well to millions of documents on millions of sites. But at the same time the majority of robots are well designed, professionally operated, cause no problems, and provide a valuable service in the absence of widely deployed better solutions. So no, robots aren't inherently bad, nor inherently brilliant, and need careful attention. 1.6) Where do I find out more about robots? ------------------------------------------- There is a Web robots home page on: http://info.webcrawler.com/mak/projects/robots/robots.html while this is hosted at one of the major robots' site, it is an unbiased and reasoneably comprehensive collection of information which is maintained by Martijn Koster <m.koster@webcrawler.com>. Of course the latest version of this FAQ is there. You'll also find details and an archive of the robots mailing list, which is intended for technical discussions about robots. 2) Indexing robots ================== 2.1) How does a robot decide where to visit? -------------------------------------------- This depends on the robot, each one uses different strategies. In general they start from a historical list of URLs, especially of documents with many links elsewhere, such as server lists, "What's New" pages, and the most popular sites on the Web. Most indexing services also allow you to submit URLs manually, which will then be queued and visited by the robot. Sometimes other sources for URLs are used, such as scanners through USENET postings, published mailing list achives etc. Given those starting points a robot can select URLs to visit and index, and to parse and use as a source for new URLs. 2.2) How does an indexing robot decide what to index? ----------------------------------------------------- If an indexing robot knows about a document, it may decide to parse it, and insert it into its database. How this is done depends on the robot: Some robots index the HTML Titles, or the first few paragraphs, or parse the entire HTML and index all words, with weightings depending on HTML constructs, etc. Some parse the META tag, or other special hidden tags. We hope that as the Web evolves more facilities becomes available to efficiently associate meta data such as indexing information with a document. This is being worked on... 2.3) How do I register my page with a robot? -------------------------------------------- You guessed it, it depends on the service :-) Most services have a link to a URL submission form on their search page. Fortunately you don't have to submit your URL to every service by hand: Submit-it <URL: http://www.submit-it.com/> will do it for you. 3) For Server Administrators ============================ 3.1) How do I know if I've been visited by a robot? --------------------------------------------------- You can check your server logs for sites that retrieve many documents, especially in a short time. If your server supports User-agent logging you can check for retrievals with unusual User-agent heder values. Finally, if you notice a site repeatedly checking for the file '/robots.txt' chances are that is a robot too. 3.2) I've been visited by a robot. Now what? -------------------------------------------- Well, nothing :-) The whole idea is they are automatic; you don't need to do anything. If you think you have discovered a new robot (ie one that is not listed on the list of active robots on <URL: http://info.webcrawler.com/ mak/projects/robots/robots.html>, and it does more than sporadic visits, drop me a line so I can make a note of it for future reference. But please don't tell me about every robot that happens to drop by! 3.3) A robot is traversing my whole site too fast! -------------------------------------------------- This is called "rapid-fire", and people usually notice it if they're monitoring or analysing an access log file. First of all check if it is a problem by checking the load of your server, and monitoring your servers' error log, and concurrent connections if you can. If you have a medium or high performance server, it is quite likely to be able to cope a high load of even several requests per second, especially if the visits are quick. However you may have problems if you have a low performance site, such as your own desktop PC or Mac you're working on, or you run low performance server software, or if you have many long retrievals (such as CGI scripts or large documents). These problems manifest themselves in refused connections, a high load, performance slowdowns, or in extreme cases a system crash. If this happens, there are a few things you should do. Most importantly, start logging information: when did you notice, what happened, what do your logs say, what are you doing in response etc; this helps investigating the problem later. Secondly, try and find out where the robot came from, what IP addresses or DNS domains, and see if they are mentioned in the list of active robots on <URL: http://info.webcrawler.com/mak/projects /robots/robots.html>. If you can identify a site this way, you can email the person responsible, and ask them what's up. If this doesn't help, try their own site for telephone numbers, or mail postmaster at their domain. If the robot is not on the list, mail me with all the information you have collected, including actions on your part. If I can't help, at least I can make a note of it for others. 3.4) How do I keep a robot off my server? Read the next section... 4) Robots exclusion standard ============================ 4.1) Why do I find entries for /robots.txt in my log files? ----------------------------------------------------------- They are probably from robots trying to see if you have specified any rules for them using the Standard for Robot Exclusion, see question 4.4. If you don't care about robots and want to prevent the messages in your error logs, simply create an empty file called robots.txt in the root level of your server. Don't put any HTML or English language "Who the hell are you?" text in it -- it will probably never get read by anyone :-) 4.2) How do I prevent robots scanning my site? ---------------------------------------------- The quick way to prevent robots visiting your site is put these two lines into your server: User-agent: * Disallow: / but its easy to be more selective than that, see 4.3 4.3) Where do I find out how /robots.txt files work? ---------------------------------------------------- You can read the whole standard on the Robot Page <URL: http://info.webcrawler.com/mak/projects/robots/robots.html> but the basic concept is simple: by writing a structured text file you can indicate to robots that certain parts of your server are off-limits to some or all robots. It is best explained with an example (The vertical bar on the left is not part of the contents): | # /robots.txt file for http://webcrawler.com/ | # mail webmaster@webcrawler.com for constructive criticism | | User-agent: webcrawler | Disallow: | | User-agent: lycra | Disallow: / | | User-agent: * | Disallow: /tmp | Disallow: /logs The first two lines, starting with '#', specify a comment The first paragraph specifies that the robot called 'webcrawler' has nothing disallowed: it may go anywhere. The second paragraph indicates that the robot called 'lycra' has all relative URLs starting with '/' disallowed. Because all relative URL's on a server start with '/', this means the entire site is closed off. The third paragraph indicates that all other robots should not visit URLs starting with /tmp or /log. Note the '*' is a special token; its not a regular expression. Two common errors: Regular expressions are _not_ supported: instead of 'Disallow: /tmp/*' just say 'Disallow: /tmp'. You shouldn't put more than one path on a Disallow line (this may change in a future version of the spec) 4.4) Will the /robots.txt standard be extended? ----------------------------------------------- Probably... there are some ideas floating around. They haven't made it into a coherent proposal because of time constraints, and because there is little pressure. Mail suggestions to the robots mailing list, and check the robots home page for work in progress. 5) Availability =============== 5.1) Where can I use a robot? ----------------------------- If you mean a search service, check out the various directory pages on the Web, such as Netscape's <URL: http://home.netscape.com/home/internet-directory.html> or try one of the Meta search services such as <UL: http://metasearch.com/> 5.2) Where can I get a robot? ----------------------------- Well, you can have a look at the list of robots; I'm starting to indicate their public availability slowly. In the meantime, two indexing robots that you should be able to get hold of are Harvest (free), and Verity's. 5.3) Where can I get the source code for a robot? ------------------------------------------------- See 5.2 -- some may be willing to give out source code. 5.4) I'm writing a robot, what do I need to be careful of? ---------------------------------------------------------- Lots. First read through all the stuff on the robot page http://info.webcrawler.com/mak/projects/robots/robots.html then read the proceedings of past WWW Conferences, and the complete HTTP and HTML spec. Yes; it's a lot of work :-) 5.5) I've written a robot, how do I list it? --------------------------------------------- Simply fill in http://info.webcrawler.com/mak/projects/robots/form.html and mail the result to Martijn Koster <m.koster@webcrawler.com> with a subject of "Addition to the list of robots". THE END -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Wed Jan 10 12:46:18 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28305; Wed, 10 Jan 96 12:46:18 -0800 Date: Wed, 10 Jan 1996 13:59:57 -0600 From: ecarp@tssun5.dsccc.com (Ed Carp @ TSSUN5) Message-Id: <9601101959.AA21857@tssun5.> To: robots@webcrawler.com Subject: Re: Does anyone else consider this irresponsible? X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > From owner-robots@webcrawler.com Wed Jan 10 13:30 CST 1996 > From: <monier@pa.dec.com> > To: robots@webcrawler.com > Subject: Re: Does anyone else consider this irresponsible? > Date: Wed, 10 Jan 96 09:32:44 -0800 > X-Mts: smtp > Could the next person who feels an urge to speak for the Alta Vista robot please > check with me first? Nothing about this project is very secret, all you have to > do is ask. OK ... um ... how about source? ;) From owner-robots Wed Jan 10 13:27:50 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00457; Wed, 10 Jan 96 13:27:50 -0800 From: <monier@pa.dec.com> Message-Id: <9601102121.AA22127@evil-twins.pa.dec.com> To: robots@webcrawler.com Subject: Re: Does anyone else consider this irresponsible? In-Reply-To: Your message of "Wed, 10 Jan 96 13:59:57 CST." <9601101959.AA21857@tssun5.> Date: Wed, 10 Jan 96 13:21:06 -0800 X-Mts: smtp Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Got a 1GB machine to run it on? OK, maybe not the source. It's too buggy and I would be embarrassed (;-)). --Louis From owner-robots Wed Jan 10 14:04:52 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02364; Wed, 10 Jan 96 14:04:52 -0800 X-Sender: mak@surfski.webcrawler.com Message-Id: <v02140807ad19d0dfc63f@[199.221.45.139]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 10 Jan 1996 14:05:30 -0700 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: Re: robots.txt extensions Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 10:35 AM 1/10/96, Adam Jack wrote: >Hello, > >Since this list started I've only ever seen one suggestion >for an extension to robots.txt. A extension discussion document sounds like an ideal, though belated, New Years resolution :-) >to add expiry information for the >robots.txt file itself. No response appears to have been given >-- did people not think it worth while? Did people think the >HTTP response field, Expires, should be used for that? Yes, and I also don't think its something widely wanted, and that is will be confusing to people (who don't understand all the ins and outs anyway (How about a separate 'funny messages in /robots.txt thread? :-). The thing about expires is that it is a prediction, and people are not good at making predictions; they want a "I changed it, now update all your robots out there" push scheme. Does submitting a '/robots.txt' manually to robots bump it up in the queue (does in WebCrawler)? Then you could use submit-it to do the push :-) >I don't know if this was discussed to death somewhere -- but >are people still considering extensions to robots.txt? I'd be >interested in any pointers to an archive of such a discussion. My thoughts never made it to the list :-) >If there is point in discussion additions pls read on -- >otherwise bin this mail. No, by all means. But most of all I want to keep things simple. >MinRequestInterval: X > > Minimum request interval in seconds, (0=no minimum), > with a default, if missing, of 60. > > This is for those of us lowely enough not to have huge > gathering tasks and the luxury ;-) of a backlog of URLs > over distributed sites. (I.e. Those of us doing a > sequential search exhausting our interest in a site in > one slurp.) Additionally local admins would have more > control over wanderers that visted. Interesting, I didn't think people still did that :-) I think 60 is a sensible default, so lets think about why you would change it from that. There seems little point in setting it much higher, because even on the worst platform one requets per minute is no problem (unless previous connections are still open). But who would set it much lower? Only someone who wants to run a robot to their own site, in which case they can control the speed themselves... So is it worth doing it at all? >DefaultIndex: index.html > > Stating that XXXX/ and XXXX/index.html are identicle. > > You can argue that this is lamely inadequate - or that it > makes a saving. I know the bigger issue is recusion. Here > I am merely hoping to save those single page recusions. Yes, I do argue that this is lamely inadequate; I too think checksums are the way for this, even if it is post-retrieval; pre-retrieval is always a guess (even if we could have an If-not-md5 HTTP header) >CGIMask: *.cgi > > Rather than guessing at CGI urls -- why not get the local > admin to answer it? I know that the WN server uses a file > extension to indicate a CGI script -- not /cgi-bin/. > > Q: Are CGI scripts universally avoided in advance -- or do > robots look at the HTTP flags of results to try to work > out wether some content is dynamically generated? I always think you shouldn't make a distinction between dynamically generated output and static output. What you should pay attention to is things like Expires, form and queries, and outrageous recursion... >Finally -- I never understood why robots.txt was exclusion only. >Why does it not have some of positive hints added? I.e. you are >allowed & welcome to browse XXXX/fred.html. Was this a choice >built upon pragmatism -- thinking that this would open a can of >worms? Ha, finally someone who understands me! :-)) Yes, the can is really opened up when you start allowing keywords and stuff. I did think maybe one or both of a 'Visit' and a 'Meta' header would be a reseonable idea: 'Visit' would allow URLs to be listed for retrieval, and nothing more. So you could do: | Disallow: / | Visit: /welcome.html | Visit: /products.html | Visit: /keywords-and-overview-for-robots.html Which would be kinda cool and simple, but doesn't scale well to many URL's, or to more meta data. 'Meta' would specify a link to a seperate document, using some TBD format (or formats, using content negotiation to pick text/url-list, text/aliweb, urc/foo or whatever) which further guides content selection and meta data for a site. Other requests I've had are regular expression support (e.g. for Disallow: *.html3), and allowing multiple paths per disallow line. What do people think of the above? Any others? -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Wed Jan 10 14:05:00 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02379; Wed, 10 Jan 96 14:05:00 -0800 X-Sender: mak@surfski.webcrawler.com Message-Id: <v02140808ad19d8f2ac66@[199.221.45.139]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 10 Jan 1996 14:05:36 -0700 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: Robots / source availability? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Any robot author could have written: > maybe not the source. It's too buggy and I would be embarrassed (;-)). :-) Who is currently selling/giving away robot binaries and/or source? I'd like to add that info to the robots listed... people ask me all the time. -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Wed Jan 10 16:18:44 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08903; Wed, 10 Jan 96 16:18:44 -0800 Message-Id: <30F45854.1697@corp.micrognosis.com> Date: Wed, 10 Jan 1996 19:22:44 -0500 From: Adam Jack <ajack@corp.micrognosis.com> Organization: CSK/Micrognosis Inc. X-Mailer: Mozilla 2.0b3 (X11; I; SunOS 5.5 sun4m) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: robots.txt extensions References: <v02140807ad19d0dfc63f@[199.221.45.139]> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Martijn Koster wrote: > > [...] sequential search exhausting our interest in a site in > > Interesting, I didn't think people still did that :-) Martijn -- think lowly, very very lowley ... ;-) People will allways start somewhere -- and, as I mentioned in point to point mail, it is us beginners that *all* ought be wary off. Robots.txt seems the first line of defense. A site can make explicit statements in it. Being explicit is a good reason for a MinRequestInterval. > I think 60 is a sensible default, so lets think about why you would > change it from that. [...] But who would set it much lower? > 60 might be sensible to your need -- but what about other's search needs? Consider people like me who get libwwwperl and a spare afternoon and a goal. Robots, Spiders et all will get more and more prolific and they won't all have long term aims and/or budgets. In testing, and in practice, I felt myself get tempted to hack down the 60 second default to, say, 30 ... then I read that an 'OK' robot on the active list did once-a-second :-) :-) ...... Soon, rabid thoughts of 60 *micro *seconds came to mind ... However - if any site every mentioned a preference for, say, 120 seconds - then I'd be happy to oblige. I think this information is a good addition. It needn't be of use to the thundering giants -- it is the WWW site that benefits. > >DefaultIndex: index.html > > > > Stating that XXXX/ and XXXX/index.html are identicle. > > > > You can argue that this is lamely inadequate - or that it > > makes a saving. I know the bigger issue is recusion. Here > > I am merely hoping to save those single page recusions. > > Yes, I do argue that this is lamely inadequate; I too think checksums > are the way for this, even if it is post-retrieval; pre-retrieval is > always a guess (even if we could have an If-not-md5 HTTP header) > Again - giants verses the lowely. This misses a saving for those who don't have MD5 capabilities. Also, as for whether checksums are the answer - that seems odd : So - a robot must cache a whole site of checksums, or load the checksum lists when a site's URL is individually access ( for those non-sequential giants.) All this to see if an URL is the same as one already seen? Is this not a huge procesing overhead? Is this mechanism suggested only because existing HTTP servers and header field would need no change to support it? Adam -- +1-203-730-5437 | http://www.micrognosis.com/~ajack/index.html From owner-robots Wed Jan 10 16:38:34 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09762; Wed, 10 Jan 96 16:38:34 -0800 From: Jaakko Hyvatti <Jaakko.Hyvatti@www.fi> Message-Id: <199601110038.CAA11256@krisse.www.fi> Subject: Re: robots.txt extensions To: robots@webcrawler.com Date: Thu, 11 Jan 1996 02:38:12 +0200 (EET) In-Reply-To: <Pine.SUN.3.91.960110095606.1141B-100000@becks> from "Adam Jack" at Jan 10, 96 10:35:10 am X-Mailer: ELM [version 2.4 PL22] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 3148 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Adam: > If there is point in discussion additions pls read on -- > otherwise bin this mail. Sure is. In the following I will comment your proposals from the viewpoint 'Does it solve any problems? Will anyone implement it?', because I think no one will implement any extensions if there is nothing to gain and more specifically no problems to solve with them. > http://info.webcrawler.com/mailing-lists/robots/0001.html > > seemed sensible enough -- to add expiry information for the > robots.txt file itself. No response appears to have been given Problem is: someone changes robots.txt while cached copy is trusted by robot. Adding expiry info does not enforce sysadmins to not to edit robots.txt before it's expiration, so it still has to be retrieved with some sensible intervals before expiration if set too far away in the future. Retrieving robots.txt every 100th - 1000th GET or minimum 8 hours, maximum couple of days will not increase net traffic and solves the problem better than expiry fields. And because every robot has to handle robots.txt expiration sensibly, no sysadmin sees this as a problem and will not implement the new field. > MinRequestInterval: X > > Minimum request interval in seconds, (0=no minimum), > with a default, if missing, of 60. There is no problem with request intervals with well-behaved robots, and ill-behaving ones - will they obey it anyway? So there is no problem and it does not even get solved :-) Again nobody will implement this. > DefaultIndex: index.html > > Stating that XXXX/ and XXXX/index.html are identicle. Checksums are easier and have to be implemented anyway, because most sites will not have this field implemented. And 'cause checksums work, this is unnecessary and no one will use it.. > CGIMask: *.cgi Hmm. Disallow: with regular expressions would be more generic. But again: how many such cases can be found that this is necessary? > Finally -- I never understood why robots.txt was exclusion only. > Why does it not have some of positive hints added? I.e. you are > allowed & welcome to browse XXXX/fred.html. Was this a choice > built upon pragmatism -- thinking that this would open a can of > worms? I do not believe it is a problem to give robots URLs, they are pretty good at finding them themselves. Also listing an url in robots.txt does not bring the robot for a visit - a submission to the robot admin will. On the other hand, lack of exclusion of robots from sites/URLs was a severe problem and was well solved by robots.txt. Also, while updating the information content of a site, sysadmins and ordinary users surely will forget to update robots.txt. (Directories are more static and therefore the current scheme works.) I am sorry I sound quite negative.. Actually, the ideas might be pretty good. I do not mean to be rude :-) I actually have a new idea too: Textarchive: /allpages.zip or Textarchive: /publicdocs.tar.gz (or with any other compressed archive format) ..instructs robots to fetch all there is in a compressed format. Is this a simple enough interface for everyone to accept? Too simple? From owner-robots Wed Jan 10 18:43:35 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16707; Wed, 10 Jan 96 18:43:35 -0800 From: mnorman@netcom.com (Mark Norman) Message-Id: <199601110243.SAA20444@netcom11.netcom.com> Subject: Re: Does anyone else consider... To: robots@webcrawler.com Date: Wed, 10 Jan 1996 18:43:09 -0800 (PST) X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 138 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Louis, you said your robot only "follows links" to find web sites, but how do you get the links you give it as a starting point? thanks. From owner-robots Wed Jan 10 19:58:35 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21787; Wed, 10 Jan 96 19:58:35 -0800 From: <monier@pa.dec.com> Message-Id: <9601110353.AA22505@evil-twins.pa.dec.com> To: robots@webcrawler.com Subject: Re: Does anyone else consider... In-Reply-To: Your message of "Wed, 10 Jan 96 18:43:09 PST." <199601110243.SAA20444@netcom11.netcom.com> Date: Wed, 10 Jan 96 19:53:28 -0800 X-Mts: smtp Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Louis, you said your robot only "follows links" to find web sites, but > how do you get the links you give it as a starting point? thanks. I just started from a few well-known sources, like the NCSA archives, and the Web is sufficiently connected to do the rest. I did not have the guts to give it a single URL, but I bet that it would take quite a bit of work to find a URL that would not connect to the whole Web. Think about how many pages mention Yahoo for example, and how quickly the search will branch after that. And then of course I use any URL people contribute. ----------------- Since I'm here, and in the interest of saving bandwidth I want to respond to Skip who was missing one important point and calling me a fool (;-)). Alta Vista uses a fast robot. I ran this robot for a week and got 16M pages. If I had run it for two weeks I would no doubt have 25-30M pages today. Once I restart the robot the index will contain more pages, unless it finds a lot of sites with better /robots.txt in which case it will delete these pages, and I may report a smaller index for a while, which would be fine with me. Notice that I said a "better" robots.txt, because I would actually enjoy seeing every webmaster put up a good file and save everyone the trouble to fetch, index, and read stuff that was never intended to be indexed. Every chance I get to educate another person, specially a reporter, about the Robots Exclusion Standard, I do it, because it's our only chance so far to improve the quality of what ends up in Web indexes. And of course if webmasters used password protection on ports that are not intended for public usage it would make life somewhat easier: I have answered enough "you have violated my secret test site" messages. My point is that I don't want to maximize at all cost the number of pages to report: I am interested in finding out how large the Web is, and giving everyone access to its complete index. And while doing this I want to report facts, not engage in a p...ing contest with some outfit who has reportedly indexed 91% of an absolutely unknown and moving figure. Alta Vista is a research project with no place for this kind of creative arithmetic. --Louis From owner-robots Wed Jan 10 21:32:44 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28339; Wed, 10 Jan 96 21:32:44 -0800 Message-Id: <30F4A20A.1B91@corp.micrognosis.com> Date: Thu, 11 Jan 1996 00:37:14 -0500 From: Adam Jack <ajack@corp.micrognosis.com> Organization: CSK/Micrognosis Inc. X-Mailer: Mozilla 2.0b3 (X11; I; SunOS 5.5 sun4m) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: robots.txt extensions References: <199601110038.CAA11256@krisse.www.fi> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Jaakko Hyvatti wrote: > > I am sorry I sound quite negative.. Actually, the ideas might be > pretty good. I do not mean to be rude :-) > Don't worry about that. I appreciate your information. Thanks. Okay - so zero for 3... How about I just comment on the following : monier@pa.dec.com wrote: > > sites with better /robots.txt [...] > index, and read stuff that was never intended to be indexed. > Our admin knows + cares little for our content. Our content is automatically transfered to our own sub-tree locations each hour. We are in a position to determine what is worthy & what not -- but not in a position to modify /robots.txt. I am sure our site is not alone in this... Adam -- +1-203-730-5437 | http://www.micrognosis.com/~ajack/index.html From owner-robots Wed Jan 10 22:03:48 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00606; Wed, 10 Jan 96 22:03:48 -0800 Message-Id: <v02130500ad1a569db95e@[202.237.148.20]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 11 Jan 1996 15:02:50 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: Re: Does anyone else consider... Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Louis: >Alta Vista uses a fast robot. I ran this robot for a week and got 16M pages. >If I had run it for two weeks I would no doubt have 25-30M pages today. Once I Are you saying that your entire database was obtained in a week?! What do the dates mean in the listings that are returned?--the date the file was created, changed, or documented by AV? I've been using AV to trace links to our page, and I can use the advanced query and divide things up by time period. When I do this the distribution of pages is over a matter of months, not a week. (By the way, there seems to be a bug in the date feature, since it returns data even when you set it to the future, and there are a couple other odd things.) Are you able to update the database in real time, or do you have to rebuild it every time you add/revise data? --Mark From owner-robots Thu Jan 11 01:44:21 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA13378; Thu, 11 Jan 96 01:44:21 -0800 Date: Thu, 11 Jan 1996 09:44 UT From: MGK@NEWTON.NPL.CO.UK (Martin Kiff) Message-Id: <0099C391D03F1640.5846@NEWTON.NPL.CO.UK> To: robots@webcrawler.com Subject: Re: robots.txt extensions X-Vms-To: SMTP%"robots@webcrawler.com" X-Vms-Cc: MGK Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hello all, I've just joined the Robots list, so this is a general 'hello'. > 'Visit' would allow URLs to be listed for retrieval, and nothing more. > So you could do: > > | Disallow: / > | Visit: /welcome.html > | Visit: /products.html > | Visit: /keywords-and-overview-for-robots.html > > Which would be kinda cool and simple, but doesn't scale well to many > URL's, or to more meta data. I'd use this for a Visit: /changes.html which contains a 'w3new' type list of all pages (all pages I want indexed) on the server with the most recently modified at the top... crawlers can do what they like with the information but at least it is there. It might help however to qualify the 'Visit' keyword in some way to say that the information is ordered. Regards, Martin Kiff mgk@newton.npl.co.uk From owner-robots Thu Jan 11 03:57:11 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20139; Thu, 11 Jan 96 03:57:11 -0800 Date: Thu, 11 Jan 1996 06:56:59 -0500 From: Skip Montanaro <skip@automatrix.com> Message-Id: <199601111156.GAA27257@dolphin.automatrix.com> To: robots@webcrawler.com Subject: Re: Does anyone else consider... In-Reply-To: <9601110353.AA22505@evil-twins.pa.dec.com> References: <199601110243.SAA20444@netcom11.netcom.com> <9601110353.AA22505@evil-twins.pa.dec.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Since I'm here, and in the interest of saving bandwidth I want to respond to Skip who was missing one important point and calling me a fool (;-)). My apologies also. I was responding in part to Robert Raisch's message. A foundation built on sand... Skip Montanaro | Looking for a place to promote your music venue, new CD skip@calendar.com | or next concert tour? Place a focused banner ad in (518)372-5583 | Musi-Cal! http://www.calendar.com/concerts/ From owner-robots Thu Jan 11 08:02:49 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02821; Thu, 11 Jan 96 08:02:49 -0800 From: <monier@pa.dec.com> Message-Id: <9601111557.AA23045@evil-twins.pa.dec.com> To: robots@webcrawler.com Subject: Re: Does anyone else consider... In-Reply-To: Your message of "Thu, 11 Jan 96 15:02:50 +0900." <v02130500ad1a569db95e@[202.237.148.20]> Date: Thu, 11 Jan 96 07:57:21 -0800 X-Mts: smtp Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com The database was obtained in 8 days. The date is last-modified as reported by the server, which is often bogus, but there is nothing I can do, except educate more webmasters (;-)). This should be better documented, we are working on documentation right now. The database is updated in real time, i.e. while queries come in: the news index for example is constantly in flux since articles come in and expire all the time. --Louis From owner-robots Thu Jan 11 09:23:28 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA07254; Thu, 11 Jan 96 09:23:28 -0800 Message-Id: <199601111723.JAA23233@meitner.cs.washington.edu> In-Reply-To: m.koster@webcrawler.com's message of Wed, 10 Jan 1996 14:05:36 -0700 To: robots@webcrawler.com Subject: Re: Robots / source availability? References: <v02140808ad19d8f2ac66@[199.221.45.139]> Date: Thu, 11 Jan 1996 09:23:23 PST From: Erik Selberg <speed@cs.washington.edu> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Martijn Koster writes: > Who is currently selling/giving away robot binaries and/or source? > I'd like to add that info to the robots listed... people ask me all the time. There's a mini-robot available with the recent release of libwww (v4.0) from w3.org -Erik From owner-robots Sat Jan 13 07:06:34 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08811; Sat, 13 Jan 96 07:06:34 -0800 Message-Id: <30F7CB49.3E4A@wsnet.com> Date: Sat, 13 Jan 1996 09:10:01 -0600 From: Alison Gwin <alison@wsnet.com> Organization: Coldwell Banker Smith X-Mailer: Mozilla 2.0b3 (Win95; I) Mime-Version: 1.0 To: robots@webcrawler.com Subject: (no subject) X-Url: http://info.webcrawler.com/mailing-lists/robots/info.html Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Could someone point me to a simple application that will scan newsgroups of interest to me and save the email adresses form those newsgroups? I'm sure that such an appication exists, but can't find one anywhere. Creating one from scratch seems like such a waste of effort when I know there's probably one out there already. Thanks! From owner-robots Sat Jan 13 10:58:29 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24036; Sat, 13 Jan 96 10:58:29 -0800 X-Sender: dhender@oly.olympic.net Message-Id: <v01510101ad1db16bf2ef@[205.240.23.66]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sat, 13 Jan 1996 11:00:17 -0800 To: robots@webcrawler.com From: david@quickimage.com (David Henderson) Subject: Robots not Frames savy Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com When will robots support frames???? _____________________________________________________________ David Henderson - Webmaster - QUICKimage HOME PH/FAX: 360-377-2182 WORK PH: 206-443-1430 WORK FAX: 206-443-5670 From owner-robots Sat Jan 13 16:03:31 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11650; Sat, 13 Jan 96 16:03:31 -0800 Message-Id: <v02130501ad1df5a3c7c2@[202.243.51.216]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sun, 14 Jan 1996 09:04:30 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: Re: Spam Software Sought Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 9:10 AM 1/13/96, Alison Gwin wrote: >Could someone point me to a simple application that will scan newsgroups >of interest to me and save the email adresses form those newsgroups? >I'm sure that such an appication exists, but can't find one anywhere. >Creating one from scratch seems like such a waste of effort when I know >there's probably one out there already. Thanks! Why would you want such a program? You wouldn't be working for Canter and Siegel, would you? From owner-robots Sun Jan 14 00:53:22 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10100; Sun, 14 Jan 96 00:53:22 -0800 Date: Sun, 14 Jan 96 12:02:20 EST From: smadja@netvision.net.il Subject: Re: Does anyone else consider... To: robots@webcrawler.com X-Mailer: Chameleon ARM_55, TCP/IP for Windows, NetManage Inc. Message-Id: <Chameleon.960114120412.smadja@Haifa.netvision.net.il> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Thu, 11 Jan 96 07:57:21 -0800 monier@pa.dec.com wrote: > >The database was obtained in 8 days. The date is last-modified as reported by >the server, which is often bogus, but there is nothing I can do, except educate >more webmasters (;-)). This should be better documented, we are working on >documentation right now. >The database is updated in real time, i.e. while queries come in: the news index >for example is constantly in flux since articles come in and expire all the time. > > --Louis Louis: What do you mean by the DB is updated real time. Do you rescan the web continuously checking for updates and new pages (with a 1-week cycle), or do you have some other strategy? Thanks ------------------------------------- Name: Frank Smadja E-mail: smadja@netvision.net.il Date: 01/14/96 Time: 12:02:20 ------------------------------------- From owner-robots Sun Jan 14 10:53:23 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA07332; Sun, 14 Jan 96 10:53:23 -0800 From: <monier@pa.dec.com> Message-Id: <9601141846.AA26989@evil-twins.pa.dec.com> To: robots@webcrawler.com Subject: Horror story Date: Sun, 14 Jan 96 10:46:32 -0800 X-Mts: smtp Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I was taking a look at the Alta Vista database and found out the following fact: about 5% of all sites visited (>100,000 total) have a non-empty /robots.txt file. Horrifying! This suggests that before we add all sorts of improvements to the standard we should try to educate the webmasters: it does not matter whether we have 36 options on how often to refetch the damn file if nobody uses it and robots still fall down holes, wander in test areas, gobble up access logs... Seriously, how about some sort of concerted effort to educate webmasters everywhere that with two minutes of their time they can make everyone's life better: less visits to their site, less junk in indexes (indices?). --Louis From owner-robots Sun Jan 14 12:05:44 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11141; Sun, 14 Jan 96 12:05:44 -0800 Date: Sun, 14 Jan 1996 15:05:31 -0500 From: Skip Montanaro <skip@automatrix.com> Message-Id: <199601142005.PAA01945@dolphin.automatrix.com> To: robots@webcrawler.com Subject: Re: Horror story In-Reply-To: <9601141846.AA26989@evil-twins.pa.dec.com> References: <9601141846.AA26989@evil-twins.pa.dec.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Seriously, how about some sort of concerted effort to educate webmasters everywhere that with two minutes of their time they can make everyone's life better: less visits to their site, less junk in indexes (indices?). Sounds good in principal. In practice, howver, since most people want to tout as many "hits" as possible, they may be less inclined than you might think to squelch robots. Here are a few suggestions: 1. Every site that uses robots (Lycos, Alta Vista, Webcrawler, ...) should have an easily found link to Martijn's norobots page. I know some do already. Others mention robots.txt but don't provide a link. 2. If possible, expose several "load-and-go" annotated robots.txt files (maybe on Martijn's site), each with clear statements of the particular file's goals. I know there are a few on the norobots page, but I doubt there are very many sites with /cyberworld directories. 3. Every robot site that supports URL inputs should mention robots.txt in both the submission form and the submission response page. 4. How about a robots.txt creation Web form? 5. Are there some good non-Web places to get a little publicity? What about WebWeek, Interactive Age, and other Internet rags? Could a short article be written? 6. All the major Web servers should come with a little blurb about robots.txt. 7. How about a little IMG like the Point Communications Top 5% graphic that points to the norobots site? Anybody with a robots.txt file could display it proudly (we are, after all a pretty elite group if Louis's message is indicative of reality). It could have little image and a catchy phrase like: robots.txt - the diagraphm for your Web server -- Skip Montanaro | Looking for a place to promote your music venue, new CD skip@calendar.com | or next concert tour? Place a focused banner ad in (518)372-5583 | Musi-Cal! http://www.calendar.com/concerts/ From owner-robots Sun Jan 14 16:05:07 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24212; Sun, 14 Jan 96 16:05:07 -0800 From: Jaakko Hyvatti <Jaakko.Hyvatti@www.fi> Message-Id: <199601150005.CAA02337@krisse.www.fi> Subject: Re: Horror story To: robots@webcrawler.com Date: Mon, 15 Jan 1996 02:04:58 +0200 (EET) In-Reply-To: <199601142005.PAA01945@dolphin.automatrix.com> from "Skip Montanaro" at Jan 14, 96 03:05:31 pm X-Mailer: ELM [version 2.4 PL22] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 760 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Skip Montanaro: > 6. All the major Web servers should come with a little blurb about > robots.txt. That is the most important thing here! And not only should all server installation instructions have a step called 'Creating /robots.txt' and an example with 'Disallow: /cgi-bin/' with it, any software that creates information or scripts that should not be indexed, like statistics packages, query frontends, database gateways.. should come with /robots.txt and specific instructions! Now it is just a question of who is going to do the real work here, to list all such software with developers contact addresses, and formulate a letter that impresses them to include these instructions into the next release. I believe it might even work. From owner-robots Sun Jan 14 17:50:05 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA29935; Sun, 14 Jan 96 17:50:05 -0800 Message-Id: <9601150149.AA29926@webcrawler.com> Date: Sun, 14 Jan 1996 17:38:00 -0800 From: Ted Sullivan <tsullivan@blizzard.snowymtn.com> Subject: Re: Horror story To: robots <robots@webcrawler.com> X-Mailer: Worldtalk (NetConnex V3.50c)/MIME Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com 3. Every robot site that supports URL inputs should mention robots.txt in both the submission form and the submission response page. How about somebody, say maybe Louis as he would certainly have the resources (I would do it but don't have a budget for that kind of stuff, unless of course somebody came up with a few man weeks) run your little spider again against your complete data set of URL's and while you are looking for links on ONLY the top most page of the site record any mail address that has "webmaster..." or something similar in it. Then send a little message to the webmaster saying in effect that a group of sites that use spiders (mention a few of the big one that are on this mailing list) have informally got together and in order to properly index your site in the future would like to make a little suggestion.... have noticed that you do not have a robots.txt file.... Then include a short example that the webmaster could cut and paste into the file system with a should tutorial on how to do it. Two things would happen, 1) lots of people would get the message and 2) if you sent out >100,000 e-mail messages one weekend surely some of the trade publications would write up a few articles on our behalf after their webmaster got a message and realized that it was send to the world. We would not hit everybody, but it would sure hit a lot of the sites that could do the 2 minute piece of work and set it up properly. Ted Sullivan tsullivan@snowymtn.com From owner-robots Sun Jan 14 19:25:23 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05452; Sun, 14 Jan 96 19:25:23 -0800 Date: Sun, 14 Jan 1996 19:24:47 -0800 Message-Id: <199601150324.TAA07209@one.mind.net> X-Sender: belisle@mind.net X-Mailer: Windows Eudora Version 1.4.3 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: belisle@mind.net (Hal Belisle) Subject: Gopher Protocol Question Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi! my name is Hal Belisle and I am new to this list. I am writing a small web client (spider?) that automatically searches specific sites on a bi-weekly basis for information using existing search engines. I am having a hard time with gophers. I can gain access and retrieve files, but I can't seem to give them a valid search string. What exactly do you replace the ? you normally see in web searches with when you query a gopher? Any help or pointers to other sources of information (i.e. a listserve for web clients) would be greatly appreciated. Thanks in advance Hal Belisle From owner-robots Sun Jan 14 22:40:14 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16486; Sun, 14 Jan 96 22:40:14 -0800 Message-Id: <9601150640.AA16476@webcrawler.com> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Brian Pinkerton <bp> Date: Sun, 14 Jan 96 22:40:09 -0800 To: robots@webcrawler.com Subject: Re: Horror story References: <9601150149.AA29926@webcrawler.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com The robots.txt format got some good press in the Nov. issue of WebWeek, and more extensive coverage on the whole robots issue is in the works. Skip is right: a lot of people want as much exposure as possible, and aren't likely to pay attention to ideas that might reduce that exposure! On the flip side, offering a way to specify what files on a site are most important *to* index would be seen as a big step forward. Martijn may have more to say on this issue. :) cheers, bri From owner-robots Sun Jan 14 23:10:11 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17982; Sun, 14 Jan 96 23:10:11 -0800 Message-Id: <199601150709.XAA05451@sparty.surf.com> Date: Sun, 14 Jan 96 11:07:40 -0800 From: Murray Bent <murrayb@surf.com> Organization: Web21 Inc. X-Mailer: Mozilla 1.12 (X11; I; IRIX 5.3 IP22) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: Horror story X-Url: http://home.netscape.com/ Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Any ideas for a back-up plan in case the 'robots.txt' approach does not gain more than 5% of the sites (admittedly representing more than 5% of the cool content). There are lots of other web middleware facilities in the works, and a "deny permission" element or quality of service element is in some them. If we want smarter robots we need the middleware too. - there are docs on the w3.org site for all manner of proposals, watch out .. some of them may happen! - without these middleware facilities, the number of dumb robots and site slurpers will *proliferate* . Is robots.txt the solution of choice to the 'personalised' robots that will come in the box with Windows 96 say? mj From owner-robots Mon Jan 15 12:06:15 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26938; Mon, 15 Jan 96 12:06:15 -0800 From: "Mordechai T. Abzug" <mabzug1@gl.umbc.edu> Message-Id: <199601152006.PAA24219@umbc10.umbc.edu> Subject: Re: Horror story To: robots@webcrawler.com Date: Mon, 15 Jan 1996 15:06:04 -0500 (EST) In-Reply-To: <199601150709.XAA05451@sparty.surf.com> from "Murray Bent" at Jan 14, 96 11:07:40 am X-Mailer: ELM [version 2.4 PL24] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 974 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com "MB" == Murray Bent spake thusly: MB> MB> Any ideas for a back-up plan in case the 'robots.txt' approach MB> does not gain more than 5% of the sites (admittedly representing MB> more than 5% of the cool content). MB> I hate to sound heretical, but why is everyone so concerned about this '5%' figure? I'm sure everyone on this list is a sufficiently sophisticated programmer to have developed the sort of complex web systems that robots.txt is supposed to protect, but most web servers probably don't need a robots.txt. Now, 'most' might not be 95% of all web servers, but the problem is still not as bad as it might seem. We don't *need* to inform every 'webmaster' who downloads a server kit and can write HTML; only the hackers. Monier, do you have any guesses on what fraction of servers *should* have a robots.txt? -- Mordechai T. Abzug http://umbc.edu/~mabzug1 mabzug1@umbc.edu finger -l mabzug1@gl.umbc.edu Assembly programmers drive stick shifts. From owner-robots Mon Jan 15 13:35:23 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA27701; Mon, 15 Jan 96 13:35:23 -0800 Subject: New Robot Announcement From: Larry Burke <lburke@aktiv.com> To: <robots@webcrawler.com> Mime-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Date: Mon, 15 Jan 1996 13:38:29 -0800 Message-Id: <1390409387-681890@aktiv.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com New Robot Announcement Name: Duppies (rhymes with puppies) Author: Larry Burke, AKTIV Software Platform: Mac OS (considering Windows NT port) User-Agent: Duppies/1.0 Purpose: Allows website administrator to provide searchable index of their own site as well as other related sites. Has facilities to perform timed updates. Performs several other utility functions. Includes filtering system to limit indexing to files meeting specified criteria. Single program performs robot function, text indexing, and search processing either as a CGI or a stand-alone web server. Important Note: It is our intention to make Duppies available commercially to web administrators. Any comments on this would be welcomed. We feel a large missing part of many web sites is the lack of a site specific index (try finding anything at www.apple.com). Status: Currently being implemented by the Government of British Columbia to index all the official ministry sites. Supports Robot Exclusion standard: Yes, and we have implemented all the other robot niceties we could think of (and read "Internet Agents" by Fah-Chun Cheong). I have been following this list since July or August. For more information visit "http://www.aktiv.com/duppies/duppies.html". -------------------- Larry Burke AKTIV Software Victoria, B.C. email: lburke@aktiv.com phone: 604.383.4195 From owner-robots Mon Jan 15 15:13:27 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28452; Mon, 15 Jan 96 15:13:27 -0800 From: "Mordechai T. Abzug" <mabzug1@gl.umbc.edu> Message-Id: <199601152313.SAA05450@umbc10.umbc.edu> Subject: Re: New Robot Announcement To: robots@webcrawler.com Date: Mon, 15 Jan 1996 18:13:01 -0500 (EST) In-Reply-To: <1390409387-681890@aktiv.com> from "Larry Burke" at Jan 15, 96 01:38:29 pm X-Mailer: ELM [version 2.4 PL24] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 999 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com "LB" == Larry Burke spake thusly: LB> LB> Name: Duppies (rhymes with puppies) LB> LB> Purpose: Allows website administrator to provide searchable index of LB> their own site as well as other related sites. Has facilities to perform LB> timed updates. Performs several other utility functions. Includes LB> filtering system to limit indexing to files meeting specified criteria. LB> Single program performs robot function, text indexing, and search LB> processing either as a CGI or a stand-alone web server. LB> LB> Important Note: It is our intention to make Duppies available LB> commercially to web administrators. Any comments on this would be LB> welcomed. We feel a large missing part of many web sites is the lack of a LB> site specific index (try finding anything at www.apple.com). Exactly how is this better than Harvest, which is free? -- Mordechai T. Abzug http://umbc.edu/~mabzug1 mabzug1@umbc.edu finger -l mabzug1@gl.umbc.edu Naaah, real men don't read docs. From owner-robots Mon Jan 15 15:56:55 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28815; Mon, 15 Jan 96 15:56:55 -0800 Subject: Re: New Robot Announcement From: Larry Burke <lburke@aktiv.com> To: <robots@webcrawler.com> Mime-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Date: Mon, 15 Jan 1996 15:56:21 -0800 Message-Id: <1390401115-150821641@gco.gov.bc.ca> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >Exactly how is this better than Harvest, which is free? Harvest server software currently runs only on UNIX machines. Duppies was designed for the web serving community who either do not know and don't want to know UNIX or do not have a UNIX box available. -------------------- Larry Burke AKTIV Software Victoria, B.C. email: lburke@aktiv.com phone: 604.383.4195 From owner-robots Mon Jan 15 22:09:15 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16221; Mon, 15 Jan 96 22:09:15 -0800 To: robots@webcrawler.com Subject: Re: robots.txt extensions X-Url: http://www.miranova.com/%7Esteve/ References: <199601110038.CAA11256@krisse.www.fi> From: Steven L Baur <steve@miranova.com> Date: 15 Jan 1996 22:06:27 -0800 In-Reply-To: Jaakko Hyvatti's message of 10 Jan 1996 16:38:12 -0800 Message-Id: <m2wx6slllo.fsf@miranova.com> Organization: Miranova Systems, Inc. Lines: 23 X-Mailer: September Gnus v0.26/Emacs 19.30 Mime-Version: 1.0 (generated by tm-edit 7.38) Content-Type: text/plain; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >>>>> "Jaakko" == Jaakko Hyvatti <Jaakko.Hyvatti@www.fi> writes: >> Finally -- I never understood why robots.txt was exclusion only. >> Why does it not have some of positive hints added? I.e. you are >> allowed & welcome to browse XXXX/fred.html. Was this a choice >> built upon pragmatism -- thinking that this would open a can of >> worms? I too would like to see something like this. Or at least some way of prioritizing pages. Jaakko> I do not believe it is a problem to give robots URLs, they are Jaakko> pretty good at finding them themselves. A little too good sometimes. The problem comes when one has an archive of something like a manual that is under development. I maintain two such archives, and had several robots going through the pages while pieces were being changed actively, while other more static pages were ignored. Regards, -- steve@miranova.com baur From owner-robots Tue Jan 16 01:39:23 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25968; Tue, 16 Jan 96 01:39:23 -0800 Date: Tue, 16 Jan 1996 09:39:23 GMT From: jeremy@mari.co.uk (Jeremy.Ellman) Message-Id: <9601160939.AA05701@kronos> To: robots@webcrawler.com Subject: Re: New Robot Announcement X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > LB> Important Note: It is our intention to make Duppies available > LB> commercially to web administrators. Any comments on this would be > LB> welcomed. We feel a large missing part of many web sites is the lack of a > LB> site specific index (try finding anything at www.apple.com). > > Exactly how is this better than Harvest, which is free? > > -- Or FreeWAIS, SWISH, and a host of others From owner-robots Tue Jan 16 05:54:26 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08486; Tue, 16 Jan 96 05:54:26 -0800 Date: Tue, 16 Jan 1996 08:54:14 -0500 From: Skip Montanaro <skip@automatrix.com> Message-Id: <199601161354.IAA05082@dolphin.automatrix.com> To: robots@webcrawler.com Subject: Re: robots.txt extensions In-Reply-To: <m2wx6slllo.fsf@miranova.com> References: <199601110038.CAA11256@krisse.www.fi> <m2wx6slllo.fsf@miranova.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Steven L. Baur writes: A little too good sometimes. The problem comes when one has an archive of something like a manual that is under development. I maintain two such archives, and had several robots going through the pages while pieces were being changed actively, while other more static pages were ignored. Hmmm... seems like you need a Disallow: item to keep those pesky robots away from your development tree. I too think a positive hint would be useful. If a robot is well-behaved, it will take some period of time to munch my entire site. I'd like to be able to suggest where it should munch first. Skip Montanaro | Looking for a place to promote your music venue, new CD, skip@calendar.com | festival or next concert tour? Place a focused banner (518)372-5583 | ad in Musi-Cal! http://www.calendar.com/concerts/ From owner-robots Tue Jan 16 09:18:33 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18167; Tue, 16 Jan 96 09:18:33 -0800 Date: Tue, 16 Jan 1996 10:28:59 -0600 From: ecarp@tssun5.dsccc.com (Ed Carp @ TSSUN5) Message-Id: <9601161628.AA06920@tssun5.> To: robots@webcrawler.com Subject: Re: New Robot Announcement X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > From owner-robots@webcrawler.com Mon Jan 15 19:28 CST 1996 > Subject: Re: New Robot Announcement > From: Larry Burke <lburke@aktiv.com> > To: <robots@webcrawler.com> > Mime-Version: 1.0 > Date: Mon, 15 Jan 1996 15:56:21 -0800 > > >Exactly how is this better than Harvest, which is free? > > Harvest server software currently runs only on UNIX machines. Duppies was > designed for the web serving community who either do not know and don't > want to know UNIX or do not have a UNIX box available. I was under the impression that most web servers were running on a UNIX box. What else are you going to run a server on? I would argue that NT doesn't have the horsepower, and tehre aren't a lot of alternatives. From owner-robots Tue Jan 16 11:18:50 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22816; Tue, 16 Jan 96 11:18:50 -0800 Message-Id: <30FBF9C5.45DD@interworld.com> Date: Tue, 16 Jan 1996 14:17:25 -0500 From: David@interworld.com (David Levine) Organization: InterWorld, Really Cool Stuff Division X-Mailer: Mozilla 2.0b5 (WinNT; I) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: New Robot Announcement References: <9601161628.AA06920@tssun5.> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Ed Carp @ TSSUN5 wrote: > I was under the impression that most web servers were running > on a UNIX box. > What else are you going to run a server on? I would argue > that NT doesn't > have the horsepower, and tehre aren't a lot of alternatives. NT can be extremely powerful when running on a Dec Alpha. My company provided the software for a server running on such a system which receives approximately 1,000,000 GETs a day. Still pretty fast. David Levine From owner-robots Tue Jan 16 11:42:39 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24167; Tue, 16 Jan 96 11:42:39 -0800 Subject: Re: New Robot Announcement From: Larry Burke <lburke@aktiv.com> To: <robots@webcrawler.com> Mime-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Date: Tue, 16 Jan 1996 11:42:17 -0800 Message-Id: <1390329959-155101283@gco.gov.bc.ca> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >I was under the impression that most web servers were running on a UNIX box. >What else are you going to run a server on? I would argue that NT doesn't >have the horsepower, and tehre aren't a lot of alternatives. There are many sites that are run on the Mac OS using either WebStar or MacHTTP. See "http://brad.net/machttp_talk/sites.by.title.html" for a partial list. I don't mean to be an advocate for the Apple Internet Server products but they are easy to use and plenty powerful for many server applications. And NT products are becoming very respectable as well. Check out "http://www.cc.gatech.edu/gvu/user_surveys/survey-10-1995/graphs/info/which _server.html" if you want some recent statistics on server usage. -------------------- Larry Burke AKTIV Software Victoria, B.C. email: lburke@aktiv.com web: www.aktiv.com phone: 604.383.4195 From owner-robots Tue Jan 16 11:51:08 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24631; Tue, 16 Jan 96 11:51:08 -0800 Date: Tue, 16 Jan 1996 13:05:36 -0600 From: ecarp@tssun5.dsccc.com (Ed Carp @ TSSUN5) Message-Id: <9601161905.AA13813@tssun5.> To: robots@webcrawler.com Subject: Alta Vista searches WHAT?!? X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com There has been a concern raised on another list that I belong to, about the privacy implications of robots and such. The specific example was that the Alta Vista web crawler didn't only index linked documents, but any and all documents that it could find at a site! Is this true, and if so, how is it doing it? How does one keep documents private? I sure don't want my personal correspondence sitting out on someone's database just because my home directory happens to be readable! From owner-robots Tue Jan 16 12:48:21 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA27956; Tue, 16 Jan 96 12:48:21 -0800 Comments: Authenticated sender is <jakob@cybernet.dk> From: "Jakob Faarvang" <jakob@jubii.dk> Organization: Jubii / cybernet.dk To: robots@webcrawler.com Date: Tue, 16 Jan 96 21:49:19 +0100 (CET) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Subject: Re: New Robot Announcement Priority: normal X-Mailer: Pegasus Mail for Windows (v2.23) Message-Id: 20491955603076@cybernet.dk X-Info: cybernet.dk Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > I was under the impression that most web servers were running on a UNIX box. > What else are you going to run a server on? I would argue that NT doesn't > have the horsepower, and tehre aren't a lot of alternatives. FYI: We run all our stuff on NT. Our web-server currently handles more than 50 virtual domains and more than 50,000 hits per day without complaining. On a 32 mb Pentium 100. But let's make this an OS war, for heavens sake. - Jakob Med venlig hilsen Jakob Faarvang Jubii / cybernet.dk -- Jakob Faarvang - jakob@jubii.dk / jakob@cybernet.dk Jubii - hele Danmarks World Wide Web-database http://www.jubii.dk From owner-robots Tue Jan 16 13:24:47 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28843; Tue, 16 Jan 96 13:24:47 -0800 X-Sender: mak@surfski.webcrawler.com Message-Id: <v02140803ad21b7e9a71c@[199.221.45.139]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Tue, 16 Jan 1996 13:25:45 -0700 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: Re: Alta Vista searches WHAT?!? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 1:05 PM 1/16/96, Ed Carp @ TSSUN5 wrote: >There has been a concern raised on another list that I belong to, about the >privacy implications of robots and such. The specific example was that the >Alta Vista web crawler didn't only index linked documents, but any and all >documents that it could find at a site! Is this true, and if so, how is it >doing it? What is this Alta-Vista vicious rumour mill stuff on the list recently? :-) If you have a question, ask the robot author, his email address is on the robots page... It would also help if you included the complete referenced article -- I wouldn't be in the least surprised if the person's files in question were in fact reacheable from the web, and therefore findable by any browser or robot. >How does one keep documents private? I sure don't want my personal >correspondence sitting out on someone's database just because my home directory >happens to be readable! To protect pages from other people, configure your server to return "access denied" for them... -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Tue Jan 16 13:34:19 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28920; Tue, 16 Jan 96 13:34:19 -0800 X-Sender: mak@surfski.webcrawler.com Message-Id: <v02140806ad21baf85f1b@[199.221.45.139]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Tue, 16 Jan 1996 13:35:17 -0700 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: BOUNCE robots: Admin request Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Date: Tue, 16 Jan 96 03:28:31 -0800 From: <owner-robots> To: owner-robots Subject: BOUNCE robots: Admin request X-Filter: mailagent [version 3.0 PL41] for mak@surfski.webcrawler.com Approved: robbie From s.nisbet@doc.mme.ac.uk Tue Jan 16 03:27:50 1996 Return-Path: <s.nisbet@doc.mme.ac.uk> Received: from ehlana.mmu.ac.uk by webcrawler.com (NX5.67f2/NX3.0M) id AA02048; Tue, 16 Jan 96 03:27:50 -0800 Received: from patsy.doc.aca.mmu.ac.uk by ehlana with SMTP (PP); Tue, 16 Jan 1996 11:26:46 +0100 Received: from raphael.doc.aca.mmu.ac.uk by patsy.doc.aca.mmu.ac.uk (4.1/SMI-4.1) id AA20450; Tue, 16 Jan 96 11:26:28 GMT Received: from jd-e114-07.doc.aca.mmu.ac.uk by raphael.doc.aca.mmu.ac.uk (4.1/SMI-4.1) id AA02693; Tue, 16 Jan 96 11:26:39 GMT Date: Tue, 16 Jan 96 11:26:38 GMT Message-Id: <9601161126.AA02693@raphael.doc.aca.mmu.ac.uk> X-Sender: steven@raphael.doc.aca.mmu.ac.uk X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Steve Nisbet <s.nisbet@doc.mme.ac.uk> Subject: Re: Horror story Maybe Im being a little touchy, but I take exception to being 'educated'. I subscribe to the robots line and others, because Im interested and because I know my stuff and want to know where things are going. A great many 'Web Masters' do a lot more than run webs, which in turn require a lot of effort and a lot of seperate tasks. I suspect as Mordechai T. Abzug points out that the majority of sites dont need a robots.txt and maybe are not that interested in robots. Think about it, they already have a lot on their plates with the admin of their respectives webs as it is. Steve Nisbet Web Admin (and other web related stuff!!) Department of Computing and sub webs Manchester Metro Uni. http://www.doc.mmu.ac.uk/ -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Tue Jan 16 13:50:00 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA29112; Tue, 16 Jan 96 13:50:00 -0800 Message-Id: <199601162149.QAA26754@revere.musc.edu> Comments: Authenticated sender is <lindroth@atrium.musc.edu> From: "John Lindroth" <lindroth@musc.edu> Organization: Medical University of South Carolina To: robots@webcrawler.com Date: Tue, 16 Jan 1996 16:49:50 +0000 Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Subject: Re: New Robot Announcement Priority: normal X-Mailer: Pegasus Mail for Windows (v2.23) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > >Exactly how is this better than Harvest, which is free? > > > > Harvest server software currently runs only on UNIX machines. Duppies was > > designed for the web serving community who either do not know and don't > > want to know UNIX or do not have a UNIX box available. > > I was under the impression that most web servers were running on a UNIX box. > What else are you going to run a server on? I would argue that NT doesn't > have the horsepower, and tehre aren't a lot of alternatives. Larry's original post stated that the robot would run under the MacOS. While our main server is on a unix workstation, many of our departments run on Macs. And with each department's info distributed on its own mac server, no single system gets a lot of hits. I can't say that I think that the Mac is a great platform to run a server, but I think they have identified a niche market that just might work. MHO, -John Lindroth MUSC Web Master ============================================= John Lindroth Senior Systems Programmer Academic & Research Computing Services Center for Computing & Information Technology Medical University of South Carolina E-Mail: lindroth@musc.edu URL: http://www.musc.edu/~lindroth ============================================= Any opinions expressed are mine, not my employer's. And they may be wrong (gasp!) ============================================= From owner-robots Tue Jan 16 14:55:36 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00539; Tue, 16 Jan 96 14:55:36 -0800 From: <monier@pa.dec.com> Message-Id: <9601162247.AA00329@evil-twins.pa.dec.com> To: robots@webcrawler.com Subject: Re: Alta Vista searches WHAT?!? In-Reply-To: Your message of "Tue, 16 Jan 96 13:05:36 CST." <9601161905.AA13813@tssun5.> Date: Tue, 16 Jan 96 14:47:51 -0800 X-Mts: smtp Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hum, one more time. Scooter, the robot behind Alta Vista, follows links, and only follows links. If the "directory browsing" option is enabled on a server, and someone publishes the URL for a directory, then the robots gets back a page of HTML which lists every file as a link, but that is not intentional. And yes, this has led to embarrassing situations, but again, it's not intentional. In the absence of strong conventions about directory names or file extensions it is hard for a robot to exclude anything a-priori. I wish it was easier... To keep a document private, list it in /robots.txt, password-protect it, change the protection on the file, or simpler: do not leave it in your Web hierarchy. Can you imagine what happens when someone uses / as web root, exposing for example the password file? It has happened! Remember that what a robot does, anyone with a browser can do: find this private file and then post to usenet for example, robots have no magic powers! The bottom line is that the usual danger is not aggressive robots, but clueless Web masters. --Louis From owner-robots Tue Jan 16 15:02:51 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00626; Tue, 16 Jan 96 15:02:51 -0800 Message-Id: <30FC2E66.77FC@corp.micrognosis.com> Date: Tue, 16 Jan 1996 18:01:58 -0500 From: Adam Jack <ajack@corp.micrognosis.com> Organization: CSK/Micrognosis Inc. X-Mailer: Mozilla 2.0b3 (X11; I; SunOS 5.5 sun4m) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: Alta Vista searches WHAT?!? References: <9601161905.AA13813@tssun5.> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Ed Carp @ TSSUN5 wrote: > > There has been a concern raised [...] > Alta Vista [..] I think all on the list ought reply for Louis :) :) Adam P.S. Ed, That was raised here a week ago. It was a *mistake* -- Alta Vista doesn't access other than is linked. -- +1-203-730-5437 | http://www.micrognosis.com/~ajack/index.html ajack@corp.micrognosis.com -> ajack@netcom.com -> ajack@?.??? From owner-robots Tue Jan 16 15:04:11 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00673; Tue, 16 Jan 96 15:04:11 -0800 Date: Wed, 17 Jan 96 00:00:01 +0100 Message-Id: <9601162300.AA03865@indy2> X-Face: $)p(\g8Er<<5PVeh"4>0m&);m(]e_X3<%RIgbR>?i=I#c0ksU'>?+~)ztzpF&b#nVhu+zsv x4[FS*c8aHrq\<7qL/v#+MSQ\g_Fs0gTR[s)B%Q14\;&J~1E9^`@{Sgl*2g:IRc56f:\4o1k'BDp!3 "`^ET=!)>J-V[hiRPu4QQ~wDm\%L=y>:P|lGBufW@EJcU4{~z/O?26]&OLOWLZ<V^N`hYM;pD#v&!` _A?V7^R! X-Url: http://www-ihm.lri.fr/~tronche/ From: "Tronche Ch. le pitre" <Christophe.Tronche@lri.fr> To: robots@webcrawler.com In-Reply-To: <9601161905.AA13813@tssun5.> (ecarp@tssun5.dsccc.com) Subject: Re: Alta Vista searches WHAT?!? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > There has been a concern raised on another list that I belong to, about the > privacy implications of robots and such. The specific example was that the > Alta Vista web crawler didn't only index linked documents, but any and all > documents that it could find at a site! Is this true, and if so, how is it > doing it? How does one keep documents private? I sure don't want my personal > correspondence sitting out on someone's database just because my home directory > happens to be readable! Not only your home directory, but also your mail directory. And all of them are readable by anybody at your site (or at least by your HTTP server). Just stay cool. Alta Vista or any other robots will certainly not access data that couldn't be accessed by another mean, this is just a classical security issue. Speaking about privacy, I feel more concerned by being cross-indexed in multiple robots-built databases. For example, we may suppose that, after some years of a career, you've left behind you some data about you in many of the organizations you've worked for. A robot could collect all these data to create a file about you. Of course, none of these infos may be very "sensitive", but, from some kind of "holistic" point of view, their gathering would permit to infer some interesting properties about yourself... May be... By the way, if every data in the world become available on the World Wide Web, such as dictionaries, encyclopedia, personal files, and so on, the Web may become MUCH LARGER than it's now. Have we any evidence the index databases will be able to scale to this extent ? +--------------------------+------------------------------------+ | | | | Christophe TRONCHE | E-mail : tronche@lri.fr | | | | | +-=-+-=-+ | Phone : 33 - 1 - 69 41 66 25 | | | Fax : 33 - 1 - 69 41 65 86 | +--------------------------+------------------------------------+ | ###### ** | | ## # Laboratoire de Recherche en Informatique | | ## # ## Batiment 490 | | ## # ## Universite de Paris-Sud | | ## #### ## 91405 ORSAY CEDEX | | ###### ## ## FRANCE | |###### ### | +---------------------------------------------------------------+ From owner-robots Tue Jan 16 17:31:42 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03549; Tue, 16 Jan 96 17:31:42 -0800 Message-Id: <9601170131.AA03543@webcrawler.com> Date: Tue, 16 Jan 1996 17:30:00 -0800 From: Ted Sullivan <tsullivan@blizzard.snowymtn.com> Subject: RE: Alta Vista searches WHAT?!? To: robots <robots@webcrawler.com> X-Mailer: Worldtalk (NetConnex V3.50c)/MIME Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com If you put your files in a file system area that the Web server has access to then they will get picked up. Robots cannot find things that they cannot see. It comes down to site security, publish what you desire the world to see and hide the rest. Ted Sullivan ---------- From: robots To: robots Subject: Alta Vista searches WHAT?!? Date: Wednesday, January 17, 1996 2:53PM There has been a concern raised on another list that I belong to, about the privacy implications of robots and such. The specific example was that the Alta Vista web crawler didn't only index linked documents, but any and all documents that it could find at a site! Is this true, and if so, how is it doing it? How does one keep documents private? I sure don't want my personal correspondence sitting out on someone's database just because my home directory happens to be readable! From owner-robots Tue Jan 16 19:05:29 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06396; Tue, 16 Jan 96 19:05:29 -0800 Message-Id: <v0213050cad221644aca5@[202.237.148.6]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 17 Jan 1996 12:00:59 +0900 To: robots@webcrawler.com From: mschrimsher@twics.com (Mark Schrimsher) Subject: Re: Alta Vista searches WHAT?!? Cc: robots@webcrawler.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 0:00 AM 1/17/96, Tronche Ch. le pitre wrote: >Speaking about privacy, I feel more concerned by being cross-indexed >in multiple robots-built databases. For example, we may suppose that, >after some years of a career, you've left behind you some data about >you in many of the organizations you've worked for. A robot could >collect all these data to create a file about you. Of course, none of >these infos may be very "sensitive", but, from some kind of "holistic" >point of view, their gathering would permit to infer some interesting >properties about yourself... May be... This is really the issue, I think, but the main problem is Usenet archives and mailing list archives, not indexing normal web pages. HTML documents tend to disappear, and presumably they would be eliminated from the robots index eventually. Archives tend to be permanent, and the participants in newsgroups and especially mailing lists are often not aware they're writing for posterity. --Mark From owner-robots Wed Jan 17 00:17:37 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19484; Wed, 17 Jan 96 00:17:37 -0800 Message-Id: <199601170817.JAA05933@storm.certix.fr> Comments: Authenticated sender is <savron@world-net.sct.fr> From: savron@world-net.sct.fr To: robots@webcrawler.com Date: Wed, 17 Jan 1996 09:07:02 +0000 Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Subject: robots.txt , authors of robots , webmasters .... Priority: normal X-Mailer: Pegasus Mail for Windows (v2.10) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com A few thoughts about the robots stuff : -- there should be no need to include a line such as : /cgi-bin/ in robots.txt because it should come as a standard of indexer robots The one exception I see is an automated query of search engines . -- Webmasters complaining about robots indexing partially built document trees . So why are they linked to the main tree ??? -- I agree with the proposed 'positive' extension of robots.txt to include 'these pages should score more than the others of my site' -- I don't understand why , if a web site is publicly accessible it shouldn't be indexable and so why there is a need for such a thing as robots.txt . -- Correct me if I'm wrong on this : If webmasters want to reserve access to certain pages to certain specific users they can do it , without needing to passwording it , by giving the pages names to these users and not linking them to the main tree . As robots follows the links they find and can't guess ( well , if you don't choose an obvious page name ) ( snoopers sort of robots ) you are pretty safe ( and if you really need it -- setup a password query form ( only a partial tree is reserved ) -- choose another port than 80 and password it too ( in case of a http port scanner sort of robot ) -- Why in the HTTP protocol there is not such an info about the required delay between to successive queries to the same server ( see the webmasters complaining about rapid fire queries from robots ) that the webserver should send in the header of each answer . If anyone wants to comment on this , I will be pleased to hear his opinion Bye Bye From owner-robots Wed Jan 17 00:17:38 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19488; Wed, 17 Jan 96 00:17:38 -0800 Message-Id: <199601170817.JAA05940@storm.certix.fr> Comments: Authenticated sender is <savron@world-net.sct.fr> From: savron@world-net.sct.fr To: robots@webcrawler.com Date: Wed, 17 Jan 1996 09:07:03 +0000 Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Subject: Web robots and gopher space -- two separate worlds Priority: normal X-Mailer: Pegasus Mail for Windows (v2.10) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Why web robots doesn't follow gopher links when they step on one ? If anyone wants to comment , especially web robot authors , feel free Thanks a lot From owner-robots Wed Jan 17 02:22:58 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25105; Wed, 17 Jan 96 02:22:58 -0800 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199601171023.LAA03915@wsinis10.win.tue.nl> Subject: Re: Alta Vista searches WHAT?!? To: robots@webcrawler.com Date: Wed, 17 Jan 1996 11:23:27 +0100 (MET) In-Reply-To: <9601161905.AA13813@tssun5.> from "Ed Carp @ TSSUN5" at Jan 16, 96 01:05:36 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Content-Length: 3747 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (Ed Carp @ TSSUN5) write: > >There has been a concern raised on another list that I belong to, about the >privacy implications of robots and such. >The specific example was that the >Alta Vista web crawler didn't only index linked documents, but any and all >documents that it could find at a site! Did you also get the messages in which the author explained that this isn't true? >Is this true, and if so, how is it doing it? How does one keep documents >private? I sure don't want my personal correspondence sitting out on >someone's database just because my home directory happens to be readable! I have a big problem with your phrase 'happens to be'. There have been more discussions like this, in which people were quite happy to make a bunch of documents available without restriction, except to indexers. Their main idea was that it is common practice to keep documents 'out of sight' without actually indicating access restrictions explicitly. I think this is plainly wrong. On Unix, if you want to indicate who is allowed access to your files, you use file permissions. If a certain file of mine is world readable, the implication is that I, the author, intentionally allow the rest of the world to read my file. (Here, 'the world' means any user with access to the file system.) I have, occasionally, browsed other people's directories and found stuff that wasn't intended for me to be read; I always assumed a mistake on their part, and decided not to read on, as a matter of courtesy. But the mistake was theirs. The same principle has always been assumed on the Internet, I guess. Iif you serve files off a WWW server without access restrictions, you intend to make them available to the rest of the world. There is no way of knowing the purpose of the accesses you get for your documents: it may be an individual user, a WWW indexer, or a secret program operated by the FBI/Mossad/KGB/whoever to scan for suspect activities. It's the access permissions that specify your intentions, not the existence of explicit references to the files, or the set of users you have told the URLs to your site explicitly, or anything else. In my opinion, it's a mistake to accuse robots of malicious behaviour when all they do is find files that have been made available to them. robots.txt should be regarded as a service to robots, a way of saying: don't bother to index this, the results won't justify the load it will place on the network and on my system. To honour this is a matter of courtesy. If you don't want robots to get access to your documents at all, then set proper access restrictions on the documents themselves. The only problem I see is that 'the world' is not the same for everybody. For example, suppose user A wants all files to be readable for all other users on the system. To user A, 'the world' is all users on the system. User A makes all files world readable. Now suppose that user B runs a WWW server, making all files on the system available to the whole Internet. (User B will think twice before doing this on purpose, but it may be a configuration error.) Suddenly, user A's files have become available to the whole Internet community. Suppose that user C (a WWW indexer) finds user A's files. It is unreasonable for user A to blame C, when B is at fault. Obviously, there must be a way for A to correct the problem, and get the files removed from C's index. This is possible in most WWW indexers. But if A is indignant at the mere fact that C found his files, s/he's barking up the wrong tree. -- Reinier Post reinpost@win.tue.nl a.k.a. <A HREF="http://www.win.tue.nl/win/cs/is/reinpost/">me</A> [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] From owner-robots Wed Jan 17 02:54:11 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26288; Wed, 17 Jan 96 02:54:11 -0800 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199601171054.LAA04028@wsinis10.win.tue.nl> Subject: Re: robots.txt , authors of robots , webmasters .... To: robots@webcrawler.com Date: Wed, 17 Jan 1996 11:54:42 +0100 (MET) In-Reply-To: <199601170817.JAA05933@storm.certix.fr> from "savron@world-net.sct.fr" at Jan 17, 96 09:07:02 am X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Content-Length: 2079 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (savron@world-net.sct.fr) write: > >A few thoughts about the robots stuff : > >-- there should be no need to include a line such as : > /cgi-bin/ > in robots.txt > because it should come as a standard of indexer robots That would be a kludge. It doesn't identify CGI scripts exactly (I do not usually include /cgi-bin/ in references to my CGI scripts) and it is not necessary tp exclude CGI scripts categorically (I sometimes serve a set of files through a CGI script). Furthermore, netter heuristics exist (eg. don't follow forms/POST requests). >-- Webmasters complaining about robots indexing partially built >document trees . So why are they linked to the main tree ??? Well, it would help if WWW servers took more pains to send accurate Expires: and Last-modified: headers. >-- I agree with the proposed 'positive' extension of robots.txt to >include 'these pages should score more than the others of my site' Perhaps, but once you're on that road, ALIWEB may be a better approach. >-- I don't understand why , if a web site is publicly accessible it >shouldn't be indexable and so why there is a need for such a thing as >robots.txt . Neither do I (see separate message). >-- Correct me if I'm wrong on this : If webmasters want to reserve >access to certain pages to certain specific users they can do it , >without needing to passwording it , by giving the pages names to >these users and not linking them to the main tree . Wrong (see that message): third parties have the right to poke for URLs, IMHO. Access restriction (password-based or otherwise) will do the job. >-- Why in the HTTP protocol there is not such an info about the >required delay between to successive queries to the same server ( see >the webmasters complaining about rapid fire queries from robots ) >that the webserver should send in the header of each answer . There is an HTTP response meaning "please don't return for a while, I'm busy". http://www.w3.org/pub/WWW/Protocols/HTTP1.0/draft-ietf-http-spec.html#Code503 -- Reinier Post reinpost@win.tue.nl From owner-robots Wed Jan 17 06:43:28 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05128; Wed, 17 Jan 96 06:43:28 -0800 Message-Id: <m0tcZ2i-0009mqC@walnut.holli.com> From: wlamb@walnut.holli.com (Wayne Lamb) Subject: Re: robots.txt , authors of robots , webmasters .... To: robots@webcrawler.com Date: Wed, 17 Jan 1996 09:40:51 -0500 (EST) In-Reply-To: <199601170817.JAA05933@storm.certix.fr> from "savron@world-net.sct.fr" at Jan 17, 96 09:07:02 am X-Mailer: ELM [version 2.4 PL23] Content-Type: text Content-Length: 1801 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com savron@world-net.sct.fr wrote: > > A few thoughts about the robots stuff : > > -- there should be no need to include a line such as : > /cgi-bin/ > in robots.txt > because it should come as a standard of indexer robots > > The one exception I see is an automated query of search engines . > > -- Webmasters complaining about robots indexing partially built > document trees . So why are they linked to the main tree ??? > > -- I agree with the proposed 'positive' extension of robots.txt to > include 'these pages should score more than the others of my site' > > -- I don't understand why , if a web site is publicly accessible it > shouldn't be indexable and so why there is a need for such a thing as > robots.txt . > > -- Correct me if I'm wrong on this : If webmasters want to reserve > access to certain pages to certain specific users they can do it , > without needing to passwording it , by giving the pages names to > these users and not linking them to the main tree . > As robots follows the links they find and can't guess ( well , if you > don't choose an obvious page name ) ( snoopers sort of robots ) you > are pretty safe ( and if you really need it > > -- setup a password query form ( only a partial tree is reserved ) > -- choose another port than 80 and password it too ( in case of a > http port scanner sort of robot ) > > -- Why in the HTTP protocol there is not such an info about the > required delay between to successive queries to the same server ( see > the webmasters complaining about rapid fire queries from robots ) > that the webserver should send in the header of each answer . > > If anyone wants to comment on this , I will be pleased to hear his > opinion > > Bye Bye > > Please take me off of your list -- From owner-robots Wed Jan 17 06:44:03 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05157; Wed, 17 Jan 96 06:44:03 -0800 Message-Id: <m0tcZ3K-0009mqC@walnut.holli.com> From: wlamb@walnut.holli.com (Wayne Lamb) Subject: Re: Web robots and gopher space -- two separate worlds To: robots@webcrawler.com Date: Wed, 17 Jan 1996 09:41:30 -0500 (EST) In-Reply-To: <199601170817.JAA05940@storm.certix.fr> from "savron@world-net.sct.fr" at Jan 17, 96 09:07:03 am X-Mailer: ELM [version 2.4 PL23] Content-Type: text Content-Length: 237 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com savron@world-net.sct.fr wrote: > > Why web robots doesn't follow gopher links when they step on one ? > > If anyone wants to comment , especially web robot authors , feel free > > Thanks a lot > Please take me off of your list -- From owner-robots Wed Jan 17 06:45:26 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05226; Wed, 17 Jan 96 06:45:26 -0800 Message-Id: <m0tcZ4X-0009mrC@walnut.holli.com> From: wlamb@walnut.holli.com (Wayne Lamb) Subject: Re: Alta Vista searches WHAT?!? To: robots@webcrawler.com Date: Wed, 17 Jan 1996 09:42:45 -0500 (EST) In-Reply-To: <199601171023.LAA03915@wsinis10.win.tue.nl> from "Reinier Post" at Jan 17, 96 11:23:27 am X-Mailer: ELM [version 2.4 PL23] Content-Type: text Content-Length: 3958 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Reinier Post wrote: > > You (Ed Carp @ TSSUN5) write: > > > >There has been a concern raised on another list that I belong to, about the > >privacy implications of robots and such. > > >The specific example was that the > >Alta Vista web crawler didn't only index linked documents, but any and all > >documents that it could find at a site! > > Did you also get the messages in which the author explained that > this isn't true? > > >Is this true, and if so, how is it doing it? How does one keep documents > >private? I sure don't want my personal correspondence sitting out on > >someone's database just because my home directory happens to be readable! > > I have a big problem with your phrase 'happens to be'. > > There have been more discussions like this, in which people were quite happy > to make a bunch of documents available without restriction, except to indexers. > Their main idea was that it is common practice to keep documents 'out of > sight' without actually indicating access restrictions explicitly. I think > this is plainly wrong. On Unix, if you want to indicate who is allowed access > to your files, you use file permissions. If a certain file of mine is world > readable, the implication is that I, the author, intentionally allow the rest > of the world to read my file. (Here, 'the world' means any user with access > to the file system.) I have, occasionally, browsed other people's directories > and found stuff that wasn't intended for me to be read; I always assumed a > mistake on their part, and decided not to read on, as a matter of courtesy. > But the mistake was theirs. > > The same principle has always been assumed on the Internet, I guess. > Iif you serve files off a WWW server without access restrictions, > you intend to make them available to the rest of the world. > There is no way of knowing the purpose of the accesses you get for your > documents: it may be an individual user, a WWW indexer, or a secret program > operated by the FBI/Mossad/KGB/whoever to scan for suspect activities. > > It's the access permissions that specify your intentions, not the existence > of explicit references to the files, or the set of users you have told > the URLs to your site explicitly, or anything else. > > In my opinion, it's a mistake to accuse robots of malicious behaviour > when all they do is find files that have been made available to them. > > robots.txt should be regarded as a service to robots, a way of saying: > don't bother to index this, the results won't justify the load it will > place on the network and on my system. To honour this is a matter of > courtesy. If you don't want robots to get access to your documents at > all, then set proper access restrictions on the documents themselves. > > The only problem I see is that 'the world' is not the same for everybody. > > For example, suppose user A wants all files to be readable for all > other users on the system. To user A, 'the world' is all users > on the system. User A makes all files world readable. > > Now suppose that user B runs a WWW server, making all files on the system > a vailable to the whole Internet. (User B will think twice before doing this > on purpose, but it may be a configuration error.) Suddenly, user A's files > have become available to the whole Internet community. Suppose that user C > (a WWW indexer) finds user A's files. It is unreasonable for user A to > blame C, when B is at fault. Obviously, there must be a way for A to correct > the problem, and get the files removed from C's index. This is possible in > most WWW indexers. But if A is indignant at the mere fact that C found his > files, s/he's barking up the wrong tree. > > -- > Reinier Post reinpost@win.tue.nl > a.k.a. <A HREF="http://www.win.tue.nl/win/cs/is/reinpost/">me</A> > [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] > Please take me off of your list -- From owner-robots Wed Jan 17 06:46:08 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05293; Wed, 17 Jan 96 06:46:08 -0800 Message-Id: <m0tcZ5E-0009mrC@walnut.holli.com> From: wlamb@walnut.holli.com (Wayne Lamb) Subject: Re: robots.txt , authors of robots , webmasters ....OMOMOM[D To: robots@webcrawler.com Date: Wed, 17 Jan 1996 09:43:28 -0500 (EST) In-Reply-To: <199601171054.LAA04028@wsinis10.win.tue.nl> from "Reinier Post" at Jan 17, 96 11:54:42 am X-Mailer: ELM [version 2.4 PL23] Content-Type: text Content-Length: 2213 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Reinier Post wrote: > > You (savron@world-net.sct.fr) write: > > > >A few thoughts about the robots stuff : > > > >-- there should be no need to include a line such as : > > /cgi-bin/ > > in robots.txt > > because it should come as a standard of indexer robots > > That would be a kludge. It doesn't identify CGI scripts exactly > (I do not usually include /cgi-bin/ in references to my CGI scripts) > and it is not necessary tp exclude CGI scripts categorically > (I sometimes serve a set of files through a CGI script). Furthermore, > netter heuristics exist (eg. don't follow forms/POST requests). > > >-- Webmasters complaining about robots indexing partially built > >document trees . So why are they linked to the main tree ??? > > Well, it would help if WWW servers took more pains to send accurate > Expires: and Last-modified: headers. > > >-- I agree with the proposed 'positive' extension of robots.txt to > >include 'these pages should score more than the others of my site' > > Perhaps, but once you're on that road, ALIWEB may be a better approach. > > >-- I don't understand why , if a web site is publicly accessible it > >shouldn't be indexable and so why there is a need for such a thing as > >robots.txt . > > Neither do I (see separate message). > > >-- Correct me if I'm wrong on this : If webmasters want to reserve > >access to certain pages to certain specific users they can do it , > >without needing to passwording it , by giving the pages names to > >these users and not linking them to the main tree . > > Wrong (see that message): third parties have the right to poke for URLs, IMHO. > Access restriction (password-based or otherwise) will do the job. > > >-- Why in the HTTP protocol there is not such an info about the > >required delay between to successive queries to the same server ( see > >the webmasters complaining about rapid fire queries from robots ) > >that the webserver should send in the header of each answer . > > There is an HTTP response meaning "please don't return for a while, I'm busy". > > http://www.w3.org/pub/WWW/Protocols/HTTP1.0/draft-ietf-http-spec.html#Code503 > > -- > Reinier Post reinpost@win.tue.nl > -- From owner-robots Wed Jan 17 06:47:13 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05366; Wed, 17 Jan 96 06:47:13 -0800 Message-Id: <m0tcZ6N-0009mrC@walnut.holli.com> From: wlamb@walnut.holli.com (Wayne Lamb) Subject: Re: robots.txt , authors of robots , webmasters ....OM To: robots@webcrawler.com Date: Wed, 17 Jan 1996 09:44:39 -0500 (EST) In-Reply-To: <199601171054.LAA04028@wsinis10.win.tue.nl> from "Reinier Post" at Jan 17, 96 11:54:42 am X-Mailer: ELM [version 2.4 PL23] Content-Type: text Content-Length: 2246 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Reinier Post wrote: > > You (savron@world-net.sct.fr) write: > > > >A few thoughts about the robots stuff : > > > >-- there should be no need to include a line such as : > > /cgi-bin/ > > in robots.txt > > because it should come as a standard of indexer robots > > That would be a kludge. It doesn't identify CGI scripts exactly > (I do not usually include /cgi-bin/ in references to my CGI scripts) > and it is not necessary tp exclude CGI scripts categorically > (I sometimes serve a set of files through a CGI script). Furthermore, > netter heuristics exist (eg. don't follow forms/POST requests). > > >-- Webmasters complaining about robots indexing partially built > >document trees . So why are they linked to the main tree ??? > > Well, it would help if WWW servers took more pains to send accurate > Expires: and Last-modified: headers. > > >-- I agree with the proposed 'positive' extension of robots.txt to > >include 'these pages should score more than the others of my site' > > Perhaps, but once you're on that road, ALIWEB may be a better approach. > > >-- I don't understand why , if a web site is publicly accessible it > >shouldn't be indexable and so why there is a need for such a thing as > >robots.txt . > > Neither do I (see separate message). > > >-- Correct me if I'm wrong on this : If webmasters want to reserve > >access to certain pages to certain specific users they can do it , > >without needing to passwording it , by giving the pages names to > >these users and not linking them to the main tree . > > Wrong (see that message): third parties have the right to poke for URLs, IMHO. > Access restriction (password-based or otherwise) will do the job. > > >-- Why in the HTTP protocol there is not such an info about the > >required delay between to successive queries to the same server ( see > >the webmasters complaining about rapid fire queries from robots ) > >that the webserver should send in the header of each answer . > > There is an HTTP response meaning "please don't return for a while, I'm busy". > > http://www.w3.org/pub/WWW/Protocols/HTTP1.0/draft-ietf-http-spec.html#Code503 > > -- > Reinier Post reinpost@win.tue.nl > Please take me off of your list -- From owner-robots Wed Jan 17 06:48:47 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05454; Wed, 17 Jan 96 06:48:47 -0800 Message-Id: <m0tcZ1Q-0009mrC@walnut.holli.com> From: wlamb@walnut.holli.com (Wayne Lamb) Subject: Re: Alta Vista searches WHAT?!? To: robots@webcrawler.com Date: Wed, 17 Jan 1996 09:39:32 -0500 (EST) In-Reply-To: <v0213050cad221644aca5@[202.237.148.6]> from "Mark Schrimsher" at Jan 17, 96 12:00:59 pm X-Mailer: ELM [version 2.4 PL23] Content-Type: text Content-Length: 1088 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Mark Schrimsher wrote: > > At 0:00 AM 1/17/96, Tronche Ch. le pitre wrote: > >Speaking about privacy, I feel more concerned by being cross-indexed > >in multiple robots-built databases. For example, we may suppose that, > >after some years of a career, you've left behind you some data about > >you in many of the organizations you've worked for. A robot could > >collect all these data to create a file about you. Of course, none of > >these infos may be very "sensitive", but, from some kind of "holistic" > >point of view, their gathering would permit to infer some interesting > >properties about yourself... May be... > > This is really the issue, I think, but the main problem is Usenet archives > and mailing list archives, not indexing normal web pages. HTML documents > tend to disappear, and presumably they would be eliminated from the robots > index eventually. Archives tend to be permanent, and the participants in > newsgroups and especially mailing lists are often not aware they're writing > for posterity. > > --Mark > Please take me off you list of mail> > -- From owner-robots Wed Jan 17 08:22:21 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10530; Wed, 17 Jan 96 08:22:21 -0800 Date: Wed, 17 Jan 1996 09:35:00 -0600 From: ecarp@tssun5.dsccc.com (Ed Carp @ TSSUN5) Message-Id: <9601171535.AA07837@tssun5.> To: robots@webcrawler.com Subject: Re: New Robot Announcement X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > From owner-robots@webcrawler.com Tue Jan 16 17:55 CST 1996 > Date: Tue, 16 Jan 1996 14:17:25 -0500 > From: David@interworld.com (David Levine) > Mime-Version: 1.0 > To: robots@webcrawler.com > Subject: Re: New Robot Announcement > Content-Transfer-Encoding: 7bit > > Ed Carp @ TSSUN5 wrote: > > I was under the impression that most web servers were running > > on a UNIX box. > > What else are you going to run a server on? I would argue > > that NT doesn't > > have the horsepower, and tehre aren't a lot of alternatives. > > > NT can be extremely powerful when running on a Dec Alpha. So can linux ;) From owner-robots Wed Jan 17 10:47:41 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19080; Wed, 17 Jan 96 10:47:41 -0800 Message-Id: <199601171847.KAA05176@mir.cs.washington.edu> In-Reply-To: reinpost@win.tue.nl's message of Wed, 17 Jan 1996 11:23:27 +0100 (MET) To: robots@webcrawler.com Subject: Re: Alta Vista searches WHAT?!? References: <199601171023.LAA03915@wsinis10.win.tue.nl> Date: Wed, 17 Jan 1996 10:47:36 PST From: Erik Selberg <speed@cs.washington.edu> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Here's a slightly different tack --- While I think that the /robots.txt is very nice, I don't think it's a worthwhile, or even workable, solution to the Idiot's Security Problem. The Idiot's Security Problem: this is when an idiot I puts some private data P on the Web but attempts to keep them private, by having either a subtle link somewhere or none at all. Later, a robot R finds data P and puts it in some database D. Now, the /robots.txt won't do a bit of good here. Why? Because (a) robots don't have to support the robots.txt file, and (b) because the goal is to keep said data _private_ from everyone, not just robots. The problem is that users feel that hiding data is a good solution to security. Robots just publicly announce that security of that form is bogus. The issue people have with robots I think is bogus; what they should be addressing is that there needs to be a better form of protection on the Web, or at least a more intuitive method of setting access control lists than the funky .htaccess file stuff (or at least a better UI!). -Erik From owner-robots Wed Jan 17 10:48:19 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19145; Wed, 17 Jan 96 10:48:19 -0800 Date: Wed, 17 Jan 1996 10:59:43 -0800 (PST) From: Benjamin Franz <snowhare@netimages.com> X-Sender: snowhare@ns.viet.net To: robots@webcrawler.com Subject: Re: robots.txt , authors of robots , webmasters .... In-Reply-To: <199601171054.LAA04028@wsinis10.win.tue.nl> Message-Id: <Pine.LNX.3.91.960117104601.18185A-100000@ns.viet.net> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Wed, 17 Jan 1996, Reinier Post wrote: > You (savron@world-net.sct.fr) write: > > > >A few thoughts about the robots stuff : > > > >-- there should be no need to include a line such as : > > /cgi-bin/ > > in robots.txt > > because it should come as a standard of indexer robots > > That would be a kludge. It doesn't identify CGI scripts exactly > (I do not usually include /cgi-bin/ in references to my CGI scripts) > and it is not necessary tp exclude CGI scripts categorically > (I sometimes serve a set of files through a CGI script). Furthermore, > netter heuristics exist (eg. don't follow forms/POST requests). And then you risk falling down rat holes like Usenet archives. I have *over* 100,000 archived Usenet articles online on the Web via my Usenet-Web software. The links are all GET to facilitate bookmarking. Now - I know enough to have a robots.txt file blocking that tree from indexing. But many of the people who have downloaded my software (many hundreds of people) are unlikely to use robots.txt. But since the installation instructions will generally lead people to put the script in /cgi-bin/ - a smart indexer will avoid it because /cgi-bin/ is dangerous in general to index. It is very wise in general to avoid all links that match any of these regexs: \.pl$ \.cgi$ \?.*$ cgi-bin -- Benjamin Franz From owner-robots Wed Jan 17 12:18:26 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24747; Wed, 17 Jan 96 12:18:26 -0800 Message-Id: <9601172019.AA05400@marys.smumn.edu> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Kevin Hoogheem <khooghee@marys.smumn.edu> Date: Wed, 17 Jan 96 14:20:11 -0600 To: robots@webcrawler.com Subject: Re: New Robot Announcement References: <9601171535.AA07837@tssun5.> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Begin forwarded message: > > Date: Wed, 17 Jan 1996 09:35:00 -0600 > From: ecarp@tssun5.dsccc.com (Ed Carp @ TSSUN5) > To: robots@webcrawler.com > Subject: Re: New Robot Announcement > X-Sun-Charset: US-ASCII > Sender: owner-robots@webcrawler.com > Reply-To: robots@webcrawler.com > > > > From owner-robots@webcrawler.com Tue Jan 16 17:55 CST 1996 > > Date: Tue, 16 Jan 1996 14:17:25 -0500 > > From: David@interworld.com (David Levine) > > Mime-Version: 1.0 > > To: robots@webcrawler.com > > Subject: Re: New Robot Announcement > > Content-Transfer-Encoding: 7bit > > > > Ed Carp @ TSSUN5 wrote: > > > I was under the impression that most web servers were running > > > on a UNIX box. > > > What else are you going to run a server on? I would argue > > > that NT doesn't > > > have the horsepower, and tehre aren't a lot of alternatives. > > > > > > NT can be extremely powerful when running on a Dec Alpha. > > So can linux ;) > So can a Vic 20.. But thats not the real problem is it. Hell any machine can be real powerful if you let it. Half of you would laugh if I told you we were runnign our Webserver on an Apple Quadra 950 running A/UX - apples version of UNIX but then again it does have 80Megs of RAM which makes it hum From owner-robots Wed Jan 17 13:20:25 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28380; Wed, 17 Jan 96 13:20:25 -0800 Date: Wed, 17 Jan 96 16:19:00 EST From: "Jim Meritt" <jmeritt@smtpinet.aspensys.com> Message-Id: <9600178219.AA821926152@smtpinet.aspensys.com> To: robots@webcrawler.com Subject: Re: BOUNCE robots: Admin request Content-Length: 2245 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com So maybe those particular folks shouldn't get on mailing lists talking about it? Jim ______________________________ Reply Separator _________________________________ Subject: BOUNCE robots: Admin request Author: robots@webcrawler.com at SMTPINET Date: 1/16/96 7:24 PM Date: Tue, 16 Jan 96 03:28:31 -0800 From: <owner-robots> To: owner-robots Subject: BOUNCE robots: Admin request X-Filter: mailagent [version 3.0 PL41] for mak@surfski.webcrawler.com Approved: robbie From s.nisbet@doc.mme.ac.uk Tue Jan 16 03:27:50 1996 Return-Path: <s.nisbet@doc.mme.ac.uk> Received: from ehlana.mmu.ac.uk by webcrawler.com (NX5.67f2/NX3.0M) id AA02048; Tue, 16 Jan 96 03:27:50 -0800 Received: from patsy.doc.aca.mmu.ac.uk by ehlana with SMTP (PP); Tue, 16 Jan 1996 11:26:46 +0100 Received: from raphael.doc.aca.mmu.ac.uk by patsy.doc.aca.mmu.ac.uk (4.1/SMI-4.1) id AA20450; Tue, 16 Jan 96 11:26:28 GMT Received: from jd-e114-07.doc.aca.mmu.ac.uk by raphael.doc.aca.mmu.ac.uk (4.1/SMI-4.1) id AA02693; Tue, 16 Jan 96 11:26:39 GMT Date: Tue, 16 Jan 96 11:26:38 GMT Message-Id: <9601161126.AA02693@raphael.doc.aca.mmu.ac.uk> X-Sender: steven@raphael.doc.aca.mmu.ac.uk X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Steve Nisbet <s.nisbet@doc.mme.ac.uk> Subject: Re: Horror story Maybe Im being a little touchy, but I take exception to being 'educated'. I subscribe to the robots line and others, because Im interested and because I know my stuff and want to know where things are going. A great many 'Web Masters' do a lot more than run webs, which in turn require a lot of effort and a lot of seperate tasks. I suspect as Mordechai T. Abzug points out that the majority of sites dont need a robots.txt and maybe are not that interested in robots. Think about it, they already have a lot on their plates with the admin of their respectives webs as it is. Steve Nisbet Web Admin (and other web related stuff!!) Department of Computing and sub webs Manchester Metro Uni. http://www.doc.mmu.ac.uk/ -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Thu Jan 18 00:24:37 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09158; Thu, 18 Jan 96 00:24:37 -0800 To: robots@webcrawler.com, w3-search@rodem.slab.ntt.com, NCGUR@uccmvsa.ucop.edu, www-vrml@wired.com Cc: amf@pdp.crl.sony.co.jp Subject: [ANNOUNCE] CFP: AAAI-96 WS on Internet-based Information Systems Date: Thu, 18 Jan 96 17:23:19 +0900 Message-Id: <8724.821953414@orange> From: Alexander Franz <amf@pdp.crl.sony.co.jp> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Call for Papers (brief version) AAAI-96 Workshop on Internet-based Information Systems August 4 or 5, Portland, Oregon The purpose of this workshop is to examine the state of the art, and explore the future, of network-based systems for browsing, searching, and sharing information in text and other forms. The focus will be on interactivity and Artificial Intelligence techniques. We solicit submissions relevant to these areas. Electronic submissions are due by March 18, 1996. For full details, please see the workshop home page: http://www.cs.cmu.edu/~amf/iis96.html From owner-robots Thu Jan 18 06:24:32 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26215; Thu, 18 Jan 96 06:24:32 -0800 Date: Thu, 18 Jan 1996 10:36:33 GMT Message-Id: <199601181036.KAA06151@admin.nj.devry.edu> X-Sender: bsran@admin.nj.devry.edu X-Mailer: Windows Eudora Version 1.4.4 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: bsran@admin.nj.devry.edu (Bhupinder S. Sran) Subject: Robot Research Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi. I am looking for materials to help me compare the various search engines on the web as a part of my research as a Ph.D student in Information Management. I have got plenty of information from the links provided on the search pages. I would appreciate if you could point me to any research on the search engines or to a source where I can get more information about how each engine works (e.g. How does it index the documents, how does it rank them, what are the theoretical basis for the search engine, etc) Bhupinder S. Sran :) :> :-) :> :) :-> :} :] :-) :> :} :> :-) :) :> :) :} :-) :) :> :) :> SMILE! It makes everyone wonder what you are up to :) Bhupinder S. Sran, Professor, CIS Department DeVry Technical Institute, Woodbridge, NJ 07095 Email: bsran@admin.nj.devry.edu Phone: 908-634-3460 Home Page: http://admin.nj.devry.edu/~bsran :) :> :-) :> :) :-> :} :] :-) :> :} :> :-) :) :> :) :} :-) :) :> :) :> From owner-robots Thu Jan 18 08:24:21 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04100; Thu, 18 Jan 96 08:24:21 -0800 Message-Id: <199601181623.LAA12548@mail.internet.com> Comments: Authenticated sender is <raisch@mail.internet.com> From: "Robert Raisch, The Internet Company" <raisch@internet.com> Organization: The Internet Company To: robots@webcrawler.com Date: Thu, 18 Jan 1996 11:20:15 -0400 Subject: Re: robots.txt , authors of robots , webmasters .... Priority: normal X-Mailer: Pegasus Mail for Windows (v2.01) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On 17 Jan 96 at 11:54, Reinier Post wrote: > Wrong (see that message): third parties have the right to poke for URLs, IMHO. > Access restriction (password-based or otherwise) will do the job. Quite frankly, I am surprised and extremely dismayed at this comment. Previously, I wrongly accused Alta-Vista of indexing pages that I had no interest in having indexed. It turned out that rather than poking each TCP port for an HTTP server, Alta-Vista actually did what every other 'bot does and follows all the links it can find. I spent some tube-time sleuthing and discovered that the pages were indeed referenced from other, generally accessible pages. I now believe my indignation at the possibility of this port-poking behavior was based on two separate considerations: 1. that the poking of ports would impose an unwelcome burden on my servers, and 2. that there are indeed pages I would not like to publish broadly that are nonetheless available behind ports I don't share with others. Having put the first issue to rest, it is now this second idea that attracts my attention. Where did we get the idea that just because a thing is accessible, that that gives us the moral right to access it, perhaps against the interests of its owner? In another message, Reinier states his belief that if a user makes the mistake of exposing his home directory to the web, that we (as robot owners) can index anything we find there with impunity; that the error is on the part of the web-master and not on the part of the robot's designer. Let me see if I understand Reinier's point and can perhaps state it another way: If I leave my house unlocked, I have given my permission for any and all to come in and read my personal papers. Does this strike anyone else as somewhat absurd? In our enthusiasm to become the cartographers of this new region of the information universe, do we not run the risk of violating the privacy of the indigenous peoples we find there? I believe that this "-WE- are the most comprehensive index of cyberspace" mentality is very dangerous and suggests a kind of information vigiliantism that I find personally distasteful. Perhaps what is really needed is a reevaluation of the role of the robots.txt file. If we take the stance, as I believe we should, that the decision to be indexed belongs in the hands of the owner of the data, not in the mechanical claws of wild roving robots, the robots.txt file should become the a source of permission not exclusion from indexing. And most importantly, that the expectation should be one of privacy, not exposure. In other words, we should not index a web-site if there is no robots.txt file to be retrieved that gives explicit permission to do so. Do any others feel as I do that control over use of my information is my responsibility and mine alone? That the assumption should be not to index a site that has not explicitly given permission to be indexed? (I don't expect much agreement here, to be honest. But I thought I would ask.) It should be noted that there is a fairly strong case to be made that a robot threshing through a non-published web site is an illegal activity under the abuse of computing facilities statute in U.S. law. </rr> From owner-robots Thu Jan 18 08:52:34 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06048; Thu, 18 Jan 96 08:52:34 -0800 Date: Thu, 18 Jan 1996 09:03:56 -0800 (PST) From: Benjamin Franz <snowhare@netimages.com> X-Sender: snowhare@ns.viet.net To: robots@webcrawler.com Subject: Re: robots.txt , authors of robots , webmasters .... In-Reply-To: <199601181623.LAA12548@mail.internet.com> Message-Id: <Pine.LNX.3.91.960118084915.21806B-100000@ns.viet.net> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Thu, 18 Jan 1996, Robert Raisch, The Internet Company wrote: > > Perhaps what is really needed is a reevaluation of the role of > the robots.txt file. If we take the stance, as I believe we > should, that the decision to be indexed belongs in the hands of > the owner of the data, not in the mechanical claws of wild > roving robots, the robots.txt file should become the a source of > permission not exclusion from indexing. And most importantly, > that the expectation should be one of privacy, not exposure. > > In other words, we should not index a web-site if there is no > robots.txt file to be retrieved that gives explicit permission > to do so. If you will review recent messages here you will discover that only about 5% of sites *have* a robots.txt file. This means that using the prescription of 'don't index unless there is a robots.txt file' would result in about one site in twenty being indexed *at best*. Because of there being such a low probability of a site with a robots.txt file linking to *another* site with a robots.txt file, the reality would be orders of magnitude worse that that. A robot would have to be exceptionally lucky to find a few hundred sites that way. In other words - it completely destroys the usefullness of robots for resource discovery. It is, and must be, the responsibility of each site to provide their own document security. If you don't want your pages indexed - add access control or *don't put them on the web*. It is *trivial* on most servers to block directory trees from remote access. You could even specifically target the search engines for blocking. If you don't want people reading your material - don't leave it on the table in the reading room of the library (which is what you are doing when you place documents on the WWW with no access control). -- Benjamin Franz From owner-robots Thu Jan 18 10:15:21 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11613; Thu, 18 Jan 96 10:15:21 -0800 Subject: Re: Re: robots.txt , authors of robots , webmasters .... From: Larry Burke <lburke@aktiv.com> To: <robots@webcrawler.com> Mime-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Date: Thu, 18 Jan 1996 10:14:55 -0800 Message-Id: <1390162401-3861802@gco.gov.bc.ca> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >If I leave my house unlocked, I have >given my permission for any and all to come in and read my >personal papers. Does this strike anyone else as somewhat >absurd? It is generally accepted in modern society that one should not enter into someone elses home without permission. A web server, by its very nature, invites public access. >Do any others feel as I do that control over use of my >information is my responsibility and mine alone? That the >assumption should be not to index a site that has not explicitly >given permission to be indexed? (I don't expect much agreement >here, to be honest. But I thought I would ask.) It is unfortunate that the robots.txt standard supports exclusion and not permission. The HTTP standard should have had indexing permission built right into it such that all servers would support some type of call that tells robots where they are allowed to go. This could have made it necessary for permission and denial to be set up during server configuration. -------------------- Larry Burke AKTIV Software Victoria, B.C. email: lburke@aktiv.com web: www.aktiv.com phone: 604.383.4195 From owner-robots Thu Jan 18 11:21:20 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16047; Thu, 18 Jan 96 11:21:20 -0800 Message-Id: <9601181922.AA06192@marys.smumn.edu> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Kevin Hoogheem <khooghee@marys.smumn.edu> Date: Thu, 18 Jan 96 13:22:37 -0600 To: robots@webcrawler.com Subject: Re: robots.txt , authors of robots , webmasters .... References: <199601181623.LAA12548@mail.internet.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > > Begin forwarded message: > > > Previously, I wrongly accused Alta-Vista of indexing pages that > I had no interest in having indexed. It turned out that rather > than poking each TCP port for an HTTP server, Alta-Vista > actually did what every other 'bot does and follows all the > links it can find. I spent some tube-time sleuthing and > discovered that the pages were indeed referenced from other, > generally accessible pages. > > I now believe my indignation at the possibility of this > port-poking behavior was based on two separate considerations: > > 1. that the poking of ports would impose an unwelcome > burden on my servers, and First I dont think that too many web-robot writers would write it so that it would probe all ports on a machine, rather would write the option to look at other ports on the runners command or if they were to find a differant port mentioned in a url. > > Where did we get the idea that just because a thing is > accessible, that that gives us the moral right to access it, > perhaps against the interests of its owner > > In another message, Reinier states his belief that if a user > makes the mistake of exposing his home directory to the web, > that we (as robot owners) can index anything we find there with > impunity; that the error is on the part of the web-master and > not on the part of the robot's designer. > > Let me see if I understand Reinier's point and can perhaps > state it another way: If I leave my house unlocked, I have > given my permission for any and all to come in and read my > personal papers. Does this strike anyone else as somewhat > absurd? > > In our enthusiasm to become the cartographers of this new > region of the information universe, do we not run the risk of > violating the privacy of the indigenous peoples we find there? > > I believe that this "-WE- are the most comprehensive index of > cyberspace" mentality is very dangerous and suggests a kind of > information vigiliantism that I find personally distasteful. > > Perhaps what is really needed is a reevaluation of the role of > the robots.txt file. If we take the stance, as I believe we > should, that the decision to be indexed belongs in the hands of > the owner of the data, not in the mechanical claws of wild > roving robots, the robots.txt file should become the a source of > permission not exclusion from indexing. And most importantly, > that the expectation should be one of privacy, not exposure. > > In other words, we should not index a web-site if there is no > robots.txt file to be retrieved that gives explicit permission > to do so. > > It should be noted that there is a fairly strong case to be > made that a robot threshing through a non-published web site is > an illegal activity under the abuse of computing facilities > statute in U.S. law. First off I do think that we as computer users on Unix systems think that there should be some level of protection of our documents that if it was intended to be private then they should be protected. But in the other case if they set up a web direcory then they are saying that this information is PUBLIC and any one that wishes to search it out can freely look at it. They should take the time and trouble to lock it up and make it so that no one but the intended people can see it. True if I leave my house unlocked I dont want anyone going into it but that is the risk I take isnt it? Also I feel that no one is realy out there doing a ip sweap of every number out there trying to connect to port 80 as of yet to find every server they possibley can, not only would that take forever but would put a big burden on not only there machine and users but the time wasted to find what??? I also feel that it is up to the web robot writers to share in some responsiblity to write robots that do not try to go out of the WWW published directorys and maybe themselfs just not even look into folders that might be of test related documents.. Why look into a folder called test unless there is a freely published html document that refers to that. From owner-robots Thu Jan 18 13:13:25 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23266; Thu, 18 Jan 96 13:13:25 -0800 Date: Thu, 18 Jan 1996 14:27:35 -0600 From: ecarp@tssun5.dsccc.com (Ed Carp @ TSSUN5) Message-Id: <9601182027.AA24864@tssun5.> To: robots@webcrawler.com Subject: Re: robots.txt , authors of robots , webmasters .... X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > It should be noted that there is a fairly strong case to be > made that a robot threshing through a non-published web site is > an illegal activity under the abuse of computing facilities > statute in U.S. law. I doubt it. In the first place, if you put up a web server on a well-known port, there isn't a DA in this country that will support a proscecution based on this, even if the site isn't "published". First of all, if you don't want the site accessed on that port, it's *your* responsibility to protect it. That's why we have login and password programs - if you don't have a modicum of protection on your site, the courts will take a very dim view of you trying to get someone nailed. Doing probes on other ports ("twisting the knobs", as it's called) to see "what's out there" is generally considered to be an unfriendly act, though. I think it's patently absurd to suggest that robots by default have no right to access your pages - if you don't want anyone looking at your pages, why put up a web site, if not just for your own self-aggrandizment? From owner-robots Thu Jan 18 13:44:28 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25465; Thu, 18 Jan 96 13:44:28 -0800 Date: Thu, 18 Jan 1996 22:37:34 +0100 (GMT+0100) From: Carlos Baquero <cbm@di.uminho.pt> To: robots@webcrawler.com Subject: Re: robots.txt , authors of robots , webmasters .... In-Reply-To: <199601181623.LAA12548@mail.internet.com> Message-Id: <Pine.LNX.3.91.960118222336.97C-100000@poe.di.uminho.pt> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 1203 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Thu, 18 Jan 1996, Robert Raisch, The Internet Company wrote: > Do any others feel as I do that control over use of my > information is my responsibility and mine alone? That the > assumption should be not to index a site that has not explicitly > given permission to be indexed? (I don't expect much agreement > here, to be honest. But I thought I would ask.) > I have some simpathy for your argument but I cannot agree. Suppose that the mass media needed explicit autorizations to publish info or photographs of public activities. There would'nt be to much info for the public, I guess. And that would be very bad ... I think that there is a legal notion of public and private places. It would be invasive to publish a photo of myself inside my house and taken through the window, but once I get into the street I am aware that a photo of myne can appear in a newspaper. I do think that unprotected places published in the web are public places by default. Carlos Baquero Distributed Systems Fax +351 (53) 612954 University of Minho, Portugal Voice +351 (53) 604475 cbm@di.uminho.pt http://shiva.di.uminho.pt/~cbm From owner-robots Thu Jan 18 15:54:44 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04434; Thu, 18 Jan 96 15:54:44 -0800 Message-Id: <30FEDED0.5526@corp.micrognosis.com> Date: Thu, 18 Jan 1996 18:59:12 -0500 From: Adam Jack <ajack@corp.micrognosis.com> Organization: CSK/Micrognosis Inc. X-Mailer: Mozilla 2.0b3 (X11; I; SunOS 5.5 sun4m) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: robots.txt , authors of robots , webmasters .... References: <Pine.LNX.3.91.960118084915.21806B-100000@ns.viet.net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Benjamin Franz wrote: > > On Thu, 18 Jan 1996, Robert Raisch, The Internet Company wrote: > > > In other words, we should not index a web-site if there is no > > robots.txt file to be retrieved that gives explicit permission > > to do so. > > In other words - it completely destroys the usefullness of robots for > resource discovery. > I wonder wether it would, instead, be the single best mechanism for mass education. What if a number of the major robots proclaimed that this was to be the case as of XXXXX date. No longer would a site be accessed (for either update or review) -- if it didn't have a robots.txt. I doubt these robots would loose out -- since they, undoubtedly, have not completed a full WWW index. They'd still have plenty of information -- it would probably be increased quality also.... People who want their data accessed would conform. Editting a robots.txt wxis no more difficult an activity than using submit-it. Adam -- +1-203-730-5437 | http://www.micrognosis.com/~ajack/index.html ajack@corp.micrognosis.com -> ajack@netcom.com -> ajack@?.??? From owner-robots Thu Jan 18 17:06:48 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA07788; Thu, 18 Jan 96 17:06:48 -0800 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199601190107.CAA13141@wsinis10.win.tue.nl> Subject: Re: robots.txt , authors of robots , webmasters .... To: robots@webcrawler.com Date: Fri, 19 Jan 1996 02:07:26 +0100 (MET) In-Reply-To: <199601181623.LAA12548@mail.internet.com> from "Robert Raisch, The Internet Company" at Jan 18, 96 11:20:15 am X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Content-Length: 3313 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (Robert Raisch, The Internet Company) write: >Where did we get the idea that just because a thing is >accessible, that that gives us the moral right to access it, >perhaps against the interests of its owner? I stated my source: the Unix environment. You are misrepresenting what I said. There is no such moral right; the normal obligation remains of notify people of likely mistakes. If I leave my personal papers on the shelves in a public library, you have the moral right to open them. You do not have the moral right to take advantage of what is obviously a mistake, once you discover it. BTW, the attitude that a robot is just like a human user probably stems from the Unix environment, too. Under Unix, people routinely scan the whole filesystem in order to search for information. The Internet is very Unix-minded; rather typical is AFS, a single global world-wide Unix-like file system where everyone is supposed to hook up their own files. >In another message, Reinier states his belief that if a user >makes the mistake of exposing his home directory to the web, >that we (as robot owners) can index anything we find there with >impunity; Yes; provided that the usual amount of care and politeness is observed, and under the moral obligation to correct mistakes once they are reported. You left that out in your summary. >that the error is on the part of the web-master and >not on the part of the robot's designer. That's what I think. >Let me see if I understand Reinier's point and can perhaps >state it another way: If I leave my house unlocked, I have >given my permission for any and all to come in and read my >personal papers. Does this strike anyone else as somewhat >absurd? Yes. I regard a WWW site like a public exhibit (maybe in someone's backyard). Not as a person's private home. >In our enthusiasm to become the cartographers of this new >region of the information universe, do we not run the risk of >violating the privacy of the indigenous peoples we find there? Indigenous people gain access to the Internet, not the other way round. (Except through malicious attacks and sloppy Webmasters.) >I believe that this "-WE- are the most comprehensive index of >cyberspace" mentality is very dangerous and suggests a kind of >information vigiliantism that I find personally distasteful. Many people hold your views, many hold mine. I don't think there's an easy solution. (Like you, I grew nervous when I found out how much Altavista knows about me.) Wouldn't it be possible for robots to generate email to the Webmaster if no robots.txt was found, offering an example robots.txt file and a pointer to relevant documentation? The robot might still start its indexing process, provided that the Webmaster has a way to undo the results. >It should be noted that there is a fairly strong case to be >made that a robot threshing through a non-published web site is >an illegal activity under the abuse of computing facilities >statute in U.S. law. Not so in the Netherlands, where entering a computer is only illegal if a lock (protection) was broken to gain access. -- Reinier Post reinpost@win.tue.nl a.k.a. <A HREF="http://www.win.tue.nl/win/cs/is/reinpost/">me</A> [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] From owner-robots Thu Jan 18 17:11:31 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08127; Thu, 18 Jan 96 17:11:31 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: <v02130506ad249da4e234@[192.187.143.12]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 18 Jan 1996 17:12:29 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: robots.txt , authors of robots , webmasters .... Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >Where did we get the idea that just because a thing is >accessible, that that gives us the moral right to access it, >perhaps against the interests of its owner? There's a difference between making something accessible with the intention of sharing it, as is the case in putting it on the Web without security, and allowing it to be accessible without the intention of sharing it. The moral argument is less clear when you dig a bit deeper into the publisher's intentions, which may not include support for automated access that would consume untoward resources. >In our enthusiasm to become the cartographers of this new >region of the information universe, do we not run the risk of >violating the privacy of the indigenous peoples we find there? The privacy argument is a difficult one to reconcile with the other watch-word of the Internet, freedom. We could talk at great length about that, but a robots list isn't the place, I think. >In other words, we should not index a web-site if there is no >robots.txt file to be retrieved that gives explicit permission >to do so. We thought and discussed this approach at some length when we got close to releasing the 1.0 version of our spider. Our pre-release version had basically no restrictions on it except that it wouldn't follow links from one server to another; it was designed to index just one site at a time. We even considered a scheme in which we'd look for robots.txt, and if it wasn't present, generate an e-mail to the webmaster, suggesting that one should be in place, with pointers to references. After X days, if we still didn't find a robots.txt, we'd consider silence to be consent to index anything the robot finds. However, clearer heads prevailed, I think, and we left things as they were. The fundamental reason that we scrapped the idea was that it was just too complex. Too many things could go wrong, it added a lot of administrative overhead, etc. Let's remember that the marketplace usually eventually solves these problems. Robot defenses can and will be built. In fact, we discovered early on that inet-d is a pretty good defense, since it limits the number of connections. Our first design of the robot was based on the typical limits of inet-d. I suspect that robot designers time would be better spent on reaching consensus on distributed systems that will make the whole wretched mess more efficient by combining pull and push methods of building indexes. There is going to be a marketplace for the meta-information that robots are generating. The sooner that robot developers agree on standards along the lines of Harvest (but simpler, perhaps), the sooner that trade in meta-information can begin to mature... and the less likely that one big player will set the standards by sheer size. For example, what if Microsoft announced a robot standard tomorrow...? Nick From owner-robots Thu Jan 18 18:59:12 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11507; Thu, 18 Jan 96 18:59:12 -0800 Date: 18 Jan 96 16:04:51 EST From: John Lammers <JLAMMERS@CSI.compuserve.com> To: Robots List <ROBOTS@webcrawler.com> Subject: re: privacy, courtesy, protection Message-Id: <CSI_6188-3773@CompuServe.COM> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Robert Raisch asks: >>Do any others feel as I do that control over use of my >>information is my responsibility and mine alone? Yes...so exercise that control and restrict access to the material that you don't want read. Control over the security of your site and your data is your responsibility and yours alone. I'm not saying that robots should TRY to invade your privacy, but your comparison of your web site and your house is a little off, I think. >>If I leave my house unlocked, I have given my permission for any and >>all to come in and read my personal papers. Does this strike anyone >>else as somewhat absurd? I think it's more analogous to leaving your office unlocked in a office building accessible to the public. You don't expect someone to sit down and read all your stuff, but then again, you don't necessarily expect that no one will notice what you have lying around your office. Robots are accustomed to going where they can. The robots.txt file is as much or more for the robot's benefit as the site's. I tend to agree with an earlier contributor that many sites don't have a robots.txt, have no need for one, and can't be expected to have one. If all these sites are excluded from indexes.... Besides, if you're relying on robots faithfully abiding by whatever you have in robots.txt for your security scheme, you're only keeping out the robots (and human browsers) who don't want your private data. Anyone that wants it can get it, if you don't protect it. I'm not advocating that, I'm just saying that's the case. Like it or not, putting info on the Web is publishing. Lack of advertising doesn't mean something hasn't been published. The failure of a chapter to appear in the table of contents doesn't mean it's not in the book. Again, I don't WANT your privacy invaded, but if you put your stuff in a public place and don't restrict access to it.... -- John Lammers From owner-robots Thu Jan 18 19:02:37 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11559; Thu, 18 Jan 96 19:02:37 -0800 Message-Id: <199601190302.VAA06301@sam.neosoft.com> X-Mailer: Post Road Mailer (Green Edition Ver 1.03a) To: robots@webcrawler.com From: Edward Stangler <mred@neosoft.com> Date: Thu, 18 Jan 1996 20:59:44 CST Subject: Re: Alta Vista searches WHAT?!? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com ** Reply to note from Erik Selberg <speed@cs.washington.edu> 01/17/96 10:47am PST > Now, the /robots.txt won't do a bit of good here. Why? Because (a) > robots don't have to support the robots.txt file, and (b) because the > goal is to keep said data _private_ from everyone, not just > robots. The problem is that users feel that hiding data is a good > solution to security. Robots just publicly announce that security of > that form is bogus. The issue people have with robots I think is > bogus; what they should be addressing is that there needs to be a > better form of protection on the Web, or at least a more intuitive > method of setting access control lists than the funky .htaccess file > stuff (or at least a better UI!). What if you're using ROBOTS.TXT to exclude CGI's which don't appear in /cgi-bin? What if the CGI's--or any data types unknown to the robot--are indistinguishable from directory pathnames or acceptable data types except if (a) it is excluded with something like ROBOTS.TXT or (b) the robot spends considerable time and resources to analyze it? -Ed- mred@neosoft.com http://www.neosoft.com/~mred 1:106/1076 - 30:30/0 - 85:842/105 74620,2333 From owner-robots Fri Jan 19 02:33:11 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA13465; Fri, 19 Jan 96 02:33:11 -0800 Date: Fri, 19 Jan 1996 10:32 UT From: MGK@NEWTON.NPL.CO.UK (Martin Kiff) Message-Id: <0099C9E1F38A92E0.992E@NEWTON.NPL.CO.UK> To: robots@webcrawler.com Subject: Server name in /robots.txt X-Vms-To: SMTP%"robots@webcrawler.com" X-Vms-Cc: MGK Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Looking up your site in the indexes is indeed educational... I have found the same pages appearing under multiple domain names - the canonical DNS name, various CNAME equivalents and the raw IP address *despite* having a <BASE HREF="http://xxx.xxx.xxx.xxx/xxx.html"> giving a 'preferred URL' in the header. Obviously indexers don't (or some indexer don't) recognise this and just build on incorrect, but currently working, links from other pages. Would it be an option to include a the preferred site name in the /robots.txt file? Couldn't enforce anything of course but would act as a reminder to the robots. Regards, Martin Kiff mgk@newton.npl.co.uk From owner-robots Fri Jan 19 05:12:10 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14219; Fri, 19 Jan 96 05:12:10 -0800 Date: Fri, 19 Jan 1996 08:11:55 -0500 From: AJAJR@aol.com Message-Id: <960119081146_201004653@mail04.mail.aol.com> To: robots@webcrawler.com Subject: Polite Request #2 to be Removed form List Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Sir: Thank you very much for including me on this list. At this time I would like to politely request for the second time that my name now be removed. Thank you in advance for your kind consideration which is much appreciated. From owner-robots Fri Jan 19 10:00:15 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15378; Fri, 19 Jan 96 10:00:15 -0800 Message-Id: <m0tdL6g-0003DtC@giant.mindlink.net> Date: Fri, 19 Jan 96 10:00 PST X-Sender: a07893@giant.mindlink.net X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Tim Bray <tbray@opentext.com> Subject: Re: Server name in /robots.txt Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >I have found >the same pages appearing under multiple domain names - the canonical DNS >name, various CNAME equivalents and the raw IP address *despite* having a > > <BASE HREF="http://xxx.xxx.xxx.xxx/xxx.html"> > >giving a 'preferred URL' in the header. Obviously indexers don't >(or some indexer don't) recognise this and just build on incorrect, >but currently working, links from other pages. Yes, well, reading a variety of specs carefully makes it clear that HTML does *not* at the current time provide a mechanism for specifying the "canonical name" of the current page. Having noticed this [several tens of thousands of times] during the construction of the Open Text Index, I tried rattling cages over in the HTML Working Group, and discovered a complete lack of consensus; some people feel that this is an appropriate use of <BASE>, as did I; others, including people who *really* know HTML, think <META HTTP-EQUIV="URI" CONTENT="http://xxx.xxx.xxx/xxx.html"> is more appropriate. I tried to get them to make up their minds, but couldn't generate sufficient interest. I don't care [nor would any other robot flogger, I think] which mechanism is used, as long as one is available. This is a sub-issue of the larger [and unfortunately largely ignored] issue of WWW metadata. The canonical-name issue could be settled, I suppose, if we and Infoseek and Lycos and Excite got together and said "do it THIS way". I think <META> is probably the way to go, since <BASE> is overloaded for other functions. Anyone who's written a serious robot knows that the aliasing available in the IP/DNS mechanisms and the Unix filesystems, plus the habit of mirroring good stuff, makes automatic duplicate detection [at the least] very difficult. Checksums help, but not with volatile pages. Making things more difficult is the fact that there are lots of people that *like* having their pages show up multiple times. Sigh, Tim Bray, Open Text From owner-robots Fri Jan 19 10:21:33 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15510; Fri, 19 Jan 96 10:21:33 -0800 Message-Id: <9601191821.AA23577@grasshopper.ucsd.edu> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Christopher Penrose <penrose@grasshopper.ucsd.edu> Date: Fri, 19 Jan 96 10:21:09 -0800 To: robots@webcrawler.com Subject: Re: Server name in /robots.txt Cc: penrose@grasshopper.ucsd.edu References: <m0tdL6g-0003DtC@giant.mindlink.net> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com With my agent, I have dealt with the lack of the canonical form fairly well, though Tim is right, there is no perfect solution. My agent searches the url database for each alias attributed to the found url. I think though that alias checking works quite well, and for http sites I have chosen to first use the name used in the found url if the name has the text, www in it. I then choose to use an alias if this first condition is not met, which meets this condition. Of course this fails to select www.netscape.com if the reference was www.mcom.com, but I think that it is a reasonable solution given that it is very difficult to infer a preference between such two domain names without direct specification. A good solution would be the inclusion of a new optional html tag: <host="www.anchovies.com"> Christopher Penrose penrose@ucsd.edu http://www-crca.ucsd.edu/TajMahal/after.html From owner-robots Fri Jan 19 11:30:21 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15862; Fri, 19 Jan 96 11:30:21 -0800 Date: Fri, 19 Jan 1996 14:30:18 -0500 From: AJAJR@aol.com Message-Id: <960119143017_121464524@mail02.mail.aol.com> To: robots@webcrawler.com Subject: un-subcribe Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com un-subcribe From owner-robots Fri Jan 19 12:26:56 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16283; Fri, 19 Jan 96 12:26:56 -0800 X-Sender: mak@surfski.webcrawler.com Message-Id: <v02140801ad259fbe620e@[199.221.45.139]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 19 Jan 1996 12:28:03 -0700 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: Re: Server name in /robots.txt Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 10:00 AM 1/19/96, Tim Bray wrote: > some people feel that this is an appropriate >use of <BASE>, as did I; others, including people who *really* know HTML, >think <META HTTP-EQUIV="URI" CONTENT="http://xxx.xxx.xxx/xxx.html"> is more >appropriate. I tried to get them to make up their minds, but couldn't >generate sufficient interest. My pet-peeve: you _don't_ want HTTP-EQUIV: why on earth would you want it in a HTTP header? Same goes for all the Auhtor stuff and what have you > I think <META> is probably the way to go, Not sure, but if you go META, promote NAME="URI", and keep it in HTML. All IMHO of course :-) -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Fri Jan 19 12:47:52 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16428; Fri, 19 Jan 96 12:47:52 -0800 Message-Id: <199601192047.MAA03794@wally.cs.washington.edu> In-Reply-To: Edward Stangler's message of Thu, 18 Jan 1996 20:59:44 CST To: robots@webcrawler.com Subject: Re: Alta Vista searches WHAT?!? References: <199601190302.VAA06301@sam.neosoft.com> Date: Fri, 19 Jan 1996 12:47:48 PST From: Erik Selberg <speed@cs.washington.edu> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Edward Stangler writes: >> Now, the /robots.txt won't do a bit of good here. Why? Because (a) >> robots don't have to support the robots.txt file, and (b) because the >> goal is to keep said data _private_ from everyone, not just >> robots. The problem is that users feel that hiding data is a good >> solution to security. Robots just publicly announce that security of >> that form is bogus. The issue people have with robots I think is >> bogus; what they should be addressing is that there needs to be a >> better form of protection on the Web, or at least a more intuitive >> method of setting access control lists than the funky .htaccess file >> stuff (or at least a better UI!). > What if you're using ROBOTS.TXT to exclude CGI's which don't appear in /cgi-bin? > What if the CGI's--or any data types unknown to the robot--are indistinguishable from > directory pathnames or acceptable data types except if (a) it is excluded with > something like ROBOTS.TXT or (b) the robot spends considerable time and resources to > analyze it? I'm not arguing here about the usefullness of robots.txt to exclude things that shouldn't be searched, i.e. to tell the robot of useless things, etc. For example, I recently discovered that many robots, like Lycos and Inktomi, have played many a game of "Hunt the Wumpus" because the links in the game are URLs to cgi-bin scripts. However, robots.txt should not be something used to enforce security. From owner-robots Fri Jan 19 15:13:19 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17394; Fri, 19 Jan 96 15:13:19 -0800 Message-Id: <01BAE717.8EBE2500@pluto.planets.com.au> From: David Eagles <eaglesd@planets.com.au> To: "'robots@webcrawler.com'" <robots@webcrawler.com> Subject: RE: Server name in /robots.txt Date: Sat, 20 Jan 1996 09:13:25 +-1100 Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="---- =_NextPart_000_01BAE717.8ECEEDE0" Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com ------ =_NextPart_000_01BAE717.8ECEEDE0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable > A good solution would be the inclusion of a new optional html tag: > <host=3D"www.anchovies.com"> True, but if we're in the process of adding new tags, etc, wouldn't it = be better to add an HTTP response field generated by the server as in = the URL line below: HTTP/9.5 200 Document follows Date: ... Server: ... URL: http://www.anchovies.com/pizza Since the server has absolute knowledge of what it's hostname is = (hopefully :-), this would even solve the www.mcom.com and = www.netscape.com problem. This can also resolve the other problems we = all have trying to guess whether the URL we requested was actually a = file or if it was a directory and the server automatically supplied the = default (index.html) for it Regards, David ------ =_NextPart_000_01BAE717.8ECEEDE0 Content-Type: application/ms-tnef Content-Transfer-Encoding: base64 eJ8+IhsWAQaQCAAEAAAAAAABAAEAAQeQBgAIAAAA5AQAAAAAAADoAAENgAQAAgAAAAIAAgABBJAG ACQBAAABAAAADAAAAAMAADADAAAACwAPDgAAAAACAf8PAQAAAEkAAAAAAAAAgSsfpL6jEBmdbgDd AQ9UAgAAAAByb2JvdHNAd2ViY3Jhd2xlci5jb20AU01UUAByb2JvdHNAd2ViY3Jhd2xlci5jb20A AAAAHgACMAEAAAAFAAAAU01UUAAAAAAeAAMwAQAAABYAAAByb2JvdHNAd2ViY3Jhd2xlci5jb20A AAADABUMAQAAAAMA/g8GAAAAHgABMAEAAAAYAAAAJ3JvYm90c0B3ZWJjcmF3bGVyLmNvbScAAgEL MAEAAAAbAAAAU01UUDpST0JPVFNAV0VCQ1JBV0xFUi5DT00AAAMAADkAAAAACwBAOgEAAAACAfYP AQAAAAQAAAAAAAAD0jcBCIAHABgAAABJUE0uTWljcm9zb2Z0IE1haWwuTm90ZQAxCAEEgAEAHwAA AFJFOiBTZXJ2ZXIgbmFtZSBpbiAvcm9ib3RzLnR4dACWCgEFgAMADgAAAMwHAQAUAAkADQAZAAYA HQEBIIADAA4AAADMBwEAFAAJAAIAIwAGABwBAQmAAQAhAAAAQTc1OTlCQ0I5RTUyQ0YxMTk4NkEw MDAwQzA4QzAzNEUAKAcBA5AGAJwEAAASAAAACwAjAAAAAAADACYAAAAAAAsAKQAAAAAAAwA2AAAA AABAADkAIMmFWrvmugEeAHAAAQAAAB8AAABSRTogU2VydmVyIG5hbWUgaW4gL3JvYm90cy50eHQA AAIBcQABAAAAFgAAAAG65rtWwcubWahSnhHPmGoAAMCMA04AAB4AHgwBAAAABQAAAFNNVFAAAAAA HgAfDAEAAAASAAAAZWFnbGVzZEBwYy5jb20uYXUAAAADAAYQ/US0tAMABxBGAgAAHgAIEAEAAABl AAAAQUdPT0RTT0xVVElPTldPVUxEQkVUSEVJTkNMVVNJT05PRkFORVdPUFRJT05BTEhUTUxUQUc6 PEhPU1Q9IldXV0FOQ0hPVklFU0NPTSJUUlVFLEJVVElGV0VSRUlOVEhFUFJPQwAAAAACAQkQAQAA ABsDAAAXAwAAzwQAAExaRnVJzfrV/wAKAQ8CFQKoBesCgwBQAvIJAgBjaArAc2V0MjcGAAbDAoMy A8UCAHByQnER4nN0ZW0CgzM3AuQHEwKDNARGEzMxIGhGaXgJgHMTsAKAfRcKgAjPCdk7F98yNTUP AoAKgQ2xC2BuZzEwvjMUUAsKFFEL8hNQbxPQhmMFQAqLbGkzNhxhYwtkFWFzMTgXQABAIA4+CuEb vxzCQSBnb4kEcCBzBvB1dGkCIBQgdwhgbCEwYmUg7HRoIlALgGMKQACQIbFAb2YgYSBuB9FvhwUw IaEHQCBodG0DIHkBkGc6CoceTx9fHIY8RGhvE8A9IncooC6TAHARcG92CJBzLgWgOG0iPgqPJcwp 1VRyPQpQLCIwIYAioCNgd2XeJxfgIqEiYxyBYweQBCBpI1JkZAuAZyOTJMFzmyxwEcBjLHAh424n LLG/BUAiQSJAAkAEkCJgby5ygyNwA6BIVFRQIBfgzHNwAiARsCBmCJAiEbZnCfAEkGET0CIheSJj aRGwcnYxEWEEIC1lVXhSTCAd0COgIjEXcHcHJPYp1THiLzkuNSDRAdAwIEQt8HUHgAIwVzKwBvA1 8XMp1UQzcTr8IC45sCnVBmE0YTmaNUGDOZAkcHRwOi8vKK9AL3BpenphNjxT/yKxIlQ0NRGABCAB oCFTIlB0a242AGwJgDMgI0J3+xGAMFInBCAoQiQwB4AioLUEICgoQHANwCIAbDPQeDotKSxwInBC YSHkZb80YAOgIVE0YCJjPDJtKYFfKXIxoSEwPDIjoHQE8GGfQsBFsxyBAmAT4C4gLCD/Q6JGwAOg B0AhUDIiRKccoH8igAXAR1VDwSJQB0AkUWHZRMJyeS6yMUFnClAEEb9BMBHASbI1BkqBF+BxS/Ht M4J3P6Ic0HVKsTPQI4B/MsBAoCNABcAs0TBxThMg/y6gF+Ac0AWwTsFGETP6IYCPA3EhkEbAQwJz dXALUK8IkFEEDbFR0GwFQCgLgGkNsHguJHIpOGFPYXTxNjxSZWcLES9gOOcpMF5kJQ8mExyNFwEA WeAAAwAQEAEAAAADABEQAAAAAEAABzDAUavWuea6AUAACDDAUavWuea6AR4APQABAAAABQAAAFJF OiAAAAAAclM= ------ =_NextPart_000_01BAE717.8ECEEDE0-- From owner-robots Fri Jan 19 15:27:29 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17562; Fri, 19 Jan 96 15:27:29 -0800 Message-Id: <m0tdQDD-0003E3C@giant.mindlink.net> Date: Fri, 19 Jan 96 15:27 PST X-Sender: a07893@giant.mindlink.net X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Tim Bray <tbray@opentext.com> Subject: RE: Server name in /robots.txt Cc: "'robots@webcrawler.com'" <robots@webcrawler.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 09:13 AM 1/20/96 +-1100, David Eagles wrote: > >> A good solution would be the inclusion of a new optional html tag: >> <host="www.anchovies.com"> > >True, but if we're in the process of adding new tags, etc, wouldn't it be >better to add an HTTP response field generated by the server as in Nope. This can't be done automatically. I may not want to tell the world that the preferred name for this happens to be the one that this particular server is operating under at this particular moment. All the server can know is what it's known as at the moment. We need more indirection. Also, I not only want to solve the hostname problem, I want to solve the /a/./b/./c/ and /a/x/../b/y/../c/ and hard-link and symlink file problems. To do this, a *human* (or a document management system) needs to assert what the canonical name for something is. Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com) From owner-robots Fri Jan 19 15:52:14 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18824; Fri, 19 Jan 96 15:52:14 -0800 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199601192352.AAA02006@wsinis10.win.tue.nl> Subject: Re: Server name in /robots.txt To: robots@webcrawler.com Date: Sat, 20 Jan 1996 00:52:51 +0100 (MET) In-Reply-To: <v02140801ad259fbe620e@[199.221.45.139]> from "Martijn Koster" at Jan 19, 96 12:28:03 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Content-Length: 740 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (Martijn Koster) write: >My pet-peeve: you _don't_ want HTTP-EQUIV: why on earth would you >want it in a HTTP header? Same goes for all the Auhtor stuff and what >have you It's better to limit the information to HTTP-served documents than to limit it to HTML documents. Am I wrong? >Not sure, but if you go META, promote NAME="URI", and keep it in HTML. Why? Ism't it possible to just have a pointer to a URC or somthing similar, and keep the metadata separate from the document itself as much as possible? I think that would make maintenance a lot easier. Mixing information *about* a document in with the HTML source *in* the document seems a bad idea to me. All IMHO of course :-) -- Reinier Post reinpost@win.tue.nl From owner-robots Sat Jan 20 07:21:27 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01974; Sat, 20 Jan 96 07:21:27 -0800 Date: Sat, 20 Jan 1996 09:21:16 -0600 From: blea@hic.net Message-Id: <199601201521.JAA01979@news.hic.net> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Subject: Re: Polite Request #2 to be Removed form List To: robots@webcrawler.com In-Reply-To: <960119081146_201004653@mail04.mail.aol.com> X-Mailer: SPRY Mail Version: 04.00.06.17 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Well, I'm not the administrator of the list but I do have the instructions for taking your self off the list. They were sent to me when I first subscribed and I saved em.... ( good Idea since they seem to be different for every mailing list which causes a lot of confusion). Hope this helps! ;) Bill > subscribe Subscription of: `/RFC-822=blea(a)hic.net/O=NEXOR/PRMD=NEXOR/ADMD= /C=GB/' to: `robots, Distribution Lists, NEXOR Ltd, GB' successful. [List owner has been notified of subscription] > help You may use the following commands: SUBSCRIBE // subscribe yourself to the list UNSUBSCRIBE // unsubscribe yourself from the list SHOW [attribute type] // shows list [or attribute values] MEMBERS // shows the membership of the list LIST [DN] // list the lists [beneath specified DN] LOCATION [DN] // returns [or sets] location in the DIT WHICH [O/R Name] // display which lists the O/R Name is on HELP // returns this message STOP // ignores any subsequent text In the above commands, if an optional O/R Name is not specified, the originator's address will be used instead. Similarly if an optional DN (Directory Name) is not specified, the current location in the DIT is used. Finally, lines beginning with // are treated as comments. > stop From owner-robots Sat Jan 20 16:19:42 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02482; Sat, 20 Jan 96 16:19:42 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: <v02130500ad27367c3a8c@[192.187.143.12]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sat, 20 Jan 1996 16:20:38 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Who sets standards (was Server name in /robots.txt) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >The canonical-name issue could be settled, I suppose, if we and Infoseek and >Lycos and Excite got together and said "do it THIS way". I think ><META> is probably the way to go, since <BASE> is overloaded for other >functions. Not to be too parochial, but I think it's at least as important to get buy-in from companies who are shipping commercial spiders, since there are a lot more copies of them out there than there are of the services. A standard endorsed by the services doesn't matter to someone who's indexing his or her intranet. Nick From owner-robots Sat Jan 20 18:00:25 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08156; Sat, 20 Jan 96 18:00:25 -0800 From: "Mordechai T. Abzug" <mabzug1@gl.umbc.edu> Message-Id: <199601210200.VAA21280@umbc10.umbc.edu> Subject: Re: Server name in /robots.txt To: robots@webcrawler.com Date: Sat, 20 Jan 1996 21:00:16 -0500 (EST) In-Reply-To: <m0tdQDD-0003E3C@giant.mindlink.net> from "Tim Bray" at Jan 19, 96 03:27:00 pm X-Mailer: ELM [version 2.4 PL24] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 2662 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com "DE" == David Eagles spake thusly: DE> A good solution would be the inclusion of a new optional html tag: DE> <host="www.anchovies.com"> "SE" == Someone Else space thusly: SE> True, but if we're in the process of adding new tags, etc, wouldn't it be >better to add an HTTP response field generated by the server as in "TB" == Tim Bray spake thusly: TB> Nope. This can't be done automatically. I may not want to tell the TB> world that the preferred name for this happens to be the one that TB> this particular server is operating under at this particular moment. TB> All the server can know is what it's known as at the moment. We need TB> more indirection. TB> TB> Also, I not only want to solve the hostname problem, I want to solve TB> the /a/./b/./c/ and /a/x/../b/y/../c/ and hard-link and symlink file TB> problems. To do this, a *human* (or a document management system) TB> needs to assert what the canonical name for something is. We've got 5% of *webmasters* using /robots.txt. Does anyone think we'll be able to convince more than a minute fraction of *HTML writers* to conform to some obscure standard? Remember, you don't need to know anything to become an HTML author. And don't forget the legacy problem: there already are millions of documents in existence. Would *you* like to go ahead and modify each of the docs you've written to conform to any new standard? Proposing a new standard for *servers* is even worse. If I've got a server up, running, and configured the way I like, you'll need a shotgun to convince me to go through the hassle of downloading, compiling, testing, and configuring some new version unless it comes with *major* benefits. If this really is a problem for your robot, I'd suggest that you solve it yourself. One suggestion: use some sort of document comparison algorithm. If you only wish to avoid perfect duplication (ie. symlinks, hard links, etc.) I'd suggest using MD5 (don't use checksums; too unreliable for something the size of the web) to generate digests, and use the digests as the keys for an associative array, with the URL as value. Every time you download a document, take its digest and make sure you don't already have that digest. Note that for a corpus of *this* size, even MD5 might not be perfectly reliable (ie. as in 29 people for 50/50 chance of same birthday) so once you have the first match, you might want to use some longer comparison (perhaps download the original document again?) to confirm that they really are the same thing. -- Mordechai T. Abzug http://umbc.edu/~mabzug1 mabzug1@umbc.edu finger -l mabzug1@gl.umbc.edu Set laser printers to "stun". From owner-robots Sat Jan 20 20:07:50 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14435; Sat, 20 Jan 96 20:07:50 -0800 Message-Id: <m0tdr3x-0003E7C@giant.mindlink.net> Date: Sat, 20 Jan 96 20:07 PST X-Sender: a07893@giant.mindlink.net X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Tim Bray <tbray@opentext.com> Subject: Re: Who sets standards (was Server name in /robots.txt) Cc: robots@webcrawler.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >Not to be too parochial, but I think it's at least as important to get >buy-in from companies who are shipping commercial spiders, since there are >a lot more copies of them out there than there are of the services. A >standard endorsed by the services doesn't matter to someone who's indexing >his or her intranet. Reasonable point. It's just that the big-league index vendors at the moment have the mind-share and the visibility to get the world's attention on this issue. I wouldn't care where a standard came from myself. Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com) From owner-robots Mon Jan 22 02:22:53 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16426; Mon, 22 Jan 96 02:22:53 -0800 Date: Mon, 22 Jan 1996 09:16 UT From: MGK@NEWTON.NPL.CO.UK (Martin Kiff) Message-Id: <0099CC32C0158FC0.EFF2@NEWTON.NPL.CO.UK> To: robots@webcrawler.com Subject: Re: Server name in /robots.txt X-Vms-To: SMTP%"robots@webcrawler.com" X-Vms-Cc: MGK Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Tim: > The Canonical-name issue could be settled, I suppose, if we and Infoseek > and Lycos and Excite got together and said 'do it THIS way', I think > <META> is probably the way to go, since <BASE> is overloaded for > other functions. Sounds fine to me - certainly having sensible 'we do it this way' information beats the current situation, that of there not being any advice on the indexers' home pages (none that I could find, that is) I would say that if people have gone to the trouble of setting up HTTP-EQUIV="URI" in their pages then they are doing it for a reason and it should be the first choice (as it can cope with multiple entries can it not? [1]) Without a HTTP-EQUIV="URI" could the BASE information be used as a hint? Presumably neither the HTTP-EQUIV and BASE data should be taken un-checked... The page should be re-read from the URL's supplied and, if valid and pointing to the same information, those URL's indexed? Now in the absence of any HTTP-EQUIV or BASE data - i.e. the normal case - the indexer could fall back on a default server name in the /robots.txt file - solving a part, but at least a part, of the problem for the webweaver. Tim again: > Also .... I want to solve > the /a/./c/ and /a/x/../b/y/../c and hard-link and symlink file problems. It is up to the individual webweavers to decide whether they are going down the slippery route of links, they can decide to or not. I would say though that it's a 'natural solution' when faced with the need to move a page, it's only a while afterwards that they notice the chickens coming home to roost. I fell into that hole and I thought I had done what I could with <BASE HREF=....> Note though that webweavers often have no control of the CNAME's in the DNS and certainly no control of third parties picking up raw IP addresses (but from where?) and using those in their links. Using <BASE HREF...> can contain the damage if indexers make use of it even in its strict meaning. While I am typing, a hopefully trivial question. Who actually reads the HTTP-EQUIV? The documentation I've read doesn't say whether it should be understood by the httpd daemon and handed out in response to a HEAD request (I haven't noticed my, maybe elderly copies of the, CERN or NCSA daemons actually doing this) or should it be understood by the browser or robot as it GETs the page? Regards, Martin Kiff mgk@newton.npl.co.uk [1] Guesswork based on the libwww.pl library discarding 'multiple' URI's From owner-robots Mon Jan 22 05:58:17 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25266; Mon, 22 Jan 96 05:58:17 -0800 X-Sender: mak@surfski.webcrawler.com Message-Id: <v02140806ad2936a38b21@[199.221.45.139]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Mon, 22 Jan 1996 05:59:21 -0700 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: Re: Server name in /robots.txt Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >While I am typing, a hopefully trivial question. Who actually reads the >HTTP-EQUIV? The documentation I've read doesn't say whether it should >be understood by the httpd daemon and handed out in response to a HEAD >request (I haven't noticed my, maybe elderly copies of the, CERN or >NCSA daemons actually doing this) The server is supposed to parse the document, and slam the value into an HTTP header. This is of course a waste of server CPU and bandwidth for the majority of cases, and opens a whole can of worms with the semantics of HTTP header namespace collisions. It isn't widely implemented. > or should it be understood by the >browser or robot as it GETs the page? This makes far more sense -- let user-agents decide what they want to do with the data; parsing it out of a HTTP stream is simple, and a browser could even drop the connection afer reading the field if that's all it needed. So the idea is that you can do both HTTP-EQUIV=foo and NAME=bar in the same META tag. The last draft I saw on the subject had HTTP-EQUIV as the main thing, with NAME being optional. I think it makes far more sense to have NAME, and abolish HTTP-EQUIV, or at least make it a secondary choice. In fact it'd be good if robots started to promote this. I'd add it to WebCrawler if I wasn't buried in other work... -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Tue Jan 23 10:26:18 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18415; Tue, 23 Jan 96 10:26:18 -0800 Organization: CNR - Istituto Tecnologie Informatiche Multimediale Date: Tue, 23 Jan 1996 19:26:19 -0100 From: davide@jargo.itim.mi.cnr.it (Davide Musella (CNR)) Message-Id: <199601232026.TAA00561@jargo> To: robots@webcrawler.com Subject: HEAD request [was Re: Server name in /robots.txt] Cc: m.koster@webcrawler.com X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > The server is supposed to parse the document, and slam the value into > an HTTP header. This is of course a waste of server CPU and bandwidth > for the majority of cases, and opens a whole can of worms with the > semantics of HTTP header namespace collisions. It isn't the only way to handle the META info, The WN server does it using a table, so they parse the document only once a day. Ok, it isn't the best way, but there are many ways to resolve it. > This makes far more sense -- let user-agents decide what they want to do > with the data; Yes, but if they can work only with the data content in an HTTP header, why request the whole document... You can save the 90% of retrieve time, and the load of the net will be a bit lower. > So the idea is that you can do both HTTP-EQUIV=foo and NAME=bar in the > same META tag. The last draft I saw on the subject had HTTP-EQUIV > as the main thing, with NAME being optional. I think it makes far > more sense to have NAME, and abolish HTTP-EQUIV, or at least make > it a secondary choice. > In fact it'd be good if robots started to promote this. I'd add it > to WebCrawler if I wasn't buried in other work... But, if the webCrawler can index a doc by the content of the META NAME tag it can also use the META HTTP-EQUIV tag so it can use an HEAD request have the indexing info without parse the document and be sure to have the best indexing info about that doc, 'cause the author has indexed it for you. I've made some alterations to that draft, to be clearer and more exhaustive. You can find the draft here. Suggestions are welcome. Davide ----------- Davide Musella davide@jargo.itim.mi.cnr.it INTERNET DRAFT Davide Musella draft-musella-html-metatag-02.txt National Research Council The META Tag of HTML [...] 1. Introduction Now the synopsis of the META HTTP-EQUIV Tag is not severe, allowing so the use of different key words to define the same things. The functions like this: <META HTTP-EQUIV = "author" CONTENT = "Pennac, Rossi"> or: <META HTTP-EQUIV = "writer" CONTENT = "Pennac, Rossi"> could reppresent the same concepts with two different syntax. The aim of this Draft is to define which are the words to use to define the contents of an HTML document. There are, also, some easy rules to implement a binary logic (AND or OR) for the CONTENT field. 2. The META Tag The META element is used within the HEAD element to embed documents meta-information not defined by other HTML elements. Such information can be extracted by servers/clients for use in identifying, indexing and cataloging specialized document meta-information. Although it is generally preferable to used named elements that have well defined semantics for each type of meta-information, such as title, this element is provided for situations where strict SGML parsing is necessary and the local DTD is not extensible. In addition, HTTP servers can read the contents of the document head to generate response headers corresponding to any elements defining a value for the attribute HTTP-EQUIV. This provides document authors with a mechanism (not necessarily the preferred one) for identifying information that should be included in the response headers of an HTTP request. The META element has three attributes: - HTTP-EQUIV - NAME - CONTENT The HTTP-EQUIV and the NAME attributes are mutually exclusives. 3. HTTP-EQUIV. This attribute binds the element to an HTTP response header. If the semantics of the HTTP response header named by this attribute is known, then the contents can be processed based on a well defined syntactic mapping, whether or not the DTD includes anything about it. HTTP header names are case insensitive. If absent, the NAME attribute should be used to identify the meta-information and it should not be used within an HTTP response header. It is possible to use any text string, but if you want to define these properties you have to use the following words: keywords: to indicate the keywords of the document author: to indicate the author of the document timestamp: to indicate when the document is authored (HTTP-date format) expire: to indicate the expire date of the document (HTTP-date format) language: to indicate the language of the document (using ISO3316 code or ISO639 code) abstract: to indicate the abstract of the document organization: to indicate the organization of the author revision: to indicate the revision number of the document (format: 00, 01, 02, or 000, 001, ...) An HTTP server must process these tags for an HEAD HTTP requestr. Do not name an HTTP-EQUIV attribute the same as a response header that should typically only be generated by the HTTP server. Some inappropriate names are "Server", "Date", and "Last-Modified". Whether a name is inappropriate depends on the particular server implementation. It is recommended that servers ignore any META elements that specify HTTP equivalents (case insensitively) to their own reserved response headers. 4. NAME. This attributes can be used to define some properties such as "number of pages" or "preferred browser" or any info an author want to insert in his document. The keywords indicates in the previous paragraph for the HTTP-EQUIV are still valid also in the NAME context. An example: <META NAME= "Maybe Published By" CONTENT = "McDraw Bill"> or <META NAME= "keywords" CONTENT = "manual, scouting"> Do not use the META element to define information that should be associated with an existing HTML element. 5. CONTENT Used to supply a value for a named property. It can contain more than one single information; it is possible to use the Boolean operator (AND, OR) to insert a Boolean definition of the field. The AND operator will be represented by the SPACE (ASCII[32]) and the OR operator by the COMMA (ASCII[44]). The AND operator is processed before the OR operator. So a string like this: "Red ball, White pen" means :"(Red AND ball) OR (White AND pen)". Example: <META HTTP-EQUIV= "Keywords" CONTENT= "Italy Products, Italy Tourism"> The spaces between a comma and a word or vice versa are ignored. 6. Cataloguing an HTML document These 'keywords' were specifically conceived to catalogue HTML documents. This allows the software agents to index at best your own document. To do a preliminary indexing, it's important to use at least the HTTP-EQUIV meta-tag "keywords". From owner-robots Wed Jan 24 01:41:47 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22716; Wed, 24 Jan 96 01:41:47 -0800 Date: Wed, 24 Jan 1996 09:41 UT From: MGK@NEWTON.NPL.CO.UK (Martin Kiff) Message-Id: <0099CDC88EF981C0.07A8@NEWTON.NPL.CO.UK> To: robots@webcrawler.com Subject: Activity from 205.252.60.5[0-8] X-Vms-To: SMTP%"robots@webcrawler.com" X-Vms-Cc: MGK Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I'm getting scatter-like accesses from the sites: 205.252.60.5[0-8] who call themselves 'Merritt/1.0', don't look at /robots.txt and don't have 'referrer' details. (205.252.60.71 is mentioned in the 'List of Robots' but as an address which does read /robots.txt. Same subnet, hmmmm). Is there any way of tracing these addresses back to source as they are scanning areas where pages used to be and are no longer - and picking up a "302 redirect". If I know they are useful robots (and likely to make use of the 302 redirect) I'll carry on as I am. If they are useful and are going to ignore the 302 I'll feed back a "404 Page departed" or something. Is there any way of finding this sort of information out? How do robots (generally) handle 302's? and all sorts of other questions :-) Regards, Martin Kiff mgk@newton.npl.co.uk P.S. Sorry to pop up in the list and start a quick-fire list of questions, point me at references if this has been covered recently. Finding the <Meta ...> discussion/dissection highly useful, thanks.... From owner-robots Wed Jan 24 03:17:40 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23044; Wed, 24 Jan 96 03:17:40 -0800 From: mannina@crrm.univ-mrs.fr Message-Id: <950817131856.ZM9327@mitac> Date: Thu, 17 Aug 1995 13:18:40 +0200 In-Reply-To: m.koster@webcrawler.com (Martijn Koster) "Re: Server name in /robots.txt" (Jan 22, 5:59am) References: <v02140806ad2936a38b21@[199.221.45.139]> X-Mailer: Z-Mail 4.0 (4.0.0 Aug 21 1995) To: robots@webcrawler.com Subject: Re: Server name in /robots.txt Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com -- ***** ***** ***** * * * * * * * ** ** * * * * * * * * * * * * * * * ****** * * * * * * Centre de Recherche Restrospective de Marseille ///////////////////////////////////// // Mannina Bruno // // D.E.A Veille Technologique // // e.mail : // // mannina@crrm.univ-mrs.fr // ///////////////////////////////////// From owner-robots Fri Jan 26 07:48:54 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03027; Fri, 26 Jan 96 07:48:54 -0800 From: "Mark Norman" <mnorman@hposl41.cup.hp.com> Message-Id: <9601260750.ZM7639@hpisq3cl.cup.hp.com> Date: Fri, 26 Jan 1996 07:50:00 -0800 X-Mailer: Z-Mail (3.2.1 10apr95) To: robots@webcrawler.com Subject: test. please ignore. Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com test From owner-robots Fri Jan 26 09:23:51 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08591; Fri, 26 Jan 96 09:23:51 -0800 Message-Id: <199601261717.MAA16787@elephant.dev.prodigy.com> X-Sender: bonnie@192.203.241.111 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 26 Jan 1996 12:16:41 -0400 To: robots@webcrawler.com From: bonnie@wp.prodigy.com (Bonnie Scott) Subject: Any info on "E-mail America"? X-Mailer: <Windows Eudora Version 2.0.2> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com To all-- Does anyone know about "E-mail America" and if they obey the robot exclusion protocol? I found this on another mailing list: From: Justin_Kerr@calunet.com (Justin Kerr) Organization: CaluNET Online, Inc. Subject: Re: A new Alta Vista? To: online-news@marketplace.com > >Eric Meyer wrote: > >>Is another major search engine about to go on-line? We've suddenly >>been getting bursts of sequential hits on every file, including a >>bunch of mockup files that don't even have permissions set, much as >>happened right before Alta Vista's debut. They come from a variety of >>domains, mostly numeric. Anyone else experiencing this in recent days? Justin Kerr wrote: >Probably from those pig-dog digital marketers at "E-mail America," >which is currently trolling every tendril of the Internet, searching >for publicly available e-mail addresses and selling lists to firms >for the purposes of sending out wads of junk e-mail. I'm going to have to post a warning for our members if this is true. Bonnie Scott Prodigy Services Company From owner-robots Fri Jan 26 18:19:13 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09159; Fri, 26 Jan 96 18:19:13 -0800 X-Sender: mak@surfski.webcrawler.com Message-Id: <v02140800ad2dd08d44a8@[199.221.45.139]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 26 Jan 1996 18:20:24 -0700 To: davide@jargo.itim.mi.cnr.it (Davide Musella (CNR)), robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: Re: HEAD request [was Re: Server name in /robots.txt] Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi all, Sorry for the delay, but to prevent this discussion going around in circles I wrote up all arguments and solutions I could think of in internet-draft form, and appended it to this message. Please have a look, it is directly relevant to this discussion, and I'd love comments... At 7:26 PM 1/23/96, Davide Musella (CNR wrote in reply to me: >> The server is supposed to parse the document, and slam the value into >> an HTTP header. This is of course a waste of server CPU and bandwidth >> for the majority of cases, and opens a whole can of worms with the >> semantics of HTTP header namespace collisions. > >It isn't the only way to handle the META info, The WN server does it using >a table, so they parse the document only once a day. >Ok, it isn't the best way, but there are many ways to resolve it. That is better, and my non-existant ideal server would do it only at document submission time. However, it is server-side parsing, which is unfortunate, and it still requires server changes for the majority of deployed servers. I think that expecting the entire server codebase to change for a few user-agents (say robots) is unrealistic. >> This makes far more sense -- let user-agents decide what they want to do >> with the data; > >Yes, but if they can work only with the data content in an HTTP header, >why request the whole document... >You can save the 90% of retrieve time, and the load of the net will be a >bit lower. Similarly you can do a full GET, and stop retrieving after reading the META Tags in the HEAD. This would save almost as many bytes. The main question is though, will the agent/browser do just a HEAD and be satisfied with that, or will it do a full GET anyway? If it does a full get, as I believe it will, see below, than you're simply duplicating information and wasting bytes. >> So the idea is that you can do both HTTP-EQUIV=foo and NAME=bar in the >> same META tag. The last draft I saw on the subject had HTTP-EQUIV >> as the main thing, with NAME being optional. I think it makes far >> more sense to have NAME, and abolish HTTP-EQUIV, or at least make >> it a secondary choice. >> In fact it'd be good if robots started to promote this. I'd add it >> to WebCrawler if I wasn't buried in other work... > >But, if the webCrawler can index a doc by the content of the META NAME tag >it can also use the META HTTP-EQUIV tag so it can use an HEAD request >have the indexing info without parse the document and be sure to have >the best indexing info about that doc, 'cause the author has indexed it > for you. But WebCrawler also needs content: for a full-text index, to find new links, for analysis etc. So we'll continue GETting. But say we didn't need the whole content; if we'd do a HEAD, chances are the server doesn't support HTTP-EQUIV, and we have to follow with a full GET anyway. This means time, bandwidth, and the agents' resources are wasted. So in practice we'd do a GET first time, and parse it out of the HTML document. Have you actually got any indication that anyone on the client side wants HTTP-EQUIV, thinks its better than the alternatives, and wants to implement it? >I've made some alterations to that draft, to be clearer and more exhaustive. OK. I believe most of the comments I made below still hold. Regards, -- Martijn -------- Martijn Koster January 1995 HTTP-EQUIV Considered harmful Status of this Memo This document is a working document. The latest version may be found on <URL:http://info.webcrawler.com/mak/projects/meta/ equiv-harmful.html> This is a working document only, it should neither be cited nor quoted in any formal document. Distribution of this document is unlimited. Please send comments to the author. Abstract The use of the HTML META element with HTTP-EQUIV attribute for generalised meta data should be discouraged. Table of Contents 1. Introduction 1.1. The definition of the META element 1.2. The definition of names 1.3. Overview of objections 2. HTTP-EQUIV considered harmful 2.1. It is not backwards compatible 2.2. It prevents future additions 2.3. The inclusion of HTTP-specific instructions in HTML is counter to the protocol-independent nature of HTML. 2.4. It opens up name-space conflicts in HTTP. 2.5. In the common case this information is a unnecesarry duplication. 2.6. It requires server-side parsing. 2.7. It does not allow for rich meta data formats. 2.8. It does not allow for meta data content negotiation. 3. Alternatives for meta data in HTML 3.1. Using NAME 3.2. A META HTTP method 3.3. Using Accept headers for meta data 3.4. Using the <LINK> element 3.5. Using content-selection 4. Conclusion and Recommendations 5. Security Considerations 6. Author's Address 7. References 1. Introduction This introduction explains the current specification of the META element, and the extension of the META element as proposed in [HTML]. Section 2 will further discuss these issues. 1.1 The definition of the META element The <META> element is defined in RFC 1866[HTML] as follows: The <META> element is an extensible container for use in identifying specialized document meta-information. Meta-information has two main functions: * to provide a means to discover that the data set exists and how it might be obtained or accessed; and * to document the content, quality, and features of a data set, indicating its fitness for use. Each <META> element specifies a name/value pair. If multiple META elements are provided with the same name, their combined contents-- concatenated as a comma-separated list--is the value associated with that name. ... HTTP servers may read the content of the document <HEAD> to generate header fields corresponding to any elements defining a value for the attribute HTTP-EQUIV. The set of names, and syntax and semantics of associated values is not defined in RFC 1866, and this may partly be the reason that few, if any WWW, User-agents acts on this information. Because of this lack of demand, few servers implement the HTTP header generation for <META> elements with an HTTP-EQUIV attribute. 1.2 The definition of names Current work in progress [META] aims to adress this issue by specifying semantics and some syntax for the following set of names: keywords, author, timestamp, expire, language, abstract, organization, revision It furthermore modifies the definition in RFC 1866 by specifying that: The HTTP-EQUIV and the NAME attributes are mutually exclusives. and that the CONTENT attribute value has to conform to a specific syntax: The SPACE (ASCII[32]) character specifies boolean AND, and the COMMA (ASCII[44]) specifies boolean OR. Finally it incourages the use of the "keywords" name/value pair to aid document cataloguing. 1.3 Overview of objections This concept of the <META> element with the HTTP-EQUIV attribute, has several drawbacks: - it is not backwards compatible - it prevents future additions - the inclusion of HTTP-specific instructions in HTML is counter to the protocol-independent nature of HTML. - it opens up name-space conflicts in HTTP. - in the common case this information is a unnecesarry duplication. - it requires server-side parsing. - it does not allow for rich meta data formats. - it does not allow for meta data content negotiation. These drawbacks are further explained in section 2. Alternatives are suggested in section 3. Recommendations are presented in section 4. 2. HTTP-EQUIV considered harmful 2.1 It is not backwards compatible Disallowing the combined use of the HTTP-EQUIV and NAME attribtues would make some previously conforming HTML documents non-conforming. This is undesirable. For exaple, the instance: <META HTTP-EQUIV="..." NAME="..."> is legal in RFC 1866, but does not conform to the proposed extension. 2.2 It prevents future additions The imposition of a syntax and semantics on all CONTENT attribute values precludes the definiton of future conflicting values syntaxes. This severly reduces the extendibility of the <META> element. For example, the instances: <META NAME="cost" CONTENT="10 dollars"> <META NAME="bestquote" CONTENT="Et tu, Brute"> would have the semantics of "10 AND dollars", and "Et AND tu OR Brute". It should also be noted that this constraint is in conflict with the proposed extension itself, in that it prescribed HTTP conforming values for HTTP-EQUIV attributes named after HTTP headers, which donot use the AND/OR logic. 2.3 The inclusion of HTTP-specific instructions in HTML is counter to the protocol-independent nature of HTML. The WWW's success is in no small part due to the protocol-independent nature of HTML, allowing it to be served from FTP servers, Gopher servers, and directly from file systems. Similarly, the content-independent nature of HTTP has advantages. The inclusion of HTTP-specific instructions goes counter to this clean separation, and this negatively affects both the meta information and the HTML document. If a browser only supports META HTTP-EQUIV it will not be able to act on this information when served via a protocol other than HTTP; so the meta data goes to waste, and the space is wasted in the HTML. 2.4 It opens up name-space conflicts in HTTP. There is a possible conflict between HTTP-EQUIV attribute values and HTTP header values, as the META and HTTP definitions of syntax and semantics may differ. This complicates the future extension of both the META and HTTP work. More importantly, it means that user ignorance can easily result in inadvertently non-conforming HTTP protocol: If a user chooses a HTTP-EQUIV header which is defined in HTTP but doesn't use the correct syntax and semantics, the server ends up sending bad protocol unless it specifically check syntax. Even is the syntax is correct there may be semantic problems, which may confuse, or might be used for spoofing. There is even already some unclarity in the values proposed now. For example, the relationship, if any, between HTTP-EQUIV="expire" and the HTTP header "Expires", or HTTP-EQUIV="Timestamp" and the HTTP header "Last-modified" etc. is unlear. 2.5 In the common case this information is a unnecesarry duplication. The most common HTTP methods (GET and POST) result in the HTML content being transmitted. In this case, the information specified in the HTTP-EQUIV is sent twice: once as HTTP header, and once in the document content. This is an unnecesarry waste of bandwidth. Limiting the generation of HTTP headers for HTTP-EQUIV attributes to HEAD requests would alleviate this duplication, but this may mean the contruct is used too little to make it worthwile to standardise and implement. The use of meta data is especially important for the special category of User-agents known as robots [ROBOT], and they could be conceivably modified to do HEAD requests. However, this is unlikely to happen as robots generally need the entire content: they need to parse content to find new URL's, they often use full-text indexing technology which works best on complete content, and they may wish to do further analysis on the content to assess desirability or statistical properties. 2.6 It requires server-side parsing. HTTP servers needs to parse the HTML document in order to generate headers for HTTP-EQUIV attributes. This is undesirable for a number of reasons: - implementing even a partial HTML parser correctly is considerable effort. - it means servers may need to be modified as the HTML standard develops. - the parsing consumes additional CPU and memory resources. The client is the one using and applying the META data; it should be given most of the flexibility and burden. 2.7 It does not allow for rich meta data formats. The data transmitted in the HTTP header has to conform to strict syntax rules. At the very least they may not contain a CR-LF followed by a non-space or another two CR-LF pair. The proposal provides no encoding mechanism, so these restrictions must be present also in the CONTENT provided with an HTTP-EQUIV attribute. In addition the values are restricted by the DTD. This limits the power of expression of the meta data. 2.8 It does not allow for meta data content negotiation. The CONTENT values of HTTP-EQUIV attributes can not be negotiated. This means one cannot specify a preference to receive meta data in HTML, URC, or IAFA format. It also means the language of the meta data cannot be selected. This limits the power of expression of the meta data. 3. Alternatives for meta data in HTML This section aims to show that alternative solutions exist that do not share the same, or as many, problems, It is not meant to be a complete overview of alternatives, nor a complete analysis of each alternative, let alone a full solution specification. 3.1 Using NAME Rather than concentrating on the HTTP-EQUIV attribute as the main use of the <META> tag, one can concentrate on using the NAME attribute. This would remove all HTTP related problems, and turn it into a meta data construct that is restrictive, but at least simple and safe. It also requires no server-side modificiation. This would promote the use of NAME as a way of associating general meta data, and leaves the HTTP-EQUIV as a separate issue. HTTP-EQUIV could either be removed altogether, or restricted to standardised HTTP headers. This would require a rewording of the relevant section in 1866, deprecating the use of HTTP-EQUIV for non-HTTP meta data. For backwards compatibility HTTP-EQUIV would be allowed, and User-agents would be encouraged to substitue NAME for HTTP-EQUIV where the NAME is missing or the HTTP-EQUIV value specifies a non-HTTP attribute. This still has the problem of severely limiting the power of expression of the meta data. 3.2 A META HTTP method A new-to-be-defined HTTP method META could be used to request meta data associated with a URL. This would return separate content, and behaves like a normal GET. This would give the meta data complete expressive power, as any kind of content can be returned. In fact, content can be negotiated, for format, language, compression etc. This would also work for non-HTML documents. This then requires a meta data content type to be defined, but this could be as simple as text/plain, or made as complex as desired. This still has the problem of requiring server modification to link a URL with its meta data. It least the modification is restricted to a new method, and no changes or pre-parsing are required. It would also limit the accessibility of the meta data to the use of the HTTP protocol. 3.3 Using Accept headers for meta data Rather than using a new method as suggested in 3.2, Accept headers can be used in conjunction with a standard GET to request the meta data. All the advantages of 3.2 are inherited, but this requires no server modification beyond standard content negotiation. This would still limit the accessibility of the meta data to the use of the HTTP protocol, but as it is an HTTP construct for an HTTP solution this is acceptable. 3.4 Using the <LINK> element A different proposal [RELREV] seeks to standardise values of the HTML <LINK> element, which expresses relations between documents. Once such relation is "META", indicating that one document contains meta data for another. This construct would again alleviate all HTTP problems, and allow complete expressive power for meta data. This does not require modifications to servers, nor relies on a correct implementation of content-negtiation, and is applicable for protocols other than HTTP. This doesn't provide a solution for non-HTML data, but as it is an HTML construct this is acceptable. A perceived disadvantage is that the means the entire document needs to be transmitted to find the META data, but this is not true; a client can stop receiving the transmission as soon as the relevant LINK element has been found. A disadvantage id that a second request is required. I believe this is not a severy problem as retrieving the meta data is not the common browsing case, and is offset by the advantages. 3.5 Using content-selection One could see meta data specification as a specific instance of content-selection, and provide a PICS [PICS] based solution. This would limit the expressive power of the meta data, and lock it into a whole new set of problems. This may be an area for research, but appears to be an unsatisfactory solution. 4. Conclusion and Recommendations We have seen that the use of HTTP-EQUIV has several drawbacks ranging from being limited as meta data, to potential conflicts with HTTP, a separate protocol. We have also seen that several alternatives exist which donot share these drawbacks. My recommendation is: - to depricate the use of <META> HTTP-EQUIV for general meta data, and specifying it only to be used, with extreme caution, for existing HTTP header values. This could be a cheap way of providing configuration information to a server, or directly to a client if some protocol does not support the construct. - to specify that <META> NAME can be used for general meta data, but not promote its use beyond the provision of free-text keywords in the language of the document. - to urge that future specifications of values for NAME specify a clear semantics and syntax for CONTENT on a per-NAME basis. - to promote the use of the <LINK> tag to specify meta information. - to define a simple content-type for meta data, such as the IAFA templates, and encourage research on more advanced formats - to see what comes out of URC research, and how this fits with the above. 5. Security Considerations There are several security implications in the use of meta data. On an abstract level it is always difficult to guarantee that the meta data is applicable to the document. It is therefore important to present the user with easy means to find out how and where meta data was obtained. The use of HTTP-EQUIV opens up potential problems if data in the HTML document is passed into protocol header fields unchecked. Syntactically incorrect values may result in invalid protocol being transmitted. Syntactically correct values may give opportunity to spoofing of certain fields. This document argues against the use of HTTP-EQUIV. 6. Author's Address Martijn Koster, Software engineer at the WebCrawler group of America OnLine. Email address: m.koster@webcrawler.com See <URL:http://info.webcrawler.com/mak/mak.html> for more information 7. References [HTML] RFC 1866, Hypertext Markup Language - 2.0, T. Berners-Lee & D. Connolly, November 1995. <URL:ftp://ds.internic.net/rfcs rfc1866.txt> [META] The META Tag of HTML, Davide Musella, January 1996. <URL:http://info.webcrawler.com/mailing-lists/robots/0324.html> <URL:ftp://ds.internic.net/internet-drafts/ draft-musella-html-metatag-02.txt> [ROBOT] World-Wide Web Robots, Wanderers and Spiders, Martijn Koster, <URL:http://info.webcrawler.com/mak/projects/robots/robots.html> [RELREV] Hypertext links in HTML, M. Maloney & L. Quin <URL:ftp://ds.internic.net/internet-drafts/ draft-ietf-html-relrev-00.txt> [HTTP] The HyperText Transfer Protocol <URL:http://www.w3.org/pub/WWW/Protocols/> [PICS] The Platform for Internet Content Selection <URL:http://www.w3.org/pub/WWW/PICS/> From owner-robots Sat Jan 27 09:49:48 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25401; Sat, 27 Jan 96 09:49:48 -0800 Date: Sat, 27 Jan 1996 12:48:08 -0500 From: "The YakkoWakko. Webmaster" <webmaster@yakkowakko.res.cmu.edu> Message-Id: <199601271748.MAA22079@yakkowakko.res.cmu.edu> To: robots@webcrawler.com Subject: www.pl? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com My site is being hit by a robot which sets the "User-Agent" field to www.pl, and made no check for /robots.txt. Does anyone know anything about this robot, especially about is responsible for it? From owner-robots Sun Jan 28 21:51:25 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19406; Sun, 28 Jan 96 21:51:25 -0800 Message-Id: <310C5F92.2C2E@randomc.com> Date: Mon, 29 Jan 1996 00:48:02 -0500 From: Mark Krell <markk@ra1.randomc.com> Organization: Equity International Webcenter X-Mailer: Mozilla 2.0b4 (WinNT; I) Mime-Version: 1.0 To: robots@webcrawler.com Subject: New URL's from Equity Int'll Webcenter X-Url: http://www.eiw.com/search.htm Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com url="http://www.eiw.com" name="Equity International Webcenter" string="Inside travel information about Atlanta, GA; Atlanta Business, Global Exchange, 3-D website, classifieds, Full service commercial website. Rated G." keywords="atlanta, georgia, GA, 3D, 3-D, business, classifieds, fulton, gwinnett, cobb, henry, clayton, business, commercial, equity, eiw, usa, world trade, family website" ----- url="http://www.eiw.com/colonel/" name="The Colonels' Atlanta Home Page" string="Insider information for travelers to Atlanta, including ways to save money while you're there. Also tips about American and southern English for visitors." keywords="atlanta, georgia, food, travel, discounts, coupons, american, english, language, restaurants, hotels, southern, saving, money, kentucky colonel, ky colonel" ----- url="http://www.eiw.com/hats/" name="Hats! Hats!! Hats!!! string="Anything you want a baseball hat to say..." keywords="hats, caps, baseball, custom, novelty, novelties" ----- url="http://www.eiw.com/equity/" name="Equity Software Co. - EQUISOFT Atlanta" string"Software Developers - Hardware Mnufacturers" keywords="ti, ti7000, emulators, cpu, texas instruments software, systems" --- url="http://www.eiw.com/wendell/" name="Patterson Realty, Franklin, NC" string="Your Smoky Mountain Vacation Hideaway property" keywords="nc, north carolina, franklin, smoky mountains, vacation land, recreational property" ----- New submissions... If we are using a bad format for them, please advise. Replies advising us of which machines we are listed on will avoid duplicate submissions from us. Thank you, Mark Krell, webmaster, eiw.com 205.160.16.171 From owner-robots Mon Jan 29 09:25:35 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21433; Mon, 29 Jan 96 09:25:35 -0800 Organization: CNR - Istituto Tecnologie Informatiche Multimediale Date: Mon, 29 Jan 1996 18:21:28 -0100 From: davide@jargo.itim.mi.cnr.it (Davide Musella (CNR)) Message-Id: <199601291921.SAA06997@jargo> To: robots@webcrawler.com Subject: Re: HEAD request [was Re: Server name in /robots.txt] Cc: m.koster@webcrawler.com X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Martijn Koster wrote: > That is better, and my non-existant ideal server would do it only at > document submission time. However, it is server-side parsing, which > is unfortunate, and it still requires server changes for the majority > of deployed servers. I think that expecting the entire server codebase to > change for a few user-agents (say robots) is unrealistic. Yeah, that's also my idea of the ideal server, but why unrealistic... robots action is the base of the internet, if we want to follow the internet growing, we must study new cataloguing techniques, 'cause the actual methods aren't sufficient. So little changing to make the net better aren't so unrealistic. Then also your suggestions need some changing of the servers. > >Yes, but if they can work only with the data content in an HTTP header, > >why request the whole document... > >You can save the 90% of retrieve time, and the load of the net will be a > >bit lower. > Similarly you can do a full GET, and stop retrieving after reading the > META Tags in the HEAD. This would save almost as many bytes. > The main question is though, will the agent/browser do just a HEAD and > be satisfied with that, or will it do a full GET anyway? If it does a full > get, as I believe it will, see below, than you're simply duplicating > information and wasting bytes. It's the price to pay. We are tring to lower the entropy of an http answer of a robot request, so somewhere the entropy must grow, we must decide where. Different methods move only the place in which it happens. The problem is: How much can we spend (in cpu,disk, memory & net load) to implement some new, essential, characteristic for the web cataloguing ?? > But WebCrawler also needs content: for a full-text index, to find new > links, for analysis etc. So we'll continue GETting. Yes, this is a problem.. the solution could be to make another table.. But I'm doing some statistics to calculate the amount of this and of others possible table. I'll post the results. > Have you actually got any indication that anyone on the client > side wants HTTP-EQUIV, thinks its better than the alternatives, > and wants to implement it? I've received many signals, but nobody told me anything about this method. So, nobody said me "I don't agree with the http-equiv method" (except you) but also I've not received any positive signal. >-- Martijn > > HTTP-EQUIV Considered harmful I'm glad to read this. I hope we can find the best way to resolve this problem. > 2. HTTP-EQUIV considered harmful > > 2.1 It is not backwards compatible > > Disallowing the combined use of the HTTP-EQUIV and NAME attribtues > would make some previously conforming HTML documents non-conforming. > This is undesirable. > > For exaple, the instance: > > <META HTTP-EQUIV="..." NAME="..."> > > is legal in RFC 1866, but does not conform to the proposed extension. Yes, it's true... I had not thought about this problem, but is it a big problem? The idea is that this Meta info MUST be used, and that the html-author, normally is not a computer-oriented :) person, so to make the utilization of this tag easier I've done a linear syntax. It could add some redundancy in the code, nothing more. If you think there are too many documents with the meta tag inside (?) and with both the http-equiv and name attributes in the same tag (?), I'll have any problem to cut off that sentence . > 2.2 It prevents future additions > > The imposition of a syntax and semantics on all CONTENT attribute > values precludes the definiton of future conflicting values syntaxes. > This severly reduces the extendibility of the <META> element. > > For example, the instances: > > <META NAME="cost" CONTENT="10 dollars"> > <META NAME="bestquote" CONTENT="Et tu, Brute"> > > would have the semantics of "10 AND dollars", and "Et AND tu OR Brute". > > It should also be noted that this constraint is in conflict with > the proposed extension itself, in that it prescribed HTTP > conforming values for HTTP-EQUIV attributes named after HTTP headers, > which donot use the AND/OR logic. You know I've only formalized the CONTENT use. I haven't changed anything. If you found CONTENT="Et tu, Brute" you understood "Et AND tu OR Brute" also before my draft. In fact you suggested me to change that part in: "Keyword phrases are separated by commas?" but I thought it wasn't more clear than my definition, so I didn't. > 2.3 The inclusion of HTTP-specific instructions in HTML is counter to > the protocol-independent nature of HTML. > > The inclusion of HTTP-specific instructions goes counter to this > clean separation, and this negatively affects both the meta > information and the HTML document. If a browser only supports META > HTTP-EQUIV it will not be able to act on this information when served > via a protocol other than HTTP; so the meta data goes to waste, and > the space is wasted in the HTML. This was also before my Draft. > 2.4 It opens up name-space conflicts in HTTP. > > There is a possible conflict between HTTP-EQUIV attribute values and > HTTP header values, as the META and HTTP definitions of syntax and > semantics may differ. This complicates the future extension of both > the META and HTTP work. > Even is the syntax is correct there may be semantic problems, > which may confuse, or might be used for spoofing. but I've written: Do not name an HTTP-EQUIV attribute the same as a response header that should typically only be generated by the HTTP server. Some inappropriate names are "Server", "Date", and "Last-Modified". Whether a name is inappropriate depends on the particular server implementation. It is recommended that servers ignore any META elements that specify HTTP equivalents (case insensitively) to their own reserved response headers. > There is even already some unclarity in the values proposed now. > For example, the relationship, if any, between HTTP-EQUIV="expire" > and the HTTP header "Expires", or HTTP-EQUIV="Timestamp" and the > HTTP header "Last-modified" etc. is unlear. A server can decide how long is the expire time but it is the same for all the docs in the web. The timetable of a school could have an expire time of an year and an internet-draft has an expire time of six months. How can you say to the server :"ehi my docs will expire in a year" rather than "in six months"? I think the difference between TIMESTAP and Last-modified is clear. we are speaking about documents, not file. The file is only the box that contains the documents, nothing more. > 2.5 In the common case this information is a unnecesarry duplication. > > The most common HTTP methods (GET and POST) result in the HTML > content being transmitted. In this case, the information specified > in the HTTP-EQUIV is sent twice: once as HTTP header, and once in > the document content. This is an unnecesarry waste of bandwidth. That info must be included only in the answer of an HEAD request, it is specified in the draft. > Limiting the generation of HTTP headers for HTTP-EQUIV attributes > to HEAD requests would alleviate this duplication, but this may > mean the contruct is used too little to make it worthwile to > standardise and implement. ops.. > > The use of meta data is especially important for the special > category of User-agents known as robots [ROBOT], and they could be > conceivably modified to do HEAD requests. However, this is unlikely > to happen as robots generally need the entire content: they need to > parse content to find new URL's, they often use full-text indexing > technology which works best on complete content, and they may wish > to do further analysis on the content to assess desirability or > statistical properties. Ok, but I'm talking only about the cataloguing, 'cause I guess the most among the robots want to index the web, if a robot wants the whole doc to do some analysis, probably it doesn't need the meta info neither. > 2.6 It requires server-side parsing. > > HTTP servers needs to parse the HTML document in order to generate > headers for HTTP-EQUIV attributes. This is undesirable for a number > of reasons: > > - implementing even a partial HTML parser correctly is considerable > effort. > - it means servers may need to be modified as the HTML standard > develops. > - the parsing consumes additional CPU and memory resources. > > The client is the one using and applying the META data; it should > be given most of the flexibility and burden. > > 2.7 It does not allow for rich meta data formats. > > The data transmitted in the HTTP header has to conform to strict > syntax rules. At the very least they may not contain a CR-LF > followed by a non-space or another two CR-LF pair. The proposal > provides no encoding mechanism, so these restrictions must be present > also in the CONTENT provided with an HTTP-EQUIV attribute. > This limits the power of expression of the meta data. Yes, but you have to think that this meta info are extracted by the author himself, to be sure that the info you receive are exact, you must be sure that the author has understood at best how to use the metatag. > 2.8 It does not allow for meta data content negotiation. > > The CONTENT values of HTTP-EQUIV attributes can not be negotiated. > This means one cannot specify a preference to receive meta data > in HTML, URC, or IAFA format. It also means the language of the > meta data cannot be selected. > > This limits the power of expression of the meta data. > > > 3. Alternatives for meta data in HTML > > This section aims to show that alternative solutions exist that > do not share the same, or as many, problems, > > It is not meant to be a complete overview of alternatives, > nor a complete analysis of each alternative, let alone > a full solution specification. > > 3.1 Using NAME [...] > 3.2 A META HTTP method [...] > 3.3 Using Accept headers for meta data [...] > 3.4 Using the <LINK> element [...] New method with new problems. How to solve the problem of retrieving external/internal links contained in a html doc?? And who must build the external meta-file?? Do you think an author can build it? bye Davide ------------ Davide Musella National Research Council, Milan, ITALY tel. +39.(0)2.70643271 e-mail: davide@jargo.itim.mi.cnr.it From owner-robots Wed Jan 31 10:31:58 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02332; Wed, 31 Jan 96 10:31:58 -0800 Date: Wed, 31 Jan 1996 15:25:58 -0600 Message-Id: <199601312125.PAA11574@server1.argenet.com.ar> X-Sender: renato@mail.argenet.com.ar X-Mailer: Windows Eudora Version 1.4.3 Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable To: robots@webcrawler.com From: renato@argenet.com.ar (Renato Mario Rossello) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi everyone, I'm a new list user from Argentina and first of all I want to excuse myself because of my poor English. I'm a data processing engineering (Ingenier=EDa Inform=E1tica in my country) student and together with two other students= I'm developing a specific web robot to retrieve Spanish resources on the Web. This project is for an Artificial Intelligence (AI) course. We are using Perl 4 and Libwww-perl on a Linux system. We have problems with the database engine. We tried some but they work horrible and, before trying new engines, we would like to hear someone's advice. Thanks & Bye renato@argenet.com.ar (Renato Rossello) From owner-robots Wed Jan 31 20:02:04 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17495; Wed, 31 Jan 96 20:02:04 -0800 Date: Thu, 1 Feb 1996 01:00:04 -0600 Message-Id: <199602010700.BAA03240@server1.argenet.com.ar> X-Sender: renato@mail.argenet.com.ar X-Mailer: Windows Eudora Version 1.4.3 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: renato@argenet.com.ar (Renato Mario Rossello) Subject: Requesting info on database engines Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi everyone, More specifically than previous mail, the Spanish Robot which I'm devoloping will use a database engine to: - Manage a 30000 word dictionary for language recognition of resources. - Manage an index of the existing Sanish URLs in order to attend queries based upon keywords. - Manage a list of URLs to be visited later by an agent (wanderer) - Manage queries. Then I need a database engine that supports: - Multiple indexes for each database. - Manage of more or less 100000 registers in an acceptable way Thanks for your help & Bye renato@argenet.com.ar (Renato Rossello) From owner-robots Sat Feb 3 05:41:16 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22775; Sat, 3 Feb 96 05:41:16 -0800 Message-Id: <199602031341.HAA20477@mailhost.onramp.net> Comments: Authenticated sender is <rtglennr@mailhost.onramp.net> From: "Richard Glenner" <rtglennr@Onramp.NET> To: robots@webcrawler.com Date: Sat, 3 Feb 1996 07:42:38 +0000 Subject: News Clipper for newsgroups - Windows Priority: urgent X-Mailer: Pegasus Mail for Windows (v2.23) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Does anyone know if there is a News Clipper application for newsgroups that runs under Windows 95. or A helper application that accomplishes the same thing for Netscape? Especially one that uses filter rules similar to Pegasus PMAIL that you can create a general set of filters that can be executed as a group. The last alternative is an example of how to write one in Visual Basic. I currently use Newshound, and SIFT and I am not happy with the accuracy. TIA Richard rtglennr@onramp.net From owner-robots Wed Feb 7 19:49:43 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06198; Wed, 7 Feb 96 19:49:43 -0800 Message-Id: <31197182.2479@stratus.geog.mankato.msus.edu> Date: Wed, 07 Feb 1996 21:44:02 -0600 From: Charlie Brown <chuck@stratus.geog.mankato.msus.edu> Organization: Nowhere man! X-Mailer: Mozilla 2.0b5 (Win95; I) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Wanted: Web Robot code - C/Perl X-Url: http://info.webcrawler.com/mailing-lists/robots/index.html#23 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hello everyone, You may consider this message bothersome, but in all the threads in the archive, I didn't find the answer I am looking for. I would like some code for implementing a robot in Perl (1st choice.) Please don't suggest I do something else with my time. I need a customizable robot, and if I can't find code, I will write it myself. I would appreciate any help you may offer. Thanks -- Charlie Brown - chuck@stratus.geog.mankato.msus.edu Software Designer @ Owatonna Tool Company, Owatonna, MN Part-time student at Mankato State University, Mankato, MN From owner-robots Thu Feb 8 05:04:41 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00761; Thu, 8 Feb 96 05:04:41 -0800 From: "Patrick 'Zapzap' Lin" <pat@popeye.cyberstation.fr> Message-Id: <199602081309.OAA00424@popeye.cyberstation.fr> Subject: Re: Wanted: Web Robot code - C/Perl To: robots@webcrawler.com Date: Thu, 8 Feb 1996 14:09:57 +0100 (MET) In-Reply-To: <31197182.2479@stratus.geog.mankato.msus.edu> from "Charlie Brown" at Feb 7, 96 09:44:02 pm X-Mailer: ELM [version 2.4 PL24] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 133 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com hello, i am interested too to a code for a robot in perl. if you find something can you tell what or where please thank for all Pat From owner-robots Thu Feb 8 08:30:10 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10657; Thu, 8 Feb 96 08:30:10 -0800 Message-Id: <9602081627.AA08268@marys.smumn.edu> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Kevin Hoogheem <khooghee@marys.smumn.edu> Date: Thu, 8 Feb 96 10:30:42 -0600 To: robots@webcrawler.com Subject: Re: Wanted: Web Robot code - C/Perl References: <31197182.2479@stratus.geog.mankato.msus.edu> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Get Libwww for Perl and check out the skeleton robot they have and then make one yourself ;)- From owner-robots Thu Feb 8 08:44:43 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11367; Thu, 8 Feb 96 08:44:43 -0800 Date: Thu, 8 Feb 96 08:43:59 -0800 From: Christopher Penrose <penrose@grasshopper.ucsd.edu> Message-Id: <9602081643.AA23579@grasshopper.ucsd.edu> To: robots@webcrawler.com Subject: Perl Spiders Cc: penrose@grasshopper.ucsd.edu Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com There is a good book on the subject, with perl source code: Internet Agents: Spiders, Wanderers, Brokers, and Bots by Fah-Chun Cheong New Riders The WebWalker perl source can be found at: http://deluge.stanford.:8000/book/WebWalker it requires the libwww-perl: http://www.ics.uci.edu/WebSoft/libwww-perl/ I'd give you my spider, NetNose, which I wrote mostly before I had this book, but it is not as well organized and complete at the http protocol level. Christopher Penrose penrose@ucsd.edu http://www-crca.ucsd.edu/TajMahal/after.html From owner-robots Thu Feb 8 09:21:02 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA13742; Thu, 8 Feb 96 09:21:02 -0800 Date: Thu, 8 Feb 1996 11:20:41 -0600 (CST) From: Keith Fischer <kfischer@mail.win.org> X-Sender: kfischer@winc0 To: robots@webcrawler.com Subject: Re: Wanted: Web Robot code - C/Perl In-Reply-To: <199602081309.OAA00424@popeye.cyberstation.fr> Message-Id: <Pine.SOL.3.91.960208111653.8827A-100000@winc0> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Notice to all: By Feb 12, 1996 (4 days from now) The code you seek should be available on the robot and search engine faq. I'll release the address and the full body html at that time. Thanks for your time. Keith D. Fischer kfischer@sy.smsu.edu kfischer@mail.win.org On Thu, 8 Feb 1996, Patrick 'Zapzap' Lin wrote: > hello, > i am interested too to a code for a robot in perl. > if you find something can you tell what or where please > > thank for all > Pat > From owner-robots Thu Feb 8 16:13:00 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09496; Thu, 8 Feb 96 16:13:00 -0800 From: dino@im.mgt.ncu.edu.tw (dino) Message-Id: <9602090013.AA21600@mgt.ncu.edu.tw> Subject: Re: Perl Spiders To: robots@webcrawler.com Date: Fri, 9 Feb 1996 08:11:08 +0800 (CST) In-Reply-To: <9602081643.AA23579@grasshopper.ucsd.edu> from "Christopher Penrose" at Feb 8, 96 08:43:59 am X-Mailer: ELM [version 2.4 PL23] Content-Type: text Content-Length: 569 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > The WebWalker perl source can be found at: > http://deluge.stanford.:8000/book/WebWalker ^^^^^^^^^^^^^^^^^ Hello..:) sorry, I seem can't connect to this site? lack something ?? -- _________Jiing-Chau Shieh _________________________________________NCU_MIS__ ______////(__///_ E-mail: dino@im.mgt.ncu.edu.tw I.am........\ O ) ________\_// managers@im.mgt.ncu.edu.tw ...Xfish.../___)_<<<__ ___/ \\ URL: http://www.mgt.ncu.edu.tw/~dino/ ___________________\\\ .................................................... From owner-robots Thu Feb 8 16:32:44 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10928; Thu, 8 Feb 96 16:32:44 -0800 Message-Id: <9602090032.AA26417@grasshopper.ucsd.edu> Content-Type: text/plain Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) From: Christopher Penrose <penrose@grasshopper.ucsd.edu> Date: Thu, 8 Feb 96 16:31:58 -0800 To: robots@webcrawler.com Subject: Re: Perl Spiders Cc: penrose@grasshopper.ucsd.edu References: <9602090013.AA21600@mgt.ncu.edu.tw> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Oops! I apologize for sending out a bad url... The WebWalker perl source can be found at: http://www.mcp.com/softlib/Internet/WebWalker Christopher Penrose penrose@ucsd.edu http://www-crca.ucsd.edu/TajMahal/after.html From owner-robots Thu Feb 8 16:49:06 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12184; Thu, 8 Feb 96 16:49:06 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: <v02130503ad404a3faa3b@[192.187.143.12]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 8 Feb 1996 16:50:00 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: News Clipper for newsgroups - Windows Cc: "Richard Glenner" <rtglennr@Onramp.NET> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 11:42 PM 2/2/96, Richard Glenner wrote: >Does anyone know if there is a News Clipper application for >newsgroups that runs under Windows 95. or A helper application that >accomplishes the same thing for Netscape? Especially one that uses >filter rules similar to Pegasus PMAIL that you can create a general >set of filters that can be executed as a group. The last alternative is an >example of how to write one in Visual Basic. Verity's Agent Server does this, but it's not priced for small applications. If you think you might want to consider it, I can direct you to the appropriate product manager. Nick Arnett Internet Marketing Manager Verity Inc. From owner-robots Thu Feb 8 17:47:23 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16054; Thu, 8 Feb 96 17:47:23 -0800 From: chuck@stratus.Geog.Mankato.MSUS.EDU (Charlie Brown) Message-Id: <9602090145.AA03410@stratus> Subject: Re: Perl Spiders To: robots@webcrawler.com Date: Thu, 8 Feb 1996 19:45:31 -0600 (CST) In-Reply-To: <9602090013.AA21600@mgt.ncu.edu.tw> from "dino" at Feb 9, 96 08:11:08 am X-Mailer: ELM [version 2.4 PL24] Content-Type: text Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > The WebWalker perl source can be found at: > > http://deluge.stanford.:8000/book/WebWalker > ^^^^^^^^^^^^^^^^^ > Hello..:) > > sorry, I seem can't connect to this site? > lack something ?? Obviously, this is Stanford U, so the correct address is deluge.stanford.edu However, as of tonight, I can't get up a connection. I have recieved quite a few replies, and will post a summary next week. In the meantime, I believe Keith Fischer will be releasing a faq on Monday. Chuck -- Charlie Brown -- chuck@stratus.geog.mankato.msus.edu Software Designer @ Owatonna Tool Company, Part-time Student <A Href="Http://www.geog.mankato.msus.edu/~chuck>Chuck</a> From owner-robots Fri Feb 9 01:39:24 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA13205; Fri, 9 Feb 96 01:39:24 -0800 From: mannina@crrm.univ-mrs.fr Message-Id: <960209113938.ZM7511@mitac> Date: Fri, 9 Feb 1996 11:39:30 +0200 In-Reply-To: Christopher Penrose <penrose@grasshopper.ucsd.edu> "Re: Perl Spiders" (Feb 8, 4:31pm) References: <9602090013.AA21600@mgt.ncu.edu.tw> <9602090032.AA26417@grasshopper.ucsd.edu> X-Mailer: Z-Mail 4.0 (4.0.0 Aug 21 1995) To: robots@webcrawler.com Subject: Re:Re: Perl Spiders Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, >The WebWalker perl source can be found at: >http://www.mcp.com/softlib/Internet/WebWalker Sorry, but your new adress doesn't work... -- ***** ***** ***** * * * * * * * ** ** * * * * * * * * * * * * * * * ****** * * * * * * Centre de Recherche Retrospective de Marseille http://crrm.univ-mrs.fr ///////////////////////////////////// // Mannina Bruno // // D.E.A Veille Technologique // // e.mail : // // mannina@crrm.univ-mrs.fr // ///////////////////////////////////// From owner-robots Fri Feb 9 08:59:31 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03294; Fri, 9 Feb 96 08:59:31 -0800 Date: Fri, 9 Feb 96 08:58:47 -0800 From: Christopher Penrose <penrose@grasshopper.ucsd.edu> Message-Id: <9602091658.AA02351@grasshopper.ucsd.edu> To: robots@webcrawler.com Subject: Here is WebWalker Cc: penrose@grasshopper.ucsd.edu Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Since various people have been having difficulty reaching this url: http://www.mcp.com/softlib/Internet/WebWalker I will email it to people who request it. It is simply an ascii perl script about 80k in size. It is also available at: http://www-crca.ucsd.edu/websoft/WebWalker Again, this script requires the libwww perl library found at: http://charlotte.ics.uci.edu/pub/websoft/libwww-perl/ Christopher Penrose penrose@ucsd.edu http://www-crca.ucsd.edu/TajMahal/after.html From owner-robots Fri Feb 9 20:57:03 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00406; Fri, 9 Feb 96 20:57:03 -0800 From: dino@im.mgt.ncu.edu.tw (dino) Message-Id: <9602100457.AA04194@mgt.ncu.edu.tw> Subject: Re: Here is WebWalker To: robots@webcrawler.com Date: Sat, 10 Feb 1996 12:56:50 +0800 (CST) In-Reply-To: <9602091658.AA02351@grasshopper.ucsd.edu> from "Christopher Penrose" at Feb 9, 96 08:58:47 am X-Mailer: ELM [version 2.4 PL23] Content-Type: text Content-Length: 943 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > Since various people have been having difficulty reaching > this url: > > http://www.mcp.com/softlib/Internet/WebWalker > > I will email it to people who request it. It is simply an ascii perl > script about 80k in size. It is also available at: > > http://www-crca.ucsd.edu/websoft/WebWalker > > Again, this script requires the libwww perl library found at: > > http://charlotte.ics.uci.edu/pub/websoft/libwww-perl/ > hello..:) ya..I got this pero script... But it need one TaskFile ....How to write this TaskFile ? Any Examples ?? Thanks..:) -- _________Jiing-Chau Shieh _________________________________________NCU_MIS__ ______////(__///_ E-mail: dino@im.mgt.ncu.edu.tw I.am........\ O ) ________\_// managers@im.mgt.ncu.edu.tw ...Xfish.../___)_<<<__ ___/ \\ URL: http://www.mgt.ncu.edu.tw/~dino/ ___________________\\\ .................................................... From owner-robots Mon Feb 12 13:19:53 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17847; Mon, 12 Feb 96 13:19:53 -0800 Date: Mon, 12 Feb 1996 15:19:43 -0600 Message-Id: <9602122119.AA24141@wins0> X-Sender: kfischer@pop.win.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: kfischer@mail.win.org (Keith D. Fischer) Subject: The "Robot and Search Engine FAQ" X-Mailer: <Windows Eudora Version 2.0.2> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Will be delayed until Tuesday, Feb 13, 1996. Sorry for the delay folks. You can expect the FAQ up and running at 9:00 am Central Standard Time. Keith D. Fischer From owner-robots Mon Feb 12 14:31:32 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21807; Mon, 12 Feb 96 14:31:32 -0800 Date: Mon, 12 Feb 1996 15:28:08 -0700 (MST) From: Kenneth DeMarse <pinkman@bpaosf.bpa.arizona.edu> To: robots@webcrawler.com Subject: algorithms Message-Id: <Pine.OSF.3.91.960212152120.9335B-100000@bpaosf.bpa.arizona.edu> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hello, I am looking for websites, journals, etc.. that will provide detailed information on the following items. I need the information for a class presentation discussing "internet spiders". I am only concerned with spiders used for information retrieval and indexing, or website/link maintenance. Specifically, I am interested in finding: 1) Information on what algorithms spiders use to avoid overloading servers 2) Information on what indexing/similarity/etc algorithms are used on the information that a spider gathers, and what kind of searches (ie keyword, concept space, etc.) these algorithms support. 3) Actual code of any spider The student in the class are all advanced programmers, so I don't want to bore them with the definition of a spider, etc.. I want to get into algorithmic detail. I would appreciate any and all information. please send mail to : pinkman@bpaosf.bpa.arizona.edu From owner-robots Tue Feb 13 07:56:18 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17520; Tue, 13 Feb 96 07:56:18 -0800 From: rvaquero@gugu.usal.es (Jose Raul Vaquero Pulido) Message-Id: <9602131558.AA00821@gugu.usal.es> Subject: Re: algorithms too To: robots@webcrawler.com Date: Tue, 13 Feb 1996 16:58:42 +0100 (GMT+0100) In-Reply-To: <Pine.OSF.3.91.960212152120.9335B-100000@bpaosf.bpa.arizona.edu> from "Kenneth DeMarse" at Feb 12, 96 03:28:08 pm X-Mailer: ELM [version 2.4 PL22] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 812 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hello: First, Sorry for my bad English. I'm doing the doctorate in documentation (expeciality in the use of the Spiders and its utility in the libraries and documentation centers). I'm looking for information about: 1) Where could I find more about the use of the spiders in the libraries or documentation center? 2) Where could I find basical information about spiders (I have read everything of http://info.webcrawler.com/mak/projects/ and its links. and I have read everything in this list (since january 1996)). I know that my request is very basical for everybody here, but I realy need this information because I don't know where go and because I'm new in this material. Thank everybody for all: please send mail to : rvquero@gugu.usal.es From owner-robots Tue Feb 13 10:24:29 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26364; Tue, 13 Feb 96 10:24:29 -0800 Date: Tue, 13 Feb 1996 12:24:15 -0600 Message-Id: <9602131824.AA25204@wins0> X-Sender: kfischer@pop.win.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: kfischer@mail.win.org (Keith D. Fischer) Subject: The Robot And Search Engine FAQ X-Mailer: <Windows Eudora Version 2.0.2> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Due to the size of the file and number of links, I've decided to just post the url. If anyone would like a copy just yell, I'll mail it. The Robot and Search Engine FAQ can be found at http://science.smsu.edu/robot/faq/robot.html p.s. It's somewhat pollished but, you may still need your construction hat. Please comment, my goal is to provide the best possible FAQ. Keith D. Fischer kfischer@mail.win.org kfischer@sy.smsu.edu From owner-robots Wed Feb 14 06:19:22 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05282; Wed, 14 Feb 96 06:19:22 -0800 X-Sender: mak@surfski.webcrawler.com Message-Id: <v02140802ad46e0c02821@[199.221.45.139]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 14 Feb 1996 06:21:20 -0700 To: robots@webcrawler.com From: "John McGrath - Money Spider Ltd." <spider@enterprise.net> (by way of m.koster@webcrawler.com (Martijn Koster)) Subject: Money Spider WWW Robot for Windows Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi - This may be of interest to you... PRESS RELEASE Internet Software Innovation announcement ========================================= Money Spider Ltd. has developed a world first for the internet. The product is a software tool so useful - everybody will wonder how they managed without it. The name of the product is: 'Arachnid - the Personal WWW Robot for Windows'. Arachnid is used in conjunction with your existing internet connection, and effectively gathers information from targeted servers / URLs. What Arachnid does is to collect HTML, GIFs, AVIs, WAV files automatically and save them locally to disk. For high speed collection of HTML, all other file options can be disabled. Current benchmarks using a 28,800 bps modem have given results as startling as 20 average HTML pages per minute. Your online time is drastically reduced with Arachnid, as all the HTML etc. can be viewed offline at your leisure. ------------------------------------------------------------------------- Other powerful features allow things like: Mailing lists to be constructed as Arachnid can optionally save all mailto: addresses found. JAVA, Internet Explorer, VRML sites can be detected by Arachnid and lists made of the pages as detected. Automatic emailing on the fly - as email addresses are encountered, pre- defined email could be sent. The user interface for Arachnid is very 'friendly' with tab dialogue construction and an inbuilt URL database. ----------------------------------------------------------------------- A downloadable pre-release 16-bit version will be available by the end of the week. Further details and screenshots can be viewed at our web site: http://www.moneyspider.com Please contact: Mr. David Ogden Sales Director Money Spider Ltd. 67 Albert Street, Rugby, Warwickshire CV21 2SN UK Email: david@moneyspider.com Phone: 01788-552057 Fax: 01788-552056 -----End From owner-robots Wed Feb 14 06:52:17 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA07292; Wed, 14 Feb 96 06:52:17 -0800 Date: Wed, 14 Feb 96 09:32:58 EST From: "Tangy Verdell" <TVerdell@dca.com> Message-Id: <9601148243.AA824320372@smtphost.dca.com> To: robots@webcrawler.com Subject: robots.txt changes how often? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com in my robot, i don't want to check the robots.txt everytime i visit that host because i think that robots.txt doesn't change very often - ususally. is this true? From owner-robots Wed Feb 14 08:16:16 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12878; Wed, 14 Feb 96 08:16:16 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: <v02130502ad47b9e4a676@[192.187.143.12]> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 14 Feb 1996 08:16:24 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Money Spider WWW Robot for Windows Cc: spider@enterprise.net Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 6:21 AM 2/14/96, "John McGrath - Money Spider Ltd." (by wrote: >Arachnid is used in conjunction with your existing internet connection, >and effectively gathers information from targeted servers / URLs. > >What Arachnid does is to collect HTML, GIFs, AVIs, WAV files automatically >and save them locally to disk. For high speed collection of HTML, all other >file options can be disabled. Current benchmarks using a 28,800 bps modem >have given results as startling as 20 average HTML pages per minute. Are the pages then searcheable on the client, or is the idea mostly to create a local cache? >Mailing lists to be constructed as Arachnid can optionally save >all mailto: addresses found. ... >Automatic emailing on the fly - as email addresses are encountered, pre- >defined email could be sent. Why would I want a spider to send e-mail automatically like this to people whom I may or may not know? Nick From owner-robots Wed Feb 14 08:51:15 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16483; Wed, 14 Feb 96 08:51:15 -0800 Date: Wed, 14 Feb 96 11:33:33 EST From: "Tangy Verdell" <TVerdell@dca.com> Message-Id: <9601148243.AA824327479@smtphost.dca.com> To: robots@webcrawler.com Subject: robots.txt Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com how standardized is the robots.txt file. has anyone ran into problems where a significant number of sites have made typos or erros in their robots.txt file. for example: spelling User-agent as useragent. ______________________________ Reply Separator _________________________________ Subject: The Robot And Search Engine FAQ Author: robots@webcrawler.com at Internet Date: 2/13/96 3:03 PM Due to the size of the file and number of links, I've decided to just post the url. If anyone would like a copy just yell, I'll mail it. The Robot and Search Engine FAQ can be found at http://science.smsu.edu/robot/faq/robot.html p.s. It's somewhat pollished but, you may still need your construction hat. Please comment, my goal is to provide the best possible FAQ. Keith D. Fischer kfischer@mail.win.org kfischer@sy.smsu.edu From owner-robots Wed Feb 14 11:14:25 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26499; Wed, 14 Feb 96 11:14:25 -0800 X-Sender: dchandler@abilnet.com X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Darrin Chandler <dchandler@abilnet.com> Subject: Re: robots.txt changes how often? Date: Wed, 14 Feb 1996 12:14:16 -0700 Message-Id: <19960214191416557.AAA46@darrin.abilnet.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 09:32 2/14/96 EST, you wrote: > in my robot, i don't want to check the robots.txt everytime i visit > that host because i think that robots.txt doesn't change very often - > ususally. > > is this true? > Depends on what you mean by "visit". Checking once per day would be very little overhead, but would make your program fairly responsive to changes. Also, use the "If-modified-since" request header so that you'll get a 304 result if robots.txt hasn't changed. Darrin ______________________________________________ _/| _| _| _| _/_| _| _| _| _| _|_|_| _/ _| _| _| _| _/_|_| _|_|_| _| _| _| _| _| _| _/ _| _| _| _| _| _| _| _| _| _/ _| _|_|_| _| _| _| _|_| _|_|_| _| _|_|_| Darrin Chandler, Duke of URL Ability Software & Productions Email: dchandler@abilnet.com WWW: http://www.abilnet.com/ ______________________________________________ From owner-robots Wed Feb 14 11:55:50 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA29298; Wed, 14 Feb 96 11:55:50 -0800 Message-Id: <199602141955.LAA04940@wally.cs.washington.edu> In-Reply-To: kfischer@mail.win.org's message of Tue, 13 Feb 1996 12:24:15 -0600 To: robots@webcrawler.com Subject: Re: The Robot And Search Engine FAQ Gcc: nnfolder:misc-mail References: <9602131824.AA25204@wins0> From: Erik Selberg <selberg@cs.washington.edu> Date: Wed, 14 Feb 1996 11:55:36 PST Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I noticed that you have a somewhat extensive section on robot-based search services, yet you don't mention other types of services, such as human-created (which I think Yahoo is, although I could be wrong) or meta (such as MetaCrawler or SavvySearch). I only mention these because it you have a few sections that mention some extra benefits of various search services available which aren't related at all to robots, so I'm advocating either eliminating those sections and sticking just to robots (which I don't agree with) or adding sections which form a more complete picture. Since I'm suggesting them, I'd be happy to help in any way with these additions. Thanks! -Erik From owner-robots Wed Feb 14 12:19:55 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01315; Wed, 14 Feb 96 12:19:55 -0800 Date: Wed, 14 Feb 1996 20:19:17 GMT From: cs0sst@isis.sunderland.ac.uk (Simon.Stobart) Message-Id: <9602142019.AA18801@osiris.sund.ac.uk> To: robots@webcrawler.com Subject: Re: Money Spider WWW Robot for Windows X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Why would I want a spider to send e-mail automatically like this to people > whom I may or may not know? Well I suppose you could then have internet mailshots ! |------------------------------------+-------------------------------------| | Simon Stobart, | Net: simon.stobart@sunderland.ac.uk | | Lecturer in Computing, | Voice: (+44) 091 515 2838 | | School of Computing | Fax: (+44) 091 515 2781 | | & Information Systems, + ------------------------------------| | University of Sunderland, SR1 3SD, | 007: Balls Q? | | England. | Q: Bolas 007! | |------------------------------------|-------------------------------------| From owner-robots Wed Feb 14 22:18:59 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08475; Wed, 14 Feb 96 22:18:59 -0800 From: Jaakko Hyvatti <Jaakko.Hyvatti@www.fi> Message-Id: <199602150618.IAA00648@krisse.www.fi> Subject: Re: robots.txt changes how often? To: robots@webcrawler.com Date: Thu, 15 Feb 1996 08:18:47 +0200 (EET) In-Reply-To: <9601148243.AA824320372@smtphost.dca.com> from "Tangy Verdell" at Feb 14, 96 09:32:58 am X-Mailer: ELM [version 2.4 PL22] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 244 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > in my robot, i don't want to check the robots.txt everytime i visit > that host because i think that robots.txt doesn't change very often - > ususally. > > is this true? Sure. I check it only every 8 hours. From owner-robots Thu Feb 15 00:39:09 1996 Return-Path: <owner-robots> Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15222; Thu, 15 Feb 96 00:39:09 -0800 From: Jaakko Hyvatti <Jaakko.Hyvatti@www.fi> Message-Id: <199602150838.KAA03129@krisse.www.fi> Subject: Re: robots.txt To: robots@webcrawler.com Date: Thu, 15 Feb 1996 10:38:43 +0200 (EET) In-Reply-To: <9601148243.AA824327479@smtphost.dca.com> from "Tangy Verdell" at Feb 14, 96 11:33:33 am X-Mailer: ELM [version 2.4 PL22] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 5318 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Tangy Verdell <TVerdell@dca.com>: > how standardized is the robots.txt file. has anyone ran into problems where a > significant number of sites have made typos or erros in their robots.txt file. Of 810 Finnish (*.fi) sites 731 have no robots.txt, and of the rest 79 the following 10 have something wrong with theirs. I'll list them here because they are so few, just for laughs and to give you some material. No actual typos here, but other mistakes (no empty lines between entries, multiple directories on one line, empty user-agent). (I have mailed the webmasters, but it doesn't always help.. and if the server software gives status 200 OK instead of 404, how can we expect to have a conforming robots.txt?) ******** http://pmtpc2.hut.fi/robots.txt # robots.txt for http://pmtpc2.hut.fi/ User-agent: * Disallow: User-agent: JumpStation Disallow: User-agent: Webcrawler/0.00000001 Disallow: User-agent: Lycos/x.x Disallow: User-agent: EIT-Link-Verifier-Robot/0.2 Disallow: User-agent: Disallow: ******** http://www.cardinal.fi/robots.txt <text/html><body> <p><strong>Error: file '/usr/Web/cardinal/robots.txt' can not open No such file or directory</strong></p> ******** http://www.kemi.fi/robots.txt <TITLE>WWW.KEMI.FI

WWW.KEMI.FI

City of Kemi
Kemin Kaupungin ATK-osasto
Keskuspuistokatu 20
Puhelin: 9698- 259224
e-mail: webmaster@kemi.fi

URL-osoite meni metsään !!!

www.kemi.fi

webmaster@kemi.fi

******** http://www.datum.fi/robots.txt

Error 404

Unable to open the specified file
httpd-info@glaci.com
******** http://joynws1.joensuu.fi/robots.txt

Error 404

Unable to open the specified file
httpd-info@glaci.com
******** http://kevdog.abo.fi/robots.txt Help me out here. You've requested a file called "robots.txt". That file does not exist on this site, nor has it ever existed. There just simply has never been such a file here. (Actually, that is not entirely true. The file you're currently reading is called "robots.txt".) Each day, though, 5 or 10 people try to check out "robots.txt". It's annoying. So, could you be so kind as to e-mail me and tell me what site gave you the URL saying that there was a "robots.txt" file on this site. I'll contact the webmaster there and make a suitable plea that they remove the link. Or perhaps I should make some goofy "robots.txt" file. Kev kev@ray.abo.fi ******** http://kirke.helsinki.fi/robots.txt This is the file "robots.txt"

This is the file "robots.txt"

Here is an extract of the server error log:

[Wed Jun 29 05:56:09 1994] httpd: access to /usr/local/www/robots.txt failed for beta.xerox.com, reason: file does not exist

[Wed Jun 29 13:36:05 1994] httpd: access to /usr/local/www/robots.txt failed for beta.xerox.com, reason: file does not exist

[Thu Jun 30 14:46:31 1994] httpd: access to /usr/local/www/robots.txt failed for beta.xerox.com, reason: file does not exist

[Mon Aug 1 18:46:59 1994] httpd: access to /usr/local/www/robots.txt failed for beta.xerox.com, reason: file does not exist

[Sat Aug 20 09:13:31 1994] httpd: access to /usr/local/www/robots.txt failed for pentland.stir.ac.uk, reason: file does not exist

[Mon Aug 22 22:32:36 1994] httpd: access to /usr/local/www/robots.txt failed for halsoft.com, reason: file does not exist

[Tue Aug 23 01:11:08 1994] httpd: access to /usr/local/www/robots.txt failed for indy1.lri.fr, reason: file does not exist

[Tue Aug 23 01:35:45 1994] httpd: access to /usr/local/www/robots.txt failed for indy1.lri.fr, reason: file does not exist As you can see the file robotxs.txt has not existed on this server. I created it solely to get this message through. If you read this message I'd appreciate if you'd send me the document address containg the link to http://kirke.helsinki.fi/robots.txt .

It is no big deal, but I am curious!

Regards,

Heikki Lehväslaiho, Server Manager, Heikki.Lehvaslaiho@Helsinki.FI
******** http://laaksonen.csc.fi/robots.txt no robots ******** http://teknet.tky.hut.fi/robots.txt # Robots.txt file for teknet.tky.hut.fi, robots welcome! User-agent: * Disallow: /cgi-bin /linux /inet /gallup ******** http://www.csc.fi/robots.txt User-agent: * Disallow: /app-defaults/ Disallow: /backup/ Disallow: /cgi-bin/ Disallow: /htbin/ Disallow: /tools/ Disallow: /math_topics/backup/ Disallow: /math_topics/data/ Disallow: /math_topics/icons/ Disallow: /math_topics/scripts/ Disallow: /math_topics/texts/ Disallow: /math_topics/wais/ Disallow: /programming/wais/ User-agent: Peregrinator-Mathematics Disallow: /math_topics/GAMS/ Disallow: /math_topics/opt/ Disallow: /math_topics/net/ From owner-robots Thu Feb 15 01:28:59 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18270; Thu, 15 Feb 96 01:28:59 -0800 Date: Thu, 15 Feb 1996 10:43:22 +0100 (MET) From: Michael De La Rue To: Eric Nolan Cc: robots@webcrawler.com Subject: Re: Commercial Robot Vendor Recoomendations Request In-Reply-To: Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Wed, 14 Feb 1996, Eric Nolan wrote: There really isn't much serious softare available for 95. If you got perl then that would change, but I believe that 95 perl isn't properly ported. Perhaps you should consider upgrading to NT or Linux. Commercial packaging isn't really viable either. Consider artistic or freeware because they are better/cheaper/mostly both.. > We are a organization in search of Vendors of commercially packaged > automated Robot Surfer, URL filer, and Html/Graphics Grab package for > Win95 Platform So far we have only found Blue squirrel. Can anyone please > recommend other vendors. Sorry I couldn't be of more help. Scottish Climbing Archive: Linux/Unix clone@ftp://src.doc.ic.ac.uk/packages/linux/sunsite.unc-mirror/docs/ From owner-robots Thu Feb 15 02:26:03 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21296; Thu, 15 Feb 96 02:26:03 -0800 Date: Thu, 15 Feb 1996 11:40:41 +0100 (MET) From: Michael De La Rue To: robots@webcrawler.com Subject: Canonical Names for documents (was Re: Server name in /robots.txt) In-Reply-To: <0099CC32C0158FC0.EFF2@NEWTON.NPL.CO.UK> Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com To go back to the original argument which started the Meta debate I would like to suggest tags like as a way of building in the logic for canonical names. Then, pressure could be put on people building mirroring packages (which seem to be becoming more popular) to put these in. This will mess up MD5 hashing schemes, but these wouldn't work anyway as most mirror systems will be changing BASENAME tags anyway. The idea would be that original authors would be encouraged to put this in as much as possible (most won't, but probably those who have well maintained pages will) and otherwise mirror robots MUST do it. The search engines should then identify these as the same document and provide one header followed by a list of mirror sites, possibly with last modified dates. this-is-main-link

this is first 250 words from the text..

    alternates
  • mirror sites WITH VISIBLE URL to allow choice by user
The reason to favour this over automatic selection or something similar is that it gives the user the choice of where to go (does he need the original.. is one of the mirror sites know reliable or otherwise) Link has the advantages that It's an agreed standard tag that shouldn't have any browser interactions It stays away from the META info contraversy it could be used to implement a 'go to original' button in browsers On the subject of the META tag etc:- Although I don't agree with the efficiency argument because I think the meta scheme can be implemented properly (just keep a meta data cache and re-parse if a document gets changed under you), so people that don't are their own problem. I'm largely convinced by the argument that MK put in against the HTTP equiv tag provided that some alternative methods are added to HTTP to lower bandwidth requirements. Just saying that alternatives could exist isn't enough. I think that the proposal is better than having no meta-data. I realise that initially the meta-data is going to be largely ignored, but as people start to maintain pages it will begin to be useful. Especially where (like me) you're dealing with a large number of sites who are willing to cooperate losely, and many of whom will be quite willing to implement Finally, getting the tops of documents:- What's a good way of implementing MK's suggestion of only getting the first couple of k of a document to allow extracting META info? I would probably want to stop at the end of the header, but how do you know when this has ended in a badly written document? Do you stop at the first non-head element (

?) or is there something more robust Scottish Climbing Archive: Linux/Unix clone@ftp://src.doc.ic.ac.uk/packages/linux/sunsite.unc-mirror/docs/ From owner-robots Thu Feb 15 10:30:34 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20582; Thu, 15 Feb 96 10:30:34 -0800 Date: Thu, 15 Feb 1996 13:27:07 -0500 (EST) From: Hayssam Hasbini Subject: Robots and search engines technical information. To: robots@webcrawler.com Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, I am doing a research paper at Boston University about the concept of operation of robots and spiders, and on search engines and the way they store information and index the web. Can anybody give me useful links or hints where I can find such detailed information (not just general introduction) I would really apreciate your help. Thanks. Hayssam. From owner-robots Thu Feb 15 10:30:05 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20532; Thu, 15 Feb 96 10:30:05 -0800 Organization: CNR - Istituto Tecnologie Informatiche Multimediale Date: Thu, 15 Feb 1996 19:24:59 -0100 From: davide@jargo.itim.mi.cnr.it (Davide Musella (CNR)) Message-Id: <199602152024.TAA09402@jargo> To: robots@webcrawler.com Subject: fdsf X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com fsdfsdf From owner-robots Fri Feb 16 03:44:52 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19582; Fri, 16 Feb 96 03:44:52 -0800 Date: Fri, 16 Feb 1996 12:45:55 +0100 From: wwwd@win.tue.nl (http) Message-Id: <199602161145.MAA19855@wsinis10.win.tue.nl> To: robots@webcrawler.com Subject: Tutorial Proposal for WWW95 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I have looked at the tutorial subjects of previous WWW conferences (2nd 3rd and 4th) and did not see a tutorial on robots and search engines. So I have written a proposal for a tutorial at WWW95, entitled "Finding Web Information using Search Engines, Index Databases and Robots." The tutorial will cover robot technology (including the robot exclusion protocol, which the next version of our own robot will obey) and the technology used by popular search engines and index databases. Systems that will certainly be presented are ALIWEB, JumpStation, WWWWorm, Yahoo, Tradewave Galaxy, WebCrawler, Lycos, Infoseek, Alta Vista. Suggestions for other systems to review are welcome. But what I would like most is information on how some of these systems work, apart from what can be found on their Web sites and in WWW conference papers. I'm already giving tutorials on this subject, but for the WWW conference I would like to receive information first-hand, from reliable sources. All help is welcome to make this tutorial a success (should it be accepted, which of course remains to be seen). Please send info to debra@win.tue.nl Thanks a lot. Paul. From owner-robots Fri Feb 16 05:58:14 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19944; Fri, 16 Feb 96 05:58:14 -0800 From: jeremy@mari.co.uk (Jeremy.Ellman) Message-Id: <9602161358.AA06256@kronos> Subject: Re: Robots and search engines technical information. To: robots@webcrawler.com Date: Fri, 16 Feb 1996 13:58:55 +0000 (GMT) In-Reply-To: from "Hayssam Hasbini" at Feb 15, 96 01:27:07 pm X-Mailer: ELM [version 2.4 PL24] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > Hi, > I am doing a research paper at Boston University about the concept of > operation of robots and spiders, and on search engines and the way they > store information and index the web. > Can anybody give me useful links or hints where I can find such detailed > information (not just general introduction) > I would really apreciate your help. > Thanks. > Hayssam. > I've forwarded you a paper as RTF. Hope this helps. Jeremy From owner-robots Fri Feb 16 07:13:42 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20211; Fri, 16 Feb 96 07:13:42 -0800 Date: Fri, 16 Feb 96 09:49:39 EST From: "Tangy Verdell" Message-Id: <9601168244.AA824494289@smtphost.dca.com> To: robots@webcrawler.com Subject: Re: Robots and search engines technical information. Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com The closest thing I found is a book named: Internet Agents Spiders, Wanderers, Brokers, and Bots Fah-Chun Cheong New Riders 1-56205-463-5 $32.00 Bots and Other Internet Beasties Joseph Williams Sams Net 1-57521-016-9 Published in April I know in the Internet Agents book he goes over spider and search engine algorithms, but not in detail. But he does point you in the direction to get more detailed info. There is another book called Agents Unleashed. But it is very theoretical, for the academic type. ______________________________ Reply Separator _________________________________ Subject: Robots and search engines technical information. Author: robots@webcrawler.com at Internet Date: 2/15/96 3:23 PM Hi, I am doing a research paper at Boston University about the concept of operation of robots and spiders, and on search engines and the way they store information and index the web. Can anybody give me useful links or hints where I can find such detailed information (not just general introduction) I would really apreciate your help. Thanks. Hayssam. From owner-robots Fri Feb 16 10:55:01 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09700; Fri, 16 Feb 96 10:55:01 -0800 Message-Id: <199602161854.KAA10851@wally.cs.washington.edu> In-Reply-To: "Tangy Verdell"'s message of Fri, 16 Feb 96 09:49:39 EST To: robots@webcrawler.com Subject: Re: Robots and search engines technical information. Gcc: nnfolder:misc-mail References: <9601168244.AA824494289@smtphost.dca.com> From: Erik Selberg Date: Fri, 16 Feb 1996 10:54:53 PST Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Bots and Other Internet Beasties > Joseph Williams > Sams Net > 1-57521-016-9 > Published in April Just an aside: the publisher of this book was sending out pretty broad "Please write this book!" messages to lots of people on various mailing lists and Web sites. Both Oren and I received one, as well as most of the rest of the UW Softbots group. Needless to say, while I don't want to say anything about Mr. Williams or his writing, I did want to mention that the publisher's standards for authors seemed quite low. -Erik From owner-robots Fri Feb 16 11:13:36 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11637; Fri, 16 Feb 96 11:13:36 -0800 Date: Fri, 16 Feb 1996 19:13 UT From: MGK@NEWTON.NPL.CO.UK (Martin Kiff) Message-Id: <0099E02B4AEEA620.0EF2@NEWTON.NPL.CO.UK> To: robots@webcrawler.com Subject: Re: robots.txt changes how often? X-Vms-To: SMTP%"robots@webcrawler.com" X-Vms-Cc: MGK Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I seem to have lost the attribution to this question, sorry... > in my robot, i don't want to check the robots.txt everytime i visit > that host because i think that robots.txt doesn't change very often - > ususally. I can imagine the /robots.txt _files_ don't change very frequently but what would happen if you met a script? I'm wondering whether a script would/could deflect robot activity towards the 'quieter' hours on the server. I was sure that there was a 'server too busy' error code, one of the '500's, but I can only see a '503 Service unavailable' in the draft spec I have.... What would robots do if presented with that when reading /robots.txt, back off and come back later? back off completely? ignore and carry on searching? Regards, Martin Kiff National Physical Laboratory, UK http://www.npl.co.uk/ From owner-robots Fri Feb 16 11:36:28 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14102; Fri, 16 Feb 96 11:36:28 -0800 Message-Id: <9602161933.AA17917@powell.mcom.com> From: "Darren R. Hardy" To: robots@webcrawler.com Subject: URL measurement studies? Date: Fri, 16 Feb 1996 11:33:26 -0800 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Does anyone know of a measurement study for URL lengths (not document size, but the actual number of bytes in the URL itself). If I were to guess at a histogram, I'd say: 60-80% are <256 bytes 80-90% are <512 bytes 90-95% are <1k bytes 95-97% are <4k bytes 97-99% are <8k bytes 100% are <16k bytes Maybe that's overestimating the length of URLs in the Internet? Based on Harvest's WWW Home Pages Gatherer, I found that 94.5% were <64 bytes, 98.5% were <80 bytes, 99.9% were <128 bytes, but that's only representative of typically short "home page" URLs. Thoughts? -Darren From owner-robots Fri Feb 16 13:16:42 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25345; Fri, 16 Feb 96 13:16:42 -0800 Message-Id: <199602162116.QAA01751@santoor.cs.columbia.edu> To: robots@webcrawler.com Subject: Re: URL measurement studies? In-Reply-To: Your message of "Fri, 16 Feb 1996 11:33:26 PST." <9602161933.AA17917@powell.mcom.com> Date: Fri, 16 Feb 1996 16:16:33 -0500 From: "John D. Pritchard" Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com i think that the current trend to making virtual domains for every new project, eg, movies toystory.com and i think brokenarrow.com, shows a trend which differentiates "usable" or "key" URLs from other stuff. if i see a URL on the bottom of my TV screen or on the side of a bus, i cant remember much more than foobar.com or fizhot.net, and i know to prepend "http://www." to anything. i think people always evolve things this way, at least at first. remember the phone numbers used to be BAkersfield9-4455 as a memory device. this is a different device, but the fact that we dont use it anymore, that we are "trained" as kids to recall 7 digits while older generations needed a memory device is interesting. -john From owner-robots Sat Feb 17 01:32:00 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12163; Sat, 17 Feb 96 01:32:00 -0800 Message-Id: <199602170927.KAA08947@storm.certix.fr> Comments: Authenticated sender is From: savron@world-net.sct.fr To: robots@webcrawler.com Date: Sat, 17 Feb 1996 10:15:19 +0000 Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Subject: about robots.txt content errors Priority: normal X-Mailer: Pegasus Mail for Windows (v2.23) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Why not doing some sort of 'fuzzy' or 'loose' scan to compensate for typos and others errors ? From owner-robots Sat Feb 17 06:36:32 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12771; Sat, 17 Feb 96 06:36:32 -0800 Date: Sat, 17 Feb 1996 07:36:12 -0700 Message-Id: <9602171436.AA183398@lamar.ColoState.EDU> X-Sender: drj@lamar.colostate.edu X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: joseph williams Subject: Re: Robots and search engines technical information. Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Sams.Net used a standard "call for authors" approach for the book. Virtually all edited books use this format. Only about a third of the authors who responded had their work accepted. The decisions regarding whose works were accepted were mine, not the publishers, and I think you'll find that "low standards" is an inappropriate assessment of the contributions made by the 20 international authors. I only posted the "call for authors" to Infosys, comp.ai, and Ray Johnson's Software Agents listserv, so the author base was tightly and appropriately focused. There are several chapters that deal with robots per se, but the book itself is not targeted specifically for the "robots" audience (for example, the designer of the Savvy Search engine wrote the core robots chapters). Rather, the focus is more broadly on the general topics of agents. The target reader is a beginner, although the last fifth of the book is for more advanced readers. I think you'll find the overall depth and quality of the book will be pleasantly surprising. Joseph Williams, Ph.D. Colorado State University CIS Department At 10:54 AM 2/16/96 PST, you wrote: >> Bots and Other Internet Beasties >> Joseph Williams >> Sams Net >> 1-57521-016-9 >> Published in April > >Just an aside: the publisher of this book was sending out pretty broad >"Please write this book!" messages to lots of people on various >mailing lists and Web sites. Both Oren and I received one, as well as >most of the rest of the UW Softbots group. Needless to say, while I >don't want to say anything about Mr. Williams or his writing, I did >want to mention that the publisher's standards for authors seemed >quite low. > >-Erik > From owner-robots Sat Feb 17 09:52:28 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17695; Sat, 17 Feb 96 09:52:28 -0800 Message-Id: <01B9DE66.6A2D8D80@posse.mnsinc.com> From: Sean Parker To: "'robots@webcrawler.com'" Subject: Dot dot problem... Date: Fri, 17 Feb 1995 13:00:15 -0500 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com The robot I am coding has problems dealing with links that point one level upward to the parent page with ".." I end up pulling the first link, re-pulling all the pages up to that nasty guy link, and then going up a level again. Help me or your server dies when my robot hits.. :) Thanx, Sean From owner-robots Sun Feb 18 09:01:00 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02506; Sun, 18 Feb 96 09:01:00 -0800 X-Sender: mak@surfski.webcrawler.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sun, 18 Feb 1996 09:03:18 -0700 To: robots@webcrawler.com From: Steve Livingston (by way of m.koster@webcrawler.com (Martijn Koster)) Subject: Robot Databases Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Any thoughts on databases that are appropriate for robots? Are there alternatives to the standard relational databases (Oracle, etc.)? Cheers, Steve From owner-robots Sun Feb 18 09:13:44 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03761; Sun, 18 Feb 96 09:13:44 -0800 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199602181714.SAA25229@wsinis10.win.tue.nl> Subject: Re: Dot dot problem... To: robots@webcrawler.com Date: Sun, 18 Feb 1996 18:14:48 +0100 (MET) In-Reply-To: <01B9DE66.6A2D8D80@posse.mnsinc.com> from "Sean Parker" at Feb 17, 95 01:00:15 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Content-Length: 271 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Sean, A relative URL with .. is not a "nasty guy link". >Help me or your server dies when my robot hits.. :) Um, read the specs and implement your robot accordingly. Better, use existing software that already does that. -- Reinier Post reinpost@win.tue.nl From owner-robots Sun Feb 18 10:13:05 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09406; Sun, 18 Feb 96 10:13:05 -0800 Date: Sun, 18 Feb 96 19:09:19 +0100 Message-Id: <9602181809.AA02707@indy2> X-Face: $)p(\g8Er<<5PVeh"4>0m&);m(]e_X3<%RIgbR>?i=I#c0ksU'>?+~)ztzpF&b#nVhu+zsv x4[FS*c8aHrq\<7qL/v#+MSQ\g_Fs0gTR[s)B%Q14\;&J~1E9^`@{Sgl*2g:IRc56f:\4o1k'BDp!3 "`^ET=!)>J-V[hiRPu4QQ~wDm\%L=y>:P|lGBufW@EJcU4{~z/O?26]&OLOWLZ To: robots@webcrawler.com In-Reply-To: (mouche@hometown.com) Subject: Re: Robot Databases Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Any thoughts on databases that are appropriate for robots? > > Are there alternatives to the standard relational databases (Oracle, etc.)? This is about large databases in general, not especially for robots. In my experience "standard" databases (like SQL databases) are slow and hog disk space. Hand-crafted databases can be much more compact, just think about some of the large databases accessible through the Web (such as "The Internet Movie Database", entirely hand-written with PERL). The interesting point is that writing a database engine by hand doesn't necessary take longer than using a client-server system, just do it the object-oriented way by using some largely available reusable software components (like persistent string dictionaries). SQL databases offers many functionality you don't need, at the price of performance. Do you really need rollback, full concurrency, full-time availability, full crash recovery, and can you pay the price (in term of performance) ? Or you need only some more simple programming model, with only one writer and many readers, don't care about one request that failed (how many Web page have you failed to download today ?), and don't care to lose the last few lines your robot has found ? This is some kind of RISC vs CISC controversy. Moreover, SQL technology can't handle complex data structure (graphs) and it can't make use of the "structure" of your data to perform efficiently. For example, URLs can be stored in a very compact way in a lexical tree, something very difficult to do with a relational database. I think that SQL technology was good when it was created, about 20 years ago. At that time, creating a database model with an associated language was a way to make "reusable components": by putting your experience, the operations you think you need in a query language. Also it was the time for people to start playing with "powerful" computers (64 kB of RAM, 10 Mb of disk space). "Yeah ! What a machine ! We gonna put everything we can think about in our new language !". Time of the great dinosaurs like PL/1; today's trends are much more "put the reusability in the language, put the functionality in the libraries". SQL is still good for some applications such as accounting (quite a profitable market segment), and small databases, with at most a few thousand lines. And object-oriented databases ? I'm quite skeptical about them. Specially those using a relational database for the persistence. May be someone else had more enthusiastic experiences ? +--------------------------+------------------------------------+ | | | | Christophe TRONCHE | E-mail : tronche@lri.fr | | | | | +-=-+-=-+ | Phone : 33 - 1 - 69 41 66 25 | | | Fax : 33 - 1 - 69 41 65 86 | +--------------------------+------------------------------------+ | ###### ** | | ## # Laboratoire de Recherche en Informatique | | ## # ## Batiment 490 | | ## # ## Universite de Paris-Sud | | ## #### ## 91405 ORSAY CEDEX | | ###### ## ## FRANCE | |###### ### | +---------------------------------------------------------------+ From owner-robots Sun Feb 18 10:31:28 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10950; Sun, 18 Feb 96 10:31:28 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sun, 18 Feb 1996 10:32:12 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Dot dot problem... Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >The robot I am coding has problems dealing with links that point one level >upward to >the parent page with ".." I end up pulling the first link, re-pulling >all the pages >up to that nasty guy link, and then going up a level again. In addition to endorsing Reinier's pointed remark, I'd like to ad that you should be aware that your robots will likely find many references to the same pages, in addition to ".." references. An essential part of any well-behaved robot is a database of some sort that keeps track of which pages have been visited, so that you don't waste your time with duplicate efforts... and more importantly, you don't waste others' bandwidth and cycles. This is basic stuff. If you're writing a new robot, you'd do well to push the envelope a bit by figuring out means of avoiding inclusion of the more subtle duplicates created by symbolic links, etc., rather than re-inventing capabilities that are well established. Nick From owner-robots Sun Feb 18 10:31:34 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10965; Sun, 18 Feb 96 10:31:34 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sun, 18 Feb 1996 10:32:25 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Anyone doing a Java-based robot yet? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Is anybody on the list yet working on a robot in Java? I'd be interesting in sharing code for the basic robot functions over the next few months. Nick From owner-robots Sun Feb 18 11:13:42 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14589; Sun, 18 Feb 96 11:13:42 -0800 Message-Id: <31277BA1.39D7@corp.micrognosis.com> Date: Sun, 18 Feb 1996 14:18:57 -0500 From: Adam Jack Organization: CSK/Micrognosis Inc. X-Mailer: Mozilla 2.0 (X11; I; SunOS 5.5 sun4m) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: Anyone doing a Java-based robot yet? References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Nick Arnett wrote: > Is anybody on the list yet working on a robot in Java? > I'd be interesting in sharing code for the basic robot > functions over the next few months. I'd be interested in knowing what Java had to offer to this task. I see nothing that makes it a good robot language. It also lacks the supporting capabilities that have been provided for other languages, e.g. perl and the libwww work. I am genuinely interested in hearing why it might be a good idea. Regards, Adam -- +1-203-730-5437 | http://www.micrognosis.com/~ajack/index.html ajack@corp.micrognosis.com -> ajack@netcom.com -> ajack@?.??? From owner-robots Sun Feb 18 13:08:03 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23506; Sun, 18 Feb 96 13:08:03 -0800 Message-Id: <9602182107.AA23493@webcrawler.com> Date: Sun, 18 Feb 1996 13:05:00 -0800 From: Ted Sullivan Subject: RE: Robot Databases To: robots X-Mailer: Worldtalk (NetConnex V3.50c)/MIME Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Yes there is. We work with an OODBMS called ObjectStore. Check "http://www.odi.com" for more info from the creator of the DB. We use it to model the real world - Engineering data usually - using a set of C++ classes & objects stored in the database. We allow access to this virtual world via a Web server that writes HTML responses on the fly based on data in the database. If you want to talk more about it off-line send me some e-mail to me at tsullivan@snowymtn.com. Ted ---------- From: robots To: robots Subject: Robot Databases Date: Monday, February 19, 1996 9:33AM Any thoughts on databases that are appropriate for robots? Are there alternatives to the standard relational databases (Oracle, etc.)? Cheers, Steve From owner-robots Sun Feb 18 13:44:48 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25955; Sun, 18 Feb 96 13:44:48 -0800 Message-Id: <9602182144.AA25944@webcrawler.com> Date: Sun, 18 Feb 1996 13:12:00 -0800 From: Ted Sullivan Subject: Re: Robot Databases To: robots X-Mailer: Worldtalk (NetConnex V3.50c)/MIME Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com As I mentioned in the reply to the originator of this thread I use ObjectStore to model my virtual worlds. Mainly because the type of data that I mine from drawings, part lists 2d/3d Models and text documents are very similar to distributed HTML documents. Lots of text & pictures that all relate together in a network or graph. Not the stuff you can usually easily normalize and put in to a bunch of tables. I am curious though, what type of databases do the Yahoo's, Infoseeks & Alta Vista's of the world use to store their data. Anybody know? Ted ---------- From: robots To: robots Subject: Re: Robot Databases Date: Monday, February 19, 1996 10:42AM > Any thoughts on databases that are appropriate for robots? > > Are there alternatives to the standard relational databases (Oracle, etc.)? This is about large databases in general, not especially for robots. In my experience "standard" databases (like SQL databases) are slow and hog disk space. Hand-crafted databases can be much more compact, just think about some of the large databases accessible through the Web (such as "The Internet Movie Database", entirely hand-written with PERL). The interesting point is that writing a database engine by hand doesn't necessary take longer than using a client-server system, just do it the object-oriented way by using some largely available reusable software components (like persistent string dictionaries). SQL databases offers many functionality you don't need, at the price of performance. Do you really need rollback, full concurrency, full-time availability, full crash recovery, and can you pay the price (in term of performance) ? Or you need only some more simple programming model, with only one writer and many readers, don't care about one request that failed (how many Web page have you failed to download today ?), and don't care to lose the last few lines your robot has found ? This is some kind of RISC vs CISC controversy. Moreover, SQL technology can't handle complex data structure (graphs) and it can't make use of the "structure" of your data to perform efficiently. For example, URLs can be stored in a very compact way in a lexical tree, something very difficult to do with a relational database. I think that SQL technology was good when it was created, about 20 years ago. At that time, creating a database model with an associated language was a way to make "reusable components": by putting your experience, the operations you think you need in a query language. Also it was the time for people to start playing with "powerful" computers (64 kB of RAM, 10 Mb of disk space). "Yeah ! What a machine ! We gonna put everything we can think about in our new language !". Time of the great dinosaurs like PL/1; today's trends are much more "put the reusability in the language, put the functionality in the libraries". SQL is still good for some applications such as accounting (quite a profitable market segment), and small databases, with at most a few thousand lines. And object-oriented databases ? I'm quite skeptical about them. Specially those using a relational database for the persistence. May be someone else had more enthusiastic experiences ? +--------------------------+------------------------------------+ | | | | Christophe TRONCHE | E-mail : tronche@lri.fr | | | | | +-=-+-=-+ | Phone : 33 - 1 - 69 41 66 25 | | | Fax : 33 - 1 - 69 41 65 86 | +--------------------------+------------------------------------+ | ###### ** | | ## # Laboratoire de Recherche en Informatique | | ## # ## Batiment 490 | | ## # ## Universite de Paris-Sud | | ## #### ## 91405 ORSAY CEDEX | | ###### ## ## FRANCE | |###### ### | +---------------------------------------------------------------+ From owner-robots Sun Feb 18 15:55:04 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03489; Sun, 18 Feb 96 15:55:04 -0800 From: Message-Id: <9602182347.AA01907@evil-twins.pa.dec.com> To: robots@webcrawler.com Subject: Re: Robot Databases In-Reply-To: Your message of "Sun, 18 Feb 96 13:12:00 PST." <9602182144.AA25944@webcrawler.com> Date: Sun, 18 Feb 96 15:47:28 -0800 X-Mts: smtp Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Alta Vista uses proprietary index technology. No d.b. --Louis From owner-robots Sun Feb 18 18:47:34 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05071; Sun, 18 Feb 96 18:47:34 -0800 Message-Id: <9602190247.AA05065@webcrawler.com> Date: Sun, 18 Feb 1996 18:44:00 -0800 From: Ted Sullivan Subject: Re: Robot Databases To: robots X-Mailer: Worldtalk (NetConnex V3.50c)/MIME Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I wondered about that. I use your service quite a lot - sure is fast. I guess you could call ours "proprietary" I suppose, it's just that we use ObjectStore as a data store and manage the object indexing & searching using a combination of the supported sets and collections in the OODBMS along with our own storage, search and retrieval objects written in C++. You guys ever going to publish a white paper on how Alta Vista works, or maybe even sell the underlying technology as a product? Ted ---------- From: robots To: robots Subject: Re: Robot Databases Date: Monday, February 19, 1996 4:28PM Alta Vista uses proprietary index technology. No d.b. --Louis From owner-robots Sun Feb 18 19:38:32 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05384; Sun, 18 Feb 96 19:38:32 -0800 Date: Mon, 19 Feb 96 12:37:52 JST From: francis@cactus.slab.ntt.jp (Paul Francis) Message-Id: <9602190337.AA23818@cactus.slab.ntt.jp> To: robots@webcrawler.com Subject: Ingrid ready for prelim alpha testing.... Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Apologies in advance for the cross-posting, (and for the length of this message) but... I am pleased to finally be able to announce that we will begin preliminary testing of Ingrid software. We are looking for a small group of volunteers. What Ingrid Is: For those of you who haven't followed Ingrid, it is a technique for whole-web searching that doesn't require a central search database. Instead, it automatically generates links between similar web resources, resulting in an infrastructure that can be (we hope) efficiently searched robot-style (at search time). While it is certain to have disadvantages to central search databases, one advantage we hope it has is that it pushes the control over searching to the edges (publishers and navigators). Anyway, please see http://rodem.slab.ntt.jp:8080/home/index-e.html for more info (please don't imagine that the quality of our home-page is a reflection on the quality of our software :-) What We Have: We have early versions of software that: 1. Publisher: Finds (using a robot) and processes text resources (isolates words, selects high-weight terms, etc.) in English and Japanese (please contact us if you want to add your favorite language...character set is not a restriction). 2. Server: Inserts the processed text into a global linked infrastructure, and then answers queries about that text and text that it is linked to. 3. Navigator: Allows users to search/navigate that infrastructure using a robot/GUI navigator running on their own workstation. This navigator has some nice relevance feedback features you don't find on current web search engines (that I'm familiar with). The navigator can also be coupled with your browser. The Publisher is in Perl. The Server is in C, and the Navigator is in C and uses Motif. We are running on Sun OS4 and Solaris. All interactions take place using UDP packets with a high port number. (This software, by the way, will all be freely available.) What We Want To Do: Starting around the end of this month, we want to start building the infrastructure. The primary purposes are: 1. to try to find out how well it scales, 2. to flush out bugs, installation procedures, etc., 3. to learn what kind of features we should add to the navigator. Who We Are Looking For: On the publishing side, we need voluteers that: 1. Are willing to spend a bit of time with this. While we are trying to make this all as simple as possible, you will have to spend some time at installation, check to see that things are working properly, and so on. Also, we expect there to be frequent re-installs early on, for bug fixes etc. 2. Have a machine that is continuously internet-accessible and has some spare memory/CPU/bandwidth. (Just how much we don't know...that is part of what we want to learn. It should, however, be negligible compared to running a web server.) 3. Ideally (though not absolutely necessary) you should have a set of resources that you yourself would like to be able to search (and that you would like others to be able to search). In particular, we are able to loosely couple Ingrid with other local search engines in that our search result not only finds resources for you, but points you to search engines that contain those resources so that you can do additional searching. Also, if you've already installed Harvest, our software can use your Harvest configuration files, so you don't have to do it again. I would say that our ideal first "customer" is someone that has already installed multiple Harvest (or similar) databases, and would like to see those databases coupled in some nice way. It is ideal because such a person gets some immediate personal benefit from using Ingrid. What You Should Do: Please send me mail (francis@slab.ntt.jp) if you are interested in participating. (Do so even if you want to postpone your participation till we have something more solid.) We can then discuss particulars. Thanks, PF From owner-robots Mon Feb 19 02:24:37 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11134; Mon, 19 Feb 96 02:24:37 -0800 From: jeremy@mari.co.uk (Jeremy.Ellman) Message-Id: <9602191025.AA08672@kronos> Subject: Re: robots.txt changes how often? To: robots@webcrawler.com Date: Mon, 19 Feb 1996 10:25:06 +0000 (GMT) In-Reply-To: <0099E02B4AEEA620.0EF2@NEWTON.NPL.CO.UK> from "Martin Kiff" at Feb 16, 96 07:13:00 pm X-Mailer: ELM [version 2.4 PL24] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > I'm wondering whether a script would/could deflect robot activity towards > the 'quieter' hours on the server. I was sure that there was a 'server too > busy' error code, one of the '500's, but I can only see a '503 Service > unavailable' in the draft spec I have.... > I've thought this for quite a while. If you look at anyone's server stats you find that they're pretty quiet between Midnight- 7 am (local time). Maybe there should be a line in robots.txt that says what they're preferred time for robot access is (in GMT). Like many other sites, we are happy for robots to look at us, but would prefer that they came out of office hours. I don't believe there is a way to express this now. Hence we don't have a robots.txt because there is nothing constructive we can put it. Correct me if I'm wrong... Jeremy From owner-robots Mon Feb 19 10:31:28 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA13878; Mon, 19 Feb 96 10:31:28 -0800 From: David Schnardthorst Message-Id: <199602191831.MAA08661@strydr.strydr.com> Subject: Robot for Sun To: robots@webcrawler.com Date: Mon, 19 Feb 1996 12:31:18 -0600 (CST) Organization: Stryder Communications, Inc. Address: 869 St. Francois, Florissant, Mo. 63031 Telephone: (314)838-6839 Fax: (314)838-8527 X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 688 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I am currently in need of a good robot for Sun. The robot can either be used to build a resource database, or it can be used for mirroring. If anybody has source that they are willing to give me, in PERL or C, it would be greatly appreciated. Thank You, ============================================================================ David Schnardthorst, Systems/Network Eng. * Phone: (314)838-6839 Stryder Communications, Inc. * Fax: (314)838-8527 869 St. Francois * E-Mail: ds3721@strydr.com Florissant, MO 63031 * URL: http://www.strydr.com ============================================================================ From owner-robots Mon Feb 19 11:41:20 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14411; Mon, 19 Feb 96 11:41:20 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Mon, 19 Feb 1996 11:42:14 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Anyone doing a Java-based robot yet? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >Nick Arnett wrote: > >> Is anybody on the list yet working on a robot in Java? ... >I am genuinely interested in hearing why it might be a good >idea. None that I know of that's specific to robots. My interest is driven by the overall advantages of Java -- easier that C++, cross-platform, distributable, etc. Nick From owner-robots Mon Feb 19 11:52:08 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14513; Mon, 19 Feb 96 11:52:08 -0800 Message-Id: <199602191952.OAA17020@santoor.cs.columbia.edu> To: robots@webcrawler.com Subject: Re: Anyone doing a Java-based robot yet? In-Reply-To: Your message of "Sun, 18 Feb 1996 14:18:57 EST." <31277BA1.39D7@corp.micrognosis.com> Date: Mon, 19 Feb 1996 14:51:54 -0500 From: "John D. Pritchard" Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > Is anybody on the list yet working on a robot in Java? > > I'd be interesting in sharing code for the basic robot > > functions over the next few months. > > I'd be interested in knowing what Java had to offer to this > task. I see nothing that makes it a good robot language. It > also lacks the supporting capabilities that have been provided > for other languages, e.g. perl and the libwww work. > > I am genuinely interested in hearing why it might be a good > idea. how to you figure java doesnt have the functional equivalent of libwww when it has ftp and http classes built on a general protocol and client classes? it lacks urlencode/urldecode. anyone have this done? what makes java nice for robots? with interfaces, fetching and using remote objects is the functional equivalent to CORBA 2.0, albeit much simpler, which i find to be a big advantage. this kind of robotics :-) is really interesting. protocols written in java can be passed down to clients, ala the plain text protocol in the java tutorial. -john From owner-robots Mon Feb 19 13:23:58 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15050; Mon, 19 Feb 96 13:23:58 -0800 Message-Id: <3128EB84.6D78@corp.micrognosis.com> Date: Mon, 19 Feb 1996 16:28:36 -0500 From: Adam Jack Organization: CSK/Micrognosis Inc. X-Mailer: Mozilla 2.0 (X11; I; SunOS 5.5 sun4m) Mime-Version: 1.0 To: "John D. Pritchard" Cc: robots@webcrawler.com Subject: Re: Anyone doing a Java-based robot yet? References: <199602191952.OAA17020@santoor.cs.columbia.edu> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com John D. Pritchard wrote: > how to you figure java doesnt have the functional equivalent of > libwww when > it has ftp and http classes built on a general protocol and client > classes? it lacks urlencode/urldecode. anyone have this done? > I has ftp and http. It does not have (since the BETA): HTML parsing WWW date handling robots.txt parsing/handling a UserAgent concept NNTP, Gopher et al. It does not allow the UserAgent control over HTTP so one can not utilize the information in that protocol. Its URL support is also pretty basic. BTW - It does have urlencode (but not urldecode) see : http://www.javasoft.com/JDK-1.0/api/java.net.URLEncoder.html#_top_ Java is also slow and, IMHO, not well integrated with databases etc. If one wanted to compare it with Perl -- it also lacks text parsing. > what makes java nice for robots? with interfaces, fetching and using > remote objects is the functional equivalent to CORBA 2.0, albeit much > simpler, which i find to be a big advantage. this kind of robotics > :-) is really interesting. > I didn't ask what made Java a nice environment -- I was focusing on WWW robots as in the sense of this mailing list. I agree -- Java is an improvement on C++ for many application developments. I just don't see what it buys a robot builder. Adam -- +1-203-730-5437 | http://www.micrognosis.com/~ajack/index.html ajack@corp.micrognosis.com -> ajack@netcom.com -> ajack@?.??? From owner-robots Mon Feb 19 14:02:41 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15357; Mon, 19 Feb 96 14:02:41 -0800 Message-Id: <199602192202.RAA19609@santoor.cs.columbia.edu> To: Adam Jack Cc: "John D. Pritchard" , robots@webcrawler.com Subject: Re: Anyone doing a Java-based robot yet? In-Reply-To: Your message of "Mon, 19 Feb 1996 16:28:36 EST." <3128EB84.6D78@corp.micrognosis.com> Date: Mon, 19 Feb 1996 17:02:35 -0500 From: "John D. Pritchard" Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > > I has ftp and http. It does not have (since the BETA): > > HTML parsing net.www.html.Parser net.www.html.Document > WWW date handling > robots.txt parsing/handling > a UserAgent concept i think this could be nicely constructed under the net package > NNTP, Gopher et al. net.nntp > It does not allow the UserAgent control over HTTP so one can not > utilize the information in that protocol. Its URL support is also > pretty basic. i would guess that it's good for subclassing and extending > BTW - It does have urlencode (but not urldecode) see : > http://www.javasoft.com/JDK-1.0/api/java.net.URLEncoder.html#_top_ java.net.html.URL.toString()? > Java is also slow and, IMHO, not well integrated with databases etc. > If one wanted to compare it with Perl -- it also lacks text > parsing. java.util.StringTokenzier java.io.StringTokenzier -john From owner-robots Mon Feb 19 14:15:53 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15531; Mon, 19 Feb 96 14:15:53 -0800 Message-Id: <3128F7CE.4AF0@corp.micrognosis.com> Date: Mon, 19 Feb 1996 17:21:02 -0500 From: Adam Jack Organization: CSK/Micrognosis Inc. X-Mailer: Mozilla 2.0 (X11; I; SunOS 5.5 sun4m) Mime-Version: 1.0 To: "John D. Pritchard" Cc: robots@webcrawler.com Subject: Re: Anyone doing a Java-based robot yet? References: <199602192202.RAA19609@santoor.cs.columbia.edu> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com John D. Pritchard wrote: > > > I has ftp and http. It does not have (since the BETA): > As I said -- "since the BETA". > net.www.html.Parser > net.www.html.Document > net.nntp > java.net.html.URL.toString()? These are not in anything later than Alpha 3. > > If one wanted to compare it with Perl -- it also lacks text > > parsing. > > java.util.StringTokenzier > java.io.StringTokenzier Enough said ;) Adam -- +1-203-730-5437 | http://www.micrognosis.com/~ajack/index.html ajack@corp.micrognosis.com -> ajack@netcom.com -> ajack@?.??? From owner-robots Mon Feb 19 15:11:33 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16036; Mon, 19 Feb 96 15:11:33 -0800 Message-Id: <199602192312.SAA15764@spear.bos.nl> X-Sender: martijn@www.findit.nl Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sat, 20 Jan 1996 00:11:52 -0100 To: robots@webcrawler.com From: martijn@findit.nl (Martijn De Boef) Subject: url locating X-Mailer: Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi , i am new here, but have a lot of questions in mind. lets start wth just 1 how on earth do i get all the domain names?? i mean i have a bot, i have an indexer, al i need now is some script to get all the domains, how do i do this ?? please help me. Martijn de Boef ---------------------------------- p.s. Whats in a name :) From owner-robots Mon Feb 19 18:56:32 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01942; Mon, 19 Feb 96 18:56:32 -0800 Date: Mon, 19 Feb 1996 21:58:31 -0500 Message-Id: <9602200258.AA08204@super.mhv.net> X-Sender: tsmith@pop.mhv.net X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Terry Smith Subject: Re: url locating Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Martijn de Boef wrote >Hi , i am new here, but have a lot of questions in mind. >lets start wth just 1 > >how on earth do i get all the domain names?? > I'm sure there are better ways to get started than the domain names, but if you need a list of com, org, and edu visit

for other domains, I expect there are similar sites. ftp://rs.internic.net/netinfo/ Specifically ftp://rs.internic.net/netinfo/domain-info.txt is just the names. Watch your parsing as some run together. e.g., ------ 07975.COM 0NE.COM 1-411.COM 1-800-24-BANKING.COM1-800-CAR-SEARCH.COM1-800-CARIBE-1.COM --------- ...netinfo/domains.txt might also be useful Not knowing how new you are, I'll add this: make sure you know how to use whois to turn a domain name into a person. This is how angry Webmaster will track you or your service provider down. Knowing what name comes up when someone converts your IP address can be very useful. tas From owner-robots Tue Feb 20 03:48:11 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18803; Tue, 20 Feb 96 03:48:11 -0800 Message-Id: From: cs31dw@ee.surrey.ac.uk (David A Weeks) Subject: Re: Anyone doing a Java-based robot yet? To: robots@webcrawler.com Date: Tue, 20 Feb 1996 11:47:26 +0000 (GMT) In-Reply-To: from "Nick Arnett" at Feb 18, 96 10:32:25 am X-Mailer: ELM [version 2.4 PL24] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 1112 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, I must admit, as part of my final year project in computing, I am doing a Java Based Robot; called 'KeyWord'. What I have found is that Java has plenty of predefined classes to use for web searching and document parsing. For example, the 'InputStream' classes allow various ways to read in files (extendable if you so wish) and the String class has numerous methods to manipulate text files (as I am currently doing). In addition, Java has the generalism to enable local file saving and an object-orientated way of constructing a data-base (although nowadays, the term 'knowledge-base' is often used). Before questions about this Java robot are asked, there are four thing to state : 1. KeyWord obeys robots.txt (again the String class makes this easy) 2. KeyWord is restricted to one request per minute regardless of the location. 3. KeyWord keeps a history of visited URLs to avoid duplication. 4. Keyword does not search by depth so does not run into 'black holes'. The URL of the current documentation can be found at : http://eeisun2.city.ac.uk/~ftp/Guinness/Hello.html Regards, Dave Weeks. From owner-robots Tue Feb 20 07:16:37 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00656; Tue, 20 Feb 96 07:16:37 -0800 Date: Tue, 20 Feb 96 10:03:49 EST From: "Tangy Verdell" Message-Id: <9601208248.AA824840496@smtphost.dca.com> To: robots@webcrawler.com Subject: Re[2]: Anyone doing a Java-based robot yet? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I went through this deliberation. We decided to go with C++ but for specific reasons. Since we are using CORBA distributed objects in our product, we needed a way of communication between these objects Now, there is a product called BlackWidow by Orbeline which allows Java apps to communicate with DOBs. But it is in Beta. Also, linking in the stuff already done for robots in C and perl would be more cumbersome than just using C++. However, I would like to share classes but since I work for a company who will own this code I can't do so. But I still would like to share ideas. Tangy TVerdell@dca.com ______________________________ Reply Separator _________________________________ Subject: Re: Anyone doing a Java-based robot yet? Author: robots@webcrawler.com at Internet Date: 2/18/96 3:26 PM Nick Arnett wrote: > Is anybody on the list yet working on a robot in Java? > I'd be interesting in sharing code for the basic robot > functions over the next few months. I'd be interested in knowing what Java had to offer to this task. I see nothing that makes it a good robot language. It also lacks the supporting capabilities that have been provided for other languages, e.g. perl and the libwww work. I am genuinely interested in hearing why it might be a good idea. Regards, Adam -- +1-203-730-5437 | http://www.micrognosis.com/~ajack/index.html ajack@corp.micrognosis.com -> ajack@netcom.com -> ajack@?.??? At 10:54 AM 2/16/96 PST, you wrote: >> Bots and Other Internet Beasties >> Joseph Williams >> Sams Net >> 1-57521-016-9 >> Published in April > >Just an aside: the publisher of this book was sending out pretty broad >"Please write this book!" messages to lots of people on various >mailing lists and Web sites. Both Oren and I received one, as well as >most of the rest of the UW Softbots group. Needless to say, while I >don't want to say anything about Mr. Williams or his writing, I did >want to mention that the publisher's standards for authors seemed >quite low. > >-Erik > From owner-robots Tue Feb 20 08:05:05 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04726; Tue, 20 Feb 96 08:05:05 -0800 Date: Tue, 20 Feb 1996 11:04:58 -0500 From: Skip Montanaro Message-Id: <199602201604.LAA10234@dolphin.automatrix.com> To: robots@webcrawler.com Subject: Re: Robot Databases In-Reply-To: <9602182347.AA01907@evil-twins.pa.dec.com> References: <9602182144.AA25944@webcrawler.com> <9602182347.AA01907@evil-twins.pa.dec.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Alta Vista uses proprietary index technology. No d.b. Ditto for Musi-Cal. No slow, hog-like relational database here. What's more, like the movie database, it's written entirely in an interpreted scripting lanugage (Python in our case, not Perl). Skip Montanaro | Looking for a place to promote your music venue, new CD, skip@calendar.com | festival or next concert tour? Place a focused banner (518)372-5583 | ad in Musi-Cal! http://www.calendar.com/concerts/ "Used to be, a good day was when I got some work done. Now I'm happy to just make it through the mail..." -me From owner-robots Tue Feb 20 23:48:37 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03025; Tue, 20 Feb 96 23:48:37 -0800 Message-Id: <312AEACB.FC8@netvision.net.il> Date: Wed, 21 Feb 1996 09:50:03 +0000 From: Frank Smadja X-Mailer: Mozilla 2.0 (WinNT; I) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: Anyone doing a Java-based robot yet? References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I am confused here, when you talk about Java robots, are you talking about a Java applet running under Netscape? I thought that for security purposes, Java applets could only open connections to the server they came from. Or maybe you're talking about a real Java program? Frank David A Weeks wrote: > > Hi, > > I must admit, as part of my final year project in computing, > I am doing a Java Based Robot; called 'KeyWord'. > > What I have found is that Java has plenty of predefined classes to use for > web searching and document parsing. For example, the 'InputStream' classes allow ... From owner-robots Wed Feb 21 06:17:12 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22835; Wed, 21 Feb 96 06:17:12 -0800 Date: Wed, 21 Feb 1996 16:29:27 +0200 From: Pertti Kasanen Message-Id: <199602211429.QAA00362@postiitti.akumiitti.fi> To: robots@webcrawler.com Subject: Anyone doing a Java-based robot yet? In-Reply-To: References: Organization: Akumiitti Oy Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Nick Arnett writes: > Is anybody on the list yet working on a robot in Java? I'd be interesting > in sharing code for the basic robot functions over the next few months. I am. Or it is still only in paper, I haven't started any serious coding yet. Please keep me informed, I hope to be able to contribute something in the next few weeks. Pertti -- Pertti Kasanen Internet: Pertti.Kasanen@akumiitti.fi Akumiitti Ltd tel: +358 208 300 400 Tekniikantie 17 B fax: +358 207 300 400 02150 Espoo, Finland http://www.akumiitti.fi/ From owner-robots Wed Feb 21 07:10:40 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26551; Wed, 21 Feb 96 07:10:40 -0800 From: Mr David A Weeks Message-Id: <9602211455.AA03608@central.surrey.ac.uk> Subject: Re: Anyone doing a Java-based robot yet?6 To: robots@webcrawler.com Date: Wed, 21 Feb 96 14:55:45 GMT In-Reply-To: <312AEACB.FC8@netvision.net.il>; from "Frank Smadja" at Feb 21, 96 9:50 am X-Mailer: ELM [version 2.3 PL3] Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > I am confused here, when you talk about Java robots, are you > talking about a Java applet running under Netscape? I thought > that for security purposes, Java applets could only open > connections to the server they came from. Or maybe you're > talking about a real Java program? I should distinguish between a Java applet and a Java application. The project I am working on is of the latter type. Applications can indeed open connectionsto wherever they like as you define the URL address as part of the URL class. (See doc:///doc/api/net.www.html.URL.html#_top_ for more details) For more information about Java security see doc:///doc/security/security.html Dave Weeks. http://eeisun2.city.ac.uk/~ftp/Guinness/Hello.html From owner-robots Wed Feb 28 21:28:09 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01021; Wed, 28 Feb 96 21:28:09 -0800 Message-Id: From: John Messerly To: "'robots@webcrawler.com'" Subject: Altavista indexing password files Date: Wed, 28 Feb 1996 19:54:52 -0800 X-Mailer: Microsoft Exchange Server Internet Mail Connector Version 4.0.822 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com #1 may be interesting to the readership of this list. >---------- >The Risks Digest Volume 17: Issue 70 > > Thursday 8 February 1996 > > >Risks of web robots > >Joe A. Dellinger >Sun, 4 Feb 96 20:50:03 CST > >Here are three risks of "web robots" I've run across recently that I >think Risks readers might find interesting. >1 The first is probably already well known to Risks readers: password > files accidentally being exported to the world. Web servers are just >yet > another way of making that mistake. > > Here is a post that has already had wide circulation (and may have > already appeared in Risks... I'm unable to scan back issues to check > right now because of heavy network load): > > >Subject: BoS: Misconfigured Web Servers > > > > A friend of mine showed me a nasty little "trick" over the weekend. >He > > went to a Web Search server (http://www.altavista.digital.com/) and > > did a search on the following keywords - > > > > root: 0:0 sync: bin: daemon: > > > > You get the idea. He copied out several encrypted root passwords >from > > passwd files, launched CrackerJack and a 1/2 MB word file and had a > > root password in under 30 minutes. All without accessing the site's > > server, just the index on a web search server! > > > .... > > > The guy that showed me this found it funny, but I find it >disturbing. > > Are there that many sites that are that poorly configured? > > > > Mark_W_Loveless@smtp.bnr.com > > I just verified that indeed this search does work, although to my >relief > the majority of the "hits" found are legitimate documents discussing > UNIX security. The risks are fairly obvious. > > 1Here is a variation on the above risk that I HAVEN'T seen discussed > before, however. See what happens if you search AltaVista for THESE > keywords: > > "unpublished proprietary source code actual intended reserved >copyright > notice" > > The results of this search are even more frightening, at least to me. > > > The general risk is not just that you can conveniently find password > files, but ANY kind of document that shouldn't be widely distributed: > > material useful for breaking into your system, copyrighted material, > illegal material, libelous material, incriminating or embarrassing > material, etc... > >2 The second risk works the other way: fooling stupid web robots so > as to lure people to your web site. > > A month ago I tried searching for "eisner reciprocity paradox" on > WebCrawler, hoping to find that it had indexed a paper of mine that I > had reprinted electronically under my home page. Nope, it hadn't (or >at > least I was unable to find it using any of the likely keywords I could > > think of!). Instead the single match was on a URL intriguingly >entitled > "The information source". > > Gee, this "information source" must have an article in it about >Eisner's > Reciprocity Paradox, one that I hadn't known of before! So I followed > the link, and ended up at something unexpected: " > http://www.graviton.com/red/", "The Red Herring Home Page"! (It comes > complete with gifs of red fish!) > > A little experimentation revealed that almost ANY obscure search would > > match "The information source", often as the only matching document > found. As near as I could figure out, his site recognized probes by >web > robots and then threw a dictionary at them! (His point made, he has > since stopped, although the Red Herring page is still there for your > perusal.) > > I contacted the author, Tom White, and asked for more details. He >didn't > want to give his secrets away, but did reply: > > > I will say that I spent no more than an hour on the whole thing, > including > > writing the page, and it was effective far beyond what I thought a > silly > > trick like that would muster. I think that by virtue of not hiding > what > > I am trying to do, people who write web indexers may see the page >and > think > > of ways to subvert feeble attempts like mine - which is a good thing > > since > > the page could have as easily been any propaganda I wanted to push >on > people. > > The risk? It can be frustratingly difficult (or impossible) to get a >web > robot's attention for a legitimate page you WANT indexed, or to find a > > page you know is there amist all the distractions of "false hits". >Part > of the clutter may be wildly off-topic pages engineered to fool web > robots into thinking that almost anything matches them. (Or simply >long > rambling pages containing lots of poems and such... documents that > "fool" the robots more by accident than design.) > >3 Finally, the act of being searched can cause problems for certain >kinds > of sites: ones that carry hundreds of thousands of distinct URLs, >often > generated only on demand, and that don't expect any one site to ever > have reason to download ALL of them, whether all at once or a few at a > > time. > > See for example "http://xxx.lanl.gov/RobotsBeware.html". The authors > state there: "This www server has been under all-too-frequent attack > from `intelligent agents' (a.k.a. `robots') that mindlessly download > every link encountered, ultimately trying to access the entire >database > through the listings links. In most cases, these processes are run by > well-intentioned but thoughtless neophytes, ignorant of common sense > guidelines." > > They have been forced to take a "proactive" stance to protect > themselves: "We are not willing to play sitting duck to this >nonsensical > method of `indexing' information." The rather UNIQUE hot link that > follows, "(Click here to initiate automated `seek-and-destroy' against > > your site.)", doesn't actually do anything but pause for 30 seconds, >I'm > told... > > I'll let readers examine the page and draw their own Risks! > From owner-robots Wed Feb 28 22:43:50 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01212; Wed, 28 Feb 96 22:43:50 -0800 From: Message-Id: <9602290636.AA14358@evil-twins.pa.dec.com> To: robots@webcrawler.com Subject: Re: Altavista indexing password files In-Reply-To: Your message of "Wed, 28 Feb 96 19:54:52 PST." Date: Wed, 28 Feb 96 22:36:35 -0800 X-Mts: smtp Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Alright, time for stories around the campfire. The story is true. It was brought to our attention quite early on, and we ... took action ... and contacted the appropriate people to warn them of the danger. Right now we no longer index /etc/passwd, and in fact every time scooter comes accross such a file, it logs an entry prefixed by "save my job". I will eventually get to contacting all these poor souls. --Louis From owner-robots Fri Mar 1 08:15:25 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00246; Fri, 1 Mar 96 08:15:25 -0800 Message-Id: <199603011258.NAA06014@storm.certix.fr> Comments: Authenticated sender is From: savron@world-net.sct.fr To: John Messerly Date: Fri, 1 Mar 1996 12:42:51 +0000 Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Subject: Re: Altavista indexing password files Cc: robots@webcrawler.com Priority: normal X-Mailer: Pegasus Mail for Windows (v2.23) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Why did you share an important information about 3 weeks old ? Thank you From owner-robots Fri Mar 1 12:18:07 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02223; Fri, 1 Mar 96 12:18:07 -0800 Message-Id: From: John Messerly To: "'savron@world-net.sct.fr'" Cc: "robots@webcrawler.com" Subject: RE: Altavista indexing password files Date: Fri, 1 Mar 1996 12:18:21 -0800 X-Mailer: Microsoft Exchange Server Internet Mail Connector Version 4.0.822 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I shared this information as soon as I came across it because I believed that those on this mailing list would also share my concern. (If the information is well known and stale, then I apologize for propogating excess noise.) It has been stated often in this forum that the existing documents discussing robots guidelines should be strengthened. As an addition to the existing list good proposals, this incident suggests that issues surrounding problem #1 should be explicitly discussed. [The opinions expressed in this message are my own personal views and do not necessarily reflect the official views of Microsoft Corporation.] >---------- >From: savron@world-net.sct.fr[SMTP:savron@world-net.sct.fr] >Sent: Friday, March 01, 1996 4:42 AM >To: John Messerly >Cc: robots@webcrawler.com >Subject: Re: Altavista indexing password files > >Why did you share an important information about 3 weeks old ? > >Thank you > From owner-robots Sat Mar 2 23:34:34 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08169; Sat, 2 Mar 96 23:34:34 -0800 Date: Sun, 3 Mar 1996 02:34:32 -0500 From: "Gordon V. Cormack" Message-Id: <199603030734.CAA29234@plg.uwaterloo.ca> To: robots@webcrawler.com Subject: BSE-Slurp/0.6 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com The robot identifying itself as User-Agent: BSE-Slurp/0.6 From: gauthier@cs.berkeley.edu read but did not heed my /robots.txt file. Does anybody know where this one comes from? It appears in tons of agent logs, but I don't see an entry for it in "known robots." Here's my /robots.txt file, as read by lynx: ------ barley 104>lynx -source http://barley.uwaterloo.ca/robots.txt # robots.txt for http://barley.uwaterloo.ca # this is a search engine. go away. User-agent: * Disallow: / ------ Here it is again, using telnet: ------ barley 105>telnet barley.uwaterloo.ca 80 Trying 129.97.186.13... Connected to barley. Escape character is '^]'. GET /robots.txt HTTP/1.0 # robots.txt for http://barley.uwaterloo.ca # this is a search engine. go away. User-agent: * Disallow: / Connection closed by foreign host. barley 106> ------ Gordon V. Cormack CS Dept, University of Waterloo, Canada N2L 3G1 gvcormac@uwaterloo.ca http://cormack.uwaterloo.ca/~gvcormac From owner-robots Sun Mar 3 08:54:21 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09165; Sun, 3 Mar 96 08:54:21 -0800 From: "Mordechai T. Abzug" Message-Id: <199603031654.LAA05546@xsa02.gl.umbc.edu> Subject: Re: BSE-Slurp/0.6 To: robots@webcrawler.com Date: Sun, 3 Mar 1996 11:54:14 -0500 (EST) In-Reply-To: <199603030734.CAA29234@plg.uwaterloo.ca> from "Gordon V. Cormack" at Mar 3, 96 02:34:32 am X-Mailer: ELM [version 2.4 PL25] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 724 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com "GVC" == Gordon V. Cormack spake thusly: GVC> GVC> GVC> The robot identifying itself as GVC> GVC> User-Agent: BSE-Slurp/0.6 GVC> From: gauthier@cs.berkeley.edu GVC> GVC> read but did not heed my /robots.txt file. Does anybody know where GVC> this one comes from? It appears in tons of agent logs, but I don't GVC> see an entry for it in "known robots." The gent gave you contact information (gauthier@cs.berkeley.edu); why don't you use it? Seems to me like he's got his heart in the right place, but has a major bug in his robots.txt parser. -- Mordechai T. Abzug http://umbc.edu/~mabzug1 mabzug1@umbc.edu finger -l mabzug1@gl.umbc.edu The first rule of intelligent tinkering is save all parts. From owner-robots Sun Mar 10 18:42:20 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06108; Sun, 10 Mar 96 18:42:20 -0800 From: "Mark Norman" Message-Id: <9603101809.ZM4183@hpisq3cl.cup.hp.com> Date: Sun, 10 Mar 1996 18:09:28 -0800 X-Mailer: Z-Mail (3.2.1 10apr95) To: robots@webcrawler.com Subject: Can I retrieve image map files? Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Is there a way to get web servers to send an image map file (i.e., the ASCII file containing the URLs and coordinates)? When I give a web browser the URL of the map file the server always invokes the imagemap program instead of sending the map file. Thanks. From owner-robots Sun Mar 10 20:09:53 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06751; Sun, 10 Mar 96 20:09:53 -0800 Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Mon, 11 Mar 1996 13:08:17 +0900 To: robots@webcrawler.com From: mark@gol.com (Mark Schrimsher) Subject: Re: BSE-Slurp/0.6 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >The robot identifying itself as > > User-Agent: BSE-Slurp/0.6 > From: gauthier@cs.berkeley.edu > >read but did not heed my /robots.txt file. Does anybody know where >this one comes from? It appears in tons of agent logs, but I don't >see an entry for it in "known robots." This is Inktomi at UC Berkeley. ==========[ n o t e t o m y c o r r e s p o n d e n t s ]========= My e-mail address has changed from mschrimsher@twics.com to mark@gol.com From owner-robots Mon Mar 11 06:32:59 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08994; Mon, 11 Mar 96 06:32:59 -0800 Message-Id: <01BB0F2D.9EC869E0@ts03-25.qtm.net> From: Douglas Summersgill To: "'robots@webcrawler.com'" Subject: Robots available for Intranet applications Date: Mon, 11 Mar 1996 09:31:55 -0500 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I am looking to find robots available for use on a corporate "Intranet" in order to create a current resource database. The corporation has many unadvertised web applications and I have been tasked with creating some sort of directory for general use. If anybody knows of any available (sale/shareware/freeware) please respond. Thanks in advance. From owner-robots Mon Mar 11 14:33:19 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12844; Mon, 11 Mar 96 14:33:19 -0800 From: Sylvain Duclos Date: Mon, 11 Mar 1996 22:31:46 GMT Message-Id: <199603112231.WAA21009@minotaure.ift.ulaval.ca> To: summersgill@qtm.net, robots@webcrawler.com Subject: Re: Robots available for Intranet applications Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > >I am looking to find robots available for use on a corporate "Intranet" in order to create a current resource database. > >The corporation has many unadvertised web applications and I have been tasked with creating some sort of directory for general use. > >If anybody knows of any available (sale/shareware/freeware) please respond. > >Thanks in advance. > > try: http://info.webcrawler.com/mak/projects/robots/active.html for the list of robots Good luck, Sylvain. From owner-robots Mon Mar 11 15:21:17 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA13195; Mon, 11 Mar 96 15:21:17 -0800 Date: Mon, 11 Mar 96 18:18:23 EST Message-Id: <9603112318.AA00195@pop.btg.com> X-Sender: mslabins@pop.btg.com Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: mslabinski@btg.com (Mark Slabinski) Subject: Re: Robots available for Intranet applications X-Mailer: Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I have a similiar interest. Please pass along any info to me as well. Thanks Mark >I am looking to find robots available for use on a corporate "Intranet" in order to create a current resource database. > >The corporation has many unadvertised web applications and I have been tasked with creating some sort of directory for general use. > >If anybody knows of any available (sale/shareware/freeware) please respond. > >Thanks in advance. > > > ========================================================== | Mark Slabinski BTG, Inc. | | mslabinski@btg.com 1945 Old Gallows | | phone: 703-761-7716 Vienna, Va 22182 | | fax: 703-761-3245 | ========================================================== From owner-robots Mon Mar 11 19:26:06 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14299; Mon, 11 Mar 96 19:26:06 -0800 From: "Mordechai T. Abzug" Date: Mon, 11 Mar 1996 22:24:27 -0500 Message-Id: <199603120324.WAA11525@umbc9.umbc.edu> To: robots@webcrawler.com, risko@csl.sri.com Subject: "What's new" in web pages is not necessarily reliable Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I have a little web robot running to keep track of changes to web pages. This came in very handy this semester, as I have an instructor who puts all assignments on the web: instead of having to check manually when the instructor put up the assignment, I just set up to monitor his "What's new" page and get mailed within 25 hours of a change. This worked for a while. . . then a homework assigment was added to a different page without being mentioned in the what's new. Oops. He was very understanding, but this class of problems is something robot users -- as well as people using netscape 2's update feature -- will want to keep in mind. Mordechai T. Abzug http://umbc.edu/~mabzug1 mabzug1@umbc.edu finger -l mabzug1@gl.umbc.edu Stupidity got us into this mess, why can't it get us out? From owner-robots Wed Mar 13 08:45:52 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01077; Wed, 13 Mar 96 08:45:52 -0800 Date: Wed, 13 Mar 96 11:43:59 EST From: "Jim Meritt" Message-Id: <9602138267.AA826747917@smtpinet.aspensys.com> To: robots@webcrawler.com Subject: verify URL Content-Length: 237 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Is there a simple/short/free app that can check for the presence of a URL and return a yes/no if it can get it? Note: Doesn't have to actually GET it, just verify that it is possible. Jim Meritt From owner-robots Wed Mar 13 09:04:32 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01415; Wed, 13 Mar 96 09:04:32 -0800 From: Vince Taluskie Message-Id: <199603131704.MAA23213@kalypso.cybercom.net> Subject: Re: verify URL To: robots@webcrawler.com Date: Wed, 13 Mar 1996 12:04:28 -0500 (EST) In-Reply-To: <9602138267.AA826747917@smtpinet.aspensys.com> from "Jim Meritt" at Mar 13, 96 11:43:59 am X-Mailer: ELM [version 2.4 PL24] Content-Type: text Content-Length: 671 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > Is there a simple/short/free app that can check for the presence of a > URL and return a yes/no if it can get it? > > Note: Doesn't have to actually GET it, just verify that it is > possible. How about using a HEAD rather than a GET request ? Vince -- ___ ____ __ | _ \/ __/| \ Vince Taluskie, at Fidelity Investments Boston, MA | _/\__ \| \ \ Pencom Systems Administration Phone: 617-563-8349 |_| /___/|_|__\ vince@pencom.com Pager: 800-253-5353, #182-6317 -------------------------------------------------------------------------- "We are smart, we make things go" From owner-robots Wed Mar 13 10:10:23 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02015; Wed, 13 Mar 96 10:10:23 -0800 Date: Wed, 13 Mar 1996 10:09:04 -0800 Message-Id: <199603131809.KAA24061@norway.it.earthlink.net> X-Sender: williams@earthlink.net X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Jared Williams Subject: Re: Robots available for Intranet applications Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >> >>I am looking to find robots available for use on a corporate "Intranet" in order to create a current resource database. >> >>The corporation has many unadvertised web applications and I have been tasked with creating some sort of directory for general use. >> >>If anybody knows of any available (sale/shareware/freeware) please respond. >> >>Thanks in advance. >> >> I have a huge list of robots on my site! Check it out! Hope it helps! Jared Williams Want a NICE SITE? Visit Web Knitter (R) http://home.earthlink.net/~williams e-mail: williams@earthlink.net From owner-robots Wed Mar 13 10:19:46 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02109; Wed, 13 Mar 96 10:19:46 -0800 Date: Wed, 13 Mar 1996 18:58:02 +0100 (GMT+0100) From: Carlos Baquero To: robots@webcrawler.com Subject: Re: verify URL In-Reply-To: <199603131704.MAA23213@kalypso.cybercom.net> Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 652 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Using the simple/short/free libWWWperl you can issue those HEAD requets. Carlos Baquero Distributed Systems Fax +351 (53) 612954 University of Minho, Portugal Voice +351 (53) 604475 cbm@di.uminho.pt http://shiva.di.uminho.pt/~cbm On Wed, 13 Mar 1996, Vince Taluskie wrote: > > > > Is there a simple/short/free app that can check for the presence of a > > URL and return a yes/no if it can get it? > > > > Note: Doesn't have to actually GET it, just verify that it is > > possible. > > How about using a HEAD rather than a GET request ? > > Vince From owner-robots Wed Mar 13 10:42:35 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02281; Wed, 13 Mar 96 10:42:35 -0800 Message-Id: <9603131834.AA9553@smtpmail.qds.com> To: robots From: Brian Gregory/Quantitative Data Systems Date: 13 Mar 96 10:40:35 EDT Subject: libww and robot source for Sequent Dynix/Ptx 4.1.3 Mime-Version: 1.0 Content-Type: Text/Plain Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Does anyone know if a port of the libwww to Sequent dynix/ptx 4.1.3 is available.? Is anyone running a robot on a Sequent system? From owner-robots Wed Mar 13 11:02:41 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02449; Wed, 13 Mar 96 11:02:41 -0800 Date: Wed, 13 Mar 96 14:01:05 EST From: "Jim Meritt" Message-Id: <9602138267.AA826756126@smtpinet.aspensys.com> To: robots@webcrawler.com Subject: Re[2]: verify URL Content-Length: 299 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You happen to have an ftp URL of it? Jim ______________________________ Reply Separator _________________________________ Subject: Re: verify URL Author: robots@webcrawler.com at SMTPINET Date: 3/13/96 2:21 PM Using the simple/short/free libWWWperl you can issue those HEAD requets. From owner-robots Wed Mar 13 11:57:49 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02938; Wed, 13 Mar 96 11:57:49 -0800 Date: Wed, 13 Mar 96 20:57:37 +0100 Message-Id: <9603131957.AA19633@indy2> X-Face: $)p(\g8Er<<5PVeh"4>0m&);m(]e_X3<%RIgbR>?i=I#c0ksU'>?+~)ztzpF&b#nVhu+zsv x4[FS*c8aHrq\<7qL/v#+MSQ\g_Fs0gTR[s)B%Q14\;&J~1E9^`@{Sgl*2g:IRc56f:\4o1k'BDp!3 "`^ET=!)>J-V[hiRPu4QQ~wDm\%L=y>:P|lGBufW@EJcU4{~z/O?26]&OLOWLZ To: robots@webcrawler.com In-Reply-To: <9602138267.AA826747917@smtpinet.aspensys.com> (jmeritt@smtpinet.aspensys.com) Subject: Re: verify URL Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Is there a simple/short/free app that can check for the presence of a > URL and return a yes/no if it can get it? What about this one: ====================================================================== #!/bin/sh code=`/bin/echo "HEAD $2 HTTP/1.0\n" | telnet $1 80 | awk '/^HTTP\/*\.*/ { print $2; exit }'` if [ "$code" = 200 ]; then echo yes else echo no fi ====================================================================== +--------------------------+------------------------------------+ | | | | Christophe TRONCHE | E-mail : tronche@lri.fr | | | | | +-=-+-=-+ | Phone : 33 - 1 - 69 41 66 25 | | | Fax : 33 - 1 - 69 41 65 86 | +--------------------------+------------------------------------+ | ###### ** | | ## # Laboratoire de Recherche en Informatique | | ## # ## Batiment 490 | | ## # ## Universite de Paris-Sud | | ## #### ## 91405 ORSAY CEDEX | | ###### ## ## FRANCE | |###### ### | +---------------------------------------------------------------+ From owner-robots Wed Mar 13 12:32:10 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03150; Wed, 13 Mar 96 12:32:10 -0800 Date: Wed, 13 Mar 96 13:23:37 MST From: Sibylle Gonzales Message-Id: <9603131323.A06235@huachuca-emh17.army.mil> To: robots@webcrawler.com Subject: robot authentication parameters Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Text item: Text_1 Does anyone know how to modify the MOMspider webcrawler to allow the passing of authentication paramaters to the Netscape Commerce server for access restricted areas on the web server? Input is greatly appreciated! From owner-robots Wed Mar 13 13:02:49 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03345; Wed, 13 Mar 96 13:02:49 -0800 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199603132104.WAA00896@wsinis10.win.tue.nl> Subject: Re: verify URL To: robots@webcrawler.com Date: Wed, 13 Mar 1996 22:04:23 +0100 (MET) In-Reply-To: <9602138267.AA826747917@smtpinet.aspensys.com> from "Jim Meritt" at Mar 13, 96 11:43:59 am X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Content-Length: 308 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (Jim Meritt) write: > > Is there a simple/short/free app that can check for the presence of a > URL and return a yes/no if it can get it? lynx -head, if you're on Unix or VMS. Similar programs exist as part of libwww-perl, or as standalone scripts. -- Reinier Post reinpost@win.tue.nl From owner-robots Wed Mar 13 14:06:19 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03732; Wed, 13 Mar 96 14:06:19 -0800 Message-Id: <01BB10E5.692AFA00@av-admin-dhcp.pa.dec.com> From: Debbie Swanson To: "'robots@webcrawler.com'" Cc: 'Debbie Swanson' Subject: RE: verify URL Date: Wed, 13 Mar 1996 14:00:15 -0500 Encoding: 17 TEXT, 42 UUENCODE X-Ms-Attachment: WINMAIL.DAT 0 00-00-1980 00:00 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com To my knowledge, no. ---------- From: Jim Meritt[SMTP:jmeritt@smtpinet.aspensys.com] Sent: Wednesday, March 13, 1996 11:43 AM To: robots@webcrawler.com Subject: verify URL Is there a simple/short/free app that can check for the presence of a URL and return a yes/no if it can get it? Note: Doesn't have to actually GET it, just verify that it is possible. Jim Meritt begin 600 WINMAIL.DAT M>)\^(A(3`0:0" `$```````!``$``0>0!@`(````Y 0```````#H``$-@ 0` M`@````(``@`!!) &`$ "```"````# ````,``# $````"P`/#@$````"`?\/ M`0```&$`````````M3O"P"QW$!JAO @`*RI6PA4````M6EO+(E?/$;ST``#X M((W%)( ```````"!*Q^DOJ,0&9UN`-T!#U0"`````$1E8F)I92!3=V%N``,P`0`` M``D```!D``(P`0````4```!33510`````!X``S !````%@```')O8F]T M``$P`0```!@` M```G2!54DP`F00!!8 #``X```#,!P,`#0`.```` M#P`#``,!`2" `P`.````S <#``T`#0`[`",``P!1`0$)@ $`(0```#E"-C2!54DP```(!<0`!````%@````&[$0]0N@,@ M9YQ\U!'/O/0``/@@C<4``!X`'@P!````!0```%--5% `````'@`?# $````4 M````9'-W86YS;VY <&$N9&5C+F-O;0`#``80F=LF$P,`!Q M`0``'@`($ $` M``!E````5$]-64M.3U=,141'12Q.3RTM+2TM+2TM+2U&4D]-.DI)34U%4DE4 M5%--5% Z2DU%4DE45$!3351024Y%5$%34$5.4UE30T]-4T5.5#I7141.15-$ M05DL34%20T@Q,RPQ.3DV,0`````"`0D0`0```$("```^`@``2@0``$Q:1G66 MBVB"_P`*`0\"%0*H!>L"@P!0`O()`@!C: K 2!K;F^$=VP)@&=E+" )H) 0;B<%0!& M..!?// <, #00I '0&P<8$?,151#D1T0:G43P#C63S[C0Z$$`$$K<&\$$&G] M`F!E'59%#R8H.8PY_SL/%R#U"H4680!1D ```P`0$ `````#`!$0`````$ ` L!S#@>N0X#Q&[`4 `"##@>N0X#Q&[`1X`/0`!````!0```%)%.B `````4/,` ` end From owner-robots Wed Mar 13 18:34:04 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05088; Wed, 13 Mar 96 18:34:04 -0800 X-Sender: gyld@mail.best.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 13 Mar 1996 18:16:31 -0800 To: robots@webcrawler.com From: gyld@in-touch.com (Dan Gildor) Subject: Re: robot authentication parameters Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Uh... my own home brewed perl based link checker does handle server authorization. I can see if I can cut out the relevant code if you'd like. Basically, its a couple of added lines to the server request (picked off of the HTTP spec), with the password being base64 encoded. -Dan >Text item: Text_1 > > Does anyone know how to modify the MOMspider webcrawler to allow the > passing of authentication paramaters to the Netscape Commerce server > for access restricted areas on the web server? > > Input is greatly appreciated! ============================================================================ Dan Gildor inTouch Technologies Director of Technical Productions 1130 Sherman Avenue gyld@in-touch.com Menlo Park, CA 94025 http://www.in-touch.com Tel: (415) 854-8036 "Making your net work" Fax: (415) 233-0155 ============================================================================ From owner-robots Wed Mar 13 23:36:10 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06022; Wed, 13 Mar 96 23:36:10 -0800 Date: Thu, 14 Mar 1996 08:36:04 +0100 (MET) From: Josef Pellizzari X-Sender: jpellizz@sp051.cern.ch To: robots@webcrawler.com Subject: Re: Robots available for Intranet applications In-Reply-To: <199603131809.KAA24061@norway.it.earthlink.net> Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi there, I couldn't find the robots, the only links I've seen are: Please sign Web Knitter's. guestbook! What is Web Knitter.? Promote your Web Site!!! I want to apply for a web site! Yhere is a lot of empty space at the end of the page, could it be that I was not patient enough and the links are there? Ciao, Josef -------------------------------------------------------------------- Josef PELLIZZARI tel : +41 22 767 9627 CN Division 31 2-013 fax : +41 22 767 7155 CERN mail: Josef.Pellizzari@cern.ch CH-1211 Geneve 23 -------------------------------------------------------------------- On Wed, 13 Mar 1996, Jared Williams wrote: > >> > >>I am looking to find robots available for use on a corporate "Intranet" in > order to create a current resource database. > >> > >>The corporation has many unadvertised web applications and I have been > tasked with creating some sort of directory for general use. > >> > >>If anybody knows of any available (sale/shareware/freeware) please respond. > >> > >>Thanks in advance. > >> > >> > I have a huge list of robots on my site! Check it out! > > Hope it helps! > > Jared Williams > > Want a NICE SITE? > > Visit Web Knitter (R) > > http://home.earthlink.net/~williams > e-mail: williams@earthlink.net > > > From owner-robots Thu Mar 14 09:30:55 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08175; Thu, 14 Mar 96 09:30:55 -0800 Date: Thu, 14 Mar 1996 09:29:17 -0800 Message-Id: <199603141729.JAA21991@norway.it.earthlink.net> X-Sender: williams@earthlink.net X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Jared Williams Subject: Re: Robots available for Intranet applications Cc: Promote.your.web.site@norway.it.earthlink.net Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 08:36 AM 3/14/96 +0100, you wrote: > >Hi there, > >I couldn't find the robots, the only links I've seen are: > > Please sign Web Knitter's. guestbook! > > What is Web Knitter.? > > Promote your Web Site!!! > > I want to apply for a web site! > >Yhere is a lot of empty space at the end of the page, could it be that I >was not patient enough and the links are there? > >Ciao, Josef > >-------------------------------------------------------------------- > Josef PELLIZZARI tel : +41 22 767 9627 > CN Division 31 2-013 fax : +41 22 767 7155 > CERN mail: Josef.Pellizzari@cern.ch > CH-1211 Geneve 23 >-------------------------------------------------------------------- > >On Wed, 13 Mar 1996, Jared Williams wrote: > >> >> >> >>I am looking to find robots available for use on a corporate "Intranet" in >> order to create a current resource database. >> >> >> >>The corporation has many unadvertised web applications and I have been >> tasked with creating some sort of directory for general use. >> >> >> >>If anybody knows of any available (sale/shareware/freeware) please respond. >> >> >> >>Thanks in advance. >> >> >> >> >> I have a huge list of robots on my site! Check it out! >> >> Hope it helps! >> Jared Williams Want a NICE SITE? Visit Web Knitter (R) http://home.earthlink.net/~williams e-mail: williams@earthlink.net >> >> >> > > Hi! You were patient enough you just have to look in the right places. : ^ ) It's under the the link "Promote your web site!" It not only has the link to submit your URL but it also has the link to the web robot itself...I hope that that is what you're looking for! Have a nice day!!! Jared Williams Want a NICE SITE? Visit Web Knitter (R) http://home.earthlink.net/~williams e-mail: williams@earthlink.net From owner-robots Thu Mar 14 09:48:53 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08338; Thu, 14 Mar 96 09:48:53 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 14 Mar 1996 09:49:12 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Robots available for Intranet applications Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >I am looking to find robots available for use on a corporate "Intranet" in >order to create a current resource database. We have such a robot available for our search products. Take a look at our site (http://www.verity.com/). We have quite a few customers using it for this purpose. It is not a free-ranging robot; you tell it which servers and which directory trees you want to index. In other words, it doesn't jump from server to server, so you can focus it on your intranet or parts of it. The basic robot that comes with our Web server products will only index the machine on which it's running. There's an upgrade to a version that will index other machines (anywhere on the net). These products are shipping for a number of Unix and Windows platforms. In addition, although it's not shipping yet, Netscape has announced its Catalog server, which is designed for this specific purpose. It is also based on our search engine. There's information about it on their site, (http://www.netscape.com/). Nick Arnett Internet Marketing Manager Verity Inc. From owner-robots Sat Mar 16 09:29:21 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21397; Sat, 16 Mar 96 09:29:21 -0800 Date: Sat, 16 Mar 1996 18:31:42 +0100 (MET) From: Michael De La Rue To: robots@webcrawler.com Subject: Re: robot authentication parameters In-Reply-To: <9603131323.A06235@huachuca-emh17.army.mil> Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com This is in LibWWWPerl 5b6.. but that's a perl5 binary. How much time do you have available for this? I'm trying to work on momspider for other reasons. It might be interesting to upgrade it to work with the latest LibWWWPerl. Quite alot of work though. Alternatively, just look at the latest versions of LibWWWperl and re-write them for perl4. Might be easier to do the upgrade though. On Wed, 13 Mar 1996, Sibylle Gonzales wrote: > > Text item: Text_1 > > Does anyone know how to modify the MOMspider webcrawler to allow the > passing of authentication paramaters to the Netscape Commerce server > for access restricted areas on the web server? > > Input is greatly appreciated! > Scottish Climbing Archive: Linux/Unix clone@ftp://src.doc.ic.ac.uk/packages/linux/sunsite.unc-mirror/docs/ From owner-robots Fri Mar 22 12:25:32 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00810; Fri, 22 Mar 96 12:25:32 -0800 From: jm@circle-slide.indianapolis.sgi.com (jon madison) Message-Id: <9603221515.ZM19070@circle-slide.indianapolis.sgi.com> Date: Fri, 22 Mar 1996 15:15:52 -0500 In-Reply-To: Douglas Summersgill "Robots available for Intranet applications" (Mar 11, 9:31am) References: <01BB0F2D.9EC869E0@ts03-25.qtm.net> X-Face: wT@!QyzV&.Q}K8PKQ90246#h4)}^Q#u|m5{gyvLyz=XrhvSP3"77M:lY.RQJC*^K]"a]{v5jS/dP8t!$L.Q'\\u|Vx*7wGC`N!kB6iYX@d?}XQ97&OdU@LQKOrKFkGb'H&'I[jq_9Y-CsJqfd?EBS;;Js`b+n^t!UK0)h_aQb[U4,T#/t0!{C[=y]d mailto:jm@sgi.com me: From owner-robots Fri Mar 22 14:45:29 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01531; Fri, 22 Mar 96 14:45:29 -0800 From: David Schnardthorst Message-Id: <199603222245.QAA09001@strydr.strydr.com> Subject: Re: Robots available for Intranet applications To: robots@webcrawler.com Date: Fri, 22 Mar 1996 16:45:14 -0600 (CST) In-Reply-To: <9603221515.ZM19070@circle-slide.indianapolis.sgi.com> from "jon madison" at Mar 22, 96 03:15:52 pm Organization: Stryder Communications, Inc. Address: 869 St. Francois, Florissant, Mo. 63031 Telephone: (314)838-6839 Fax: (314)838-8527 X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 521 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com In the Original, jon madison Says > >we use lycos source. Where can we get lycos source? ============================================================================ David Schnardthorst, Systems/Network Eng. * Phone: (314)838-6839 Stryder Communications, Inc. * Fax: (314)838-8527 869 St. Francois * E-Mail: ds3721@strydr.com Florissant, MO 63031 * URL: http://www.strydr.com ============================================================================ From owner-robots Mon Mar 25 03:00:43 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA13204; Mon, 25 Mar 96 03:00:43 -0800 Message-Id: <9603251102.AA00945@inrete.it> Date: Mon, 25 Mar 96 11:59:46 -0100 From: Francesco X-Mailer: Mozilla 1.2N (Windows; I; 16bit) Mime-Version: 1.0 To: robots@webcrawler.com Subject: How to...??? Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Dear all subscribers, happy to found this amazing mailing list. The most, really, easy question: how to submit a new URL to the largest number of robot? Thanks a lot. Francesco From owner-robots Mon Mar 25 05:53:54 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA13663; Mon, 25 Mar 96 05:53:54 -0800 From: jeremy@mari.co.uk (Jeremy.Ellman) Message-Id: <9603251353.AA06543@kronos> Subject: Re: How to...??? To: robots@webcrawler.com Date: Mon, 25 Mar 1996 13:53:30 +0000 (GMT) In-Reply-To: <9603251102.AA00945@inrete.it> from "Francesco" at Mar 25, 96 11:59:46 am X-Mailer: ELM [version 2.4 PL24] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > The most, really, easy question: > > how to submit a new URL to the largest number of robot? > "Billed as the "fastest way to publicize your website", "Submit It!" is a single point of entry to 15 of the Web's info search engines, including Yahoo, Infoseek, Lycos, and WebCrawler. " > Extract fron Netsurf Magazine.... > > > From owner-robots Mon Mar 25 08:23:00 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14352; Mon, 25 Mar 96 08:23:00 -0800 From: "Mark Norman" Message-Id: <9603250829.ZM25712@hpisq3cl.cup.hp.com> Date: Mon, 25 Mar 1996 08:29:43 -0800 X-Mailer: Z-Mail (3.2.1 10apr95) To: robots@webcrawler.com Subject: image map traversal Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hello, I previously asked the question: how can you retrieve an image map file from a web server? (the ascii file with the urls and coordinates). Someone answered that web servers do not allow you to retrieve this file. This leads me to wonder how a robot can descend the document tree of a web site if, as is the case for many sites, the path to lower level documents is through image maps and not through explicit hyperlinks? You could interrogate the map completely if you know its dimensions, but I don't know how to get the dimensions. But this would require many, many queries of the image map to find its hotspots. Thanks for any help. From owner-robots Mon Mar 25 09:47:16 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15178; Mon, 25 Mar 96 09:47:16 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Mon, 25 Mar 1996 09:48:00 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: image map traversal Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >Hello, > >I previously asked the question: how can you retrieve an image map file from a >web server? (the ascii file with the urls and coordinates). Someone answered >that web servers do not allow you to retrieve this file. This leads me to >wonder how a robot can descend the document tree of a web site if, as is the >case for many sites, the path to lower level documents is through image maps >and not through explicit hyperlinks? I think you summarized the problem well and there's no obvious solution as long as the image map is the only path by which one could reach some of the documents. Not only robots would be affected, so would people using Lynx or any other text-based browser, or those who didn't care to download the image map. These are reasons that it is always a good idea to have a text-only alternative page. Nick From owner-robots Mon Mar 25 09:51:56 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15229; Mon, 25 Mar 96 09:51:56 -0800 Date: Mon, 25 Mar 1996 09:37:27 -0800 (PST) From: Benjamin Franz X-Sender: snowhare@ns.viet.net To: robots@webcrawler.com Subject: Re: image map traversal In-Reply-To: <9603250829.ZM25712@hpisq3cl.cup.hp.com> Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com -----BEGIN PGP SIGNED MESSAGE----- On Mon, 25 Mar 1996, Mark Norman wrote: > Hello, > I previously asked the question: how can you retrieve an image map file from a > web server? (the ascii file with the urls and coordinates). Someone answered > that web servers do not allow you to retrieve this file. This leads me to > wonder how a robot can descend the document tree of a web site if, as is the > case for many sites, the path to lower level documents is through image maps > and not through explicit hyperlinks? The simple answer is: you can't. One more *very* good reason not to hide a site exclusively behind imagemaps. The search engines probably won't be able to index your site. > You could interrogate the map completely if you know its dimensions, but I > don't know how to get the dimensions. There are a number of tools that could figure its dimensions either via tha HEIGHT and WIDTH tags or by looking in the image file itself. In fact, given the occasional use of HEIGHT and WIDTH to resize images, you should use both. Look for a perl program called 'wwwimagesize'. > But this would require many, many queries of the image map to find its > hotspots. Yup. Order X * Y accesses. Even a modestly sized image map is too large to explore *completely* via automation. You could accelerate the exploration by using a low resolution grid, say every 15 pixels or so (which would still require about 278 accesses to explore a 250x250 map), with the trade off that you could miss very small hotzones. You should also store the *final* URL instead of the XY coord as the URL of the page fetched to prevent false duplication of pages. Benjamin Franz -----BEGIN PGP SIGNATURE----- Version: 2.6.2 iQCVAwUBMVbZhOjpikN3V52xAQEj5gP+J193pvHZiVFDOC3w0t4RhnkcIjPgalWh fw8hW5tzFBtTq85ZapMdgPi7R4l6Bkmqca54M8kefR0ehe9I1Be5KF154BHN3b7E jXf41H8AkRmMK7JBmEX3E0qIUgdXmu2fttvGJdvAx4Kpt2qKt/z9aUEgwjkTJixR Ql05Jg4cuYs= =BLTl -----END PGP SIGNATURE----- From owner-robots Mon Mar 25 10:09:46 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15339; Mon, 25 Mar 96 10:09:46 -0800 Date: Mon, 25 Mar 1996 10:09:40 -0800 Message-Id: <199603251809.KAA01280@iceland.it.earthlink.net> X-Sender: williams@earthlink.net X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Jared Williams Subject: ReHowto...??? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Francesco, If you want to publisize your site and quite having to search around for links, visit my site. It has plenty of search engine links that I think that you would be happy with! It's at: http://home.earthlink.net/~williams Enjoy! Jared Williams Want a NICE SITE? Visit Web Knitter (R) http://home.earthlink.net/~williams e-mail: williams@earthlink.net From owner-robots Mon Mar 25 10:12:34 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15374; Mon, 25 Mar 96 10:12:34 -0800 Date: Mon, 25 Mar 1996 10:12:28 -0800 Message-Id: <199603251812.KAA01586@iceland.it.earthlink.net> X-Sender: williams@earthlink.net X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Jared Williams Subject: Info on authoring a Web Robot Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I have a question and I think that I can find an answer for it in this mailing group. I would like to author a web robot but I don't know were to start. Does anyone know of any publications or web sites that give info on making web robots? I would greatly appreciate any feedback!!! Thanks!!! Jared Williams Want a NICE SITE? Visit Web Knitter (R) http://home.earthlink.net/~williams e-mail: williams@earthlink.net From owner-robots Mon Mar 25 10:24:59 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15479; Mon, 25 Mar 96 10:24:59 -0800 Date: Mon, 25 Mar 1996 13:23:03 -0600 (CST) From: Cees Hek To: robots@webcrawler.com Subject: Re: image map traversal In-Reply-To: <9603250829.ZM25712@hpisq3cl.cup.hp.com> Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Mon, 25 Mar 1996, Mark Norman wrote: > I previously asked the question: how can you retrieve an image map file from a > web server? (the ascii file with the urls and coordinates). Someone answered > that web servers do not allow you to retrieve this file. This leads me to > wonder how a robot can descend the document tree of a web site if, as is the > case for many sites, the path to lower level documents is through image maps > and not through explicit hyperlinks? This depends on where the server keeps the image map files. If they are kept right beside the HTML documents (below the document ROOT). Then you should be able to rebuild the URL for it from the HREF and retrieve it as a text file. However, if the server stores these files somewhere that is not accessible by the world (ie if it gets the map file location through conf files), then your SOL. Here is an example that someone uses on one of our servers that works this way. Just take out the cgi-bin/imagemap/ and you've got the URL for the Map file. > You could interrogate the map completely if you know its dimensions, but I > don't know how to get the dimensions. But this would require many, many queries > of the image map to find its hotspots. This would probably work, but I would not be impressed if someone hit our site a couple hundered times just to get some URLs. I guess this is an instance where Client side Imagemaps have a distinct advantage. Cees Hek Computing & Information Services Email: hekc@mcmaster.ca McMaster University Hamilton, Ontario, Canada From owner-robots Mon Mar 25 11:08:51 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15871; Mon, 25 Mar 96 11:08:51 -0800 Date: Mon, 25 Mar 1996 13:14:40 -0500 From: frizzlefry@nucleus.atom.com Message-Id: Subject: image map traversal To: robots@webcrawler.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com please take me off your mailing lists From owner-robots Mon Mar 25 15:15:38 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17363; Mon, 25 Mar 96 15:15:38 -0800 Date: Mon, 25 Mar 1996 15:14:41 -0800 Message-Id: <199603252314.PAA18117@sweden.it.earthlink.net> X-Sender: williams@earthlink.net X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Jared Williams Subject: Links Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com When I submit a URL to a Web Robot what is the deciding facter as to how close to the top of a search engine the link appears? Is there any way that I can get my links to be closer to the top? Thanks!!! Jared Williams Want a NICE SITE? Visit Web Knitter (R) http://home.earthlink.net/~williams e-mail: williams@earthlink.net From owner-robots Mon Mar 25 20:06:16 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18774; Mon, 25 Mar 96 20:06:16 -0800 X-Sender: mak@surfski.webcrawler.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Mon, 25 Mar 1996 20:07:33 -0700 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: Re: Links Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 3:14 PM 3/25/96, Jared Williams wrote: >When I submit a URL to a Web Robot what is the deciding facter as to how >close to the top of a search engine the link appears? The degree to which the engine considers your submission to match against the user's query. As such it is a function of both the user's query and the retrieval engine, neither of which is under your control :-) >Is there any way that I can get my links to be closer to the top? Sure -- get all the users to enter the text of your document as their query, or replace the retrieval engine so it displays yours first? Seriously, in general search services don't give you the option to influence the order much. If they did, everyone would abuse it, and it would render itself useless. People do try, usually by purposefully exploiting weaknesses in the implementation, or the very nature of the retrieval engines. Such spammers can relatively easily be detected, and of course easily get blocked from the search engine forever... That's called quality control :-) Just wondering: What makes your page so much more deserving to be on the top of a results list? -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Mon Mar 25 21:50:22 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19054; Mon, 25 Mar 96 21:50:22 -0800 From: doucette@tinkerbell.macsyma.com (Chuck Doucette) Message-Id: <9603260049.ZM15550@tinkerbell.macsyma.com> Date: Tue, 26 Mar 1996 00:49:04 -0500 X-Mailer: Z-Mail (3.2.2 10apr95 MediaMail) To: robots@webcrawler.com Subject: Limiting robots to top-level page only (via robots.txt)? Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Since I found out that Alta Vista (probably among others) indexes each page on our site (and not just the top-level page), I've been trying to find out how to prevent that. The sub-pages should certainly be accessible to anyone who can read the top-level page; however, someone may not have the context to go directly to a sub-page without going through our top-level page first. So, if indeed I wanted to prevent a robot from indexing any page other than the default one for the top-level (http://www.macsyma.com/), how could I do that? It's my understanding that the syntax for disallow assumes the top-level URL (http://www.macsyma.com) and matches on any trailing characters (such as /). This isn't stated clearly in the Robot exclusion documents I've read. With this syntax, I see no way of allowing "http://www.macsyma.com/" but preventing "http://www.macsyma.com/*.html" since regular expressions aren't allowed (nor multiple disallow fields?). Chuck -- Chuck Doucette e-mail: doucette@macsyma.com Macsyma, Inc. phone: (617) 646-4550 20 Academy St., Suite 201 fax: (617) 646-3161 Arlington MA 02174-6436 / U.S.A. URL: http://www.macsyma.com From owner-robots Tue Mar 26 01:12:22 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19628; Tue, 26 Mar 96 01:12:22 -0800 Message-Id: <199603260911.KAA13514@GuJ.de> Comments: Authenticated sender is From: "Detlev Kalb" Organization: Gruner + Jahr (Doku) To: robots@webcrawler.com Date: Tue, 26 Mar 1996 10:09:42 +0000 Subject: Re: Info on authoring a Web Robot X-Confirm-Reading-To: kalb.detlev@guj.de X-Pmrqc: 1 Priority: normal X-Mailer: Pegasus Mail for Windows (v2.0-WB4) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Jared, try the WWW Robot and Search Engine FAQ at http://science.smsu.edu/robot/faq/robot.html You'll find links to MOMspider and to a patch that will modify MOMspider into a full fledged web crawler. I haven't tried it myself yet, but maybe it helps. Yours Detlev Kalb email: kalb.detlev@guj.de Fon: 0049 40 3703 2021 Fax: 0049 40 3703 5652 From owner-robots Tue Mar 26 02:03:05 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19712; Tue, 26 Mar 96 02:03:05 -0800 From: Jaakko Hyvatti Message-Id: <199603261002.MAA15685@krisse.www.fi> Subject: Re: Limiting robots to top-level page only (via robots.txt)? To: robots@webcrawler.com Date: Tue, 26 Mar 1996 12:02:40 +0200 (EET) In-Reply-To: <9603260049.ZM15550@tinkerbell.macsyma.com> from "Chuck Doucette" at Mar 26, 96 00:49:04 am X-Mailer: ELM [version 2.4 PL22] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com doucette@tinkerbell.macsyma.com (Chuck Doucette): > So, if indeed I wanted to prevent a robot from indexing any page other > than the default one for the top-level (http://www.macsyma.com/), how > could I do that? With the Disallow: lines in robots.txt one disallows all pathname strings starting with the specified string. If you want to allow / but disallow /..*, you may list all your top-level files and directories: User-agent: * Disallow: /xyz.html Disallow: /sales ... which you have to update any time you add new files/dirs, or you could disallow all files by their first letter and list all possible filename starting letters, so you do not have to maintain robots.txt too often: User-agent: * Disallow: /a Disallow: /b Disallow: /c Disallow: /d ... Disallow: /z From owner-robots Tue Mar 26 02:20:59 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19764; Tue, 26 Mar 96 02:20:59 -0800 Date: Tue, 26 Mar 1996 11:20:49 GMT Message-Id: <199603261120.LAA18088@Cindy.mhm.fr> X-Sender: merlin@10.1.1.2 X-Mailer: Windows Eudora Pro Version 2.1.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Thomas Merlin Subject: Image Maps Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Wait a minute... A robot can go through an image map can't it ? By just exploring everything in the current directory and the sub-directories, as long as it isn't blocked by a .htaccess... Do all robots quit when they run into an image map ? Thanks. Tom. _______________________________________________________ Thomas Merlin http://www.cybertheque.fr/~merlin Grolier Interactive Europe http://www.club-internet.fr From owner-robots Tue Mar 26 04:37:05 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20050; Tue, 26 Mar 96 04:37:05 -0800 Date: Tue, 26 Mar 96 18:13:21 EST From: "ACHAKS" Encoding: 13 Text Message-Id: <9602268278.AA827859769@.inf.com> To: robots@webcrawler.com Subject: Request for Source code in C for Robots Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi friends, I am planning to make a robot(Window3.1 based) using the NetAPI calls. I could extract the url but I need to parse it to get subsequent url's. Is there any public domain(shareware or freeware) parser(in C) available for the same which I can use? Is there any source code for robots in C available which I can go through ? It would be better if it implements robots exclusion protocol. Thanks in anticipation, Angs. From owner-robots Tue Mar 26 06:20:23 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20384; Tue, 26 Mar 96 06:20:23 -0800 From: "Jeannine Washington" Organization: TECHNOLOGIES To: robots@webcrawler.com Date: Tue, 26 Mar 1996 08:31:01 MST Subject: RCPT: Re: Info on authoring a Web Robot Priority: normal X-Mailer: Pegasus Mail v3.22 Message-Id: <132DF1D6AAB@n115.tvi.cc.nm.us> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Confirmation of reading: your message - Date: 26 Mar 96 10:09 To: robots@webcrawler.com Subject: Re: Info on authoring a Web Robot Was read at 8:31, 26 Mar 96. From owner-robots Tue Mar 26 10:31:03 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21709; Tue, 26 Mar 96 10:31:03 -0800 X-Sender: gyld@mail.best.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Tue, 26 Mar 1996 10:28:25 -0800 To: robots@webcrawler.com From: gyld@in-touch.com (Dan Gildor) Subject: robots that index comments Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I was wondering, is it standard practice for most of the larger indexing robots (Alta Vista, WebCrawler, etc), to index keywords placed in comments towards the top of an html document? Or do they just ignore them? I was just thinking that for a set of pages under development for a client, that we could embed certain key words in comments that don't quite fit into the flow of the page text and still have them indexed and hittable in searches. Thanks. -Dan ============================================================================ Dan Gildor inTouch Technologies Director of Technical Productions 1130 Sherman Avenue gyld@in-touch.com Menlo Park, CA 94025 http://www.in-touch.com Tel: (415) 854-8036 "Making your net work" Fax: (415) 233-0155 ============================================================================ From owner-robots Tue Mar 26 10:57:02 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21883; Tue, 26 Mar 96 10:57:02 -0800 Message-Id: <199603261856.KAA17847@sparty.surf.com> Date: Tue, 26 Mar 96 10:56:28 -0800 From: murray bent Organization: icis X-Mailer: Mozilla 1.12IS (X11; I; IRIX 5.3 IP22) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: robots that index comments X-Url: http://www.infoseek.com/ Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >I was wondering, is it standard practice for most of the >robots (Alta Vista, WebCrawler, etc), to index keywords robots retrieve documents, its indexers that index documents. To check how the larger indexing sites have done their job just do an experiment - use the same query (say "Ford") on each of the sites and look at the first documents returned with a weighting of 1.00000 . A simple way to do this is use the metacrawler at : http://metacrawler.cs.washington.edu:8080 From owner-robots Tue Mar 26 16:33:08 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24018; Tue, 26 Mar 96 16:33:08 -0800 Date: Tue, 26 Mar 1996 19:33:04 -0500 From: Pinano@aol.com Message-Id: <960326193302_178717163@emout07.mail.aol.com> To: robots@webcrawler.com Subject: Re: UNSUBSCRIBE ROBOTS Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Unsuscribe Pinano@aol.com ROBOTS From owner-robots Tue Mar 26 20:44:11 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25285; Tue, 26 Mar 96 20:44:11 -0800 From: Date: Tue, 26 Mar 1996 20:44:05 -0800 (PST) Subject: Re: Image Maps To: robots@webcrawler.com In-Reply-To: <199603261120.LAA18088@Cindy.mhm.fr> Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Tue, 26 Mar 1996, Thomas Merlin wrote: > Wait a minute... > > A robot can go through an image map can't it ? > > By just exploring everything in the current directory and the > sub-directories, as long as it isn't blocked by a .htaccess... > > Do all robots quit when they run into an image map ? > > Thanks. > > Tom. > > _______________________________________________________ > Thomas Merlin http://www.cybertheque.fr/~merlin > Grolier Interactive Europe http://www.club-internet.fr > > You got a very good point in asking if all robots quit when running into an image map. If they don't, then the question is, *when* do they quit or are they in some kind of a *loop* and just keep going and going like the Eveready Bunny. Speaking of which leads to another thing that troubles me, if a site owner is allocated say 200MB of transfer (throughput, whatever) for their web site, and everyone starts using robots and spiders, then won't these robots and spiders *alone* use up the average small web site's transfer allocation? If the charge for additional transfers above the web site allocation is say $20 per 200MB, then couldn't active robot/spider activity (especially if one gets stuck in a loop on a given site and keeps hitting it for days) break the piggy bank of a small web site owner? Just a rhetorical aside.. ^^The Net Surfer^^^ From owner-robots Wed Mar 27 02:31:14 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26219; Wed, 27 Mar 96 02:31:14 -0800 Message-Id: <199603271030.LAA20288@GuJ.de> Comments: Authenticated sender is From: "Detlev Kalb" Organization: Gruner + Jahr (Doku) To: robots@webcrawler.com Date: Wed, 27 Mar 1996 11:28:31 +0000 Subject: keywords in META-element X-Confirm-Reading-To: kalb.detlev@guj.de X-Pmrqc: 1 Priority: normal X-Mailer: Pegasus Mail for Windows (v2.0-WB4) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I would like to include some keywords in the HEAD of my web pages to facilitate indexing and catalogue construction of search engines. On the net I saw 2 alternatives using the META element: What difference does it make with respect to indexing? Which alternative is preferable? Any comments would be appreciated. Detlev Kalb email: kalb.detlev@guj.de Fon: 0049 40 3703 2021 Fax: 0049 40 3703 5652 From owner-robots Wed Mar 27 02:56:56 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26298; Wed, 27 Mar 96 02:56:56 -0800 Date: Wed, 27 Mar 1996 10:54 UT From: MGK@NEWTON.NPL.CO.UK (Martin Kiff) Message-Id: <0099FF543909A700.982F@NEWTON.NPL.CO.UK> To: robots@webcrawler.com Subject: Re: Image Maps X-Vms-To: SMTP%"robots@webcrawler.com" X-Vms-Cc: MGK Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Re: Robots and Image maps.... Think of it in reverse... not all robots give due respect to the /robots.txt file so if you have a section of tree you want obvious to (people running graphical) browsers then you can reference it with an server-side clickable map. If you *do* want robots to see the tree then you *do* want to include text links in parallel... > .... if a site owner > is allocated say 200MB of transfer (throughput, whatever) for their web > site, and everyone starts using robots and spiders, then won't these > robots and spiders *alone* use up the average small web site's transfer > allocation? That's a question to ask the provider, I don't suppose many can bundle individuals' robots.txt files into a general /robots.txt. It's a good thing to ask however - just to express concern. You ought to get an offer from the provider to state "if the server is getting hammered by a particular site we'll block it". Regards, Martin Kiff National Physical Laboratory, UK mgk@newton.npl.co.uk / mgk@webfeet.co.uk From owner-robots Wed Mar 27 05:08:06 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26582; Wed, 27 Mar 96 05:08:06 -0800 Organization: CNR - Istituto Tecnologie Informatiche Multimediale Date: Wed, 27 Mar 1996 14:08:22 -0100 From: davide@jargo.itim.mi.cnr.it (Davide Musella) Message-Id: <199603271508.OAA05571@jargo> To: robots@webcrawler.com Subject: Re: keywords in META-element X-Sun-Charset: US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > I would like to include some keywords in the HEAD of my web pages to > facilitate indexing and catalogue construction of search engines. On > the net I saw 2 alternatives using the META element: > > > > Now any robot use the META tag content to index a document and if someone use this method use a private semantic for this tag. The HTTP-EQUIV MEthod is not yet implemented in any HTTP server, so you can use the NAME method, but some groups are working to implement this HTML feature (PICT group e.g.) Davide ----------------------------------------------------------------------------- Davide Musella Institute for Multimedia Technologies, National Research Council, Milan, ITALY tel. +39.(0)2.70643271 e-mail: davide@jargo.itim.mi.cnr.it http://jargo.itim.mi.cnr.it/ From owner-robots Wed Mar 27 07:08:24 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26956; Wed, 27 Mar 96 07:08:24 -0800 Date: Wed, 27 Mar 1996 09:08:09 -0600 (CST) Message-Id: <199603271508.JAA00244@wins0.win.org> X-Sender: kfischer@pop.win.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: kfischer@mail.win.org (Keith D. Fischer) Subject: Re: Info on authoring a Web Robot X-Mailer: Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com try looking at http://science.smsu.edu/robot/faq/index.html Keith. >I have a question and I think that I can find an answer for it in this >mailing group. I would like to author a web robot but I don't know were to >start. Does anyone know of any publications or web sites that give info on >making web robots? I would greatly appreciate any feedback!!! > > >Thanks!!! > Jared Williams > > Want a NICE SITE? > Visit Web Knitter (R) > > http://home.earthlink.net/~williams > e-mail: williams@earthlink.net > > From owner-robots Wed Mar 27 15:31:05 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA29802; Wed, 27 Mar 96 15:31:05 -0800 Message-Id: From: Lee Fisher To: "'robots@webcrawler.com'" Subject: Announce: ActiveX Search (IFilter) spec/sample Date: Wed, 27 Mar 1996 15:31:44 -0800 X-Mailer: Microsoft Exchange Server Internet Mail Connector Version 4.0.837.3 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I just realized that we just published something that might be of relavance of folks on this mailing list... About 2 weeks ago at our Internet developer's conference we released a variety of new Internet products, some of which have interfaces for ISVs. Nearly all of that stuff is on , which is a collection of development interfaces for client- and server-side Internet stuff. The "ActiveX Search" stuff, an OLE/COM IFilter interface, is something that might be of interest to web crawlers and search engines. Info on this is in the ActiveX SDK (available at above URL), in the \InetSDK\Samples\HTMLFilt subdirectory. The spec for it is also in that directory, as the source. Quoting from the readme: ----- snip ----- snip ----- The IFilter interface was designed primarily to provide a uniform mechanism to extract character streams from formatted data. The goal was to provide ISVs with an interface that extracts text as the initial step in content indexing document data. IFilter can be implemented over any document format and the ISV can choose any API or interface to read the data format. For example, a content filter can be written that reads data using the Win32 file APIs or uses the OLE storage interfaces. Any software author who stores textual data should consider implementing a content filter for the document format to allow content indexing systems to extract text. The sample filter in this directory will extract text and properties from HTML pages. In addition to raw content, headings (level 1 to 6), title and anchors are emitted as pseudo-properties. Title is also published as a full property available via IFilter::GetValue. ----- snip ----- snip ----- So, search engines and crawlers which grok IFilter will be able to break up OLE-based code and get the contents of it. The HTMLFilt sample here implements an IFilter-based sample which reads HTML text. Hope that someone finds this useful... __ Lee Fisher, leefi@microsoft.com From owner-robots Thu Mar 28 14:18:31 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05533; Thu, 28 Mar 96 14:18:31 -0800 Date: Thu, 28 Mar 1996 14:16:53 -0800 Message-Id: <199603282216.OAA01387@norway.it.earthlink.net> X-Sender: williams@earthlink.net X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Jared Williams Subject: Re: Links Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 08:07 PM 3/25/96 -0700, you wrote: >At 3:14 PM 3/25/96, Jared Williams wrote: > >>When I submit a URL to a Web Robot what is the deciding facter as to how >>close to the top of a search engine the link appears? > >The degree to which the engine considers your submission to match against >the user's query. As such it is a function of both the user's query and >the retrieval engine, neither of which is under your control :-) > >>Is there any way that I can get my links to be closer to the top? > >Sure -- get all the users to enter the text of your document as their >query, or replace the retrieval engine so it displays yours first? > >Seriously, in general search services don't give you the option to >influence the order much. If they did, everyone would abuse it, and >it would render itself useless. > >People do try, usually by purposefully exploiting weaknesses in the >implementation, or the very nature of the retrieval engines. Such >spammers can relatively easily be detected, and of course easily get >blocked from the search engine forever... That's called quality >control :-) > >Just wondering: What makes your page so much more deserving to be on >the top of a results list? > > >-- Martijn > >Email: m.koster@webcrawler.com >WWW: http://info.webcrawler.com/mak/mak.html > > > > If you want go ahead and check out my sight and reply me with e-mail telling me if you think it deserves to be higher on the list. Thanks... Jared Williams Want a NICE SITE? Visit Web Knitter (R) http://home.earthlink.net/~williams e-mail: williams@earthlink.net From owner-robots Thu Mar 28 19:19:49 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA07092; Thu, 28 Mar 96 19:19:49 -0800 From: "Mordechai T. Abzug" Message-Id: <199603290319.WAA02762@rpa07.gl.umbc.edu> Subject: Re: Links To: robots@webcrawler.com Date: Thu, 28 Mar 1996 22:19:43 -0500 (EST) In-Reply-To: <199603282216.OAA01387@norway.it.earthlink.net> from "Jared Williams" at Mar 28, 96 02:16:53 pm X-Mailer: ELM [version 2.4 PL25] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 1158 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com "JW" == Jared Williams spake thusly: JW> JW>If you want go ahead and check out my sight and reply me with e-mail telling JW>me if you think it deserves to be higher on the list. JW> JW>Thanks... JW> http://home.earthlink.net/~williams JW> e-mail: williams@earthlink.net Um, *no* site should be at the top of every search list. No matter how good and important your site is, I don't want to see it when I'm searching for info on Classical Origami. ;> Now, if you want your list to appear with reasonable prominence on searches that want your type of site, just add a description of what your site is on the site's main page. If you describe it in sufficiently specific terms, indexers will respond to queries that match your sites description -- which is the normal theory behind indexers. There is a new, unfortunate tendency of sites to put lots of graphics on the home page, and explain themselves in a secondary page. This is *not* good for indexers. -- Mordechai T. Abzug http://umbc.edu/~mabzug1 mabzug1@umbc.edu finger -l mabzug1@gl.umbc.edu Spelling errors? NO WAY! I've got an ERROR-CORRECTING MODEM! From owner-robots Thu Mar 28 22:17:20 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA07714; Thu, 28 Mar 96 22:17:20 -0800 Date: Thu, 28 Mar 1996 22:15:58 -0800 Message-Id: <199603290615.WAA03729@norway.it.earthlink.net> X-Sender: williams@earthlink.net X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Jared Williams Subject: Re: Links Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >Um, *no* site should be at the top of every search list. No matter how good >and important your site is, I don't want to see it when I'm searching for info >on Classical Origami. ;> > >Now, if you want your list to appear with reasonable prominence on searches >that want your type of site, just add a description of what your site is on >the site's main page. If you describe it in sufficiently specific terms, >indexers will respond to queries that match your sites description -- which is >the normal theory behind indexers. > >There is a new, unfortunate tendency of sites to put lots of graphics on the >home page, and explain themselves in a secondary page. This is *not* good for >indexers. > >-- > Mordechai T. Abzug >http://umbc.edu/~mabzug1 mabzug1@umbc.edu finger -l mabzug1@gl.umbc.edu >Spelling errors? NO WAY! I've got an ERROR-CORRECTING MODEM! > > Sorry for the misuderstanding. I of course meant in the catagory to which it belongs to. Jared Williams Want a NICE SITE? Visit Web Knitter (R) http://home.earthlink.net/~williams e-mail: williams@earthlink.net From owner-robots Fri Mar 29 01:56:28 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08032; Fri, 29 Mar 96 01:56:28 -0800 Date: Fri, 29 Mar 1996 10:56:12 +0100 (MET) From: Michael De La Rue To: robots@webcrawler.com Subject: Re: Links (don't bother checking; I've done it for you) In-Reply-To: <199603282216.OAA01387@norway.it.earthlink.net> Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Thu, 28 Mar 1996, Jared Williams wrote: > At 08:07 PM 3/25/96 -0700, you wrote: > >At 3:14 PM 3/25/96, Jared Williams wrote: > > > > > > If you want go ahead and check out my sight and reply me with e-mail telling > me if you think it deserves to be higher on the list. ...then you're someone trying to make money from it.. > Thanks... To save you all the bother it's just like the Metasearch page, but it charges you $130 for the service as opposed to free.. and I think they possibly also sell a web authoring tool, but the page wasn't sufficiently clear about what it was exactly they were doing.. Then, at the end of his page he had the following keywords repeated California Northwest Animation Promotion Web Site Development Web Site Knitter Web Money about 50 times each.. okay, so the action to take in his case is obvious (deindex any siGHt(e) beloning to him), but what is the general case to stop this? It's very difficult as far as I can see. It's deliberate worthless junk, trying to get in at the level of people who are providing worthwhile information about california/web sites etc. Seems quite similar to the problem of spammed email/netnews to me. In that case the answer was for responsible sites to ban users who do the spamming and for irresponsible sites to be separated from all their neighbourhood, cutting them off from the network.. Dosen't really work here, because this is too trivial a matter (he's not doing any active damage) for that kind of response, but, what he is doing is messing up the catologues. I suggest that the major catalogue brokers might want to start removing sites/domains that don't protect people from this. Then the peer pressure would become strong from neigbourhood sites so that this would be controlled. Obviously anything which is put as not to access in a robots.txt would be something that shouldn't be complained about? Okay, it's not a big problem now, but it's much easier to write hundreds of junk pages than one properly researched web page, so the swamp is coming upon us pretty fast. Actually cutting off a few domains (as they are noticed or complained about) might be worthwhile.. What about each search page coming up with a rating button for each page (relevance to search/quality).. Pages which are consistently marked down would eventually get examined somehow and downrated?? Do censorship issues come in here, or would that just be stopped from being a problem by the level of competition? Scottish Climbing Archive: Linux/Unix clone@ftp://src.doc.ic.ac.uk/packages/linux/sunsite.unc-mirror/docs/ From owner-robots Fri Mar 29 03:19:55 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08199; Fri, 29 Mar 96 03:19:55 -0800 X-Sender: radio@mail.mpx.com.au Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 30 Mar 1995 22:28:21 +1000 To: robots@webcrawler.com From: radio@mpx.com.au (Keith) Subject: Re: Links This Site is about Robots Not Censorship Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Are You Kidding or Something ? This is not the place for this sort of Censorship Plans. Here we have millions of people supporting the Blue Ribbon Campaign and you want to organise the Size of the Jail Cell Who are you to Judge? Keith >On Thu, 28 Mar 1996, Jared Williams wrote: > >> At 08:07 PM 3/25/96 -0700, you wrote: >> >At 3:14 PM 3/25/96, Jared Williams wrote: >> > >> > >> >> If you want go ahead and check out my sight and reply me with e-mail telling >> me if you think it deserves to be higher on the list. > > ...then you're someone trying to make money from it.. > >> Thanks... > >To save you all the bother it's just like the > > Metasearch page, but it charges you $130 for the >service as opposed to free.. and I think they possibly also sell a web >authoring tool, but the page wasn't sufficiently clear about what it was >exactly they were doing.. > >Then, at the end of his page he had the following keywords repeated > >California >Northwest >Animation >Promotion Web Site >Development Web Site >Knitter Web >Money > >about 50 times each.. > >okay, so the action to take in his case is obvious (deindex any siGHt(e) >beloning to him), but what is the general case to stop this? It's very >difficult as far as I can see. It's deliberate worthless junk, trying to >get in at the level of people who are providing worthwhile information >about california/web sites etc. > >Seems quite similar to the problem of spammed email/netnews to me. In >that case the answer was for responsible sites to ban users who do the >spamming and for irresponsible sites to be separated from all their >neighbourhood, cutting them off from the network.. Dosen't really work >here, because this is too trivial a matter (he's not doing any active >damage) for that kind of response, but, what he is doing is messing up >the catologues. > >I suggest that the major catalogue brokers might want to start removing >sites/domains that don't protect people from this. Then the peer pressure >would become strong from neigbourhood sites so that this would be controlled. >Obviously anything which is put as not to access in a robots.txt would be >something that shouldn't be complained about? > >Okay, it's not a big problem now, but it's much easier to write hundreds >of junk pages than one properly researched web page, so the swamp is >coming upon us pretty fast. > >Actually cutting off a few domains (as they are noticed or complained >about) might be worthwhile.. > >What about each search page coming up with a rating button for each >page (relevance to search/quality).. Pages which are consistently marked >down would eventually get examined somehow and downrated?? Do censorship >issues come in here, or would that just be stopped from being a problem >by the level of competition? > > > Scottish Climbing Archive: >Linux/Unix clone@ftp://src.doc.ic.ac.uk/packages/linux/sunsite.unc-mirror/docs/ AAA World Announce Archive Search Engine Home: Australian Cool Site and Daily News. Web: http://www.aaa.com.au Email: webmaster@aaa.com.au Postal: AAA/The Radio EDGE 2AM16 P.O. Box 202, Caringbah 2229 Australia From owner-robots Fri Mar 29 05:02:55 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08458; Fri, 29 Mar 96 05:02:55 -0800 Message-Id: <9603291302.AA08452@webcrawler.com> To: robots@webcrawler.com Subject: Re: Links (don't bother checking; I've done it for you) In-Reply-To: Your message of "Fri, 29 Mar 1996 10:56:12 +0100." Date: Fri, 29 Mar 1996 13:01:18 +0000 From: Chris Brown Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Michael wrote: > Then, at the end of his page he had the following keywords repeated > > California > Northwest > Animation > Promotion Web Site > Development Web Site > Knitter Web > Money > > about 50 times each.. He also has "Buisness", "PageWeb", and "InternetInternet" among many others repeated about 10 times in a META KEYWORDS field. Maybe he's trying to attract poor spellers to his site. :-) Yes! What a great idea! An internet spelling coaching service could promote themselves by getting a dictionary of common misspellings indexed along with their page. Anyone typing a misspelled word into a search engine would come up with "Welcome to the Internet Spelling Service. Click *here* for the correct spelling of your word ($1)". On the other hand, maybe not... We don't want to give Jared any ideas... Regards, Chris Brown From owner-robots Fri Mar 29 05:09:52 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08495; Fri, 29 Mar 96 05:09:52 -0800 From: Jaakko Hyvatti Message-Id: <199603291309.PAA10090@krisse.www.fi> Subject: Re: Links (don't bother checking; I've done it for you) To: robots@webcrawler.com Date: Fri, 29 Mar 1996 15:09:30 +0200 (EET) In-Reply-To: from "Michael De La Rue" at Mar 29, 96 10:56:12 am X-Mailer: ELM [version 2.4 PL22] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Michael De La Rue : > about 50 times each.. > > okay, so the action to take in his case is obvious (deindex any siGHt(e) > beloning to him), but what is the general case to stop this? It's very > difficult as far as I can see. It's deliberate worthless junk, trying to > get in at the level of people who are providing worthwhile information > about california/web sites etc. The question is not how to stop this. Anyone still can put anything to his/hers pages, and it is you who fetches the information from his private property. Don't you never ever dare to question that. (Ok, so some covernments do.) The question is how a search engine can provide its customers the most valuable information. There are many other problems here and this one is only one of them. This is a matter of creating better and better algorithms and heuristics to evaluate the goodness of the match between customers query and the information content of the page. Also the tools to examine the query results matter. Even the plain title displayed on the results page often says that this page is not worth looking. Remember also that a customer is not interested to hear your personal opinions on what is worth reading or not, or who do you not like. In conclusion, do not panic. Just make a note and remember it when writing a search engine. From owner-robots Fri Mar 29 06:42:36 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08829; Fri, 29 Mar 96 06:42:36 -0800 Date: Fri, 29 Mar 1996 15:40:33 +0100 (MET) From: Michael De La Rue To: robots@webcrawler.com Subject: Re: Links This Site is about Robots Not Censorship In-Reply-To: Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Thu, 30 Mar 1995, Keith wrote: > Are You Kidding or Something ? > This is not the place for this sort of Censorship Plans. > Here we have millions of people supporting the Blue Ribbon Campaign > and you want to organise the Size of the Jail Cell > Who are you to Judge? Judgement is my most highly valued human atribute. I judge everything and everyone. Always in my own views; always as fairly as I can; always trying not to let my judgement affect the way I treat someone unless it should, but I do judge. I suspect you do too. > Keith I think this is the kind of thing that we are really having the problem with. Censorship:- You write a book, I stop you publishing it. Freedom:- You write a book, I ignore it completely. Stalinism:- I write a book, You HAVE to read it. It's everywhere etc. The catalogues aren't put there for the benefit of the web authors; they're for the readers. They run on machines belonging to somebody, using network connect belonging to somebody paid for by somebody (e.g. DEC, DEC, in the case of my favourite). They set out with a specific aim (getting publicity for DEC by providing something I want) and to try to stop that is vandalism. Trying to get round their prioritising algorithm is trying to stop that. Them not indexing you is their choice. If you want a place where everyone has the right to be indexed, then you are looking at government sponsorship or a not-for-profit organisation having to be set up (not such a bad idea) If they stop indexing people because of valuable content (e.g. politics/sex/competing products) I will find a different engine. If they start dumping stuff that tries to get in the way of a normal search then I will move even more strongly over to them. I've switched almost entirely to Alta Vista, simple because of the ease with which I can personally chop down the search list I get back by excluding URLs/Titles/etc..) and almost get back to a level of useful information. again :- The internet does not belong to you. In fact it barely exists as an object... It's all someones property and you have no right to what you haven't paid for; just a set of mutually agreed friendly offers. Trying to use someone elses computer (DECs) for something other than they intended it can be a crime in most US states and the UK (DEC intended to make it possible to search for information across the World Wide Web to publicise themselves :- this person IMHO is attempting to use the computer to advertise his buisness; he should set up his own search engine or buy Lycos/Infoseek advertising space). Oh; and just to end this diatribe; since people like Kieth will have these feelings, this is a need which should be addressed and could easily be by just having a censor-junk radio button on the Web pages. Anyone who wants to wade through page after page of empty adverts would easily be able to do so. Possibly we just need some clearer set of ways of categorising pages and this is the real answer. I don't think that the person who started the thread really wants me to come to his page (I push up his hit rating which he will pay for in one way or another, and I'm not likely to ever buy his particular service).. Perhaps the only way forward in the long run is some set of META info which it's obviously in your best interest to get right. Mutually exclusive values of properties for the page, such as commercial/non-commercial? P.S. a final thought: do the people who do site rating count as censors? If I write a bad web page and it doesn't get included in the Point top 5%, should I sue them :-) Scottish Climbing Archive: Linux/Unix clone@ftp://src.doc.ic.ac.uk/packages/linux/sunsite.unc-mirror/docs/ From owner-robots Fri Mar 29 09:11:48 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09553; Fri, 29 Mar 96 09:11:48 -0800 X-Sender: dchandler@abilnet.com X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Darrin Chandler Subject: Re: Links (don't bother checking; I've done it for you) Date: Fri, 29 Mar 1996 10:11:37 -0700 Message-Id: <19960329171137409.AAA98@ganymede.abilnet.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 10:56 3/29/96 +0100, you wrote: >Then, at the end of his page he had the following keywords repeated > >California >Northwest >Animation >Promotion Web Site >Development Web Site >Knitter Web >Money > >about 50 times each.. > >okay, so the action to take in his case is obvious (deindex any siGHt(e) >beloning to him), but what is the general case to stop this? It's very >difficult as far as I can see. It's deliberate worthless junk, trying to >get in at the level of people who are providing worthwhile information >about california/web sites etc. Well, if you were to analyze the comments of a web page and compare the ratio of unique words to total words, you could cull a large percentage of these types of pages. Even better, you could use that metric to decide whether your indexer should include comments, which means you can still index the page, but without the bogus keywords. ______________________________________________ _/| _| _| _| _/_| _| _| _| _| _|_|_| _/ _| _| _| _| _/_|_| _|_|_| _| _| _| _| _| _| _/ _| _| _| _| _| _| _| _| _| _/ _| _|_|_| _| _| _| _|_| _|_|_| _| _|_|_| Darrin Chandler, Duke of URL Ability Software & Productions Email: dchandler@abilnet.com WWW: http://www.abilnet.com/ ______________________________________________ From owner-robots Fri Mar 29 10:41:54 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10174; Fri, 29 Mar 96 10:41:54 -0800 Date: Fri, 29 Mar 1996 10:53:28 -0800 (PST) From: Benjamin Franz X-Sender: snowhare@ns.viet.net To: robots@webcrawler.com Subject: Re: Links (don't bother checking; I've done it for you) In-Reply-To: <19960329171137409.AAA98@ganymede.abilnet.com> Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Fri, 29 Mar 1996, Darrin Chandler wrote: > At 10:56 3/29/96 +0100, you wrote: > >Then, at the end of his page he had the following keywords repeated > > > >California > >Northwest > >Animation > >Promotion Web Site > >Development Web Site > >Knitter Web > >Money > > > >about 50 times each.. > > > >okay, so the action to take in his case is obvious (deindex any siGHt(e) > >beloning to him), but what is the general case to stop this? It's very > >difficult as far as I can see. It's deliberate worthless junk, trying to > >get in at the level of people who are providing worthwhile information > >about california/web sites etc. > > Well, if you were to analyze the comments of a web page and compare the > ratio of unique words to total words, you could cull a large percentage of > these types of pages. Even better, you could use that metric to decide > whether your indexer should include comments, which means you can still > index the page, but without the bogus keywords. Ugh. That is a bad heuristic. I use keywords to 'cover the bases' for search engines. IOW I try to guess possible mis-spellings, alternative spellings, related concepts that are likely to be searched on etc. The result is perhaps twenty to fifty unique words that act as a 'wide net' for those looking for information covered by the site. The primary defect of whole body text indexing is that the people issuing search requests are frequently very poor at actually generating *good* search requests that will match all relevant information. You don't believe me? Try searching for information related to rabbits. As sample searches try buns bunny bunnies rabbit rabbits bunnyrabbits "bunny rabbit" "bunny rabbits" buny bunnys devilbunnies ( >;-) ) The individual words usually cough up very different sub-sets of pages related to rabbits. A *good* search request would look for all of them - in the absence of such searches, I would keyword a page to all of them. And I would be correct to do so. But your unique words rejection heuristic would likely deny the page. -- Benjamin Franz From owner-robots Fri Mar 29 10:56:59 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10285; Fri, 29 Mar 96 10:56:59 -0800 Date: Fri, 29 Mar 1996 11:09:48 -0800 X-Sender: dhender@oly.olympic.net Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: dhender@olympic.net (David Henderson) Subject: Re: Links (don't bother checking; I've done it for you) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >>okay, so the action to take in his case is obvious (deindex any siGHt(e) >>beloning to him), but what is the general case to stop this? It's very >>difficult as far as I can see. It's deliberate worthless junk, trying to >>get in at the level of people who are providing worthwhile information >>about california/web sites etc. I am not sure I would disagree with that. Search engines are an important tool. I liked the idea of allowing users to rate a site. I think, as a whole Internet users are reasonable and responsible. I personally have used CGI methods to deliver keyword lists to robots/spiders. However, I consider my action carefully. I feel that I am doing a service to the Internet community by more efficiently delivering information or services that people need or want. I find it interesting that Lycos allows users to manually remove sites from their database. How easy it would be for a competitor to remove MY listing. But, It has never happened. By allowing users to rate a site based on their impression of it's value in relation to their keyword search and then applying that new value in the database would soon encourage Site Authors to be responsible. Not to mention I would be able to find 'sex' and 'porn' so much faster, as all those sites using bogus keywords would soon be devalued. ;/ Thanx, David Henderson, Website developer. ________________________________________________________________ David Henderson Technical Services iwtnet/QUICKimage WK PH: 206-443-1430 WK FX: 206-443-5670 HM PH/FX: 360-377-2182 http://www.quickimage.com http://www.qinet.com From owner-robots Fri Mar 29 11:18:58 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10489; Fri, 29 Mar 96 11:18:58 -0800 Message-Id: <315C379A.167E@austin.ibm.com> Date: Fri, 29 Mar 1996 13:18:50 -0600 From: Rob Turk Organization: IBM Worldwide AIX Support Tools Development X-Mailer: Mozilla 2.01 (X11; I; AIX 1) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: Links This Site is about Robots Not Censorship References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Michael De La Rue wrote: > > On Thu, 30 Mar 1995, Keith wrote: > > > Are You Kidding or Something ? I thought this mailing list was supposed to be a technical discussion of httpd robots and intelligent network agents. Please post things that don't pertain to "How To Build A (Better) httpbot" somewhere else. -- Rob Turk Unofficially Speaking. "The only thing that saves us from the bureaucracy is its inefficiency."--Eugene McCarthy From owner-robots Fri Mar 29 12:32:06 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11000; Fri, 29 Mar 96 12:32:06 -0800 Date: Fri, 29 Mar 1996 20:31 UT From: MGK@NEWTON.NPL.CO.UK (Martin Kiff) Message-Id: <009A013722D8BEA0.021D@NEWTON.NPL.CO.UK> To: robots@webcrawler.com Subject: Re: Links (don't bother checking; I've done it for you) X-Vms-To: SMTP%"robots@webcrawler.com" X-Vms-Cc: MGK Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > I find it interesting that Lycos allows users to manually remove sites from > their database. How easy it would be for a competitor to remove MY listing. It could be confirmed before removal... Lycos (or others) could read the page and send a message to the address in the header asking for confirmation. O.K. People who do not put in the appropriate LINK REV deserve to get their pages deleted ;-) Regards, Martin Kiff mgk@newton.npl.co.uk / mgk@webfeet.co.uk From owner-robots Fri Mar 29 12:52:06 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11090; Fri, 29 Mar 96 12:52:06 -0800 X-Sender: dchandler@abilnet.com X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Darrin Chandler Subject: Re: Links (don't bother checking; I've done it for you) Date: Fri, 29 Mar 1996 13:51:42 -0700 Message-Id: <19960329205142245.AAA223@ganymede.abilnet.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 10:53 3/29/96 -0800, you wrote: ... >> Well, if you were to analyze the comments of a web page and compare the >> ratio of unique words to total words, you could cull a large percentage of >> these types of pages. Even better, you could use that metric to decide >> whether your indexer should include comments, which means you can still >> index the page, but without the bogus keywords. > >Ugh. That is a bad heuristic. I use keywords to 'cover the bases' for >search engines. IOW I try to guess possible mis-spellings, alternative >spellings, related concepts that are likely to be searched on etc. The >result is perhaps twenty to fifty unique words that act as a 'wide net' >for those looking for information covered by the site. The primary defect >of whole body text indexing is that the people issuing search requests >are frequently very poor at actually generating *good* search requests >that will match all relevant information. You don't believe me? ... >The individual words usually cough up very different sub-sets of pages >related to rabbits. A *good* search request would look for all of them - in >the absence of such searches, I would keyword a page to all of them. And I >would be correct to do so. But your unique words rejection heuristic would >likely deny the page. What I proposed was that the page *would* be indexed, but the keywords in comments or META tags may be discarded. Indeed, this would be detremental in examples such as you gave. However, not everyone is as conscientious when adding keywords to their html. In total, I believe the signal:noise ratio would increase using my hueristic, even though some good references would not be returned. Some time soon I'll be ready to test out my ideas. At that time I will certainly post my results, and probably give the URL for people here to see for themselves. In the meantime, I welcome further discussion... ______________________________________________ _/| _| _| _| _/_| _| _| _| _| _|_|_| _/ _| _| _| _| _/_|_| _|_|_| _| _| _| _| _| _| _/ _| _| _| _| _| _| _| _| _| _/ _| _|_|_| _| _| _| _|_| _|_|_| _| _|_|_| Darrin Chandler, Duke of URL Ability Software & Productions Email: dchandler@abilnet.com WWW: http://www.abilnet.com/ ______________________________________________ From owner-robots Fri Mar 29 13:18:07 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11232; Fri, 29 Mar 96 13:18:07 -0800 Date: Fri, 29 Mar 1996 13:16:45 -0800 Message-Id: <199603292116.NAA00675@norway.it.earthlink.net> X-Sender: williams@earthlink.net X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Jared Williams Subject: The Letter To End All Letters Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Wow! Quite alot of different view points! I especially liked the "article" length letter done by Michael De La Rue. The spelling bee idea was a nice touch. (Don't worry Chris I didn't get any ideas, I like you sence of humor though.) Well... With all of these opinions I think that I should clear up any misunderstandings to the best of my ability. 1. For one thing I didn't want to get into the subject of sensorship. I'll remain nuterel in this sensitive matter so as to extinquish this subject as quickly as possible.(This mailing list is about ROBOTS isn't it?) I believe as well that search engines should have good quality links as well that's one of the reasons I didn't inlude off beat subjects as well. 2. Another letter wrote that I offer the same services as meta search. I'm not going to argue with that I do. And as long as we're on the subject Post Master as well. I feel that I'm going to change this policy do to my belief in quality. Also an interesting note: I don't offer a web authoring tool although I might in the future and there is no need to refer to the company as "they". Currently there is only one person who is me working for Web Knitter. And (please don't judge me on this factor) I'm fifteen. I think that covers the majority of the ones sent. I order not to inflame any more opinions (wich I shall use as constructive criticism), and be tooken off of any search engines, I'll take down all of those "key words" that have irratated so many people ( it won't get tooken out today but should be gone by Tuesday. I'll be out of town this weekend). I hope that this letter will end all of the letters that have been coming and turn the subject back to were it belongs. :) In this letter it isn't my intention to offend anyone in any possible was, and if you find any mistakes in this letter that you may be tempted to reply to and quote them, please resist the urge! They weren't written there on purpose. My applogies to all for starting this thread as of I think that I inadvertly set off a bomb that's been waiting to explode. :) So with this written I hope to have all questions answered and will end my letter! Have a nice day! Jared Williams Webmaster of Web Knitter From owner-robots Sat Mar 30 06:19:24 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02207; Sat, 30 Mar 96 06:19:24 -0800 Date: Sat, 30 Mar 1996 14:19 BST From: MGK@NEWTON.NPL.CO.UK (Martin Kiff) Message-Id: <009A01CC53C1B180.0B7A@NEWTON.NPL.CO.UK> To: robots@webcrawler.com Subject: Heuristics.... X-Vms-To: SMTP%"robots@webcrawler.com" X-Vms-Cc: MGK Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com .... into the stream of discussion about heuristics (which is a way of saying that I've lost track of who has said what... apologies) > Well, if you were to analyze the comments of a web page and compare the > ratio of unique words to total words, you could cull a large percentage of > these types of pages. Even better, you could use that metric to decide > whether your indexer should include comments, which means you can still > index the page, but without the bogus keywords. Do you need this complexity? I guess, and it is only a guess, that people assume a 'WAIS' like behaviour in the weighting. I.e. the number of times *that* word has appeared over the total number of words in the document. (If I've got this wrong you can correct me privately :-). A linear relationship therefore... But does it need to be linear? How does a log (*that* word) / total number behave? Artificially loading the document with keywords then becomes counter-productive as you are also increasing the total number of words. Time for some back of envelope work I think.... > What I proposed was that the page *would* be indexed, but the keywords in > comments or META tags may be discarded. I would vote for ignoring comments but surely you *must* include the Regards, Martin Kiff mgk@newton.npl.co.uk / mgk@webfeet.co.uk From owner-robots Sat Mar 30 08:22:09 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02542; Sat, 30 Mar 96 08:22:09 -0800 Message-Id: <315D6E54.27EE@mail.fc-net.fr> Date: Sat, 30 Mar 1996 17:24:36 +0000 From: christophe grandjacquet Organization: COMME VOUS VOULEZ ! X-Mailer: Mozilla 2.0 (Macintosh; I; 68K) Mime-Version: 1.0 To: robots@webcrawler.com Subject: unscribe References: <19960329205142245.AAA223@ganymede.abilnet.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com unscribe From owner-robots Sat Mar 30 08:23:52 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02552; Sat, 30 Mar 96 08:23:52 -0800 Message-Id: <315D6EBB.6B47@mail.fc-net.fr> Date: Sat, 30 Mar 1996 17:26:19 +0000 From: christophe grandjacquet Organization: COMME VOUS VOULEZ ! X-Mailer: Mozilla 2.0 (Macintosh; I; 68K) Mime-Version: 1.0 To: robots@webcrawler.com Subject: unsubscibe References: <199603292116.NAA00675@norway.it.earthlink.net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com unsubscibe From owner-robots Sat Mar 30 18:17:23 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04818; Sat, 30 Mar 96 18:17:23 -0800 X-Sender: mak@surfski.webcrawler.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sat, 30 Mar 1996 18:18:52 -0700 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: Admin: how to get off this list Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com chris cobb suggested I sent out a reminder about getting on/off this list. The complete instructions are on-line on: http://info.webcrawler.com/mailing-lists/robots/info.html and the important thing to remember is to send requests to robots-request@webcrawler.com, or if that fails to owner-robots@webcrawler.com, _not_ robots@webcrawler.com. Regards, -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Mon Apr 1 09:28:40 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12649; Mon, 1 Apr 96 09:28:40 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Mon, 1 Apr 1996 09:29:25 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Search accuracy Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >On Fri, 29 Mar 1996, Darrin Chandler wrote: >The individual words usually cough up very different sub-sets of pages >related to rabbits. A *good* search request would look for all of them - in >the absence of such searches, I would keyword a page to all of them. And I >would be correct to do so. But your unique words rejection heuristic would >likely deny the page. You're saying that a search on "buns" should return pages about rabbits? ;-) The nit I'd like to pick here is that you're describing good recall (finding all of the relevant documents), which is only half of the search accuracy problem. The other half is precision, which is finding only relevant documents. A thesaurus/dictionary-based semantic network could return all of the documents that you describe... but the problem would remain that it would *also* return many, many other documents that have words with some sort of linguistic connection to these. Balancing precision and recall is the big problem in search. Robots that compile additional evidence can help in ways that go beyond just indexing the words. For example, capturing HTML zone information can help score documents based on where words appear. Nick From owner-robots Mon Apr 1 09:28:31 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12639; Mon, 1 Apr 96 09:28:31 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Mon, 1 Apr 1996 09:29:12 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Heuristics.... Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >Do you need this complexity? I guess, and it is only a guess, that people >assume a 'WAIS' like behaviour in the weighting. I.e. the number of times >*that* word has appeared over the total number of words in the document. >(If I've got this wrong you can correct me privately :-). A linear >relationship therefore... When you say "people assume..." do you mean users? In that case, I think they assume that don't have to worry about the underlying algorithms; people want relevant results, which shouldn't have to depend on any deep understanding of what happens behind the scenes. >But does it need to be linear? How does a > > log (*that* word) / total number Our density operator is not linear... and it gives pretty good results, although density is rarely the only evidence involved in coming up with a relevancy score. Nick From owner-robots Mon Apr 1 15:01:57 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA16254; Mon, 1 Apr 96 15:01:57 -0800 Date: Mon, 1 Apr 1996 15:13:30 -0800 (PST) From: Benjamin Franz X-Sender: snowhare@ns.viet.net To: robots@webcrawler.com Subject: Re: Search accuracy In-Reply-To: Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Mon, 1 Apr 1996, Nick Arnett wrote: > >On Fri, 29 Mar 1996, Darrin Chandler wrote: > > >The individual words usually cough up very different sub-sets of pages > >related to rabbits. A *good* search request would look for all of them - in > >the absence of such searches, I would keyword a page to all of them. And I > >would be correct to do so. But your unique words rejection heuristic would > >likely deny the page. > > You're saying that a search on "buns" should return pages about rabbits? ;-) Actually, yes. ;-). Those who frequently talk about rabbits use 'buns' as a synonym for rabbits. And a search on buns on Alta Vista *does* return pages involving rabbits. Along with discussions of food, hair, long distance running in cold weather, as well as human and non-human anatomy. This is where skill in constructing a search to exclude things that are *not* of interest comes in handy. > The nit I'd like to pick here is that you're describing good recall > (finding all of the relevant documents), which is only half of the search > accuracy problem. The other half is precision, which is finding only > relevant documents. A thesaurus/dictionary-based semantic network could > return all of the documents that you describe... but the problem would > remain that it would *also* return many, many other documents that have > words with some sort of linguistic connection to these. Yup. > Balancing precision and recall is the big problem in search. Robots that > compile additional evidence can help in ways that go beyond just indexing > the words. For example, capturing HTML zone information can help score > documents based on where words appear. The general problem is that while as an author I can tell the search engines that a list of words are relevant to the topic of my page, it is incumbent on the *searcher* to exclude irrelevant topics - because I have no way to determine that as an author. If the search engines even *allowed* specifiying a list of irrelevant but potentially searched keywords, it would help. So when someone searched 'buns AND pictures' I could rank my pages *lower*. But even that marginal assistance is not available with the current search engines. Parsing the HTML structure simply will not (cannot) resolve the search problem of 'buns'. In *each* of the ones listed in my example, 'buns' are in fact the highly relevant element of each page - but only a sub-set are relevant to *me* as I am only interested in one *kind* of buns (Well, ok, I'm interested in some of the others. But they still are not relevant to my search for information on rabbits). -- Benjamin Franz From owner-robots Mon Apr 1 17:23:23 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00336; Mon, 1 Apr 96 17:23:23 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Mon, 1 Apr 1996 17:20:39 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Search accuracy Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 3:13 PM 4/1/96, Benjamin Franz wrote: >On Mon, 1 Apr 1996, Nick Arnett wrote: >> You're saying that a search on "buns" should return pages about rabbits? ;-) > >Actually, yes. ;-). Those who frequently talk about rabbits use 'buns' as a >synonym for rabbits. And a search on buns on Alta Vista *does* return pages >involving rabbits. Along with discussions of food, hair, long distance >running in cold weather, as well as human and non-human anatomy. This is >where skill in constructing a search to exclude things that are *not* of >interest comes in handy. I'm not sure that this is a good direction -- expecting people to define the subjects to exclude. After all, this isn't how we tend to "search" when we have a human being to help us. If you walked up to a reference librarian and said "Rabbits," what kind of response would you expect? I think it would be something along the lines of "What about rabbits?" Yet we expect computers to be better mind readers than humans! Fuzzy logic -- the more evidence, the better -- seems to get people to relevant documents with fewer iterations. For example, you could probably come up with a query that would get rid of the documents that use "buns" to refer to anatomy (though it's not obvious to me, actually), but why not spend that energy and time providing more words, phrases and other evidence that a document is about rabbits, so that the anatomy documents fall to the bottom of the relevancy list? Having said all that, I should add that I realize that most of the search engines used by the major Internet search services don't support this kind of search -- they're generally limited to Boolean logic. As robots are increasingly able to extract evidence from documents and context, the limitations of Boolean search will become more and more obvious. Ditto as people learn the more sophisticated search query techniques. Nick From owner-robots Mon Apr 1 17:54:45 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00699; Mon, 1 Apr 96 17:54:45 -0800 Date: Mon, 1 Apr 1996 18:01:11 -0800 (PST) From: Benjamin Franz X-Sender: snowhare@ns.viet.net To: robots@webcrawler.com Subject: Re: Search accuracy In-Reply-To: Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Mon, 1 Apr 1996, Nick Arnett wrote: > At 3:13 PM 4/1/96, Benjamin Franz wrote: > >On Mon, 1 Apr 1996, Nick Arnett wrote: > > >> You're saying that a search on "buns" should return pages about rabbits? ;-) > > > >Actually, yes. ;-). Those who frequently talk about rabbits use 'buns' as a > >synonym for rabbits. And a search on buns on Alta Vista *does* return pages > >involving rabbits. Along with discussions of food, hair, long distance > >running in cold weather, as well as human and non-human anatomy. This is > >where skill in constructing a search to exclude things that are *not* of > >interest comes in handy. > > I'm not sure that this is a good direction -- expecting people to define > the subjects to exclude. After all, this isn't how we tend to "search" > when we have a human being to help us. If you walked up to a reference > librarian and said "Rabbits," what kind of response would you expect? I > think it would be something along the lines of "What about rabbits?" Yet > we expect computers to be better mind readers than humans! Not really. The whole search page is an implicit 'information on X' request. More significantly - a great deal of library science has to deal with the issues of categorization and cross-referencing. The lack of which on the web is the fundamental issue that lead to full body text indexing being the search mechanism of choice on the WWW in the first place. > Fuzzy logic -- the more evidence, the better -- seems to get people to > relevant documents with fewer iterations. For example, you could probably > come up with a query that would get rid of the documents that use "buns" to > refer to anatomy (though it's not obvious to me, actually), but why not > spend that energy and time providing more words, phrases and other evidence > that a document is about rabbits, so that the anatomy documents fall to the > bottom of the relevancy list? Better yet might be iterated searching with ratings. You do an initial search, then you can mark matches on the first page of returned results for relevancy and rekey the search. The search engine could then re-rank the returned results via a smart attempt to place the docs in an N-space based on word frequencies and other measurable properties of the documents. In a primitive way, Alta Vista kind of does this with its ranking options for search terms. -- Benjamin Franz From owner-robots Mon Apr 1 19:05:07 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01034; Mon, 1 Apr 96 19:05:07 -0800 From: mred@neosoft.com Message-Id: <199604020305.VAA29842@sam.neosoft.com> To: robots@webcrawler.com X-Mailer: Post Road Mailer (Green Edition Ver 1.05c) Date: Mon, 1 Apr 1996 21:03:07 CST Subject: Re: Search accuracy Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com ** Reply to note from Benjamin Franz 04/01/96 6:01pm -0800 > Better yet might be iterated searching with ratings. You do an initial > search, then you can mark matches on the first page of returned results > for relevancy and rekey the search. The search engine could then re-rank > the returned results via a smart attempt to place the docs in an N-space I once saw a search engine which did this. What was the name of it? -Ed- mred@neosoft.com From owner-robots Wed Apr 3 10:12:00 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10218; Wed, 3 Apr 96 10:12:00 -0800 Subject: Re: Search accuracy From: YUWONO BUDI To: robots@webcrawler.com Date: Thu, 4 Apr 1996 02:11:41 +0800 (HKT) In-Reply-To: <199604020305.VAA29842@sam.neosoft.com> from "mred@neosoft.com" at Apr 1, 96 09:03:07 pm X-Mailer: ELM [version 2.4 PL24alpha3] Content-Type: text Content-Length: 797 Message-Id: <96Apr4.021146hkt.19061-3812+174@uxmail.ust.hk> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > ** Reply to note from Benjamin Franz 04/01/96 6:01pm > -0800 > > > Better yet might be iterated searching with ratings. You do an initial > > search, then you can mark matches on the first page of returned results > > for relevancy and rekey the search. The search engine could then re-rank > > the returned results via a smart attempt to place the docs in an N-space > > I once saw a search engine which did this. What was the name of it? We have an experimental search engine that does this so called relevance feedback at: http://dbx.cs.ust.hk:8000/index.html (note: it's not the default page) The idea is to expand the initial query by adding keywords from hit-URL's marked as `relevant' (with respect to user's query) by the user. -Budi. From owner-robots Thu Apr 4 10:00:41 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17663; Thu, 4 Apr 96 10:00:41 -0800 Message-Id: <9604041654.AA2522@worldcom-45.worldcom.com> To: Nick Arnett Cc: robots , Paul Nelson , dschulze , pcondo From: Judy Feder Date: 4 Apr 96 7:05:32 Subject: Re: Search accuracy Mime-Version: 1.0 Content-Type: Text/Plain Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You're saying that a search on "buns" should return pages about rabbits? ;-) The nit I'd like to pick here is that you're describing good recall (finding all of the relevant documents), which is only half of the search accuracy problem. The other half is precision, which is finding only relevant documents. A thesaurus/dictionary-based semantic network could return all of the documents that you describe... but the problem would remain that it would *also* return many, many other documents that have words with some sort of linguistic connection to these. Balancing precision and recall is the big problem in search. Robots that compile additional evidence can help in ways that go beyond just indexing the words. For example, capturing HTML zone information can help score documents based on where words appear. Nick Re: Nick's comments on semantic networks. I'm very pleased to see him giving a plug for semantics, but I'd like to clarify one thing. A true semantic network (which today is only offered by Excalibur Technologies' RetrievalWare) does not force the user to make the precision/recall tradeoff. Yes, the semantic network does boost recall by building in literally millions of word links (so, stock is linked to equity, share, trade, bond, security, etc.). However, unlike a thesaurus, or any other tool used in search engines today, the semantic network also lets you specify word meaning. Thus, you can specify a search on stock as "shares issued by a company...," telling the system to ignore references to soup stock, live stock, retail stock, etc. I agree, that leveraging fielded or zone information is also a very useful part of the mix. But, the bottom line is that semantic networks provide the most accurate searching -- precision and recall -- available to users and Web site developers today. For more on this, see the TREC results at the NIST WWW site, Judy From owner-robots Thu Apr 4 11:01:07 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18104; Thu, 4 Apr 96 11:01:07 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 4 Apr 1996 11:01:45 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Search accuracy Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >A true semantic network ... does not >force the user to make the precision/recall tradeoff. Yes, the semantic >network does boost recall by building in literally millions of word links (so, >stock is linked to equity, share, trade, bond, security, etc.). However, >unlike a thesaurus, or any other tool used in search engines today, the >semantic network also lets you specify word meaning. Thus, you can specify a >search on stock as "shares issued by a company...," telling the system to >ignore references to soup stock, live stock, retail stock, etc. If there exists a search technology that is so accurate that it never finds irrelevant documents and always finds all of the relevant ones, we'd like to buy it. Any time you ask a more accurate question of a good search engine, you'll get more accurate results, regardless of whether you're using a "true" semantic network, knowledgebase, Cliff Notes or anything else that helps define the concept on which you searching. In any event, this isn't the place to flog our products, features, etc. One of the things that I'd find really interesting would be research into the construction of semantic networks or other knowledgebases from Web topology. That would be a fascinating byproduct of a spider's explorations. I only know of one experiment along those lines, being done by one of our customers. Anyone else looking at this? Nick From owner-robots Thu Apr 4 12:09:49 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18604; Thu, 4 Apr 96 12:09:49 -0800 Message-Id: <199604042009.PAA18569@play.cs.columbia.edu> To: robots@webcrawler.com Subject: Re: Search accuracy In-Reply-To: Your message of "Thu, 04 Apr 1996 11:01:45 PST." Date: Thu, 04 Apr 1996 15:09:42 -0500 From: "John D. Pritchard" Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > One of the things that I'd find really interesting would be research into > the construction of semantic networks or other knowledgebases from Web > topology. That would be a fascinating byproduct of a spider's > explorations. I only know of one experiment along those lines, being done > by one of our customers. Anyone else looking at this? i think some people around here, in Judith Klavan's or Kathy McKeown's group(s), have done and are doing work in this vein. i looked around but couldn't find anything. -john From owner-robots Thu Apr 4 12:23:58 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18712; Thu, 4 Apr 96 12:23:58 -0800 Message-Id: <199604041933.NAA14325@whistler.dorm.net> Comments: Authenticated sender is From: "Andy Warner" To: robots@webcrawler.com Date: Thu, 4 Apr 1996 14:23:45 +0000 Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Subject: Clean up Bots... Priority: normal X-Mailer: Pegasus Mail for Win32 (v2.30) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com First off, is anybody running a 'bot dedicated purely to cleaning up their database? If not, why not ? Also, how tough would it be to have a user scheduled cleaning bot. I don't mean the tradtional go find the delete entry page and submit the link that doesn't work. Why not either have a link right there to automatically recheck the site or have a script checking the referer logs for 404s (and other errors). For every error site send the bot to verify the problem and delete the entry if it is valid. -- Andy Warner andy@andy.net 01000001 01101110 01100100 01110010 01100101 01110111 01010111 01100001 01110010 01101110 01100101 01110010 http://www.andy.net/~andy/ http://www.dorm.net From owner-robots Thu Apr 4 14:50:13 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19673; Thu, 4 Apr 96 14:50:13 -0800 Date: Thu, 4 Apr 1996 16:50:08 -0600 (CST) From: Daniel C Grigsby To: robots@webcrawler.com Subject: Re: Search accuracy In-Reply-To: Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Thu, 4 Apr 1996, Nick Arnett wrote: > > One of the things that I'd find really interesting would be research into > the construction of semantic networks or other knowledgebases from Web > topology. That would be a fascinating byproduct of a spider's > explorations. I only know of one experiment along those lines, being done > by one of our customers. Anyone else looking at this? > > Nick, could you please expand this last paragraph. I'd love to hear your ideas. Sounds fascinating. Thanks, Dan From owner-robots Thu Apr 4 16:03:48 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20053; Thu, 4 Apr 96 16:03:48 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 4 Apr 1996 16:04:30 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Search accuracy Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >Nick, could you please expand this last paragraph. I'd love to hear your >ideas. Sounds fascinating. If you assume that Web authors are creating links that have some sort of conceptual logic behind them, then a robot may be able to infer knowledge from the choice of documents that are linked, as well as the locations of the links and the link texts. For example, look at the highly structured pages of Yahoo, where the text of each link actually describes the pages to which it is linked. Let's take "Java," for example. One might guess that if a robot examined the pages linked to the word Java, it might often find terms such as "object-oriented" and "Sun Microsystems," to name a couple. Although your software might not be able to figure out the nature of the conceptual connections among them, it could observe the connection and use them as evidence when someone searches on "Java." That is to say, when someone would search on "Java," documents that contain "object-oriented" and "Sun Microsystems" would be ranked as more relevant. This assumes a completely automated approach. Probably more practical would be to present the results of the analysis to a human editor, who could tune the knowledgebase. This could help address the big problem with semantic networks and other sorts of conceptual knowledgebases -- automation of their creation and maintenance. There have been two general approaches. One is to automatically extract a semantic network from a dictionary. This works well within the limits of a dictionary's vocabulary, but many, if not most of the interesting words, especially for new information, (the proper noun "Java," for example) aren't in dictionaries. The alternative is to build your own knowledgebase from the ground up, but that's not easy. The results are effective, but few people have the resources to build robust ones. Even more interesting to me (and perhaps more practical) is the idea of using a robot to extract subjective information from the Web. For example, if you could accurately recognize people's "my favorite links" lists, you might be able to come up with a pool of opinions rapidly. Then you could do the kind of analysis that Pattie Maes has been doing at the MIT Media Lab -- if you like "A" and "B" and I like "A", then my agent will bring "B" to my attention. Of course, as one of our engineers observed, we might discover that the most talked-about concept on the Web is "click here" or perhaps "under construction." ;-) Nick From owner-robots Thu Apr 4 17:00:24 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20393; Thu, 4 Apr 96 17:00:24 -0800 Message-Id: From: cs31dw@ee.surrey.ac.uk (David A Weeks) Subject: Re: Search accuracy To: robots@webcrawler.com Date: Fri, 5 Apr 1996 02:00:15 +0100 (BST) In-Reply-To: from "Daniel C Grigsby" at Apr 4, 96 04:50:08 pm X-Mailer: ELM [version 2.4 PL24] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 1024 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > On Thu, 4 Apr 1996, Nick Arnett wrote: > > > > > One of the things that I'd find really interesting would be research into > > the construction of semantic networks or other knowledgebases from Web > > topology. That would be a fascinating byproduct of a spider's > > explorations. I only know of one experiment along those lines, being done > > by one of our customers. Anyone else looking at this? > > > > My final year project is exactly this. I am using a robot to compile a knowledge-base on various keywords. The documentation is at : http://eeisun2.city.ac.uk/~ftp/Guinness/Hello.html Early results are at : http://eeisun2.city.ac.uk/~ftp/Guinness/results2.html It would appear that a knowledge-base hooked up to an index catalogue can be extremely useful in returning more accurate results. Naturally, the user still has a lot of the responsibility in making his query as less ambigous as possible. Any feedback will be much appreciated. --------------------- Dave Weeks. University of Surrey. From owner-robots Thu Apr 4 17:27:27 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20522; Thu, 4 Apr 96 17:27:27 -0800 Message-Id: <199604050125.UAA01755@mail.internet.com> Comments: Authenticated sender is From: "Robert Raisch, The Internet Company" Organization: The Internet Company To: Judy Feder , robots@webcrawler.com Date: Thu, 4 Apr 1996 20:22:09 -0400 Subject: Re: Search accuracy Cc: Paul Nelson , dschulze , pcondo , matt@pls.com Priority: normal X-Mailer: Pegasus Mail for Windows (v2.23) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com It should be noted that while semantic networks offer greater accuracy in the assessment of search results, they imply a rather crushing editorial burden on *someone* as well. They also address only one form of search and not the most prevalent form at that. It is all well and fine to limit your search to stock -- a term related to finance and corporate worth, until you search for feldarcarb -- a recently coined term related to initial offerings of Internet-related companies whose value has been artificially inflated by ignorance on the part of the market and the public. (Apologies to Battlestar Galactica.) Unless someone has updated the network to include this new terminology and how it relates to the rest of the corpus, your search may prove less than completely satisfying. And herein is the problem, how often is the semantic network enriched and by whom? Since language and meaning are not static, easily encompass-able things, an effective semantic network must be considered to be a verb rather than a noun. The whom above is usually an expert in the chosen field rather than a general purpose editor or information specialist. This is another problem relating to semantic networks as the needed experts would rather be doing something interesting, something other than creating what amounts to a very detailed dictionary of the specific terminology within a particular field. This is why semantic network based indexing is popular in highly technical fields such as medicine, where the changes to the dictionary are not especially frequent. "Femur" has been around a very long time indeed. This raises another issue I have with semantic-network indexing, that of the difference between what I call the Expanders and the Contractors. Expanders are your typical searcher, where one broadens the initial search, casting the net ever wider to catch more information rather than less. This kind of behaviour is typical of the general public and allows one to capture results that otherwise might have eluded the searcher usually because of lack of forethought in constructing the search. Most of us do not spend much time building an effective search schema before we jump into the index and thus depend on the serendipitous nature of this kind of search to show us something for which we were originally unprepared and might not original expect to find without help. Contractors are those, usually in a highly technical field, who know there is one perfect match for the search, if only they could specify their query properly. Semantic networks are perfect for this kind of searcher, providing much of the "inferred" value of the search -- where *I* know which stock I mean when I start out and only wish to see those results that have something very specifically to do with finance. All in all, I prefer approaches like PLS' where the document is subjected to a statistical analysis, one where each word is indexed as well as its relationship with all the other words in the document. Along with the standard arsenal of boolean, fielded, and adjacency search features, I feel this represents a very reasonable "middle-ground" for the typical searcher. Of course, most of this is more than a little academic since the *vast* majority of all searches initiated online are for single keywords rather than more complexly constructed, multi-termed queries. Go figure. We have a wealth of brushes, paints and inks within easy reach and yet most of us would rather photocopy pretty pictures out of coffee table books. ;) From owner-robots Thu Apr 4 18:10:45 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21885; Thu, 4 Apr 96 18:10:45 -0800 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 4 Apr 1996 18:11:28 -0800 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Search accuracy Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 8:22 PM 4/4/96, Robert Raisch, The Internet Company wrote: >All in all, I prefer approaches like PLS' where the document is >subjected to a statistical analysis, one where each word is >indexed as well as its relationship with all the other words in >the document. Along with the standard arsenal of boolean, >fielded, and adjacency search features, I feel this represents a >very reasonable "middle-ground" for the typical searcher. I think you're saying two things at once here -- statistical analysis helps, but a variety of algorithms/operators is important. This would seem to be quite true; as was said here earlier, the more evidence, the better. Statistic analysis (aside from its indexing speed and size issues) is done on a corpus, not individual documents. This presents the problem of combining search results from multiple corpuses. That's not an issue until you try to leverage search across a bunch of indexes whose corpuses have different co-occuring word frequencies. We find that customers don't generally turn on our statistical operators when they're available. Do you get better search results with co-occurring word search ("concept") search turned on? Search algorithms are like cold medicine -- if you combine a bunch of them, you minimize the size effects. >Of course, most of this is more than a little academic since >the *vast* majority of all searches initiated online are for >single keywords rather than more complexly constructed, >multi-termed queries. I'm not sure this ends up being true; even though each search may add just one term, people often are building multi-word searches through trial and error. There aren't many one-word searches that yield useful results on the big Web indexes, in my experience. I suspect that search is like page layout when PageMaker came out. No one thought they'd need to learn typesetting "language," but they did. Today, people don't think they'll learn query languages... but I predict that the basics of a query language will be familiar to most Internet users within a few years. Of course, the question is, what query language... ;-) Nick From owner-robots Fri Apr 5 04:59:15 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23468; Fri, 5 Apr 96 04:59:15 -0800 Date: Fri, 5 Apr 1996 07:59:11 -0500 (EST) From: Ellen M Voorhees Message-Id: <199604051259.HAA11208@hawk.scr.siemens.com> To: robots@webcrawler.com Subject: Re: Search accuracy Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > >All in all, I prefer approaches like PLS' where the document is > >subjected to a statistical analysis, one where each word is > >indexed as well as its relationship with all the other words in > >the document. Along with the standard arsenal of boolean, > >fielded, and adjacency search features, I feel this represents a > >very reasonable "middle-ground" for the typical searcher. > > I think you're saying two things at once here -- statistical analysis > helps, but a variety of algorithms/operators is important. This would > seem to be quite true; as was said here earlier, the more evidence, the > better. Statistic analysis (aside from its indexing speed and size issues) > is done on a corpus, not individual documents. This presents the problem > of combining search results from multiple corpuses. That's not an issue > until you try to leverage search across a bunch of indexes whose corpuses > have different co-occuring word frequencies. We find that customers don't > generally turn on our statistical operators when they're available. Do you > get better search results with co-occurring word search ("concept") search > turned on? In many retrieval systems (e.g., SMART, INQUERY) the functions used to weight terms include both within-document factors (number of times the term occurs in the document) and corpus-wide factors (number of documents in which the document occurs. These systems get much better results with weighting schemes that include both factors as compared to the results obtained when using weighting schemes that lack one of the factors. What are TOPIC's statistical operators? The combination of search results from multiple corpora is receiving attention in the text retrieval research community. The TREC (Text REtrieval Conference) workshop series sponsored by NIST has a track devoted to the topic (the Database Merging Track) that I lead. There are also a couple of papers on the topic in the SIGIR-95 proceedings: one paper by Jamie Callan and his colleagues at UMASS and one by me and my colleagues at Siemens. > >Of course, most of this is more than a little academic since > >the *vast* majority of all searches initiated online are for > >single keywords rather than more complexly constructed, > >multi-termed queries. > > I'm not sure this ends up being true; even though each search may add just > one term, people often are building multi-word searches through trial and > error. There aren't many one-word searches that yield useful results on > the big Web indexes, in my experience. > > I suspect that search is like page layout when PageMaker came out. No one > thought they'd need to learn typesetting "language," but they did. Today, > people don't think they'll learn query languages... but I predict that the > basics of a query language will be familiar to most Internet users within a > few years. Of course, the question is, what query language... ;-) > > Nick I disagree that people are going to learn query languages to search the Internet. The statistical systems mentioned above do a very good job of retrieving relevant documents using English phrases as a query. Ellen Voorhees Siemens Corporate Research, Inc. ellen@scr.siemens.com From owner-robots Fri Apr 5 06:04:53 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23603; Fri, 5 Apr 96 06:04:53 -0800 Date: Fri, 5 Apr 1996 06:16:16 -0800 (PST) From: Benjamin Franz X-Sender: snowhare@ns.viet.net To: robots@webcrawler.com Subject: Re: Search accuracy In-Reply-To: Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Thu, 4 Apr 1996, Nick Arnett wrote: > I'm not sure this ends up being true; even though each search may add just > one term, people often are building multi-word searches through trial and > error. There aren't many one-word searches that yield useful results on > the big Web indexes, in my experience. It seem probable that the usefulness of single word searches varies about inversely with general interest of the topic. -- Benjamin Franz "Who is pondering the fact that a search on the word 'snowhare' returns almost *nothing* but pages by or related to himself." From owner-robots Fri Apr 5 07:55:00 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23927; Fri, 5 Apr 96 07:55:00 -0800 Date: Fri, 5 Apr 96 16:54 BST-1 From: cgoodier@cix.compulink.co.uk (Colin Goodier) Subject: Re: (Fwd) Re: Search accuracy To: robots@webcrawler.com Message-Id: X-Mailer: Cixread v3.5 (Registered to cgoodier) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com In-Reply-To: Jorg, Thanks for the message re 'semantic networks' , looks useful. I'll take a look at it some time over Easter (probably Sunday). I've been working my way through some of the stuff I've accumulated to read, on HTTP, TCP etc. Hopefully by the time Easter's over I'll have got through that and the WW3 library stuff. I picked up a bit of code that might be useful from an Aussie site on Berkeley Sockets (how to implement TCP/IP). So I think I'm beginning to get a glimmering of how to do this! Perhaps we should meet sometime next week? I'll be in on Thursday morning for a TSC assignment group meeting. Fortunately there's some overlap between this coursework and the TSC coursework, or at least the area that I'm covering, as I'm researching Internet Protocols for that one! I'm just looking at Linux again to see if there's anything I can do with that to get started with playing with TCP etc. Still having to run it off the CD-ROM due to lack of space however :-( Colin From owner-robots Fri Apr 5 09:33:30 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24383; Fri, 5 Apr 96 09:33:30 -0800 Message-Id: <9604051733.AA24377@webcrawler.com> Date: Fri, 5 Apr 1996 09:29:00 -0800 From: Ted Sullivan Subject: Re: Search accuracy To: robots Cc: John Spencer X-Mailer: Worldtalk (NetConnex V3.50c)/MIME Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Yes, I am. My interest is related to data mining Engineering Design / Construct / Maintain information related to plant development. Mostly focused on the Pulp and Paper & Mining Industries. The goal is to present the user a Web Interface that allows the user to query the whole dataset ( 2d, 3d, commercial data ( purchase orders, engineering specs etc), project schedule, documentation (memo's, faxes, contracts and calculation sheets), and ISO 9000 procedures ... ) and get back links to the data regardless of what Engineering application created it or uses it. Of course this is more then a Web server or a search engine, it is trying to locate important information in the off hours and store up references ( in a OODBMS) so that during the working hours my customers can find what they are looking for. The system uses a inference engine to try and locate the data and tie it together. The main idea is that once you find a useful piece of information the page has links to all other related piece and you can navigate around in a dynamic document. Bold, yes; can I actually do it, good question; am I trying, sure am. Ted ---------- From: robots To: robots Subject: Re: Search accuracy Date: Thursday, April 04, 1996 11:26AM >A true semantic network ... does not >force the user to make the precision/recall tradeoff. Yes, the semantic >network does boost recall by building in literally millions of word links (so, >stock is linked to equity, share, trade, bond, security, etc.). However, >unlike a thesaurus, or any other tool used in search engines today, the >semantic network also lets you specify word meaning. Thus, you can specify a >search on stock as "shares issued by a company...," telling the system to >ignore references to soup stock, live stock, retail stock, etc. If there exists a search technology that is so accurate that it never finds irrelevant documents and always finds all of the relevant ones, we'd like to buy it. Any time you ask a more accurate question of a good search engine, you'll get more accurate results, regardless of whether you're using a "true" semantic network, knowledgebase, Cliff Notes or anything else that helps define the concept on which you searching. In any event, this isn't the place to flog our products, features, etc. One of the things that I'd find really interesting would be research into the construction of semantic networks or other knowledgebases from Web topology. That would be a fascinating byproduct of a spider's explorations. I only know of one experiment along those lines, being done by one of our customers. Anyone else looking at this? Nick From owner-robots Sat Apr 6 19:54:32 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01904; Sat, 6 Apr 96 19:54:32 -0800 X-Sender: Mitchell Elster X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Mitchell Elster Subject: VB and robot development Date: Sat, 06 Apr 1996 23:04:15 Message-Id: <19960406230415.0285bf38.in@BitMaster> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Anyone developing robots in VB? If not, why not? If so, I'm looking for an HTML parsing tool to use under VB (Prefer VB 3.0). Any help will be appreciated. Thanks, Mitch Elster elsterm@bwcc.com From owner-robots Sun Apr 7 00:39:47 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04940; Sun, 7 Apr 96 00:39:47 -0800 From: mred@neosoft.com Message-Id: <199604070839.DAA25961@sam.neosoft.com> To: robots@webcrawler.com X-Mailer: Post Road Mailer (Green Edition Ver 1.05c) Date: Sun, 7 Apr 1996 03:37:12 CST Subject: Re: VB and robot development Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com ** Reply to note from Mitchell Elster 04/07/96 03:24am > Anyone developing robots in VB? If not, why not? If so, I'm looking for an > HTML parsing tool to use under VB (Prefer VB 3.0). Any help will be > appreciated. Um--perhaps because VB doesn't run under UNIX? -Ed- mred@neosoft.com From owner-robots Sun Apr 7 00:52:45 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04986; Sun, 7 Apr 96 00:52:45 -0800 Date: Sun, 7 Apr 1996 04:51:52 -0400 Message-Id: <199604070851.EAA29323@dal1820.computek.net> From: erc@dal1820.computek.net To: robots@webcrawler.com Subject: Re: VB and robot development Cc: elsterm@bwcc.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Interesting idea - I was thinking about this idea this weekend, in fact. You don't need to really do much HTML parsing, though - just look for links. Easy enough to do... ______________________________ Reply Separator _________________________________ Subject: VB and robot development Sent To: robots@webcrawler.com Author: elsterm@bwcc.com Reply To: robots@webcrawler.com Date: 4/7/96 1:36:11 AM Anyone developing robots in VB? If not, why not? If so, I'm looking for an HTML parsing tool to use under VB (Prefer VB 3.0). Any help will be appreciated. Thanks, Mitch Elster elsterm@bwcc.com From owner-robots Sun Apr 7 05:37:11 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05313; Sun, 7 Apr 96 05:37:11 -0700 Comments: Authenticated sender is From: "Jakob Faarvang" Organization: Jubii / cybernet.dk To: robots@webcrawler.com Date: Sun, 7 Apr 1996 14:35:54 +0100 Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Subject: Re: VB and robot development Priority: normal X-Mailer: Pegasus Mail for Win32 (v2.30) Message-Id: 12390778817265@cybernet.dk Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Anyone developing robots in VB? If not, why not? If so, I'm looking for an > HTML parsing tool to use under VB (Prefer VB 3.0). Any help will be > appreciated. We're developing our robot in VB. - Jakob From owner-robots Sun Apr 7 18:10:03 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08518; Sun, 7 Apr 96 18:10:03 -0700 From: Ian McKellar Message-Id: <199604080109.JAA11914@tartarus.uwa.edu.au> Subject: Re: VB and robot development To: robots@webcrawler.com Date: Mon, 8 Apr 1996 09:09:56 +0800 (WST) Cc: [3~@tartarus.uwa.edu.au In-Reply-To: "12390778817265@cybernet.dk" from Jakob Faarvang at "Apr 7, 96 02:35:54 pm" X-Mailer: ELM [version 2.4ME+ PL11 (25)] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > Anyone developing robots in VB? If not, why not? If so, I'm looking for an > > HTML parsing tool to use under VB (Prefer VB 3.0). Any help will be > > appreciated. > > We're developing our robot in VB. I'm quite impressed. VB seems to me to be a bit underpowered. I would use Perl or C. I do hope you are using VB under NT. Ian -- -------------------------------------------------------------------------- yakk@ucc.gu.uwa.edu.au yakk@tartarus.uwa.edu.au yakk@s30.dialup.uwa.edu.au http://www.ucc.gu.uwa.edu.au/~yakk/ (My currently very very lame web pages) ftp://ftp.ucc.gu.uwa.edu.au/pub/mirror/guitar (Only Australian OLGA Mirror) For a good time call: "deltree windows" From owner-robots Sun Apr 7 19:06:42 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08794; Sun, 7 Apr 96 19:06:42 -0700 Comments: Authenticated sender is From: "Jakob Faarvang" Organization: Jubii / cybernet.dk To: robots@webcrawler.com Date: Mon, 8 Apr 1996 04:05:51 +0100 Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Subject: Re: VB and robot development Cc: [3~@tartarus.uwa.edu.au Priority: normal X-Mailer: Pegasus Mail for Win32 (v2.30) Message-Id: 02085636522437@cybernet.dk Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com From: Ian McKellar > > > Anyone developing robots in VB? If not, why not? If so, I'm looking for an > > > HTML parsing tool to use under VB (Prefer VB 3.0). Any help will be > > > appreciated. > > > > We're developing our robot in VB. > I'm quite impressed. VB seems to me to be a bit underpowered. I would use > Perl or C. I do hope you are using VB under NT. VB is not underpowered. Database technologies like ROCK-E-T and others are extremely fast and powerful (we're in the process of converting at the moment - the Access stuff is really bad). As a robot, VB is also great.. Parsing is fairly easy, and the whole thing runs fast .. It's not entirely finished yet, but I expect it will be within 5-6 weeks (other projects are being developed at the same time)... - Jakob Faarvang jakob@jubii.dk From owner-robots Sun Apr 7 23:49:34 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10020; Sun, 7 Apr 96 23:49:34 -0700 X-Sender: dchandler@abilnet.com X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Darrin Chandler Subject: Re: VB and robot development Date: Sun, 7 Apr 1996 23:49:24 -0700 Message-Id: <19960408064923574.AAA113@defiant.abilnet.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I was about to give a similar reply when I saw yours. VB is certainly powerful enough to do a robot in. I'm using Borland Delphi myself. At 04:05 04/08/96 +0100, you wrote: > From: Ian McKellar > >> > > Anyone developing robots in VB? If not, why not? If so, I'm looking for an >> > > HTML parsing tool to use under VB (Prefer VB 3.0). Any help will be >> > > appreciated. >> > >> > We're developing our robot in VB. > >> I'm quite impressed. VB seems to me to be a bit underpowered. I would use >> Perl or C. I do hope you are using VB under NT. > >VB is not underpowered. Database technologies like ROCK-E-T and >others are extremely fast and powerful (we're in the process of >converting at the moment - the Access stuff is really bad). As a >robot, VB is also great.. Parsing is fairly easy, and the whole >thing runs fast .. It's not entirely finished yet, but I expect it >will be within 5-6 weeks (other projects are being developed at the >same time)... > >- Jakob Faarvang >jakob@jubii.dk > From owner-robots Mon Apr 8 13:36:24 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12690; Mon, 8 Apr 96 13:36:24 -0700 Message-Id: <199604082034.NAA07157@norway.it.earthlink.net> Subject: Re: VB and robot development Date: Mon, 8 Apr 96 13:33:36 +0100 From: To: Mime-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >I was about to give a similar reply when I saw yours. VB is certainly >powerful enough to do a robot in. I'm using Borland Delphi myself. Nearly any programming system which can use HTTP and parse text can be used to build robots. Now that Allegiant Technologies has published Marionet, there's a fair number of Mac folks using Director, SuperCard, and HyperCard discussing robot building. To help stave off a potential onslaught of porrly-designed robots built from such high-level systems, I'm doing everything I can to make sure everyone has a chance to read "A Standard for Robot Exclusion" and "Guidelines for Robot Writers". Put a link to these on your web page, and pray along with me. :) - Richard Gaskin Fourth World Software Tools for SuperCard, Director, HyperCard, OMO, and more.... Mail: Ambassador@FourthWorld.com Web: www.FourthWorld.com 1-800-288-5825 From owner-robots Mon Apr 8 17:17:04 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14269; Mon, 8 Apr 96 17:17:04 -0700 Message-Id: <2.2.32.19960409001946.00b5c238@mango.mangonet.com> X-Sender: mikey@mango.mangonet.com X-Mailer: Windows Eudora Pro Version 2.2 (32) Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Mon, 08 Apr 1996 20:19:46 -0400 To: robots@webcrawler.com, info@webcrawler.com From: Mike Rodriguez Subject: Problem with your Index Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi, There seems to be a problem in the way WC is indexing my site. If you do a query on: southern cross astronomical society the first URLs that comes up is to my site. You'll notice that the URL in the link is http://mango.mangonet.com/scas ^^^^^ but it SHOULD BE http://www.mangonet.com/scas/ ^^^ Notice that the hostname, mango, is listed. The alias www should be listed instead. Can you tell me why this is happening? Thanks for any help you can offer. --------------------------------------------------------------------------- -- Mike Rodriguez Finger for PGP public key. -- Mangonet Communications, Inc. (800) 554-0033 -- http://www.mangonet.com/ -- South Florida's Graphics and Web Design Firm. --------------------------------------------------------------------------- From owner-robots Mon Apr 8 19:47:20 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15012; Mon, 8 Apr 96 19:47:20 -0700 X-Sender: mak@surfski.webcrawler.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Mon, 8 Apr 1996 18:49:09 -0700 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: Re: Problem with your Index Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 8:19 PM 4/8/96, Mike Rodriguez wrote: >There seems to be a problem in the way WC is indexing my site. For WC specific problems please follow-up to wc@webcrawler.com, or me personally, not the robots list which is for technical robot development discussions. >[robot indexed host.domain instead of www.domain] >Can you tell me why this is happening? Someone had a link to mango rather than www, so we just used it. I've forwarded your message to someone who can help you further... -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Thu Apr 11 01:02:37 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28982; Thu, 11 Apr 96 01:02:37 -0700 From: Mr David A Weeks Message-Id: <9604110657.AA15373@central.surrey.ac.uk> Subject: Handling keyword repetitions To: robots@webcrawler.com Date: Thu, 11 Apr 96 7:57:31 BST X-Mailer: ELM [version 2.3 PL3] Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi all, I was wondering what people's views are on regarding keyword repetitions found in many web documents. Sometimes they're in the tags and sometimes they form a large block at the bottom of the page. Obviously the authors of such pages are trying to score higher ratings on the index catalogue searches. Should catalogues really index these keyword repetitions? Regards, Dave Weeks. ---------------------------- cs31dw@surrey.ac.uk http://eeisun2.city.ac.uk/~ftp/Guinness/Hello.html From owner-robots Thu Apr 11 18:11:57 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04732; Thu, 11 Apr 96 18:11:57 -0700 Message-Id: <199604120111.CAA21289@earth.ftech.co.uk> Comments: Authenticated sender is From: "Alan" Organization: Visionaries.co.uk To: robots@webcrawler.com Date: Fri, 12 Apr 1996 02:11:07 +0000 Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Subject: Re: Handling keyword repetitions Cc: Mr David A Weeks Priority: normal X-Mailer: Pegasus Mail for Win32 (v2.30) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > From: Mr David A Weeks > Subject: Handling keyword repetitions > To: robots@webcrawler.com > Date: Thu, 11 Apr 96 7:57:31 BST > Reply-to: robots@webcrawler.com Hello David, and all readers, David poses an interesting question indeed , and one which has, I'm sure, been discussed at length on this list. I should expect the archives hold some interesting points of view on this matter. From a personal point of view, I guess I'd have to base my answer on the sophistication of the robot concerned. From one point of view, I have to say NO NO NO! Come on, think about it, Remember the "red herring page?" I have spotted it's url in my search results some 10 times in three months, whilst querying diverse subjects. This on-line dictionary (Not), is a classic example of how the older robots were open to abuse. Putting "You've been had" at the top of a dictionary on a web page is a laugh OK, but perhaps this answer's your question. So this is the argument against indexing the "word spam". If the page is about, what it is about, there should be little need for key-word sections. The content should speak louder than key-words. A page about UFO's and Aliens, will use these words frequently in the body of the document, not to attract visitors, but because that is what it is about. An author should be able to place most conceivable key-words to it's content, in it's content. There are however some exceptions to this thinking, graphic sites are one. (although IMO, ALT text is grossly under used, and this under useage is particularly apparent on Corporate sites, that have the funds to pay someone to know better) Another exception is the ever popular mispeeling ov werds, abbreviations, acronyms, or slang. Key-words can help here, but as you say, do you use them or not? I know for one that I would exclude pages where there was a "word spam". This is so unsightly, amateurish, and frankly quite rude! How to do this? I guess, I'd start by excluding all text using the same as the . I would treat as suspicious, all text within a link to the active page, which would render invisible in most pages without . Pages starting in AARDVARK and ending ZYGOTE could also be steered with a wide berth:-) I'd say that if you had a sophisticated enough neurology, you could identify key-word sections, and compare them with page content, and downgrade pages with poor scores, or repetition within poor grammar constructs. I think you'd certainly be doing the public a service! Hope some of this is useful T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts T-Shirts Oops, I mean Alan . On the other hand, > Hi all, > > I was wondering what people's views are on regarding key-word repetitions found in many web documents. Sometimes they're in the tags and sometimes they form a large block at t > e bottom of the page. > > Obviously the authors of such pages are trying to score higher ratings on the index catalogue searches. > > Should catalogues really index these key-word repetitions? > > > Regards, > Dave Weeks. > > > > ---------------------------- > cs31dw@surrey.ac.uk > http://eeisun2.city.ac.uk/~ftp/Guinness/Hello.html > -- ".....UFO Scatters Crash Debris Across UK Web Site........" ...**.....:..*.*....^...*...*..*.*....*............*.*..*.. ...*..".. M y G o d ......*..:..*..*....*...*...:*.. .^.*..*... It's F u l l o f S t a r s ..*.:....*.. .*.... http://www.visionaries.co.uk/webcat/ufos00.hts :.*.. From owner-robots Fri Apr 12 00:40:24 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00747; Fri, 12 Apr 96 00:40:24 -0700 Message-Id: <01BB2808.8CD2DF60@pax-ca7-20.ix.netcom.com> From: chris cobb To: "'robots@webcrawler.com'" Subject: word spam Date: Fri, 12 Apr 1996 00:22:55 -0400 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com This may have been discussed in past sections, but it comes to mind regarding the use of large blocks of random or repetitive keyword text in web pages - either to obstruct a crawler's indexing mechanism or to increase the ranking of a page. There is a feature in many word processors - Word comes to mind - as part of the "Grammer Check" section. After executing a grammer check, Word displays statistics about the document including some proprietary grammatical ranking scales and the score of the document on each. I don't recall the actual name of these scales, but they were created by the dictionary/thesaurus /encyclopedia people to rank the 'literary level' of a document. Typically the score represents a grade level, like 9.4, indicating that the document is constructed and written in a manner indicative of a ninth grade reading level. I suggest that it would be possible to run these algorithms on web pages to determine if the page contained sufficient structure to be worthy of indexing. Obviously, a page like the 'red herring' page would be thrown out unless the author spent time to construct the page so it would appear as normal sentences. Some analysis using these algorithms could probably help determine at what level a document is 'useless' for information purposes. 'Useless' documents should still be indexed, but the default webcrawler query might exclude them unless instructed not to. This would be necessary if a person were searching for a foreign language or math/science type of page. Just some thoughts. Chris Cobb From owner-robots Sat Apr 13 19:16:46 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12174; Sat, 13 Apr 96 19:16:46 -0700 Message-Id: <31702EC4.19D0@helix.net> Date: Sat, 13 Apr 1996 15:46:29 -0700 From: arutgers X-Mailer: Mozilla 3.0B2 (Win95; I) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: word spam References: <01BB2808.8CD2DF60@pax-ca7-20.ix.netcom.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi chris cobb wrote: > > This may have been discussed in past sections, but it comes to mind regarding the use > of large blocks of random or repetitive keyword text in web pages - either to obstruct a crawler's indexing > mechanism or to increase the ranking of a page. > > There is a feature in many word processors - Word comes to mind - as part of the > "Grammer Check" section. After executing a grammer check, Word displays statistics about ... > > Chris Cobb Nice idea, if you think people will always use good english grammar and sentences for their web pages, but often in things like a list or an index there is no need for good grammar and point form works far better for many things.(When was the last time you saw "Click here to go to the ________ page."?) For example, look at the search results from any robot such as webcrawler, they are very useful but there is little english grammar and few if any complete sentences. Or any companies web page, it probably has a series of links at bottom that reads "Technical Support; Ordering; On-line Catalog", again not much of a sentence. Also the 'ideal' web page is not too long and probably has some info in point form, including the title and the primary heading on the page. Then there is the language and identifying issues. First you can have a page in french and not use the french character set tag, this would immediatly get a very low score and be discarded, despite its use to people fluent in french. Second identifying is a problem because the computer has to reconize a word, (ie noun, verb, etc.) in order to check the grammar. There are lots of company and product names that would confuse it and lower the score. (Is 'descent' a proper noun as in the game, or a verb?, and should it be indexed?) Grammar checkers are great for essays, but web pages are not essays. You would however be able to use parts of grammer checkers diferently though. You could have a 'literacy level'. The average number of characters in a word over a document is usually about 4 for a high school student and if it's >5 the writter probably went to university. Again though this is affected by leaving out the 'the's and other parts of a proper sentence or if your web page has the word 'supercalifagilisticexpaldocious'(sp?; from Mary Poppins). This is actually part of how the grammar checkers assign grade levels. The other use would be to pick out good keywords. Such as in the 'descent' example a modified grammar checker could decide if 'descent' refers to the game or the verb, by looking at the rest of the sentence, captialization etc. If it's the game it's worth indexing. As far as developing something to ignore large blocks of random or repeditive text, good idea. Andrew From owner-robots Sun Apr 14 10:18:55 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14624; Sun, 14 Apr 96 10:18:55 -0700 Message-Id: <199604141819.TAA26103@earth.ftech.co.uk> Comments: Authenticated sender is From: "Alan" Organization: Visionaries.co.uk To: robots@webcrawler.com, robots@webcrawler.com Date: Sun, 14 Apr 1996 18:18:17 +0000 Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Subject: Re: word spam Priority: normal X-Mailer: Pegasus Mail for Win32 (v2.31) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi > chris cobb wrote: > > There is a feature in many word processors - Word comes to mind - as part of the > > "Grammer Check" section. After executing a grammer check, Word displays statistics about ... > Andrew Wrote > Nice idea, if you think people will always use good english grammar and > sentences for their web pages, And do they want indexing anyway? > but often in things like a list or an index there > is no need for good grammar and point form works far better for many things. But there is need for structure which can rate as highly as grammar > (When was the last time you saw "Click here to go to the ________ page."?) Today, yesterday, the day before and the day, after. Stupud instructions abound. It's sadly everywhere. Even on my toothpicks (instructions:insert between teeth, ang wiggle from left to right!) > For example, > look at the search results from any robot such as webcrawler, they are very useful > but there is little english grammar and few if any complete sentences. Or any > companies web page, it probably has a series of links at bottom that reads > "Technical Support; Ordering; On-line Catalog", again not much of a sentence. Also But does it say ? "Technical Support; Ordering; On-line Catalog", "Technical Support; Ordering; On-line Catalog", "Technical Support; Ordering; On-line Catalog", "Technical Support; Ordering; On-line Catalog", >As far as developing something to ignore large blocks of random or > repeditive text, good idea. > Andrew > agreed Alan -- "Just when you thought is was safe to go back to Burgers.." ...**.....:..*.*....^...*...*..*.*....*............*.*..*.. ...*..".. M y G o d ......*..:..*..*....*...*...:*.. .^.*..*... It's F u l l o f S t a r s ..*.:....*.. .*.... http://www.visionaries.co.uk/zine/madcows.hts .:.*.. From owner-robots Mon Apr 15 07:43:34 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24033; Mon, 15 Apr 96 07:43:34 -0700 From: Trevor Jenkins Organisation: Don't put it down; put it away! To: robots@webcrawler.com Date: Mon, 15 Apr 1996 09:49:19 +0000 Message-Id: <5796.tfj@apusapus.demon.co.uk> Subject: Re: word spam Priority: normal X-Mailer: Pegasus Mail for Windows (v2.23) via wpkGate v2.01 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On 12 Apr 96 at 0:22, chris cobb wrote: > This may have been discussed in past sections, but it comes to mind regarding the use > of large blocks of random or repetitive keyword text in web pages - either to obstruct a crawler's indexing > mechanism or to increase the ranking of a page. > > There is a feature in many word processors - Word comes to mind - as part of the > "Grammer Check" section. Sadly, some index engines are incluing grammatically correct pages that are not really pages. For example, use the Alta Vista engine and look for "posix". You will get "hundreds" of perl, Tcl/Tk or other scripts that include "posix.pl" or similar. These scripts are syntactically correct but are not what I was expecting to see. The grammar checker, in case, has verified that the content is okay but I would contend that the page should have been excluded from consideration. Regards, Trevor. -- Procrastinate Now! From owner-robots Mon Apr 15 09:02:29 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24487; Mon, 15 Apr 96 09:02:29 -0700 Date: Mon, 15 Apr 1996 08:14:05 -0700 (PDT) From: Benjamin Franz X-Sender: snowhare@ns.viet.net To: robots@webcrawler.com Subject: Re: word spam In-Reply-To: <5796.tfj@apusapus.demon.co.uk> Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Mon, 15 Apr 1996, Trevor Jenkins wrote: > On 12 Apr 96 at 0:22, chris cobb wrote: > > > This may have been discussed in past sections, but it comes to mind regarding the use > > of large blocks of random or repetitive keyword text in web pages - either to obstruct a crawler's indexing > > mechanism or to increase the ranking of a page. > > > > There is a feature in many word processors - Word comes to mind - as part of the > > "Grammer Check" section. > > Sadly, some index engines are incluing grammatically correct pages > that are not really pages. For example, use the Alta Vista engine and > look for "posix". You will get "hundreds" of perl, Tcl/Tk or other > scripts that include "posix.pl" or similar. These scripts are > syntactically correct but are not what I was expecting to see. The > grammar checker, in case, has verified that the content is okay but I > would contend that the page should have been excluded from > consideration. posix AND NOT (posix.pl) Add in some ranking keywords for the specific area of interest and Alta Vista can produce quite good results. It gets back to what I said about it being the responsibility of the searcher to tailor the search. There is no feasible way for either the indexing engines or the content providers to *exclude* things as being irrelevant to searchers without assistance from the person making the query. Almost everything is relevant to *someone*, even if not to you. If you get a large irrelevant return from a search - retune your search to exclude the non-relevant material. -- Benjamin Franz From owner-robots Mon Apr 15 20:51:06 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00449; Mon, 15 Apr 96 20:51:06 -0700 Message-Id: <2.2.32.19960416035621.0030a96c@pop.tiac.net> X-Sender: wadland@pop.tiac.net X-Mailer: Windows Eudora Pro Version 2.2 (32) Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Mon, 15 Apr 1996 23:56:21 -0400 To: robots@webcrawler.com From: Ken Wadland Subject: Re: word spam Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >> Sadly, some index engines are incluing grammatically correct pages >> that are not really pages. For example, use the Alta Vista engine and >> look for "posix". >posix AND NOT (posix.pl) This still gets 20,000 hits! Yes, you can correct this particular case with a revised query; but, wouldn't it be nice if the search engines were a little smarter about the context of the word in the document? As another example, try searching for "HTML". AltaVista gets 900,000 matches. I have yet to find a query for documents about HTML which works on any of the search engines. For example, excluding "(HTML)" excludes all documents! I had one heck of a time finding the RFC for HTTP because of this. From owner-robots Mon Apr 15 22:51:25 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA01424; Mon, 15 Apr 96 22:51:25 -0700 Subject: Re: word spam From: YUWONO BUDI To: robots@webcrawler.com Date: Tue, 16 Apr 1996 13:50:51 +0800 (HKT) In-Reply-To: <2.2.32.19960416035621.0030a96c@pop.tiac.net> from "Ken Wadland" at Apr 15, 96 11:56:21 pm X-Mailer: ELM [version 2.4 PL24alpha3] Content-Type: text Content-Length: 839 Message-Id: <96Apr16.135104hkt.19105-20961+109@uxmail.ust.hk> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > >posix AND NOT (posix.pl) > > This still gets 20,000 hits! > > Yes, you can correct this particular case with a revised query; but, > wouldn't it be nice if the search engines were a little smarter about the > context of the word in the document? > > As another example, try searching for "HTML". AltaVista gets 900,000 > matches. I have yet to find a query for documents about HTML which works on > any of the search engines. For example, excluding "(HTML)" excludes all > documents! > > I had one heck of a time finding the RFC for HTTP because of this. Try visualizing what you expect to see while typing the keywords, hopefully the search engine will pick up the vibration :-) Or you could simply type in: "HTML specification", or spell it out "hypertext markup language" Seriously, single-word queries rarely work. -Bude. From owner-robots Tue Apr 16 07:10:05 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03307; Tue, 16 Apr 96 07:10:05 -0700 Date: Tue, 16 Apr 1996 06:21:40 -0700 (PDT) From: Benjamin Franz X-Sender: snowhare@ns.viet.net To: robots@webcrawler.com Subject: Re: word spam In-Reply-To: <2.2.32.19960416035621.0030a96c@pop.tiac.net> Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Mon, 15 Apr 1996, Ken Wadland wrote: > >> Sadly, some index engines are incluing grammatically correct pages > >> that are not really pages. For example, use the Alta Vista engine and > >> look for "posix". > > >posix AND NOT (posix.pl) > > This still gets 20,000 hits! Yup. That is because you didn't give me any more information about what you were *exactly* looking for. Assuming you are looking for the standards docs: Searching on "posix AND NOT (posix.pl) AND compliance AND IEEE AND fips" and telling Alta Vista to list 'fips' first resulted in 167 documents, starting with the standards docs for POSIX at the NIST. It took me a few iterations to tune the search - but it still took under 5 minutes (most of that time because of poor network performance at alternet stalling my page loads for long periods of time). > Yes, you can correct this particular case with a revised query; but, > wouldn't it be nice if the search engines were a little smarter about the > context of the word in the document? > > As another example, try searching for "HTML". AltaVista gets 900,000 > matches. I have yet to find a query for documents about HTML which works on > any of the search engines. For example, excluding "(HTML)" excludes all > documents! > > I had one heck of a time finding the RFC for HTTP because of this. Using a search engine to look for 'http' (or 'HTML' or 'gif' or 'jpg' or any other string that appears as a structural element of HTML markup) on the WWW is like searching for an acronym that matches an article of speech such as 'AND'. It is simply going to appear too many times in too many places to be useful as a search criteria and the search engines with rightly tell you "I don't think so." This is true in large part because of the huge number of pages with broken HTML resulting in out of context 'structural' strings. Alta Vista reports a mere 83 million hits on 'http'. So you search for some other feature of the information that is not so common: A search for 'RFC' with the keyword 'hypertext' did the trick. The sixth item returned was a pointer to the March 2nd 1.0 http draft. A better approach for any WWW related standards is of course to go directly to http://www.w3.org/. The search engines are powerful enough to find information rapidly and accurately, but you do have to *specify* the information you are looking for. It is no different than using a modern library catalog. -- Benjamin Franz From owner-robots Tue Apr 16 10:32:37 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04533; Tue, 16 Apr 96 10:32:37 -0700 From: "Mark Norman" Message-Id: <9604161032.ZM2205@hpisq3cl.cup.hp.com> Date: Tue, 16 Apr 1996 10:32:53 -0700 X-Mailer: Z-Mail (3.2.1 10apr95) To: robots@webcrawler.com Subject: http directory index request Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com What is the HTTP request that returns a list of the files in a directory? This is obviously a no-brainer since all of the browsers can make this request, but I don't see anything in the documentation I have on HTTP that describes this. I am looking at the document from the IETF which has this header: HTTP Working Group R. Fielding, UC Irvine INTERNET-DRAFT H. Frystyk, MIT/LCS T. Berners-Lee, MIT/LCS Expires in six months January 19, 1996 Hypertext Transfer Protocol -- HTTP/1.1 Thank you. From owner-robots Tue Apr 16 10:31:25 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04517; Tue, 16 Apr 96 10:31:25 -0700 Message-Id: <9604161731.AA07473@marys.smumn.edu> Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable From: Kevin Hoogheem Date: Tue, 16 Apr 96 12:33:16 -0500 To: robots@webcrawler.com Subject: Re: word spam References: <2.2.32.19960416035621.0030a96c@pop.tiac.net> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I was wondering if any one did this when they index. =20 This idea I have is kinda from Dykstras paper on the indexing = problem he wrote about a long time ago, sorry dont have the article numbers = handy but what it was doing is going though papers and indexing the articles = from taking keywords and noise words and only taking out the keywords.=20 Well what I was thinking is most peoples web robots have or could = easily take a list of noise words and then not index them only index words that are = not in that list. Well if someone puts sex sex sex sex sex that many times right in = a row. =20 that should not get realy index.. It might be safe to say that = for ever keyword a noise word must have been either infront of it or behind it. oh well From owner-robots Tue Apr 16 11:36:04 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05072; Tue, 16 Apr 96 11:36:04 -0700 Message-Id: From: David Levine To: "'robots@webcrawler.com'" Subject: RE: http directory index request Date: Tue, 16 Apr 1996 14:39:40 -0400 X-Mailer: Microsoft Exchange Server Internet Mail Connector Version 4.12.736 Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="---- =_NextPart_000_01BB2BA2.8C1E8F90" Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. Contact your mail administrator for information about upgrading your reader to a version that supports MIME. ------ =_NextPart_000_01BB2BA2.8C1E8F90 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit There is no such command. This would be a security breach of a fairly high order. Sometimes web servers will return a directory when you access a directory that has no default file. For instance, if a web server is set up to serve "index.html" when a file is not specified (i.e. http://foo.bar/blah/) and there IS no index.html in that directory, some web servers will send a directory list instead. This is user configurable, however, and can be turned off. -- David Levine, Application Engineer InterWorld Technology Ventures, Inc. david@interworld.com http://www.interworld.com/staff/david/ >---------- >From: Mark Norman[SMTP:mnorman@hposl41.cup.hp.com] >Sent: Tuesday, April 16, 1996 1:32 PM >To: robots@webcrawler.com >Subject: http directory index request > >What is the HTTP request that returns a list of the files in a directory? >This >is obviously a no-brainer since all of the browsers can make this request, >but >I don't see anything in the documentation I have on HTTP that describes >this. I >am looking at the document from the IETF which has this header: > >HTTP Working Group R. Fielding, UC Irvine >INTERNET-DRAFT H. Frystyk, MIT/LCS > T. Berners-Lee, MIT/LCS >Expires in six months January 19, 1996 > > > Hypertext Transfer Protocol -- HTTP/1.1 > > >Thank you. > > > ------ =_NextPart_000_01BB2BA2.8C1E8F90-- From owner-robots Tue Apr 16 12:41:53 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05566; Tue, 16 Apr 96 12:41:53 -0700 From: "Mordechai T. Abzug" Message-Id: <199604161941.PAA00249@rpa04.gl.umbc.edu> Subject: Re: http directory index request To: robots@webcrawler.com Date: Tue, 16 Apr 1996 15:41:35 -0400 (EDT) In-Reply-To: <9604161032.ZM2205@hpisq3cl.cup.hp.com> from "Mark Norman" at Apr 16, 96 10:32:53 am X-Mailer: ELM [version 2.4 PL25] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 1496 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Mark Norman spake thusly: >What is the HTTP request that returns a list of the files in a directory? This >is obviously a no-brainer since all of the browsers can make this request, but >I don't see anything in the documentation I have on HTTP that describes this. [Hm. This should probably be a FAQ, under the more general category of "How do I get a server to send me _____?"] The answer: GET, same as normal. Or to be more precise, there is no such thing as either a request for a file or for a directory, only a request for a "resource". When you send the browser a request, it is at its own liberty to decide what to send back. Simple servers generally return data corresponding to some section of the file system, but this need not be the case. Servers are at their own liberty to decide how to respond. The directory behavior you saw follows the choice of some servers to return directory contents when a URL corresponds to a directory in the file system and the directory does not contain some default HTML file. For instance, our server is configured to return a directory for http://www.gl.umbc.edu/~mabzug1/ if the directory didn't contain a file called index.html, but if I touch index.html, it'll return the file. You can try to get a directory by manipulating the URL in various ways, but you are not guaranteed success. -- Mordechai T. Abzug http://umbc.edu/~mabzug1 mabzug1@umbc.edu finger -l mabzug1@gl.umbc.edu God does not play dice. -Albert Einstein From owner-robots Tue Apr 16 13:31:24 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05951; Tue, 16 Apr 96 13:31:24 -0700 Message-Id: <2.2.32.19960416203635.003091d0@pop.tiac.net> X-Sender: wadland@pop.tiac.net X-Mailer: Windows Eudora Pro Version 2.2 (32) Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Tue, 16 Apr 1996 16:36:35 -0400 To: robots@webcrawler.com From: Ken Wadland Subject: Re: word spam Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > Searching on "posix AND NOT (posix.pl) AND compliance AND IEEE AND > fips" and telling Alta Vista to list 'fips' first resulted in 167 > documents, starting with the standards docs for POSIX at the NIST. I'm not saying that Alta Vista isn't a powerful search engine. It most certainly is. As a computer scientist, I, too, can quickly revise the query to something similar to yours. In fact, I sometimes claim that the only thing my PhD really means is that I know how to use libaries REALLY well. ;-) But, your typical Internet user has never had a course in Boolean logic. What started this thread is the observation that smarter indexing could result in better query results. Search engines that understand the difference between a text word, a title word and an HTML tag will invariably return better results for simple queries than one that doesn't. Do you disagree with this conclusion? From owner-robots Tue Apr 16 17:19:05 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA07731; Tue, 16 Apr 96 17:19:05 -0700 From: reinpost@win.tue.nl (Reinier Post) Message-Id: <199604170021.CAA05638@wsinis10.win.tue.nl> Subject: Re: word spam To: robots@webcrawler.com Date: Wed, 17 Apr 1996 02:21:12 +0200 (MET DST) In-Reply-To: <2.2.32.19960416203635.003091d0@pop.tiac.net> from "Ken Wadland" at Apr 16, 96 04:36:35 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Content-Length: 1833 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com You (Ken Wadland) write: [...] >But, your typical Internet user has never had a course in Boolean >logic. How can you expect people to express themselves without a language? Some query languages and techniques may be easier to learn than others, which probably means that it takes a minimal amount of experimentation for the user to learn how to compose effective queries, but in any case, some level of understanding on the part of the user of the querying mechanism is inevitable. > What started this thread is the observation that smarter indexing >could result in better query results. Not quite; it was a suggestion rather than an observation. Boolean logic may be rather difficult to learn, but once the principle is understood, its application will always be easier to understand than that of a hidden weighting mechanism over which the user has no control. >Search engines that understand the >difference between a text word, a title word and an HTML tag will invariably >return better results for simple queries than one that doesn't. Do you >disagree with this conclusion? As a reader of this forum, I would be interested to see any mention of evidence one way or the other. But I strongly agree with what Benjamin Franz appears to be saying, namely, if 'smartness' means knowing what the user wants, the user must supply that information in one way or another. Magic doesn't quite work with computers ... If, on the other hand, 'smartness' means outguessing the user at what the user wants, a 'smart' query technique may be rather effective, up to the point that the user feels a need to understand what it is doing. -- Reinier Post reinpost@win.tue.nl a.k.a. me [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] From owner-robots Tue Apr 16 18:49:01 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08185; Tue, 16 Apr 96 18:49:01 -0700 Message-Id: <199604170148.CAA26151@earth.ftech.co.uk> Comments: Authenticated sender is From: "Alan" Organization: Visionaries.co.uk To: robots@webcrawler.com Date: Wed, 17 Apr 1996 02:48:51 +0000 Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Subject: Re: word spam Priority: normal X-Mailer: Pegasus Mail for Win32 (v2.31) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Benjamin Franz Wisely Wrote: With regard to: > > >posix AND NOT (posix.pl) > > This still gets 20,000 hits! > Yup. That is because you didn't give me any more information about what > you were *exactly* looking for. > The search engines are powerful enough to find information rapidly and > accurately, but you do have to *specify* the information you are looking > for. It is no different than using a modern library catalog. What is the Quest? To seek the Grail. What is the Grail? It has many forms. How can I seek the Unknown? By following the path which will reveal itself to you. Where does such a path begin? Here, at your very doorstep. How shall I know I am on it? That which you seek will guide you.............. Aaah Like they say............. Ask a stupid question.............:-) Alan -- ".....UFO Scatters Crash Debris Across UK Web Site........" ...**.....:..*.*....^...*...*..*.*....*............*.*..*.. ...*..".. M y G o d ......*..:..*..*....*...*...:*.. .^.*..*... It's F u l l o f S t a r s ..*.:....*.. .*.... http://www.visionaries.co.uk/webcat/ufos00.hts :.*.. From owner-robots Tue Apr 16 21:05:00 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08862; Tue, 16 Apr 96 21:05:00 -0700 Date: Wed, 17 Apr 1996 13:05:00 +0900 (JST) From: Mail Delivery Subsystem Subject: Returned mail: Can't create output: Error 0 Message-Id: <199604170405.NAA19897@ecom.ecom.or.jp> To: Auto-Submitted: auto-generated (failure) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com The original message was received at Wed, 17 Apr 1996 13:04:58 +0900 (JST) from firewall-user@ofw.jri.co.jp [202.32.75.227] ----- The following addresses have delivery notifications ----- (unrecoverable error) ----- Transcript of session follows ----- mail: Cannot append to /var/mail/shirai 550 ... Can't create output: Error 0 ----- Original message follows ----- Received: from ofw.jri.co.jp by ecom.ecom.or.jp (8.7.1/6.4JAIN-ecom.v0) id NAA19894; Wed, 17 Apr 1996 13:04:58 +0900 (JST) From: Return-Path: Received: by ofw.jri.co.jp; id NAA21909; Wed, 17 Apr 1996 13:04:47 +0900 Received: from unknown(10.127.7.6) by ofw.jri.co.jp via smap (g3.0.3) id xma021895; Wed, 17 Apr 96 13:04:25 +0900 Received: from [10.12.20.196] by jriuu.sci.jri.co.jp (16.6/IIJ-U1.1-JRI1.5-960111) id AA01548; Wed, 17 Apr 96 13:04:21 +0900 Received: from issmtpgw.tyo.hq.jri.co.jp ([10.12.20.181]) by t111ws06 (5.x/3.3W4-) id AA29341; Wed, 17 Apr 1996 13:03:10 +0900 Received: from cc:Mail SMTPLINK 2.1 by issmtpgw.tyo.hq.jri.co.jp id AA829771558; Wed, 17 Apr 96 13:01:35 JST Date: Wed, 17 Apr 96 13:01:35 JST Message-Id: <9604178297.AA829771558@issmtpgw.tyo.hq.jri.co.jp> To: robots@webcrawler.com Subject: Re: word spam Benjamin Franz Wisely Wrote: With regard to: > > >posix AND NOT (posix.pl) > > This still gets 20,000 hits! > Yup. That is because you didn't give me any more information about what > you were *exactly* looking for. > The search engines are powerful enough to find information rapidly and > accurately, but you do have to *specify* the information you are looking > for. It is no different than using a modern library catalog. What is the Quest? To seek the Grail. What is the Grail? It has many forms. How can I seek the Unknown? By following the path which will reveal itself to you. Where does such a path begin? Here, at your very doorstep. How shall I know I am on it? That which you seek will guide you.............. Aaah Like they say............. Ask a stupid question.............:-) Alan -- ".....UFO Scatters Crash Debris Across UK Web Site........" ...**.....:..*.*....^...*...*..*.*....*............*.*..*.. ...*..".. M y G o d ......*..:..*..*....*...*...:*.. .^.*..*... It's F u l l o f S t a r s ..*.:....*.. .*.... http://www.visionaries.co.uk/webcat/ufos00.hts :.*.. Received: from r6.ichi.jri.co.jp by ccMail SMTPLINK 2.1 >From owner-robots@webcrawler.com X-Envelope-From: owner-robots@webcrawler.com Received: from jriuu.sci.jri.co.jp by r6.ichi.jri.co.jp (AIX 3.2/UCB 5.64/3.3W4-jri-relay-1.1) id AA21479; Wed, 17 Apr 1996 12:45:53 +0900 Received: from ofw.jri.co.jp by jriuu.sci.jri.co.jp (16.6/IIJ-U1.1-JRI1.5-960111) id AA01073; Wed, 17 Apr 96 12:23:15 +0900 Received: by ofw.jri.co.jp; id MAA20181; Wed, 17 Apr 1996 12:23:13 +0900 Received: from surfski.webcrawler.com(192.216.46.61) by ofw.jri.co.jp via smap (g3.0.3) id xma020174; Wed, 17 Apr 96 12:22:45 +0900 Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08185; Tue, 16 Apr 96 18:49:01 -0700 Message-Id: <199604170148.CAA26151@earth.ftech.co.uk> Comments: Authenticated sender is From: "Alan" Organization: Visionaries.co.uk To: robots@webcrawler.com Date: Wed, 17 Apr 1996 02:48:51 +0000 Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Subject: Re: word spam Priority: normal X-Mailer: Pegasus Mail for Win32 (v2.31) Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com From owner-robots Tue Apr 16 21:33:06 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08982; Tue, 16 Apr 96 21:33:06 -0700 From: Message-Id: <9604170426.AA07772@evil-twins.pa.dec.com> To: robots@webcrawler.com Subject: Re: word spam In-Reply-To: Your message of "Wed, 17 Apr 96 02:48:50 -0000." <199604170149.CAA26154@earth.ftech.co.uk> Date: Tue, 16 Apr 96 21:26:45 -0700 X-Mts: smtp Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Once in a while I need to pipe in and speak for my poor bad-mouthed robot and search engine. It's one of these weeks. For the last question about getting the content of a Perl script and not its output: I don't know, ask the server! Someone published the URL, and the server gave this back. I promise that we don't use out-of-work KGB psychics to get the pages, just regular GET requests. And no crystal balls to guess URLs, just following links... As for the discussion about the "quality" of returns, the distribution of responses has been wonderfully bimodal: those of us who think that what you get should be a strong function of what you put in, and the rest (hmm) who believe in magic. If a query is vague, say "computers", there are indeed tens of thousands of pages matching the query, in the simple sense of containing the word. Now the ranking can help some, but most of the returns will fall in very crowded buckets of similar ranking, say that all pages with "computers" in the title and mentionning "computers" often enough will all compete for top spot. Now, what is the "right answer"? Well, if you believe that the game is to guess the rest of the query in your head, only magic will do. If you think it's some sort of reference page, Yahoo-style (not a criticism, they play a different game, and certainly not to be exhaustive), it requires human intervention, and this is not what search engines are about. So the right answer is indeed to refine the query. What's so hard about it? --Louis From owner-robots Wed Apr 17 05:46:44 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10142; Wed, 17 Apr 96 05:46:44 -0700 Message-Id: <1.5.4.32.19960417134552.002a65b0@postoffice.ptd.net> X-Sender: terces1@postoffice.ptd.net (Unverified) X-Mailer: Windows Eudora Light Version 1.5.4 (32) Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 17 Apr 1996 08:45:52 -0500 To: robots@webcrawler.com From: terces1@postoffice.ptd.net Subject: Re: word spam Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Regarding Louis's statement: >For the last question about getting the content of a Perl script and not its >output: I don't know, ask the server! Someone published the URL, and the >server gave this back. I promise that we don't use out-of-work KGB psychics >to get the pages, just regular GET requests. And no crystal balls to guess >URLs, just following links... What do you mean by "Someone published the URL"? Do you mean that someone has to explictly link to these pages, or that these files are simply located in the data directory of the Web Server? From owner-robots Wed Apr 17 09:08:27 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10985; Wed, 17 Apr 96 09:08:27 -0700 From: Message-Id: <9604171602.AA08339@evil-twins.pa.dec.com> To: robots@webcrawler.com Subject: Re: word spam In-Reply-To: Your message of "Wed, 17 Apr 96 08:45:52 CDT." <1.5.4.32.19960417134552.002a65b0@postoffice.ptd.net> Date: Wed, 17 Apr 96 09:02:46 -0700 X-Mts: smtp Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > What do you mean by "Someone published the URL"? Do you mean that someone > has to explictly link to these pages, or that these files are simply located > in the data directory of the Web Server? Unfortunately, same thing. If someone publishes the directory, AND the directory browsing feature on the server is not disabled, THEN scooter will get a nicely formatted html page containing pointers to every file in the directory, and there is no easy way for me (short of really ugly heuristics) to detect the situation. But I wish I could. Plea to most webmasters: please please please, disable the directory indexing feature, it is rarely needed, and causes robots to pick a lot of junk. --Louis From owner-robots Thu Apr 18 01:52:23 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA15798; Thu, 18 Apr 96 01:52:23 -0700 From: "Andrey A. Krasov" Organization: BINP RAS, Novosibirsk To: robots@webcrawler.com Date: Thu, 18 Apr 1996 15:51:32 +0700 Subject: Re: word spam Priority: normal X-Mailer: Pegasus Mail v3.31 Message-Id: <2E60203418@csd.inp.nsk.su> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On 17 Apr 96 at 9:02, monier@pa.dec.com wrote: > > > What do you mean by "Someone published the URL"? Do you mean that someone > > has to explictly link to these pages, or that these files are simply located > > in the data directory of the Web Server? > > Unfortunately, same thing. If someone publishes the directory, AND the > directory browsing feature on the server is not disabled, THEN scooter will get > a nicely formatted html page containing pointers to every file in the directory, > and there is no easy way for me (short of really ugly heuristics) to detect the > situation. But I wish I could. Plea to most webmasters: please please please, > disable the directory indexing feature, it is rarely needed, and causes robots > to pick a lot of junk. > You can try to scan for specific string such as "up to previous dirs " or similar one From owner-robots Thu Apr 18 15:05:19 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18384; Thu, 18 Apr 96 15:05:19 -0700 Date: Thu, 18 Apr 1996 15:03:34 -0700 Message-Id: <199604182203.PAA00605@norway.it.earthlink.net> X-Sender: williams@earthlink.net X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Jared Williams Subject: Web Robot Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com How hard is it for someone to put a web robot onto there web site? I'd like to put one on mine. Is there anything I should know? Thanks! Jared Williams Want a NICE SITE? Visit Web Knitter (R) http://home.earthlink.net/~williams e-mail: williams@earthlink.net lllll lllll lll lll lll ll ll llllllllllllllllllll From owner-robots Fri Apr 19 05:28:54 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21856; Fri, 19 Apr 96 05:28:54 -0700 Message-Id: <199604191328.OAA03697@fedro> X-Sender: x8035952@fedro.ugr.es X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 19 Apr 1996 21:03:38 +0100 To: robots@webcrawler.com From: Ricardo Eito Brun Subject: Robots in the client? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I'm trying to recover exhaustive information about robots which can run in oun client WWW browsers. I have only read something about Arachnidus?. If you can give me some information about some other application or about the performance of such a tools, I would be gratefull. Thanks in advance: From owner-robots Fri Apr 19 05:33:03 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA21866; Fri, 19 Apr 96 05:33:03 -0700 Message-Id: <199604191324.OAA03650@fedro> X-Sender: x8035952@fedro.ugr.es X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 19 Apr 1996 20:59:49 +0100 To: robots@webcrawler.com From: Ricardo Eito Brun Subject: General Information Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I'm trying to recover exhaustive information about mailing-list, faqs, etc.., about robots, spiders, and someother kind of indexing tool for the WWW (Aliweb, Swish...) If somebody can help me with some adress I would be gratefuly. Thanks in advance: From owner-robots Fri Apr 19 06:25:53 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22045; Fri, 19 Apr 96 06:25:53 -0700 From: debra@win.tue.nl (Paul De Bra) Message-Id: <199604191328.PAA04067@wsinis10.win.tue.nl> Subject: Re: Robots in the client? To: robots@webcrawler.com Date: Fri, 19 Apr 1996 15:28:02 +0200 (MET DST) In-Reply-To: <199604191328.OAA03697@fedro> from "Ricardo Eito Brun" at Apr 19, 96 09:03:38 pm X-Mailer: ELM [version 2.4 PL23] Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Length: 429 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Ricardo Eito Brun: > I'm trying to recover exhaustive information >about robots which can run in oun client WWW browsers. >I have only read something about Arachnidus?. If you >can give me some information about some other application >or about the performance of such a tools, I would be >gratefull. Check out the fish search, integrated into Tuebingen Mosaic. (look in ftp://ftp.win.tue.nl/pub/infosystems/www) Paul. From owner-robots Fri Apr 19 08:15:38 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22425; Fri, 19 Apr 96 08:15:38 -0700 Message-Id: <2.2.32.19960419151900.009f25d8@giant.mindlink.net> X-Sender: a07893@giant.mindlink.net X-Mailer: Windows Eudora Pro Version 2.2 (32) Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 19 Apr 1996 08:19:00 -0700 To: robots@webcrawler.com From: Tim Bray Subject: Magic, Intelligence, and search engines Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 09:26 PM 4/16/96 -0700, monier@pa.dec.com wrote: > those of us who think that what you get >should be a strong function of what you put in, and the rest (hmm) who believe >in magic. >... >So the right answer is indeed to refine the query. >What's so hard about it? Right. Hear hear. People have been, since about 1975, saying "wouldn't it be wonderful if search engines were intelligent." And every couple of years, some little venture-cap-funded startup comes along and says "HUZZAH! We've made search engines intelligent!" If you believe in precision/recall [which might be useful if it could be measured] the numbers show a discouraging lack of progress in the last 20 years at making engines intelligent. Not that we don't make progress... but mostly on user interfaces, data structures & algorithms, feedback mechanisms, document strucures, indexing efficiency, distributed search. Why is all this? Because to be intelligent, the software would have to, for an arbitrary web page, be able to discern what it's about. In a multi-lingual fashion, at that. Such software does not currently exist. Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com) From owner-robots Fri Apr 19 13:00:23 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA00581; Fri, 19 Apr 96 13:00:23 -0700 From: Bonnie Scott Message-Id: <199604192003.QAA15669@elephant-int.prodigy.com> Subject: Re: Robots in the client? To: robots@webcrawler.com Date: Fri, 19 Apr 1996 16:03:18 -0400 (EDT) In-Reply-To: <199604191328.OAA03697@fedro> from "Ricardo Eito Brun" at Apr 19, 96 09:03:38 pm X-Mailer: ELM [version 2.4 PL24alpha3] Content-Type: text Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com There was a program called SIMON which worked with Mosaic for X-Windows. The Fish Search was built into a version of Mosaic by Paul De Bra (debra@win.tue.nl) I never used either, but I did some research myself about a year ago. Bonnie Scott Prodigy Services Company > > I'm trying to recover exhaustive information > about robots which can run in oun client WWW browsers. > I have only read something about Arachnidus?. If you > can give me some information about some other application > or about the performance of such a tools, I would be > gratefull. > > Thanks in advance: > > > > From owner-robots Sun Apr 21 05:46:53 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA05524; Sun, 21 Apr 96 05:46:53 -0700 Comments: Authenticated sender is From: "Jakob Faarvang" Organization: Jubii / cybernet.dk To: robots@webcrawler.com Date: Sun, 21 Apr 1996 14:44:43 +0100 Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Subject: default documents Priority: normal X-Mailer: Pegasus Mail for Win32 (v2.30) Message-Id: 12475069900646@cybernet.dk Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com How does a robot know what the default document (index.html/default.htm/home.html) is called? I mean, how does it know that, say, http://www.mydomain.com/test/ is the same as http://www.mydomain.com/test/index.html or http://www.mydomain.com/default.htm ? - Jakob From owner-robots Sun Apr 21 11:59:39 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06063; Sun, 21 Apr 96 11:59:39 -0700 X-Sender: dchandler@abilnet.com X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Darrin Chandler Subject: Re: default documents Date: Sun, 21 Apr 1996 11:59:26 -0700 Message-Id: <19960421185925458.AAA60@defiant.abilnet.com> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 14:44 04/21/96 +0100, you wrote: >How does a robot know what the default document >(index.html/default.htm/home.html) is called? > >I mean, how does it know that, say, http://www.mydomain.com/test/ is >the same as http://www.mydomain.com/test/index.html or >http://www.mydomain.com/default.htm ? > >- Jakob > It doesn't know. I imagine that some robots make assumptions and equate index.html or default.html with a resource ending in '/', but there's nothing in the HTTP spec that guarantees it. The robots I write don't assume this, nor do most of the other HTTP related tools I use. It may be irritating to have different entries in your database for '/' and '/index.html', but it's safer. A given server may have several file names which it uses as default. For instance, given two files '/index.cgi' and '/index.html', the server may give you the .cgi when you ask for '/', and assuming .html would be incorrect even though that resource exists and is published. From owner-robots Sun Apr 21 19:11:36 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA07289; Sun, 21 Apr 96 19:11:36 -0700 Message-Id: <199604220211.LAA20343@beagle.mtl.t.u-tokyo.ac.jp> To: robots@webcrawler.com Cc: behrens@mtl.t.u-tokyo.ac.jp Subject: Re: default documents In-Reply-To: Your message of "Sun, 21 Apr 1996 11:59:26 JST." <19960421185925458.AAA60@defiant.abilnet.com> Date: Mon, 22 Apr 1996 11:11:29 +0900 From: Harry Munir Behrens Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com In message <19960421185925458.AAA60@defiant.abilnet.com>, Darrin Chandler write s: >At 14:44 04/21/96 +0100, you wrote: >>How does a robot know what the default document >>(index.html/default.htm/home.html) is called? >> >>I mean, how does it know that, say, http://www.mydomain.com/test/ is >>the same as http://www.mydomain.com/test/index.html or >>http://www.mydomain.com/default.htm ? >> >>- Jakob >> > >It doesn't know. I imagine that some robots make assumptions and equate >index.html or default.html with a resource ending in '/', but there's >nothing in the HTTP spec that guarantees it. The robots I write don't assume >this, nor do most of the other HTTP related tools I use. It may be >irritating to have different entries in your database for '/' and >'/index.html', but it's safer. A given server may have several file names >which it uses as default. For instance, given two files '/index.cgi' and >'/index.html', the server may give you the .cgi when you ask for '/', and >assuming .html would be incorrect even though that resource exists and is >published. Neither nor: It's not the client nor the robot: It's the server that knows. When setting up your HTTP server, that's one of the configuration parameters. So you can configure your server to look for /what_ever.html when receiving a request for Cheers, Harry "Munir Basha" Behrens Tel.: +81-3-3812-2111 #6752 PhD candidate +81-3-3814-4251 #6763 Tanaka Lab Dept. of Electrical Engineering e-mail: behrens@mtl.t.u-tokyo.ac.jp University of Tokyo From owner-robots Mon Apr 22 09:09:09 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08908; Mon, 22 Apr 96 09:09:09 -0700 From: micah@sequent.uncfsu.edu (Micah A. Williams) Message-Id: <199604221607.MAA07121@sequent.uncfsu.edu> Subject: Re: default documents To: robots@webcrawler.com Date: Mon, 22 Apr 96 12:07:45 EDT In-Reply-To: 12475069900646@cybernet.dk; from "Jakob Faarvang" at Apr 21, 96 2:44 pm X-Mailer: ELM [version 2.3 PL0] Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com In the words of Jakob Faarvang, > > How does a robot know what the default document > (index.html/default.htm/home.html) is called? > > I mean, how does it know that, say, http://www.mydomain.com/test/ is > the same as http://www.mydomain.com/test/index.html or > http://www.mydomain.com/default.htm ? > > - Jakob My robot, Pioneer, does not make any assumptions as to what a "default document" may be. My search database therefore has a few "xxxx/" and "xxxx/index.html" pairs laying around that point to the same document. Now if the HTTP protocol had servers make certain provisions for this, it would be different. Like for example, what if, instead of just sending back the document, the server sent the actual true location as a redirect (code 301). Say the robot decides to get the document, "http://www.foo.org/info/" ... the server at foo.org sees that the GET request is in 'short form', consults its internal configuration, and then returns a redirect with the Location set to the full absolute location of its default document, "http://www.foo.org/info/index.html". This being the case, if the robot already had this URL indexed, it would ignored. This may be helpful for robots, but since so many URLs are published in the 'short cut' format, I wonder whether or not dealing with this redirection frequently would slow down normal browsers. -Micah -- ============================================================================ Micah A. Williams | Computer Science | Fayetteville State University micah@sequent.uncfsu.edu | http://sequent.uncfsu.edu/~micah/ Bjork WebPage: http://sequent.uncfsu.edu/~micah/bjork.html Though we do not realize it, we all, in some capacity, work for Keyser Soze. ============================================================================ From owner-robots Mon Apr 22 13:43:11 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10174; Mon, 22 Apr 96 13:43:11 -0700 Date: Mon, 22 Apr 1996 13:41:20 -0700 Message-Id: <199604222041.NAA07986@norway.it.earthlink.net> X-Sender: williams@earthlink.net X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: Callig@listserv.ACNS.NWU.EDU, robots@webcrawler.com From: Jared Williams Subject: Mailing list Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I want to start a mailing list on HTML and web site authoring and announcements. Does anyone already know of such a list? If not, how could I start one? Thanks! Jared Williams Want a NICE SITE? Visit Web Knitter (R) http://home.earthlink.net/~williams e-mail: williams@earthlink.net lllll lllll lll lll lll ll ll llllllllllllllllllll From owner-robots Mon Apr 22 15:10:07 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA10661; Mon, 22 Apr 96 15:10:07 -0700 From: "Mordechai T. Abzug" Message-Id: <199604222209.SAA14021@xsa04.gl.umbc.edu> Subject: Re: Mailing list To: robots@webcrawler.com Date: Mon, 22 Apr 1996 18:09:52 -0400 (EDT) Cc: Callig@listserv.ACNS.NWU.EDU In-Reply-To: <199604222041.NAA07986@norway.it.earthlink.net> from "Jared Williams" at Apr 22, 96 01:41:20 pm X-Mailer: ELM [version 2.4 PL25] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 664 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com "JW" == Jared Williams spake thusly: JW> JW> I want to start a mailing list on HTML and web site authoring and JW> announcements. Does anyone already know of such a list? If not, how could I No, you don't -- such a mailing list would soon have thousands of messages a day. There's already are USENET groups in the comp.infosystems.www.* hierarchy, including comp.infosystems.www.authoring.html, which have immense amounts of traffic. Use these (or should I say wade through these) instead. -- Mordechai T. Abzug http://umbc.edu/~mabzug1 mabzug1@umbc.edu finger -l mabzug1@gl.umbc.edu Scanning for viruses. . . Windows detected. Delete? From owner-robots Mon Apr 22 17:53:42 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA11646; Mon, 22 Apr 96 17:53:42 -0700 Message-Id: <317C2A23.5D18@earthlink.net> Date: Mon, 22 Apr 1996 17:53:55 -0700 From: Jared Williams Organization: Web Knitter X-Mailer: Mozilla 2.01 (Win95; I) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: Mailing list References: <199604222209.SAA14021@xsa04.gl.umbc.edu> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Mordechai T. Abzug wrote: > > "JW" == Jared Williams spake thusly: > JW> > JW> I want to start a mailing list on HTML and web site authoring and > JW> announcements. Does anyone already know of such a list? If not, how could I > > No, you don't -- such a mailing list would soon have thousands of messages a > day. There's already are USENET groups in the comp.infosystems.www.* > hierarchy, including comp.infosystems.www.authoring.html, which have immense > amounts of traffic. Use these (or should I say wade through these) instead. > > -- > Mordechai T. Abzug > http://umbc.edu/~mabzug1 mabzug1@umbc.edu finger -l mabzug1@gl.umbc.edu > Scanning for viruses. . . Windows detected. Delete? What would you think if I didn't give the opertunity to announce web sites? Just HTML tips? Jared Williams Web Knitter (TM) http://home.earthlink.net/~williams From owner-robots Mon Apr 22 19:15:46 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12158; Mon, 22 Apr 96 19:15:46 -0700 Message-Id: <9604230214.AA18951@marys.smumn.edu> Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable From: Kevin Hoogheem Date: Mon, 22 Apr 96 21:21:09 -0500 To: robots@webcrawler.com Subject: Re: Mailing list References: <199604222209.SAA14021@xsa04.gl.umbc.edu> <317C2A23.5D18@earthlink.net> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com There are plenty of mailing list that do that already... I think = you should search around on them and maybe just compile a listing of them all = and see what they are there for. if there is not one that you like then maybe think about it if you = have the time space and will to do it ;)- From owner-robots Mon Apr 22 19:46:14 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12306; Mon, 22 Apr 96 19:46:14 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Mon, 22 Apr 1996 19:46:54 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: word spam Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > What started this thread is the observation that smarter indexing >could result in better query results. Search engines that understand the >difference between a text word, a title word and an HTML tag will invariably >return better results for simple queries than one that doesn't. Do you >disagree with this conclusion? Our engine does this. Take a look at this: http://www.verity.com/vlibsearch.html To search for a word in the title and the body, but weight the results higher if it's in the title, you'd use a query like this: [.7](robot title),robot The implicit weight of a term is .5 -- try different weights for the title term and you'll see different results. This query looks for the word robot in the document and the title; if the density of "robot" in the text is equal, those that have "robot" in the title will be ranked higher. It stems, too, so you'll also get documents about robotics, for example. However, I think you'll discover, if you use this quite a bit, that although it is useful, it isn't as great as you might imagine. For one thing, your queries can become quite complex if you want to search on a few terms. On the other hand, the work I'm doing with JavaScript might make it much easier to set weights in titles and such. You can also use the "" syntax to search other HTML zones -- HEAD, BODY, H1, etc. Even wildcards -- "robot h*" will find it in any heading. So... better results, yes, in general, maybe. Weighting title words higher implies putting less emphasis on the document contents, which means you'll decrease recall in some cases (when titles aren't informative). Even if it works, will people use it? Is the difference significant? Will the behavior be unexpected and confuse people who assume they're doing a plain search of text? Should the default behavior be to automatically weight the title and headings higher? Maybe. The real question is how much difference this makes. I'd be curious to hear, especially for real-world search problems. By the way, that URL points to a collection of Web-related documents. Nick P.S. I'm glad to see Tim and others pointing out that the differences in generic search accuracy among the top engines are relatively small. From owner-robots Mon Apr 22 20:05:51 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12401; Mon, 22 Apr 96 20:05:51 -0700 X-Sender: Mitchell Elster X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Mitchell Elster Subject: RE: Mailing List Date: Mon, 22 Apr 1996 23:16:00 Message-Id: <19960422231600.142e2164.in@BitMaster> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com There is a list that doesn't generate that much traffic called "html_exchange" Check that one out before you create one, you REALLY don't want to start a new one. Mitchell Elster President BitWise Computer Consultants, Inc From owner-robots Mon Apr 22 22:08:44 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12737; Mon, 22 Apr 96 22:08:44 -0700 From: "Mordechai T. Abzug" Message-Id: <199604230508.BAA07766@umbc10.umbc.edu> Subject: Re: Mailing list To: robots@webcrawler.com Date: Tue, 23 Apr 1996 01:08:38 -0400 (EDT) In-Reply-To: <317C2A23.5D18@earthlink.net> from "Jared Williams" at Apr 22, 96 05:53:55 pm X-Mailer: ELM [version 2.4 PL25] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 819 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com "JW" == Jared Williams spake thusly: JW> What would you think if I didn't give the opertunity to announce web JW> sites? Just HTML tips? JW> JW> Jared Williams JW> Web Knitter (TM) JW> http://home.earthlink.net/~williams JW> If you moderated it -- a fearful job! -- it might not have *too* much traffic. If you don't moderate it -- well, let's just say that I'm not sending this message through the list, 'cuz it really has nothing to do with robots. ;> On a sleepy list like robots, the occaisional transgression is OK, but things don't scale well. . . If you really want to, give it a try just to see what'll happen, but I think you'll get swamped. -- Mordechai T. Abzug http://umbc.edu/~mabzug1 mabzug1@umbc.edu finger -l mabzug1@gl.umbc.edu The Magic of Windows: Turns a 486 back into a PC/XT. From owner-robots Tue Apr 23 04:22:23 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA13478; Tue, 23 Apr 96 04:22:23 -0700 Message-Id: <199604231222.NAA17878@fedro> X-Sender: x8035952@fedro.ugr.es (Unverified) X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Tue, 23 Apr 1996 12:34:06 +0100 To: robots@webcrawler.com From: Ricardo Eito Brun Subject: Re: Robots in the client? Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Thanks for your information: I have tried to locate this two robots; I haven't found anything about Simon, and Fish Search seems to have become a robot in the same sense as Lycos, Infoseek, etc.., of course with less coverage. Could you tell me how to contact with information about SIMON? If possible I would be grateful if you may correct me: I think a personal robot is a kind of search engine that resides in our own personal computer, that's to say, I give the application a set of URL's adress and it gather html files from these sites. After that it builds a local database with these data. Is it all right? If it's, when we speak about MOMSpider (freely distributed), can we consider this robot in the same sense and categorize it as a personal robot?. Thank in advance. There was a program called SIMON which worked with Mosaic for X-Windows. The Fish Search was built into a version of Mosaic by Paul De Bra (debra@win.tue.nl) I never used either, but I did some research myself about a year ago. Bonnie Scott Prodigy Services Company > > I'm trying to recover exhaustive information > about robots which can run in oun client WWW browsers. > I have only read something about Arachnidus?. If you > can give me some information about some other application > or about the performance of such a tools, I would be > gratefull. > > Thanks in advance: > > > > From owner-robots Tue Apr 23 05:05:45 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA13558; Tue, 23 Apr 96 05:05:45 -0700 Message-Id: <199604231305.OAA18455@fedro> X-Sender: x8035952@fedro.ugr.es (Unverified) X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Tue, 23 Apr 1996 13:17:41 +0100 To: robots@webcrawler.com From: Ricardo Eito Brun Subject: About Mother of All Bulletin Boards Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Can anybody give me some information about 'The Mother of all Bulletin Boards'? I have read this was one of the first attempts to generate a searchable index of the Internet, in a kind of 'distributed indexing'. Thank in advance. From owner-robots Tue Apr 23 06:53:14 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA13838; Tue, 23 Apr 96 06:53:14 -0700 X-Sender: mcbr@piper.cs.colorado.edu Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Tue, 23 Apr 1996 07:56:58 -0600 To: robots@webcrawler.com From: Oliver.McBryan@cs.colorado.edu (Oliver A. McBryan) Subject: Re: About Mother of All Bulletin Boards Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Its at http://wwwmbb.cs.colorado.edu/~mcbryan/bb/summary.html Oliver McBryan; Oliver.McBryan@cs.colorado.edu Phone: 303-6650544 and 303-4923898; Cell: 303-8097804; Fax: 303-4922844 Dept of Computer Science, Univ. of Colorado, Boulder, CO 80309-0430. WWW: http://www.cs.colorado.edu/~mcbryan/Home.html From owner-robots Tue Apr 23 07:43:29 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14039; Tue, 23 Apr 96 07:43:29 -0700 Date: Tue, 23 Apr 1996 07:42:58 -0700 From: Gordon Bainbridge Message-Id: <199604231442.HAA27230@BASISinc.com> To: robots@webcrawler.com Subject: Re: Mailing list Content-Length: 936 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > I want to start a mailing list on HTML and web site authoring and > announcements. Does anyone already know of such a list? If not, how could I > start one? > First check out the archives of existing newsgroups and mailing lists. This will give you some idea about volume. One archive is at: http://asknpac.npac.syr.edu/cgi-bin/news/hypermail_home A word of caution: this site is undergoing changes right now; yesterday I was bounced to a chat server when I tried to access it. Today, a mailing list archive I normally read is not there. You can also find JavaScript mailing archives at: http://www.obscure.org/javascript/archives/ This mailing list stopped operation in March, by the way, because it could not deal with the heavy load. Try sending e-mail to some mailing list administrators for advice before attempting one yourself. Now let's get back to robots. Gordon Bainbridge Software Engineer BASIS Inc From owner-robots Tue Apr 23 08:20:22 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14191; Tue, 23 Apr 96 08:20:22 -0700 Message-Id: <317CF52C.4DAA@austin.ibm.com> Date: Tue, 23 Apr 1996 10:20:12 -0500 From: Rob Turk Organization: IBM Worldwide AIX Support Tools Development X-Mailer: Mozilla 3.0b2 (X11; I; AIX 1) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: Mailing list References: <199604230508.BAA07766@umbc10.umbc.edu> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Mordechai T. Abzug wrote: well, let's just say that I'm not sending this > message through the list, 'cuz it really has nothing to do with robots. ;> Then how did I get it? Is this the "how to build and administer spiders and robots" mailing list or the "uh, I want to do ___________ with the web" mailing list"? I'd like for someone to say: I coded ____________ to do this part of the program and the result was ____________. This way we can share ideas about how to make more intelligent web clients, you know? I don't want to know who's spider sucks the most (or the least) I just want to know that something I didn't have time to try either worked or failed. -- Rob Turk Unofficially Speaking. It looks like blind screaming hedonism won out. From owner-robots Tue Apr 23 20:50:47 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18461; Tue, 23 Apr 96 20:50:47 -0700 From: mred@neosoft.com Message-Id: <199604240350.WAA08092@sam.neosoft.com> To: robots@webcrawler.com X-Mailer: Post Road Mailer (Green Edition Ver 1.05d) Date: Tue, 23 Apr 1996 22:47:28 CST Subject: Re: default documents Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com ** Reply to note from 04/22/96 12:07pm EDT > Like for example, what if, instead of just sending back the > document, the server sent the actual true location as a > redirect (code 301). Say the robot decides to get the > This may be helpful for robots, but since > so many URLs are published in the 'short cut' format, I wonder > whether or not dealing with this redirection frequently > would slow down normal browsers. It's not a question of slowdown, but rather, it's one of undermining symbolic linking in UNIX. -Ed- mred@neosoft.com From owner-robots Tue Apr 23 22:40:41 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18742; Tue, 23 Apr 96 22:40:41 -0700 From: "Michael Carnevali, Student, FHD" Organization: Technical University Darmstadt To: robots@webcrawler.com Date: Wed, 24 Apr 1996 07:40:22 +0200 Subject: Re: Robots in the client? Priority: normal X-Mailer: Pegasus Mail for Windows (v2.23DE) Message-Id: <14FBA463CF3@hrz1.hrz.th-darmstadt.de> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Ricardo Eito Brun wrote: > I think a personal robot is a kind of search engine > that resides in our own personal computer, that's to say, > I give the application a set of URL's adress and it gather > html files from these sites. After that it builds a local > database with these data. Is it all right? Have you ever tried Quarterdeck`s Webcompas? IMHO it isn`t a personal robot, but you can do all the things you explained and it works with a standard Web-Bowser. In fact Webcompas uses other Search Engines to retrieve informations and builds a local database. I will check out the program during the next weeks. So, if you have further questions, just send a mail to my private address. Michael Carnevali st002065@hrz1.hrz.th-darmstadt.de From owner-robots Tue Apr 23 23:55:58 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18906; Tue, 23 Apr 96 23:55:58 -0700 From: rvaquero@gugu.usal.es (Jose Raul Vaquero Pulido) Message-Id: <9604240730.AA22872@gugu.usal.es> Subject: search engine To: robots@webcrawler.com Date: Wed, 24 Apr 1996 08:30:20 +0100 (GMT+0100) X-Mailer: ELM [version 2.4 PL22] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 886 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hi everybody: First, sorry for my bad english. I study the use of the search engine, its effectivite, etc. Now, I am looking for shareware of search engine that running on Unix or Windows platform. Anyone know where can I find shareware about it? Also, I am looking for statistics of use information in the most important search engine (how long do visitors usually stay on its site?, where are visitors coming form and what content are they looking?, etc.). I need desperately some track. Thank for all. *************************************** ** Jose Raul Vaquero Pulido ** ** rvaquero@gugu.usal.es ** ** Universidad de Salamanca ** *************************************** From owner-robots Wed Apr 24 03:22:17 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19181; Wed, 24 Apr 96 03:22:17 -0700 From: "Andrey A. Krasov" Organization: BINP RAS, Novosibirsk To: robots@webcrawler.com Date: Wed, 24 Apr 1996 17:21:55 +0700 Subject: Quiz playing robots ? Priority: normal X-Mailer: Pegasus Mail v3.31 Message-Id: Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Does anybody know about subject From owner-robots Wed Apr 24 04:22:16 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19307; Wed, 24 Apr 96 04:22:16 -0700 Date: Wed, 24 Apr 1996 07:22:32 -0400 Message-Id: <1.5.4.16.19960424072244.2657d3aa@tre.thewild.com> X-Sender: swood@tre.thewild.com X-Mailer: Windows Eudora Light Version 1.5.4 (16) Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: "Scott W. Wood" Subject: Re: search engine Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 08:30 AM 4/24/96 +0100, you wrote: >Hi everybody: > First, sorry for my bad english. > I study the use of the search engine, its effectivite, etc. Now, >I am looking for shareware of search engine that running on Unix or >Windows platform. Anyone know where can I find shareware about it? Try looking at the code for Harvester. It is available from the university of Colorado. Scott From owner-robots Wed Apr 24 07:51:16 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA19890; Wed, 24 Apr 96 07:51:16 -0700 Message-Id: <317E3FE0.353C@austin.ibm.com> Date: Wed, 24 Apr 1996 09:51:12 -0500 From: Rob Turk Organization: IBM Worldwide AIX Support Tools Development X-Mailer: Mozilla 3.0b2 (X11; I; AIX 1) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: search engine References: <9604240730.AA22872@gugu.usal.es> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Jose Raul Vaquero Pulido wrote: > > Hi everybody: > First, sorry for my bad english. That's okay! > I study the use of the search engine, its effectivite, etc. Now, > I am looking for shareware of search engine that running on Unix or > Windows platform. Anyone know where can I find shareware about it? Check out some info first: http://agents.www.media.mit.edu/groups/agents/ http://thule.mt.cs.cmu.edu:8001/braustubl/ http://thule.mt.cs.cmu.edu:8001/webants/ http://www.cs.bham.ac.uk/~amw/agents http://www.media.mit.edu/ I've been able to find interesting stuff at these web sites. > Also, I am looking for statistics of use information in the most > important search engine (how long do visitors usually stay on its site?, > where are visitors coming form and what content are they looking?,> etc.). You'll have to ask someone else. Here's some popular spider-type URLs: ALTA VISTA: http://www.altavista.digital.com/ EXCITE: http://www.excite.com/ LYCOS http://www.lycos.com/ WEBCRAWLER: http://www.webcrawler.com/ The World Wide Web Worm -- http://guano.cs.colorado.edu/wwww/ YAHOO: http://www.yahoo.com/ http://www.yahoo.com/Computers/World_Wide_Web/ > ** Jose Raul Vaquero Pulido > ** Universidad de Salamanca -- Rob Turk Unofficially Speaking. Every solution breeds new problems. From owner-robots Wed Apr 24 10:50:40 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20840; Wed, 24 Apr 96 10:50:40 -0700 Comments: Authenticated sender is From: "Jakob Faarvang" Organization: Jubii / cybernet.dk To: robots@webcrawler.com Date: Wed, 24 Apr 1996 19:48:47 +0100 Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Subject: Re: search engine Priority: normal X-Mailer: Pegasus Mail for Win32 (v2.30) Message-Id: 17514730101831@cybernet.dk Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > From: Rob Turk > http://agents.www.media.mit.edu/groups/agents/ > http://thule.mt.cs.cmu.edu:8001/braustubl/ > http://thule.mt.cs.cmu.edu:8001/webants/ > http://www.cs.bham.ac.uk/~amw/agents > http://www.media.mit.edu/ Great list, Rob! Thanks! - Jakob Faarvang robot developer / sysadmin / etc.. From owner-robots Wed Apr 24 15:23:50 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA22738; Wed, 24 Apr 96 15:23:50 -0700 Date: Wed, 24 Apr 1996 15:22:00 -0700 Message-Id: <199604242222.PAA07419@norway.it.earthlink.net> X-Sender: williams@earthlink.net X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: ADV-HTML@UA1VM.UA.EDU, robots@webcrawler.com From: Jared Williams Subject: Try robot... Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com The other day I downloaded the MOMspider software and followed some instruction to turn it into a full fleged web wanderer. Now that I have a web wanderer all I need now is a server to put it on. Does anyone know of any servers that are looking for and are willing to have a web wanderer on there sight? Thanks! Jared Williams Want a NICE SITE? Visit Web Knitter (R) http://home.earthlink.net/~williams e-mail: williams@earthlink.net lllll lllll lll lll lll ll ll llllllllllllllllllll From owner-robots Wed Apr 24 16:41:25 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24365; Wed, 24 Apr 96 16:41:25 -0700 Date: Wed, 24 Apr 1996 16:40:32 -0700 Message-Id: <199604242340.QAA00545@iceland.it.earthlink.net> X-Sender: williams@earthlink.net X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: dateline@nbc.com, ADV-HTML@UA1VM.UA.EDU, brit@worlds.net, Sutter_Kunkel@spiderisland.com, VIRUS-L@LEHIGH.EDU, robots@webcrawler.com From: Jared Williams Subject: Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >From: owner-merchants@mail-it.com >Date: Wed, 24 Apr 1996 15:06:13 -0700 >>To: merchants@mail-it.com >>From: davidlee@idirect.com (Monica Reyes) >>Subject: INTERNET VIRUS WARNING >Apparently-To: merchants-outgoing@atlas.earthlink.net > >> >>From: FSHG@aol.com >>Date: Wed, 24 Apr 1996 16:03:20 -0400 >>To: ecash@digicash.com >>Subject: Fwd: FWD>Virus on the Internet!! >>Sender: owner-ecash@digicash.com >>Precedence: bulk >> >>Please pass along! >>--------------------- >>Forwarded message: >>Subj: Fwd: FWD>Virus on the Internet!! >>Date: 96-04-24 15:23:12 EDT >>From: R0CKE >>To: FBFam,P1CCOLO,DKirch6184,JJSpirit >>To: FSHG >> >> >>--------------------- >>Forwarded message: >>From: Helen_Ross@tbwachiat.com (Helen Ross) >>To: r0cke@aol.com (Ellen Mahoney) >>Date: 96-04-23 10:03:59 EDT >> >>>From HoloGate FWD>Virus on the Internet!!!!!=C9 >>elle, fyi-helle >> >>-------------------------------------- >>Date: 4/22/96 11:11 AM >>From: Kristen Henderson >> >>Killer Internet Email Virus, BEWARE!!!!! >> >>SUBJECT: VIRUSES--IMPORTANT PLEASE READ IMMEDIATELY >> >>There is a computer virus that is being sent across the >>Internet. If you >>receive an email message with the subject line "Good Times", DO >>NOT read >>the message, >>DELETE it immediately. Please read the messages below. >> >>Some miscreant is sending email under the title "Good Times" >>nationwide, >>if you get anything like this, DON'T DOWN LOAD THE FILE! It has a >>virus >>that rewrites >>your hard drive, obliterating anything on it. >> >>Please be careful and forward this mail to anyone you care about. >> >>************************************************************************** >> >> WARNING!!!!!!! INTERNET VIRUS >>************************************************************************** >> >> >> >>The FCC released a warning last Wednesday concerning a matter of major >>importance to any regular user of the Internet. Apparently a >>new computer >>virus has been engineered by a user of AMERICA ON LINE that is >>unparalleled >>in its >>destructive capability. Other more well-known viruses such as >>"Stoned", >>"Airwolf" and "Michaelangelo" pale in comparison to the >>prospects of this >>newest creation by a warped mentality. What makes this virus so >>terrifying, >>said the >>FCC, is the fact that no program needs to be exchanged for a new >>computer >>to be infected. It can be spread through the existing email >>systems of the >>Internet. >>Once a Computer is infected, one of several things can happen. >>If the >>computer contains a hard drive, that will most >>likely be destroyed. >> >> If the program is not stopped, the computer's processor >>will be placed in an nth-complexity infinite binary loop -which can >>severely >>damage the processor if left running that way too long. >>Unfortunately, most novice computer users will not realize what is >>happening until it is far too late. Luckily, there is one sure >>means of >>detecting what is now known as the "Good Times" virus. It always >>travels >>to new computers the same way in a text email message with the >>subject line >>reading >>"Good Times". Avoiding infection is easy once the file has been >>received >>simply by NOT READING IT! The act of loading the file into the >>mail server's >>ASCII >>buffer causes the "Good Times" mainline program to initialize >>and execute. >>The >>program is highly intelligent- it will send copies of itself to >>everyone >>whose >>email address is contained in a receive-mail file or a sent-mail >>file, if >>it can find one. It will then proceed to trash the computer it is >>running >>on. The >>bottom line there is - if you receive a file with the subject >>line "Good >>Times", delete it immediately! Do not read it" Rest assured that >>whoever's >>name >>was on the "From" line was surely struck by the virus. Warn your >>friends and >>local >>system users of this newest threat to the Internet! It could >>save them a >>lot of time and money. >> Could you pass this along to your global mailing list as well? >> >>>>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= >>=3D*=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D*=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= >>=3D=3D=3D=3D=3D=3D=3D=3D*=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>>>U.S.Miracle Communications,Inc. admin@miracle.net Tel: (860) >>>523-5677 >>>>West Hartford, Connecticut, 06107 Fax: (860) >>>523-5805 >>>>for more info http://www.miracle.net or send mail to >>>sales@miracle.net >> >>AN EASY PRECAUTION IS TO SET [EASY OPEN] on your EUDORA SWITCHES to OFF !! >> >>Monica >> Jared Williams Want a NICE SITE? Visit Web Knitter (R) http://home.earthlink.net/~williams e-mail: williams@earthlink.net lllll lllll lll lll lll ll ll llllllllllllllllllll From owner-robots Wed Apr 24 18:08:18 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA24988; Wed, 24 Apr 96 18:08:18 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 24 Apr 1996 18:08:59 -0700 To: williams@earthlink.net From: narnett@Verity.COM (Nick Arnett) Subject: "Good Times" hoax Cc: robots@webcrawler.com Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Jared, The "Good Times" virus is an old hoax. It's a good idea to check the FAQs on this sort of thing before hitting the panic button and sending it to lots of places. Nick P.S. Craig Shergold is fine and doesn't want any more cards. If you don't know what this means, just tuck it away in your memory where it'll be triggered by any Internet references to Craig. From owner-robots Wed Apr 24 18:25:02 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25036; Wed, 24 Apr 96 18:25:02 -0700 Message-Id: <9604250123.AA22566@marys.smumn.edu> Mime-Version: 1.0 (NeXT Mail 3.3 v118.2) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit From: Kevin Hoogheem Date: Wed, 24 Apr 96 20:31:02 -0500 To: robots@webcrawler.com Subject: Re: References: <199604242340.QAA00545@iceland.it.earthlink.net> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Retards thats all I have to say ;)(- From owner-robots Wed Apr 24 19:27:19 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25183; Wed, 24 Apr 96 19:27:19 -0700 From: Vince Taluskie Message-Id: <199604250227.WAA07311@kalypso.cybercom.net> Subject: Re: To: robots@webcrawler.com Date: Wed, 24 Apr 1996 22:27:13 -0400 (EDT) In-Reply-To: <199604242340.QAA00545@iceland.it.earthlink.net> from "Jared Williams" at Apr 24, 96 04:40:32 pm X-Mailer: ELM [version 2.4 PL24] Content-Type: text Content-Length: 619 Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Actually, I think that Jared Williams is the best example I've seen of an e-mail virus.... day after day his clueless stuff keeps appearing in my mailbox... Please make it stop! :) Aaaaiiiiieeeeee.... Vince [ Bogus hoax deleted ] -- ___ ____ __ | _ \/ __/| \ Vince Taluskie, at Fidelity Investments Boston, MA | _/\__ \| \ \ Pencom Systems Administration Phone: 617-563-8349 |_| /___/|_|__\ vince@pencom.com Pager: 800-253-5353, #182-6317 -------------------------------------------------------------------------- "We are smart, we make things go" From owner-robots Wed Apr 24 22:13:01 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25524; Wed, 24 Apr 96 22:13:01 -0700 Message-Id: <199604250320.WAA11591@whistler.dorm.net> Comments: Authenticated sender is From: "Andy Warner" To: robots@webcrawler.com Date: Thu, 25 Apr 1996 00:12:56 +0000 Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Subject: Re: Try robot... Priority: normal X-Mailer: Pegasus Mail for Win32 (v2.30) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Those of us who want to play with MOMSpider already have downloaded and installed it. I doubt you'll find any takers. BTW: There was more than one instruction if you did things properly. Untarring it, the make file, etc... Fledged had a "D" in it. The version of there you meant to use is "their". Sigs are supposed to be four (4) lines or less, not 16 like yours. Please read the FAQ. Was it Mark Twain who said: "It is better to keep your mouth shut and let people assume you're and idiot than to open it and remove all doubt." On 24 Apr 96 at 15:22, Jared Williams wrote: The other day I downloaded the MOMspider software and followed some instruction to turn it into a full fleged web wanderer. Now that I have a web wanderer all I need now is a server to put it on. Does anyone know of any servers that are looking for and are willing to have a web wanderer on there sight? --------------huge .sig snipped------------- -- Andy Warner andy@andy.net From owner-robots Thu Apr 25 05:36:45 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26113; Thu, 25 Apr 96 05:36:45 -0700 Message-Id: <01BB32F7.2629B8C0@pluto.planets.com.au> From: David Eagles To: "'robots@webcrawler.com'" Subject: RE: "Good Times" hoax Date: Thu, 25 Apr 1996 22:32:47 +-1000 Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="---- =_NextPart_000_01BB32F7.263159E0" Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com ------ =_NextPart_000_01BB32F7.263159E0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit The "Good Times" virus is an old hoax. It's a good idea to check the FAQs on this sort of thing before hitting the panic button and sending it to lots of places. P.S. Craig Shergold is fine and doesn't want any more cards. If you don't know what this means, just tuck it away in your memory where it'll be triggered by any Internet references to Craig. *Grin* Good to see jokes of this nature _do_ actually make it right around the world (unlike my mail :-( ) However, I'm a little concerned people on technical mailing lists can be fooled by such obvious junk??? Regards, David ------ =_NextPart_000_01BB32F7.263159E0 Content-Type: application/ms-tnef Content-Transfer-Encoding: base64 eJ8+IjgMAQaQCAAEAAAAAAABAAEAAQeQBgAIAAAA5AQAAAAAAADoAAEIgAcAGAAAAElQTS5NaWNy b3NvZnQgTWFpbC5Ob3RlADEIAQ2ABAACAAAAAgACAAEEkAYAJAEAAAEAAAAMAAAAAwAAMAIAAAAL AA8OAAAAAAIB/w8BAAAASQAAAAAAAACBKx+kvqMQGZ1uAN0BD1QCAAAAAHJvYm90c0B3ZWJjcmF3 bGVyLmNvbQBTTVRQAHJvYm90c0B3ZWJjcmF3bGVyLmNvbQAAAAAeAAIwAQAAAAUAAABTTVRQAAAA AB4AAzABAAAAFgAAAHJvYm90c0B3ZWJjcmF3bGVyLmNvbQAAAAMAFQwBAAAAAwD+DwYAAAAeAAEw AQAAABgAAAAncm9ib3RzQHdlYmNyYXdsZXIuY29tJwACAQswAQAAABsAAABTTVRQOlJPQk9UU0BX RUJDUkFXTEVSLkNPTQAAAwAAOQAAAAALAEA6AQAAAAIB9g8BAAAABAAAAAAAAALQNwEEgAEAFgAA AFJFOiAiR29vZCBUaW1lcyIgaG9heACwBgEFgAMADgAAAMwHBAAZABYAIAAvAAQAWQEBIIADAA4A AADMBwQAGQAWAB8ABAAEAC0BAQmAAQAhAAAAMERBOUI3NjcyMjlFQ0YxMTk4NkEwMDAwQzA4QzAz NEUAEwcBA5AGAFQEAAAUAAAACwAjAAAAAAADACYAAAAAAAsAKQAAAAAAAwAuAAAAAAADADYAAAAA AEAAOQBg/ydPozK7AR4AcAABAAAAFgAAAFJFOiAiR29vZCBUaW1lcyIgaG9heAAAAAIBcQABAAAA FgAAAAG7MqNPDme3qQ6eIhHPmGoAAMCMA04AAB4AHgwBAAAABQAAAFNNVFAAAAAAHgAfDAEAAAAS AAAAZWFnbGVzZEBwYy5jb20uYXUAAAADAAYQL/KFFgMABxDKAQAAHgAIEAEAAABlAAAAVEhFIkdP T0RUSU1FUyJWSVJVU0lTQU5PTERIT0FYSVRTQUdPT0RJREVBVE9DSEVDS1RIRUZBUVNPTlRISVNT T1JUT0ZUSElOR0JFRk9SRUhJVFRJTkdUSEVQQU5JQ0JVVFRPTgAAAAACAQkQAQAAAMwCAADIAgAA KwQAAExaRnWoYCF8/wAKAQ8CFQKkA+QF6wKDAFATA1QCAGNoCsBzZXTuMgYABsMCgzIDxgcTAoMG MwRGAgBwcnExIIhGaXgJgHN5cwKALn0KgAjPCdk7Fp8yNR41AoAKgQ2xC2BuZzEMMDMUIAsDbGkz Np8N8AtVEvIMARTQb3QFkAMFQAqFVGhlICJHDm8EcByAB3IiIHZp7HJ1BCAEACADkQbwHRAAaG9h eC4gIEmkdCceESBnHPJpDbCxH2B0byARsAWQayAQ4RyhRkFRcwqFAiAgoW0eAXMWQR5QZiHCGaAg /mINwAWwHLAh4AJAIsIgsusKsAMAYyLwdQJAIaEAcP8dEBHwJRAiwiOAIBEKhRYwrnQEICJxC1Fj B5AuCoWRJ3xQLlMe4UNyC3B1IuBTHKByH4AecR4BZicLgBywJQJkbweQbiedBUB3AHAFQABweSAE YL0jQWMLESdQHvEigHkIYEcqwSsRCoVrbm8H4HcfEcAl0SHiB4AAcSwgatsd0CXRdSCBJcFhK1Ar wP8LgCzSBcAHgCvhK8AuMASQcxywI4AnbAMgIwAKhXTZBRBnZzFxHRBiK8Aroq5JAjAEkRIAIBag ZjFx/m4nMSASKSMnZhq8HBYK+zEUIXMxOBYAAEAgKvJHBRBuKh7wHOMgIRHwcRywam9rB5EidAQg btcuUAhwHLBfKtBfHiAb8Op1B0BsK8FhOjAlsjKxtmgrgQNgdSURILJ3BbDdHnEoPVAaYDxhbTwi AxGgOi0oICknfEguAIxldgSQLyBJJ20fUfsaYAJAbCwRAiAnMDPxHRB4cGVvC1AcsCGiBZBo/yRR B0A+0yLCGmAvYAQgLDC/A6AjACowHPBBkDMjcy+gemgeUGIdoAhgBCAvQG6Maz9GoCd8UmVnLELS LDblRGEdoGQnfBoPvxsfSP9KDjglG60VwQBPoAMAEBAAAAAAAwAREAAAAABAAAcwIOneEaMyuwFA AAgwIOneEaMyuwEeAD0AAQAAAAUAAABSRTogAAAAAAMADTT9NwAAICc= ------ =_NextPart_000_01BB32F7.263159E0-- From owner-robots Thu Apr 25 06:41:54 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26250; Thu, 25 Apr 96 06:41:54 -0700 From: "Bill Day" Message-Id: <9604250841.ZM21@i14.msi.umn.edu> Date: Thu, 25 Apr 1996 08:41:43 -0500 In-Reply-To: Vince Taluskie "Re:" (Apr 24, 10:27pm) References: <199604250227.WAA07311@kalypso.cybercom.net> X-Mailer: Z-Mail (3.2.2 10apr95 MediaMail) To: robots@webcrawler.com Subject: Re: Re: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On Apr 24, 10:27pm, Vince Taluskie wrote: > Actually, I think that Jared Williams is the best example I've seen of > an e-mail virus.... day after day his clueless stuff keeps appearing > in my mailbox... Please make it stop! :) > > Aaaaiiiiieeeeee.... > > Vince > > [ Bogus hoax deleted ] > > -- > ___ ____ __ > | _ \/ __/| \ Vince Taluskie, at Fidelity Investments Boston, MA > | _/\__ \| \ \ Pencom Systems Administration Phone: 617-563-8349 > |_| /___/|_|__\ vince@pencom.com Pager: 800-253-5353, #182-6317 > -------------------------------------------------------------------------- > "We are smart, we make things go" >-- End of excerpt from Vince Taluskie You just made my day Vince! :-) -b- -- Bill Day | 612-624-0533 NSF Research Fellow | day@msi.umn.edu Supercomputer Institute | www.msi.umn.edu/~day/ From owner-robots Thu Apr 25 09:21:31 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26751; Thu, 25 Apr 96 09:21:31 -0700 Subject: Re: Magic, Intelligence, and search engines From: YUWONO BUDI To: robots@webcrawler.com Date: Fri, 26 Apr 1996 00:20:55 +0800 (HKT) In-Reply-To: <2.2.32.19960419151900.009f25d8@giant.mindlink.net> from "Tim Bray" at Apr 19, 96 08:19:00 am X-Mailer: ELM [version 2.4 PL24alpha3] Content-Type: text Content-Length: 1247 Message-Id: <96Apr26.002104hkt.19046-23165+226@uxmail.ust.hk> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > Right. Hear hear. People have been, since about 1975, saying "wouldn't > it be wonderful if search engines were intelligent." And every couple > of years, some little venture-cap-funded startup comes along and says > "HUZZAH! We've made search engines intelligent!" If you believe > in precision/recall [which might be useful if it could be measured] > the numbers show a discouraging lack of progress in the last 20 years > at making engines intelligent. > > Not that we don't make progress... but mostly on user interfaces, > data structures & algorithms, feedback mechanisms, document strucures, > indexing efficiency, distributed search. > > Why is all this? Because to be intelligent, the software would have to, > for an arbitrary web page, be able to discern what it's about. In a > multi-lingual fashion, at that. Such software does not currently exist. I don't think that is the central issue. I'd say, to be truely intelligent, the software would have to be able to understand what the user wants based solely on his/her imprecise (typically, due to the lack of means for expressing his/her information needs) query. Knowing what an arbitrary web page is about is only a small part of the whole enterprise, IMO. -Bud. From owner-robots Thu Apr 25 12:03:03 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA27583; Thu, 25 Apr 96 12:03:03 -0700 Message-Id: <199604252003.VAA04320@fedro> X-Sender: x8035952@fedro.ugr.es (Unverified) X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 25 Apr 1996 21:30:31 +0100 To: robots@webcrawler.com From: Ricardo Eito Brun Subject: About integrated search engines Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Can anybody give me some adress about robots that performs searches in several index at the same time? I'm only know MetaCrawler, SavvySearch and InfoMarket from IBM. Thank in advance. From owner-robots Thu Apr 25 13:25:32 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA27956; Thu, 25 Apr 96 13:25:32 -0700 Date: Thu, 25 Apr 1996 13:25:25 -0700 (PDT) Message-Id: <199604252025.NAA27645@iceland.it.earthlink.net> X-Sender: williams@earthlink.net X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Jared Williams Subject: [ MERCHANTS ] My Sincerest Apologies Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Hello! This is the "e-mail virus's" response! Sorry for posting that stupid hoax. This morning I just received a letter from the person that put that message in MY box. I've forwarded it all to you because I think that it parrallels my thoughts exactly. So here it is: >Date: Thu, 25 Apr 1996 06:54:52 -0700 >To: merchants@mail-it.com >From: davidlee@idirect.com (Monica Reyes) >Subject: [ MERCHANTS ] My Sincerest Apologies >Sender: owner-merchants@mail-it.com >Reply-To: merchants@mail-it.com > >Good Morning Everyone, > >I have been thoroughly trashed by everyone, including our very own loving >nerds, because the 'GOOD TIMES' - VIRUS is a HOAX. During the night I have >personally replied to hundreds of letters and will admit to this small >merchant community ... I feel like an idiot, a fool, stupid and I accept >that I wasted your time, and I broadcasted trash, garbage, @#%$. > >I have always respected your time, and since the launching of the Merchants' >Forum - I have been very careful not to clutter your mailboxes with constant >mailouts. I thank everyone in this small community who took the time to let >me know ... of this hoax - and sincerely apologise for this slip in judgement. > >My forte is marketing, and I am still at a loss at 'how a computer works'. I >still remember siding with the group that believes that ' micro computers >will never be accepted by business'. Therefore the word 'VIRUS' created a >monstrous image in my mind, and the fact that 'viruses cannot be transmitted >by email' never clicked. > >Yet, I view everyone of you out there as friends. A few years back, I was >caught in a fire on the 12th floor because I thought it was a false alarm. >The time before that, friends in my condo joked around because I'd always >walk the 12 flights of stairs when the fire alarm went off. They also >assured me that the condo was fire resistant. I almost died from the smoke >in the condo, and had to be rescued. From that day, I chose penance, >caution, instead of being smart. If I had to go through this kind of night >1,000 times, I'd gladly do it, than be the reason 'why a friend' will suffer >loss when I could have prevented it. > >To everyone who wanted to unsubscribe, I will be doing it right after this >letter. I have stayed up all night to personally reply to ... as many as I >can. I did not have time. Please understand O.K. > >I have taken up enough of your time .... >I apologise ... >Please do not trash me for writing this letter ! > >Your friend > >Monica >end > > >Visit Around The World Showcase > >http://www.spectranet.ca/~lee/agenda.htm - Merchants' Forum >http:// webcom.net/~peak/frames/busdir.htm - Around The World Storefronts >http://www.echo-on.net/~millions/s/l/atwlinks.html - Link Up Your StoreFront >http://www.echo-on.net/~millions/s/c/atwsclas.html - Place A FREE Classified Ad >http://www.globedirect.com/~abelynx/doitgd.htm - Link up Hundreds of Locations >http://www.spectranet.ca/~lee/erotica2.htm - Art Deco Erotic Gallery >http://toronto.ark.com/~abedon/bwerogal.htm - The Most Beautiful Women in >the World >http://webcom.net/~peak/frames/artgdir.htm - The Best Graphic Designers >World Wide >http://www.spectranet.ca/~lee/hotdoor.htm - The Erotic Outlink Gallery >http://www.spectranet.ca/~aaron/astro2.html - Free Numerology Report >http://toronoto.ark.com/~abedon/egalmem.htm - Free Pics >Come visit the Hottest Spot in Cyberspace. >Working Harder to Serve The World Better !! > > > > > > p.s. You'll notice my signature is only 9 lines long now... ;) Jared Williams Want a NICE SITE? Visit Web Knitter (R) http://home.earthlink.net/~williams e-mail: williams@earthlink.net From owner-robots Thu Apr 25 14:56:18 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28360; Thu, 25 Apr 96 14:56:18 -0700 Message-Id: <9604252156.AA28354@webcrawler.com> X-Sender: ulicny@alonzo.limbex.com X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Brian Ulicny Subject: Re: About integrated search engines Date: Thu, 25 Apr 96 21:54:41 -0700 (PDT) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Quarterdeck's WebCompass does metasearch from the client side. At 09:30 PM 4/25/96 +0100, you wrote: >Can anybody give me some adress about robots that >performs searches in several index at the same >time? I'm only know MetaCrawler, SavvySearch and >InfoMarket from IBM. >Thank in advance. > Brian Ulicny From owner-robots Thu Apr 25 22:20:14 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA02121; Thu, 25 Apr 96 22:20:14 -0700 Date: Fri, 26 Apr 1996 00:20:04 -0500 (CDT) Message-Id: <199604260520.AAA08985@wins0.win.org> X-Sender: kfischer@pop.win.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: kfischer@mail.win.org (Keith D. Fischer) Subject: Re: About integrated search engines X-Mailer: Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I'm in the process of writing a meta-search engine. Actually the meta-search engine is done; I'm trying to implement increased functionality including that of artificial inteligence techniques. Mostly it is under construction and slow... give it some time and you may be pleased with the results. It searches in parallel, the problem lies in the bandwidth and the server speed. It is written in perl ... soon to be C. Thanks for the interest: http://science.smsu.edu/robot/meta/index.html Keith D. Fischer >Can anybody give me some adress about robots that >performs searches in several index at the same >time? I'm only know MetaCrawler, SavvySearch and >InfoMarket from IBM. >Thank in advance. > > > From owner-robots Fri Apr 26 09:51:21 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA03781; Fri, 26 Apr 96 09:51:21 -0700 Message-Id: <3180FEFF.41C6@austin.ibm.com> Date: Fri, 26 Apr 1996 11:51:11 -0500 From: Rob Turk Organization: IBM Worldwide AIX Support Tools Development X-Mailer: Mozilla 3.0b2 (X11; I; AIX 1) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: Apologies || communal bots References: <199604252025.NAA27645@iceland.it.earthlink.net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Jared Williams wrote: > Sorry for posting that stupid hoax. > This morning I just received a letter from the person that put that message > in MY box. I've forwarded it all to you because I think that it parrallels > my thoughts exactly. Can you please get a clue? I don't care about why you send this junk to a discussion list about HTTP-related robots. Just stop doing it! If you're not going to write about internet agents, web bots, spiders, then don't send your message to this mailing list! > p.s. You'll notice my signature is only 9 lines long now... ;) BFD. Please respect my time and not fill up my inbox with crapola. On another Note: I was wondering just now if communal bots could use RPC to log their information in a central site, with multiple bots depositing their findings in a clearinghouse kind of thing. Is this how the Web Ants model works? Would this be a resource saver (in terms of CPU cycles and network bandwidth) or what? -- Rob Turk Unofficially Speaking. If two men agree on everything, you may be sure that one of them is doing the thinking. -- Lyndon Baines Johnson From owner-robots Fri Apr 26 10:50:55 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04056; Fri, 26 Apr 96 10:50:55 -0700 From: Bonnie Scott Message-Id: <199604261754.NAA16875@elephant-int.prodigy.com> Subject: Re: communal bots To: robots@webcrawler.com Date: Fri, 26 Apr 1996 13:54:20 -0400 (EDT) In-Reply-To: <3180FEFF.41C6@austin.ibm.com> from "Rob Turk" at Apr 26, 96 11:51:11 am X-Mailer: ELM [version 2.4 PL24alpha3] Content-Type: text Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > On another Note: > > I was wondering just now if communal bots could use RPC to log their information in a > central site, with multiple bots depositing their findings in a clearinghouse kind of thing. > Is this how the Web Ants model works? Would this > be a resource saver (in terms of CPU cycles and network bandwidth) or what? I'm not sure if Harvest would meet all your needs, but I give it two thumbs up for its levels of indexing that allow multiple applications for data collected once (multiple brokers talking to multiple brokers and gatherers), common format (SOIF; compatibilty with WAIS or Glimpse), and its easy customizability. Would that corporate politics didn't make things so difficult... Could you tell us more about RPC and its advantages? Would I want to use RPC to make the Glimpse or WAIS database available directly? (Any security risks or other disadvantages?) Bonnie Scott Prodigy Services Company From owner-robots Fri Apr 26 11:53:05 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04354; Fri, 26 Apr 96 11:53:05 -0700 Message-Id: <199604261852.OAA20547@play.cs.columbia.edu> To: robots@webcrawler.com Subject: Re: communal bots In-Reply-To: Your message of "Fri, 26 Apr 1996 13:54:20 EDT." <199604261754.NAA16875@elephant-int.prodigy.com> Date: Fri, 26 Apr 1996 14:52:51 -0400 From: "John D. Pritchard" Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com webhackers, if RPC is coming up, might as well mention CORBA http://www.cilabs.org which is an OOP paradigm for RPC. there's tons of info on RPC on the net and there's a decent book from o'reilly on it. sun initiated rpc, in the open tradition, so if you have access to sunos or solaris you certainly have a lot of info on rpc in the man pages (etc) there. rpc has a trusted hosts model and a network login model. if you're familiar with nfs -- that's built on rpc. it's really very flexible but a lot of systems hackers think it's too slow and rather write their own sockets code tailored to each problem. there's rpc (RPCDEV.ZIP) for OS/2 on hobbes. i thought harvest uses rpc.. -john From owner-robots Fri Apr 26 12:40:50 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA04614; Fri, 26 Apr 96 12:40:50 -0700 Message-Id: <318126B8.446B@austin.ibm.com> Date: Fri, 26 Apr 1996 14:40:40 -0500 From: Rob Turk Organization: IBM Worldwide AIX Support Tools Development X-Mailer: Mozilla 3.0b2 (X11; I; AIX 1) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: communal bots References: <199604261754.NAA16875@elephant-int.prodigy.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Bonnie Scott wrote: > I'm not sure if Harvest would meet all your needs, but I give it two thumbs up for its levels of indexing that allow multiple applications for data collected o Thanks for the recommendation. I've been using other's Perl scripts and libraries to do link checks on an internal site (a package called webxrefs.pl) that was satisfactory. (I mean that it worked great doing what it was supposed to do, but the people that I had to give its output to were slightly overwhelmed by its output format...I think it told them "too much") > Could you tell us more about RPC and its advantages? Would I want to use RPC to make the Glimpse or WAIS database available directly? (Any security risks or disadvantages?) RPC (Remote Procedure Call) involves piping processes over a network, and I am by no means an expert on the protocol. I'm not sure how Glimpse or WAIS database query results (is that what you're asking?) would benefit. Those tools may already utilize RPC operations. -- Rob Turk Unofficially Speaking. Is it tax freedom day yet? From owner-robots Fri Apr 26 23:58:55 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA06529; Fri, 26 Apr 96 23:58:55 -0700 Message-Id: <3181C636.7FE9@netvision.net.il> Date: Sat, 27 Apr 1996 10:01:10 +0300 From: Frank Smadja X-Mailer: Mozilla 2.01 (WinNT; I) Mime-Version: 1.0 To: robots@webcrawler.com Subject: Re: About integrated search engines References: <199604260520.AAA08985@wins0.win.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Keith D. Fischer wrote: > > I'm in the process of writing a meta-search engine. Actually the meta-search > engine is done; I'm trying to implement increased functionality including > that of artificial inteligence techniques. Mostly it is under construction > and slow... give it some time and you may be pleased with the results. It > searches in parallel, the problem lies in the bandwidth and the server > speed. It is written in perl ... soon to be C. Thanks for the interest: > > http://science.smsu.edu/robot/meta/index.html > > Keith D. Fischer > Sounds interesting. Please send us some explanation about what is done and how. Like "more precisely", "less precisely," etc. Frank Smadja smadja@netvision.net.il From owner-robots Sat Apr 27 08:56:24 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA07394; Sat, 27 Apr 96 08:56:24 -0700 Date: Sat, 27 Apr 96 08:56:19 -0700 Message-Id: <9604271556.AA07388@webcrawler.com> X-Sender: Mitchell Elster X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Mitchell Elster Subject: To: ???? Robot Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I've got a robot I'm thinking of creating. Only I don't care about indexing HTML Docs. I'm looking for people. Any help will be appreciated. Goal: To create a searchable database of e-mail address TTR : Robot would be run from 10:00pm - 12:00am (Midnight) DTC : Email address and full name of recipient(if possible) - Company, etc.. Why : Because one of my clients could use the information, if I can get enough general information about each email account. (down to the city would be great!) And, I feel that by putting the database online could prove usefull to alot of other people and companies. ENV : Windows NT/VB/SQL-SERVER TTR = Times To Run DTC = Data To Collect ENV = Our Working Environment My problem is, I need a hand getting started. What tools out there can help me get started? Is there anyone doing this already? Any help is better than NO help at all! Mitchell Elster President BitWise Computer Consultants, Inc From owner-robots Sat Apr 27 10:21:08 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA07562; Sat, 27 Apr 96 10:21:08 -0700 Message-Id: <2.2.32.19960427171955.002d6054@pop.tiac.net> X-Sender: wadland@pop.tiac.net X-Mailer: Windows Eudora Pro Version 2.2 (32) Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sat, 27 Apr 1996 13:19:55 -0400 To: robots@webcrawler.com From: Ken Wadland Subject: Re: To: ???? Robot Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 08:56 AM 4/27/96 -0700, you wrote: > I've got a robot I'm thinking of creating. Only I don't care about indexing >HTML Docs. I'm looking for people. Any help will be appreciated. > > Goal: To create a searchable database of e-mail address Sounds to me like you're trying to collect addresses so you can send out junk mail. We have enough of that already. Whose addresses are you collecting? Are you planning to search just within your company or the entire web? -- Ken Wadland wadland@engsoftware.com From owner-robots Sat Apr 27 20:28:04 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA08763; Sat, 27 Apr 96 20:28:04 -0700 X-Sender: Mitchell Elster X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Mitchell Elster Subject: Re: To: ???? Robot Date: Sat, 27 Apr 1996 23:38:06 Message-Id: <19960427233806.0493ef80.in@BitMaster> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I said: >> Goal: To create a searchable database of e-mail address You Replied: >Sounds to me like you're trying to collect addresses so you can send out >junk mail. We have enough of that already. I say to that: No, my goal is not to produce junk mail! My client is a Private Investigative firm. Their goal is to have another avenue in which to generate leads to people they are trying to locate. I Said: >> I've got a robot I'm thinking of creating. Only I don't care about indexing >>HTML Docs. I'm looking for people. Any help will be appreciated. You asked: >Whose addresses are you collecting? Are you planning to search just within >your company or the entire web? >-- Ken Wadland >wadland@engsoftware.com I reply: I'm looking to index only in the U.S.A. starting with the East Coast. For the amount of information that I am looking to index, I plan to start with a 500meg SQL Server database, scaling up to 2-4gig, as necessary. As I have NO idea how many records would be generated (millions I'm sure), I plan to start with only the east coast, monitoring closely the data being collected. Mitchell Elster President BitWise Computer Consultants, Inc From owner-robots Sun Apr 28 01:54:44 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA09075; Sun, 28 Apr 96 01:54:44 -0700 Message-Id: <01BB34A5.5D688640@pax-ca1-15.ix.netcom.com> From: chris cobb To: "'robots@webcrawler.com'" Subject: RE: To: ???? Robot Date: Sun, 28 Apr 1996 01:52:23 -0400 Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="---- =_NextPart_000_01BB34A5.5D7DE300" Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com ------ =_NextPart_000_01BB34A5.5D7DE300 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit A rather good email address collection attempt has been done by OKRA: http://okra.ucr.edu/okra ---------- From: Mitchell Elster[SMTP:elsterm@bwcc.com] Sent: Saturday, April 27, 1996 7:38 PM To: robots@webcrawler.com Subject: Re: To: ???? Robot I said>> Goal: To create a searchable database of e-mail address You Replied: >Sounds to me like you're trying to collect addresses so you can send out >junk mail. We have enough of that already. I say to that: No, my goal is not to produce junk mail! My client is a Private Investigative firm. Their goal is to have another avenue in which to generate leads to people they are trying to locate. I Said: >> I've got a robot I'm thinking of creating. Only I don't care about indexing >>HTML Docs. I'm looking for people. Any help will be appreciated. You asked: >Whose addresses are you collecting? Are you planning to search just within >your company or the entire web? >-- Ken Wadland >wadland@engsoftware.com I reply: I'm looking to index only in the U.S.A. starting with the East Coast. For the amount of information that I am looking to index, I plan to start with a 500meg SQL Server database, scaling up to 2-4gig, as necessary. As I have NO idea how many records would be generated (millions I'm sure), I plan to start with only the east coast, monitoring closely the data being collected. Mitchell Elster President BitWise Computer Consultants, Inc ------ =_NextPart_000_01BB34A5.5D7DE300 Content-Type: application/ms-tnef Content-Transfer-Encoding: base64 eJ8+IiAFAQaQCAAEAAAAAAABAAEAAQeQBgAIAAAA5AQAAAAAAADoAAENgAQAAgAAAAIAAgABBJAG ACQBAAABAAAADAAAAAMAADADAAAACwAPDgAAAAACAf8PAQAAAEkAAAAAAAAAgSsfpL6jEBmdbgDd AQ9UAgAAAAByb2JvdHNAd2ViY3Jhd2xlci5jb20AU01UUAByb2JvdHNAd2ViY3Jhd2xlci5jb20A AAAAHgACMAEAAAAFAAAAU01UUAAAAAAeAAMwAQAAABYAAAByb2JvdHNAd2ViY3Jhd2xlci5jb20A AAADABUMAQAAAAMA/g8GAAAAHgABMAEAAAAYAAAAJ3JvYm90c0B3ZWJjcmF3bGVyLmNvbScAAgEL MAEAAAAbAAAAU01UUDpST0JPVFNAV0VCQ1JBV0xFUi5DT00AAAMAADkAAAAACwBAOgEAAAACAfYP AQAAAAQAAAAAAAAD0jcBCIAHABgAAABJUE0uTWljcm9zb2Z0IE1haWwuTm90ZQAxCAEEgAEAFAAA AFJFOiBUbzogPz8/PyAgUm9ib3QAUAUBBYADAA4AAADMBwQAHAABADQAFwAAAD8BASCAAwAOAAAA zAcEABwAAQAyAB8AAABFAQEJgAEAIQAAADZCQzA0Mzk4MEE5RkNGMTFCM0RDNDQ0NTUzNTQwMDAw APwGAQOQBgC4BgAAEgAAAAsAIwAAAAAAAwAmAAAAAAALACkAAAAAAAMANgAAAAAAQAA5ACCB6t7G NLsBHgBwAAEAAAAUAAAAUkU6IFRvOiA/Pz8/ICBSb2JvdAACAXEAAQAAABYAAAABuzTG3uSYQ8Bs nwoRz7PcREVTVAAAAAAeAB4MAQAAAAUAAABTTVRQAAAAAB4AHwwBAAAAFQAAAGMtY29iYkBpeC5u ZXRjb20uY29tAAAAAAMABhCH3muEAwAHEGUEAAAeAAgQAQAAAGUAAABBUkFUSEVSR09PREVNQUlM QUREUkVTU0NPTExFQ1RJT05BVFRFTVBUSEFTQkVFTkRPTkVCWU9LUkE6SFRUUDovL09LUkFVQ1JF RFUvT0tSQS0tLS0tLS0tLS1GUk9NOk1JVENIAAAAAAIBCRABAAAAPwUAADsFAADqCAAATFpGdTwy kDX/AAoBDwIVAqgF6wKDAFAC8gkCAGNoCsBzZXQyNwYABsMCgzIDxQIAcHJCcRHic3RlbQKDM3cC 5AcTAoB9CoAIzwnZO/EWDzI1NQKACoENsQtg4G5nMTAzFFALChRRDQvyYwBAFLAgcmF0omgEkCBn bwRwIBPg4wtwAyBhZGQWEAQRFYHybAWQdGkCIAqFGyAT0ZcFMRGABCBiCeEgZAIgAmUecHkgT0tS QRY6CoUKhWgCQHA6L0gvb2sbEC51BQAuPQmAdSDjH6wfrAr0bGnIMTQ0AtFpLSRDDNDzJEMLWTE2 CqADYBPQHQB8IC0m5wqHJZsMMCZmRj0DYTon7iZmDIIF0Gl0RRFwZRzQIEVsE8FywFtTTVRQOivw LFKAbUBid2NjLgWg/G1dJ48onQZgAjApzyrbhQYQdAhwZGF5LBSwExNQAxEyNzKgMTk5ADYgNzoz OCBQ5k0uDyidVG8wTyrbA2CBBuB0c0B3ZWIFAHxhdxzgIWAtwTQvLx51nGJqHPE2TyrbUmU74PU2 Ij8+QSAH8ThBIu8Zr0cmZiJ/H9lJIHMLcGTgPj4gR28HQD3iHKC3FhAbIB8AYUNgRJByEXE/AmAf ADJwAZEeUB8Ab2btG8AtG+ofrFkIYD2xC1AlCJBkH5Y+UwhgbmStBCB0RFAHgCAkIGsfAJp5CGAn FhBJ0HJ5C4DeZ0nSHLUcJgeRc0RQSpFfHKADkRGwSaBGMHU+5j5aakmQa0oAG/EuPoBXOx8AEYB2 HwAJ8AhgZ2gvRjIbMBsgHCBsRIFkeT4uQn8ygEnSUFIflk5vXTKgbR8wG4AHQCAEACDfT8AFQEnh JmEhkGMfAE539iEF0B8wY0ihAjBUAkTg2lAFEHZEoULmbk+AE8D8aWcbIFbwHwAkkC1ATwD+VBtA WKBTt0nhT2MAcCaAOxtCT3FuClBUAAOgd2juaRFwSdE/ZWcJ8ASQRKLHHOAcMEnDcGVvC1BK4f8b QB8wCsBK6hWgTTAT0FEfzwYQQ5BI50PASSdPgRuAt0wBOBRhcW1QQQuAa0syj0ZBRHNLMU8AT25s HzD7Q1Ae0ScFQE0wStEBoE3BG1sxDbB4SzFg50hUTehMIERfMHNPAGKSFaA/IPBLMgIQBcBdhE8A QW77HzAr4XBbYAMQAyAegBwg3nATUAWQBzAT0GRPAEdfjRwgc0pgSNhXaG9GEf9MKF5CTPMcxRkQ PnEHEG30/QtRbgMAS0RFBFUhE8Bpcf9i0kj2SpEFwC3BP5BpAQWx3xsxT6EdEErROJE/SPYm4LQg Sx6hVxwwGPFkSPbqd3T0QAnwZ0zAAYB10N8WEDk5QucWEAtQeR+WZ3o/SeFls0YwZEJbQXMCVS74 Uy5BTwATwArAY9JxE31y80UeUAVACFB8wU8ARvcFsD9lcwJhBGBJkAVARkH/C4BoMQDAHRJQRENQ flB5T38yoENQb6JwI3uifBQh9jXMMDAHgEtQU1Fm8AZhf0+ABcBFtjKgBPAHQEsydcNpYEnhMi00 Z1gQMqD/HlEe8FUABBAKwFEAFLAEIPNDUE9jTk8/ZUOQRJAeMH5vB+ADgR8wajEFsEmxdz0IYGwb sGnRXGYbsChtL2mRHSGHUWKhcwhwZSn/gVk/ZYIZemNzA3zCBaB8wf9TcQIgK7AFsEsyVhBs8Y4l /0WyHnGPwxzEapEfrCutP2VfVtAHkIhxAjA/ZUIrsFf/BAAfAAhQHgBN0BtRCFAAgJeJ8AGQAjBz gVFuY0IPLz8vQD9BShUxAJuQAAMAEBAAAAAAAwAREAAAAABAAAcwoGuQnMY0uwFAAAgwoGuQnMY0 uwEeAD0AAQAAAAUAAABSRTogAAAAAGwN ------ =_NextPart_000_01BB34A5.5D7DE300-- From owner-robots Mon Apr 29 05:55:21 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12110; Mon, 29 Apr 96 05:55:21 -0700 Date: Mon, 29 Apr 1996 08:55:24 -0400 Message-Id: <1.5.4.16.19960429085546.434f25fe@tre.thewild.com> X-Sender: swood@tre.thewild.com X-Mailer: Windows Eudora Light Version 1.5.4 (16) Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: "Scott W. Wood" Subject: Private Investigator Lists Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Junk mail or no junk mail, I still get worried everytime I see the likes of the Internet White pages, doofus's adding me to chain mails or someone creating bots to collect email. You may have good intentions, but it is just a matter of time before you reach those dire financial straights and decide the names collected are a 'liquidatable asset' From owner-robots Mon Apr 29 08:13:29 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA12488; Mon, 29 Apr 96 08:13:29 -0700 Message-Id: <3184DC90.1CFB@austin.ibm.com> Date: Mon, 29 Apr 1996 10:13:21 -0500 From: Rob Turk Organization: IBM Worldwide AIX Support Tools Development X-Mailer: Mozilla 3.0b2 (X11; I; AIX 1) Mime-Version: 1.0 To: robots@webcrawler.com Subject: [Fwd: Re: To: ???? Robot] Content-Type: multipart/mixed; boundary="------------3F54FF6ABD" Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com This is a multi-part message in MIME format. --------------3F54FF6ABD Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Well, this has to do with robots, in that, what they want to do shouldn't be done with robots. I'm offering this to the developer community for opinions: What do you guys think? --------------3F54FF6ABD Content-Type: message/rfc822 Content-Transfer-Encoding: 7bit Content-Disposition: inline Received: from netmail.austin.ibm.com by toolbox.austin.ibm.com (AIX 4.1/UCB 5.64/4.03-client-2.6) for rturk at ; id AA33872; Mon, 29 Apr 1996 10:04:57 -0500 Received: from toolbox.austin.ibm.com (toolbox.austin.ibm.com [129.35.203.131]) by netmail.austin.ibm.com (8.6.12/8.6.11) with SMTP id KAA58182; Mon, 29 Apr 1996 10:04:55 -0500 Received: from toolbox.austin.ibm.com by toolbox.austin.ibm.com (AIX 4.1/UCB 5.64/4.03-client-2.6) for rturk@austin.ibm.com at austin.ibm.com; id AA34890; Mon, 29 Apr 1996 10:04:51 -0500 Sender: rturk@austin.ibm.com Message-Id: <3184DA93.59E2@austin.ibm.com> Date: Mon, 29 Apr 1996 10:04:51 -0500 From: Rob Turk Organization: IBM Worldwide AIX Support Tools Development X-Mailer: Mozilla 3.0b2 (X11; I; AIX 1) Mime-Version: 1.0 To: Mitchell Elster Cc: rturk@megalith.com Subject: Re: To: ???? Robot References: <19960427233806.0493ef80.in@BitMaster> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Mitchell Elster wrote: > > I said: > >> Goal: To create a searchable database of e-mail address The web is already searchable, and sometimes the documents that comprise it contain tags that refer back to the author of the document. Most of the e-mail addresses found in a given search would simply be web freaks...not the kind of person to be involved in things that P.I.'s care about...though that could make a mighty "fresh" episode of yet-another-cop show. Don't you think? Most of the e-mail-like strings of text in web docs relate to the author of the document. Other ones go to mailing lists or are forwarded to different people based on characteristics of the context of the message. This leads me to believe that searching the web for e-mails would not be the best way to accomplish your client's goals. > > I Said: > >> I've got a robot I'm thinking of creating. Only I don't care about indexing > >>HTML Docs. I'm looking for people. Any help will be appreciated. > > I reply: > I'm looking to index only in the U.S.A. starting with the East Coast. For > the amount of information that I am looking to index, I plan to start with a > 500meg SQL Server database, scaling up to 2-4gig, as necessary. As I have NO > idea how many records would be generated (millions I'm sure), I plan to > start with only the east coast, monitoring closely the data being collected. > Okay, the web is this *distributed* network of networks. It has few geographical references that would enable one to say "This occurred in _____________" about any given document. A good many of the documents that would be found if a search were made for one of my old e-mail addresses would refer to Austin, TX. Any particular e-mail address you found on one of those documents that refer to Austin may refer to someone who lives in Austin, TX. Perhaps not. For a P.I., the way to get names is to offer to buy them from websites willing to sell in the market that you're pursuing. See, if people are filling in a form for say, to register for the Annual Private Investigators Ball in Baltimore, Maryland then you'd be able to say "These people are going to be in Maryland (East Coast, remember?) on this or that date." Now, that information you could BROKER to the P.I.'s themselves, or anyone else who wants to know. Kinda scary isn't it? See this kind of thing would lead to a diminished trust for users to fill out CGI's that exact personal information, a commodity that must be respected by merchants in the new economic frontier. If the merchants violate the personal information of their customers, then they will lose customers. That's basic business offline, but for some reason mostly ignored by all the P.I.'s with $$$ in their eyes, and all their insurance-salesperson ilk. But really, you don't need a robot to do what you're asking, although the process of making website CGI's manage visitor information is VERY similar to a robot. -- Rob Turk Unofficially Speaking. --------------3F54FF6ABD-- From owner-robots Mon Apr 29 11:17:24 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA13313; Mon, 29 Apr 96 11:17:24 -0700 Message-Id: <199604291817.OAA15227@play.cs.columbia.edu> To: robots@webcrawler.com Subject: Re: To: ???? Robot In-Reply-To: Your message of "Sat, 27 Apr 1996 08:56:19 PDT." <9604271556.AA07388@webcrawler.com> Date: Mon, 29 Apr 1996 14:17:13 -0400 From: "John D. Pritchard" Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > I've got a robot I'm thinking of creating. Only I don't care about indexing > HTML Docs. I'm looking for people. Any help will be appreciated. > > Goal: To create a searchable database of e-mail address reGoal: interface to things found at gopher://gopher.utn.umn.edu -john From owner-robots Mon Apr 29 11:40:26 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA13414; Mon, 29 Apr 96 11:40:26 -0700 Message-Id: <199604291840.OAA15384@play.cs.columbia.edu> To: robots@webcrawler.com Subject: Re: To: ???? Robot In-Reply-To: Your message of "Sat, 27 Apr 1996 13:19:55 EDT." <2.2.32.19960427171955.002d6054@pop.tiac.net> Date: Mon, 29 Apr 1996 14:40:14 -0400 From: "John D. Pritchard" Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > I've got a robot I'm thinking of creating. Only I don't care about indexing > >HTML Docs. I'm looking for people. Any help will be appreciated. > > > > Goal: To create a searchable database of e-mail address > > Sounds to me like you're trying to collect addresses so you can send out > junk mail. We have enough of that already. totally, and if you do send junk mail, i think it's pretty well understood that you'll have trouble. this isn't a threat, it's an observation. i hope no one is stupid enough to do unsolicited commercial mail. it's such an industrial, pre-internet thing to do. -john From owner-robots Mon Apr 29 13:02:51 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA13699; Mon, 29 Apr 96 13:02:51 -0700 Date: Mon, 29 Apr 1996 16:05:18 -0400 Message-Id: <9604292005.AA28286@super.mhv.net> X-Sender: tsmith@pop.mhv.net X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Terry Smith Subject: Re: To: ???? Robot] Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com [included ]Mitchell Elster wrote: Goal: To create a searchable database of e-mail address TTR : Robot would be run from 10:00pm - 12:00am (Midnight) DTC : Email address and full name of recipient(if possible) - Company, etc.. Why : Because one of my clients could use the information, if I can get enough general information about each email account. (down to the city would be great!) And, I feel that by putting the database online could prove usefull to alot of other people and companies. [snip] Summary: I'm not in full agreement with what Rob Turk says. I think there's an interesting robot here, but I'm not sure the goal described above will produce *that* robot. I also agree with Rob that it could make a good TV episode. As a side note: Private Investigators are in the business of invading people's privacy. All we can do is hope they have ethics and responsibility. I wouldn't want to start a privacy thread here, but I've included two annecdotes below for your amusement. I see two definitions (not exclusive) of robot: one is the indexer of information as a front end to standard search engines (Lycos, Yahoo etc). The other is a more focused searcher for specific information: what many call an agent. When a PI goes hunting for a missing person (courtesy an unfortunate friend ... child support), one of the first things they do is hit Lexus/Nexus to see if the target has made the news. This picks up marriage announcements, some classifieds, as well as traditional news. You don't always come up lucky, but diligence demands that this be an early stop on the trail. When I use the net to find someone, I start with Deja News: search for there name and permuations that are likely email names. I get back articles which I filter (John would never be posting to alt-rec.music.makers). I then use the who-where type servers, and finally hunt the open web and gopher. Now this would be a neat process to automate -- if only to compress the transfer time. It's an agent. The code is a specific web robot. I imagine a PI firm that built the agent and created a DB could sell this to other PI firms (a la Nexus) for a fixed fee. its already useful. For example Mitchell Elster -- (via who-where) --> elsterm@us.net us.net --(via rs.internic.net) --> Silver Spring, MD 20904-1735 if us.net is considered a small ISP then we have a name and a city. Not a bad start. As more and more people consider an email a necessity, this will grow beyond the web-head (Rob Turk's phrase) and into the general population. ------------------------------------------ Fun Annectodes -------------------------------------- I needed a consultant in another city to do some on-site work for a client. I used Deja News to search .jobs.*. Saw a subject line that caught my eye. Click on the hyperlink, read the text. Thinking it was a good lead, I clicked on the gent's name (a hyperlink in Deja News) thinking it was a mail-to link. It actually gave me his posting summary. In addition to learning about the severity of problem that sent him to the newsgroups, I had the opportunity to learn more about his sexual interests than was really necessary. When my friend was working with the PI (several years ago), she was asked the names of his best friends. It seems the PI had someone inside the phone company that for $200/phone bill Could produce a copy of all long distance calls made. Equally as illegal as hacking the Bell system computers, but I know lots more people with $200 than good hackers. Frightening, eh? From owner-robots Mon Apr 29 15:27:35 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14458; Mon, 29 Apr 96 15:27:35 -0700 X-Sender: Mitchell Elster X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Mitchell Elster Subject: Re: To: ???? Robot Date: Mon, 29 Apr 1996 18:38:13 Message-Id: <19960429183813.0dce190e.in@BitMaster> Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Message replying to is quoted at bottom: Unsolicitated mail on the internet is a waste of the keystrokes it takes to delete/create it, there is no doubt about that. It's pretty evident from the message traffic, and personal reply's, that I've hit a nerve around here. And I concur that most of the points being brought up are valid. A few of the personal reply's had almost got me to think of writing a 'bot to sniff out mail directed to XXXXX and delete it before it ever gets there (you know who you are!!), effectivly killing there I-Net mail for a bit. But alas, I am not like that. In any case, I have opted to give my client a more viable solution. They are now the proud owners of (3) three PhoneDisk USA software packages. Thier quest to invade people using the internet mail world is over for now. I am glad that there are people, like us in this LIST, who can debate the use of 'Bots on the web. Although the revenue generated, had I performed his development effort, would have been useful it's a job I am willing to turn down. Thanks for all the input!!! -Mitchell Elster President BitWise Computer Consultants, Inc. Microsoft Certified Product Specialist E-Mail: president@bwcc.com or elsterm@bwcc.com At 02:40 PM 4/29/96 -0400, you wrote: > > > >> > I've got a robot I'm thinking of creating. Only I don't care about indexing >> >HTML Docs. I'm looking for people. Any help will be appreciated. >> > >> > Goal: To create a searchable database of e-mail address >> >> Sounds to me like you're trying to collect addresses so you can send out >> junk mail. We have enough of that already. > >totally, and if you do send junk mail, i think it's pretty well understood >that you'll have trouble. this isn't a threat, it's an observation. i >hope no one is stupid enough to do unsolicited commercial mail. it's such >an industrial, pre-internet thing to do. > >-john > > > From owner-robots Mon Apr 29 15:53:48 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA14555; Mon, 29 Apr 96 15:53:48 -0700 Date: Mon, 29 Apr 1996 18:50:01 -0400 Message-Id: <199604292250.SAA05432@inigo.cybernex.net> X-Sender: sran@bc.cybernex.net (Unverified) X-Mailer: Windows Eudora Version 1.4.4 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: bsran@admin.nj.devry.edu (Bhupinder S. Sran) Subject: Looking for a search engine Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I think this question was raised before, but I seem to have lost the email messages relating to it. I am looking for a search engine that I can put on a company's server to index their local URL's (which are located on several servers). Where can I find one? Are there any public domain engines. Thanks. Bhupinder S. Sran :) :> :-) :> :) :-> :} :] :-) :> :} :> :-) :) :> :) :} :-) :) :> :) :> Bhupinder S. Sran, Professor, CIS Department :> PHONE: 908-634-3460 DeVry Institute, Woodbridge, NJ 07095 :> FAX: 908-634-7614 EMAIL: bsran@admin.nj.devry.edu :> HOME PAGE: http://admin.nj.devry.edu/~bsran :> :) :> :-) :> :) :-> :} :] :-) :> :} :> :-) :) :> :) :} :-) :) :> :) :> From owner-robots Tue Apr 30 10:52:38 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA17627; Tue, 30 Apr 96 10:52:38 -0700 X-Sender: mak@surfski.webcrawler.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Tue, 30 Apr 1996 10:55:13 -0700 To: robots@webcrawler.com From: m.koster@webcrawler.com (Martijn Koster) Subject: Admin: List archive is back Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com After a few weeks of absence while we were adding machines, the list archives are back on-line. During the absence messgaes were still spooled, so the archive is up-to-date. In this new configuration there will be a slightly longer delay before messages appear, in the order of a few hours. Web access should be faster than before. FYI, mailing list membership seems to be hovering around 400. Happy browsing, -- Martijn Email: m.koster@webcrawler.com WWW: http://info.webcrawler.com/mak/mak.html From owner-robots Tue Apr 30 12:03:48 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA18115; Tue, 30 Apr 96 12:03:48 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Tue, 30 Apr 1996 12:04:22 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Admin: List archive is back Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >After a few weeks of absence while we were adding machines, >the list archives are back on-line. During the absence messgaes >were still spooled, so the archive is up-to-date. I have also kept my scrolling HyperMail-ish archive (the last two weeks' of messages) alive here: http://www.mccmedia.com/robots/ Nick you want low cost...? If you're using the Microsoft server, you can download a free version of our engine from www.verity.com. If you're using Netscape, a free version of our engine will be included in the 2.0 release. I'm not sure if it's in the current beta. Otherwise, you might look at Glimpse, which is used in Harvest, the University of Colorado distributed search project. There are lots of others, too, but I have enough trouble keeping track of the commercial ones without trying to keep an eye on the p.d. ones. Nick From owner-robots Tue Apr 30 21:38:44 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20429; Tue, 30 Apr 96 21:38:44 -0700 Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 1 May 1996 13:40:54 +0900 To: robots@webcrawler.com From: mark@gol.com (Mark Schrimsher) Subject: Re: Looking for a search engine Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >Sounds like you want low cost...? If you're using the Microsoft server, >you can download a free version of our engine from www.verity.com. If >you're using Netscape, a free version of our engine will be included in the >2.0 release. I'm not sure if it's in the current beta. Nick: For these built in search engines, what is the interface to them? Can you write your own HTML interface, calling some canned command to trigger the searching? --Mark From owner-robots Tue Apr 30 22:26:57 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA20588; Tue, 30 Apr 96 22:26:57 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Tue, 30 Apr 1996 22:27:36 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: Looking for a search engine Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >For these built in search engines, what is the interface to them? Can you >write your own HTML interface, calling some canned command to trigger the >searching? We use ordinary HTML forms, which opens up new possibilities for custom interfaces with Java and JavaScript, too. For example, I've just finished a first cut at a new JavaScript interface to our engine. It's here: http://www.verity.com/demo/javascript/v3/frames.htm IMPORTANT -- this is only tested with Netscape Navigator Atlas PR2 (the latest beta pre-release of Netscape 3.0). However, early reports indicate that it works with Navigator Gold 2.x. However, there are errors with Nav 2.x, which I'll try to clean up soon... but no guarantees. I'm co-authoring an O'Reilly book on JavaScript and this is one of our examples, so our target is Nav 3.0. This example also uses JavaScript to format the search results, although that can be done by the server, of course. This part of it uses a Perl script on the server (to format the results into a simple script) but expect open standards that would allow JavaScript-based search results to be interoperable. The off-the-self product formats HTML results using template files, though it's easy to intercept them with Perl or whatever, of course. Nick From owner-robots Wed May 1 14:21:58 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA23885; Wed, 1 May 96 14:21:58 -0700 Date: Wed, 1 May 1996 17:09:22 -0400 (EDT) From: Brian Fitzgerald Subject: topical search tool -- help?! To: robots@webcrawler.com Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I'm looking to find a tool for the web that would allow me to search for new sites in apprx. 300 distinct topical areas. I need this tool to autonomously search, and categorize the information such that a human could rifle through it in a timely fashion to find appropriate new links. Am I looking for *a* robot? many robots? a spider? where can i go to find the information that i'm looking for? does this technology exist in the commercial market, or in research circles? I'd really appreciate any help that anyone can give. Brian Fitzgerald From owner-robots Wed May 1 15:47:19 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA25212; Wed, 1 May 96 15:47:19 -0700 Message-Id: <9605012247.AA25206@webcrawler.com> X-Sender: ulicny@alonzo.limbex.com X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Brian Ulicny Subject: Re: topical search tool -- help?! Date: Wed, 1 May 96 22:46:19 -0700 (PDT) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 05:09 PM 5/1/96 -0400, Brian Fitzgerald wrote: >I'm looking to find a tool for the web that would allow me to search for >new sites in apprx. 300 distinct topical areas. I need this tool to >autonomously search, and categorize the information such that a human >could rifle through it in a timely fashion to find appropriate new links. That's pretty much a description of what Quarterdeck's WebCompass does (as well as automatically summarize and produce a list of keywords). Please excuse the plug. You can check out a browse only (non-editable) version at the Limbex site (URL below). It sells commercially for (I think) around $90 now. Best, Brian Ulicny Limbex Corporation 13160 Mindanao Way, Suite 234 Marina Del Rey, CA 90292 USA (310) 309-4281 x4505 (office/vmail) (310) 309 4282 (fax) http://www.limbex.com/ (URL) From owner-robots Wed May 1 19:39:00 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26479; Wed, 1 May 96 19:39:00 -0700 Message-Id: <9605020238.AA26473@webcrawler.com> X-Sender: ulicny@alonzo.limbex.com X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Brian Ulicny Subject: Re: topical search tool -- help?! Date: Thu, 2 May 96 02:37:04 -0700 (PDT) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com I should have mentioned that I was referring to WebCompass Professional Edition, not WebCompass Personal Edition, which can be downloaded for free from the Quarterdeck Web site (http://www.qdeck.com/). The Personal Edition only does metasearch and summarizes on demand: it doesn't maintain a database of summaries, keywords, etc. WebCompass Professional is the product with the Agent that does the summarization, clustering, etc. At 10:46 PM 5/1/96 -0700, Brian Ulicny wrote: >That's pretty much a description of what Quarterdeck's WebCompass does (as >well as automatically summarize and produce a list of keywords). Please >excuse the plug. > >You can check out a browse only (non-editable) version at the Limbex site >(URL below). >It sells commercially for (I think) around $90 now. > >Best, > >Brian Ulicny >Limbex Corporation >13160 Mindanao Way, Suite 234 >Marina Del Rey, CA 90292 USA >(310) 309-4281 x4505 (office/vmail) >(310) 309 4282 (fax) >http://www.limbex.com/ (URL) > > From owner-robots Wed May 1 20:47:40 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26701; Wed, 1 May 96 20:47:40 -0700 Date: Thu, 2 May 96 12:47:29 JST From: francis@cactus.slab.ntt.jp (Paul Francis) Message-Id: <9605020347.AA18249@cactus.slab.ntt.jp> To: robots@webcrawler.com Subject: Re: topical search tool -- help?! Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >That's pretty much a description of what Quarterdeck's WebCompass does (as >well as automatically summarize and produce a list of keywords). Please >excuse the plug. > Looks offhand like a great product, and I imagine we should be seeing more and more such agent-type products. Especially speaking from an internet-wide systems perspective, but also from a product user's perspective, I'm concerned about its dependence on the existing mega-search engines. First, presumably a lot of agents querying the mega-search engines as a backgroud task will increase the load on those engines, yes? Also, I assume that you don't copy the advertisement of the search-engine page over to your WebCompass GUI. This is great for the user, but I wonder if it might cause the mega-search engines to somehow identify agent queries (say by patterns or volume of usage?) and cut them off or otherwise degrade their service. If nothing else, I assume that their advertisers would want to know what percentage of their business is from "eyeballs", and what percentage is from (advertisement- oblivious) agents. PF From owner-robots Wed May 1 21:58:21 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26998; Wed, 1 May 96 21:58:21 -0700 Date: Thu, 2 May 96 13:58:14 JST From: francis@cactus.slab.ntt.jp (Paul Francis) Message-Id: <9605020458.AA18704@cactus.slab.ntt.jp> To: robots@webcrawler.com Subject: Re: topical search tool -- help?! Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com > > I'm looking to find a tool for the web that would allow me to search for > new sites in apprx. 300 distinct topical areas. I need this tool to > autonomously search, and categorize the information such that a human > could rifle through it in a timely fashion to find appropriate new links. > A plug of my own... We have a research project going that we hope in the not too distant future will lead to this kind of functionality. What it does is automatically set up an alternative web topology (fully distributed), whereby the links are topical rather than explicit pointers as in HTML. The resulting topology is then searched from the local site robot-style (at search time). Our hope is that 1) this approach will scale well, 2) by pushing the control of both indexing and searching to the users, we can improve the quality of searches overall, and 3) other interesting applications can be built on the distributed infrastructure. One of the features of our navigator/robot that would help you is that you can search for things that are new since your last search in the topical area of interest. Some drawbacks to the scheme for now are 1) that we first have to make it work, and 2) that a large number of people have to locally install it. We are just about release our alpha version. (Embarassingly, I said about the same thing last February! But this time it is true. We have finished the binaries we plan to release, and are now cleaning up the documentation and just waiting for after the Paris WWW conference, where we are exhibiting.) Anyway, check us out at http://www.ingrid.org. We are in general looking for people to play with our alpha code. PF From owner-robots Wed May 1 22:24:47 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA27045; Wed, 1 May 96 22:24:47 -0700 From: Date: Thu, 02 May 96 14:27:04 JST Message-Id: <9605028310.AA831072424@issmtpgw.tyo.hq.jri.co.jp> To: robots@webcrawler.com Subject: cc:Mail SMTPLINK Undeliverable Message Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com User shirai@tyo.hq.jri.co.jp is not defined Original text follows ---------------------------------------------- Received: from r6.ichi.jri.co.jp by ccMail SMTPLINK 2.1 From owner-robots@webcrawler.com X-Envelope-From: owner-robots@webcrawler.com Received: from jriuu.sci.jri.co.jp by r6.ichi.jri.co.jp (AIX 3.2/UCB 5.64/3.3W4-jri-relay-1.1) id AA41314; Thu, 2 May 1996 14:12:26 +0900 Received: from ofw.jri.co.jp by jriuu.sci.jri.co.jp (16.6/IIJ-U1.1-JRI1.5-960111) id AA02421; Thu, 2 May 96 14:11:31 +0900 Received: by ofw.jri.co.jp; id OAA11775; Thu, 2 May 1996 14:11:31 +0900 Received: from surfski.webcrawler.com(192.216.46.61) by ofw.jri.co.jp via smap (g3.0.3) id xma011757; Thu, 2 May 96 14:11:01 +0900 Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA26701; Wed, 1 May 96 20:47:40 -0700 Date: Thu, 2 May 96 12:47:29 JST From: francis@cactus.slab.ntt.jp (Paul Francis) Message-Id: <9605020347.AA18249@cactus.slab.ntt.jp> To: robots@webcrawler.com Subject: Re: topical search tool -- help?! Sender: owner-robots@webcrawler.com Precedence: bulk Reply-To: robots@webcrawler.com >That's pretty much a description of what Quarterdeck's WebCompass does (as >well as automatically summarize and produce a list of keywords). Please >excuse the plug. > Looks offhand like a great product, and I imagine we should be seeing more and more such agent-type products. Especially speaking from an internet-wide systems perspective, but also from a product user's perspective, I'm concerned about its dependence on the existing mega-search engines. First, presumably a lot of agents querying the mega-search engines as a backgroud task will increase the load on those engines, yes? Also, I assume that you don't copy the advertisement of the search-engine page over to your WebCompass GUI. This is great for the user, but I wonder if it might cause the mega-search engines to somehow identify agent queries (say by patterns or volume of usage?) and cut them off or otherwise degrade their service. If nothing else, I assume that their advertisers would want to know what percentage of their business is from "eyeballs", and what percentage is from (advertisement- oblivious) agents. PF From owner-robots Thu May 2 07:06:06 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28115; Thu, 2 May 96 07:06:06 -0700 Date: Thu, 02 May 1996 16:04:54 +0100 (MET) From: Fred Melssen Subject: Indexing a set of URL's To: robots@webcrawler.com Cc: MELSSEN@AZNVX1.AZN.NL Message-Id: <01I48IF6J97C0006DK@AZNVX1.AZN.NL> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII Content-Transfer-Encoding: 7BIT Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com We have a manually crafted list of topic-specific URLs. We maintain and document this list by hand. In order to facilitate a public Boolean keyword-searching to all URLs, we want to implement a robot for this purpose. This robot has to: - index all HTML-documents. We dont know yet what kind of indexing (parsing entire HTMLs...) and result- valuation we will need to use; - provide a flexible and scaleable interface to the indexed information; The Index-part is the difficult one. We have a Linux 1.3.93 and the following wishes: - System maintenance should be low-demanding: - disk-use should be minimal and efficient - network traffic should be low, and bandwidth minimal - Maintenance (configuring, updating...) should be as minimal as possible. Probably the Webmaster should be able to maintain the robot. - A feature should be kept for the Web page owner(s) to - add URLs to the searchable database. - We want an advanced search query interface where the users have maximum control over the enumeration of search results. The List of Robots (http://webcrawler.com/mak/projects/robots/active.html) enumerates a few engines, which specifically focus on community- or topic-specific collections of HTML objects: Harvest, Peregrinator (sources not available) and HI Search. HARVESTs motivation reflects ours, as it is indexing community- specific collections, rather than locating and indexing all objects that can be found. But I see a possible drawback in choosing Harvest: Our operating system - Linux 1.3.93 - is not supported by Harvest. Configuring the robot, and keeping it in the air WITHOUT much maintenance, looks like a hard job. We want to make a good choice, and your suggestions and discussion are highly appreciated. Thank you. -fred melssen- ------------------------------------------------------------------------ Fred Melssen | Manager Electronic Information Services P.O.Box 9104 | Centre for Pacific Studies | Phone and fax: 6500 HE Nijmegen | University of Nijmegen | 31-024-378.3666 (home) The Netherlands | Email: melssen@aznvx1.azn.nl | 31-024-361.1945 (fax) | http://www.kun.nl/~melssen | PGP key available ------------------------------------------------------------------------ From owner-robots Thu May 2 07:32:28 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA28212; Thu, 2 May 96 07:32:28 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 2 May 1996 07:33:07 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: topical search tool -- help?! Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com Paul and others on the list who'll be at the Paris conference who are working on the distributed search problems -- please try to come to Panel 10, on Internet indexing efficiency. This thread is touching many of the issues that we'll be discussing. Nick P.S. Although I didn't copy the list with my reply to the original message here, it is a perfect application for our knowledgebase capability. From owner-robots Thu May 2 09:59:33 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA29057; Thu, 2 May 96 09:59:33 -0700 Message-Id: <9605021659.AA29051@webcrawler.com> X-Sender: ulicny@alonzo.limbex.com X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Brian Ulicny Subject: Re: topical search tool -- help?! Date: Thu, 2 May 96 16:57:51 -0700 (PDT) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 05:09 PM 5/1/96 -0400, Brian Fitzgerald wrote: >I'm looking to find a tool for the web that would allow me to search for >new sites in apprx. 300 distinct topical areas. I need this tool to >autonomously search, and categorize the information such that a human >could rifle through it in a timely fashion to find appropriate new links. That's pretty much a description of what Quarterdeck's WebCompass does (as well as automatically summarize and produce a list of keywords). Please excuse the plug. You can check out a browse only (non-editable) version at the Limbex site (URL below). It sells commercially for (I think) around $90 now. Best, Brian Ulicny Limbex Corporation 13160 Mindanao Way, Suite 234 Marina Del Rey, CA 90292 USA (310) 309-4281 x4505 (office/vmail) (310) 309 4282 (fax) http://www.limbex.com/ (URL) Limbex Corporation 13160 Mindanao Way, Suite 234 Marina Del Rey, CA 90292 USA (310) 309-4281 x4505 (office/vmail) (310) 309 4282 (fax) http://www.limbex.com/ (URL) From owner-robots Thu May 2 11:01:15 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA29258; Thu, 2 May 96 11:01:15 -0700 Message-Id: <9605021801.AA29252@webcrawler.com> X-Sender: ulicny@alonzo.limbex.com X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Brian Ulicny Subject: Re: topical search tool -- help?! Date: Thu, 2 May 96 17:56:55 -0700 (PDT) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 02:27 PM 5/2/96 JST, shirai@tyo.hq.jri.co.jp wrote: >>That's pretty much a description of what Quarterdeck's WebCompass does (as >>well as automatically summarize and produce a list of keywords). Please >>excuse the plug. > >Looks offhand like a great product, and I imagine we >should be seeing more and more such agent-type products. > >Especially speaking from an internet-wide systems >perspective, but also from a product user's perspective, >I'm concerned about its dependence on the existing >mega-search engines. First, presumably a lot of >agents querying the mega-search engines as a backgroud >task will increase the load on those engines, yes? Yes, it will. However, if you turn on your WebCompass Agent at night or configure it to work during off-peak hours (an upcoming feature) that makes efficient use of the search engines and your own time. >Also, I assume that you don't copy the advertisement >of the search-engine page over to your WebCompass GUI. >This is great for the user, but I wonder if it might >cause the mega-search engines to somehow identify >agent queries (say by patterns or volume of usage?) >and cut them off or otherwise degrade their service. >If nothing else, I assume that their advertisers would >want to know what percentage of their business is from >"eyeballs", and what percentage is from (advertisement- >oblivious) agents. > Actually, WebCompass Personal (the freely downloadable metasearch/summary on demand tool) _does_ pass through banner ads from the search engines to the user. Our goal is to cooperate with the search engines on this, as we have been. In this way, their advertisers get more exposure, not less, from metasearching. So everybody wins. Best, Brian Ulicny Limbex Corporation 13160 Mindanao Way, Suite 234 Marina Del Rey, CA 90292 USA (310) 309-4281 x4505 (office/vmail) (310) 309 4282 (fax) http://www.limbex.com/ (URL) From owner-robots Thu May 2 11:07:06 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA29308; Thu, 2 May 96 11:07:06 -0700 Message-Id: <9605021807.AA29302@webcrawler.com> X-Sender: ulicny@alonzo.limbex.com X-Mailer: Windows Eudora Light Version 1.5.2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" To: robots@webcrawler.com From: Brian Ulicny Subject: Re: topical search tool -- help?! Date: Thu, 2 May 96 18:05:28 -0700 (PDT) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com At 07:33 AM 5/2/96 -0700, Nick Arnett wrote: >P.S. Although I didn't copy the list with my reply to the original message >here, it is a perfect application for our knowledgebase capability. > Why be so coy? I, for one, would be very interested to hear about Verity's knowledgebase capability. Can you explain it for us? Best, Brian Ulicny From owner-robots Thu May 2 11:44:59 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA29453; Thu, 2 May 96 11:44:59 -0700 X-Sender: narnett@hawaii.verity.com Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 2 May 1996 11:45:31 -0700 To: robots@webcrawler.com From: narnett@Verity.COM (Nick Arnett) Subject: Re: topical search tool -- help?! Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com >Why be so coy? I, for one, would be very interested to hear about Verity's >knowledgebase capability. Can you explain it for us? I wasn't trying to be coy, just trying to stay on topic (no pun intended). The list is about robots, not search. I also don't want every message I post to be about a plug for our products and technologies. I've been hoping that someone independent (not one of the search vendors) would start a list specifically about Internet search. Perhaps that should be one of the first things that Mike Schwartz's new W3C working group does. But since you asked... Topics are tree-like (actually graphs) sets of saved queries, which can include one another, so they can be quite complex. We have several customers who are developing systems that feed the documents found by a robot into our engine with a Topic set that describes the subjects they're looking for. Then you can perform searches on each topic to categorize the information. Or there's a real-time way to do it, with our Agent Server developer tools. The Agent Server can manage a few hundred thousand profiles -- queries -- and compare them to a feed of several documents per second. The agents can be set to take various actions -- put a link on a Web page, e-mail the document, etc. Nick From owner-robots Thu May 2 13:21:46 1996 Return-Path: Received: by webcrawler.com (NX5.67f2/NX3.0M) id AA29749; Thu, 2 May 96 13:21:46 -0700 Message-Id: <199605022021.QAA16586@mail.internet.com> Comments: Authenticated sender is From: "Robert Raisch, The Internet Company" Organization: The Internet Company To: robots@webcrawler.com Date: Thu, 2 May 1996 16:20:39 -0400 Subject: Re: topical search tool -- help?! Priority: normal X-Mailer: Pegasus Mail for Windows (v2.31) Sender: owner-robots Precedence: bulk Reply-To: robots@webcrawler.com On 2 May 96 at 17:56, Brian Ulicny wrote: > At 02:27 PM 5/2/96 JST, shirai@tyo.hq.jri.co.jp wrote: > >Also, I assume that you don't copy the advertisement > >of the search-engine page over to your WebCompass GUI. > >This is great for the user, Ummm... Is it really? If the search-engine's business model is based upon sponsorship through advertising (as many of them are or are soon to become), any diminishment in eyeball-time for the advertising -- that which pays for the muscular hardware and clever software -- will be deemed non-productive by those who provide the service and will either cause them to find ways around the abuse, to change to a for-fee basis, or cause them to go out of business. This is simple economics. Brian Ulicny replies to comments that agents might not pass advertising through to the user: "Actually, WebCompass Personal ... _does_ pass through banner ads from the search engines to the user." However, if the agent is to act effectively in the service of its master, should it not proactively integrate results from as many sources as it can parse/understand? If an agent is tasked with searching for everything it can on a subject and places its queries with more than one search-engine (services), how would you suggest it might resolve the issue of integrating the advertising on all results-pages from all engines? Brian continues: "Our goal is to cooperate with the search engines on this, as we have been. In this way, their advertisers get more exposure, not less, from metasearching. So everybody wins." Respect to Brian but I sincerely doubt that everyone wins. The required integration of advertising works against the interests of the search-engines. Ads from competitors might potentially appear on the same page, gathered from different sources, and this reduces the perceived value to the advertiser -- which reduces the fee the advertiser is willing to pay. What I believe is being overlooked here is that the page that the search-engine returns is not simply data and should not be treated as such; it is an editorial product -- at least, this is how those who provide the search-engine view it. If you modify the page, reaping the benefits of the search without paying for it, you are working at odds with those who provide this service. If you don't concur or have difficulties with this viewpoint, try replacing the term: 'search-engine' with 'publisher' throughout this reply and you may begin to see it from their viewpoint. I would like to hear from those who