lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev The traversal limitation


From: David Woolley
Subject: Re: lynx-dev The traversal limitation
Date: Thu, 1 Oct 1998 23:20:57 +0100 (BST)

> 
> RE "well-behaved crawlers" vs "not well-behaved"
> 
> For those of us who know nothing much about robots, and
> whose experience with this idea is that alias "treelynx",
> which seems to do what it advertises -- could someone
> say some more about what makes wget so different?

It's a small program designed for the purpose;

It respects the robots.txt file (try retrieving this from 
a site like www.imdb.com, or www.microsoft.com) which informs robots
where they are not allowed to go;

It fetches the source HTML;

It can relocate URLs to allow the resulting sub-web to be browsed 
locally;

It can mirror a site by only retrieving pages that have changed since
its last visit;

It will automatically retry a failed fetch;

You can set depth limits (dynamic pages are normally blocked by
robots.txt, but I could write a very short CGI script which would put
Lynx into a loop until 2038 - just return a page with a single link
to the script name with /<day-number><process-ID> appended);

etc.

> 
> (I have not seen any book with robots as subject -- do
> you know of any?  Or with this general subject in mind.)

The original robots where things like scooter, the program that AltaVista
uses to retrieve pages for indexing. robots.txt was invented to:

- prevent them uselessly indexing dynamic content;
- prevent them indexing unreliable draft material;
- keep them out of private areas;
- protect them form deep, dynamically generated hierarchies;
- prevent them performing expensive operations.

> 
> Vocab: would lynx in treelynx-mode a "robot"?
 
Yes, but it does not honour robots.txt, so would be treated as hostile
by people like IMDB who don't want to be overloaded by robots and want
to justify their advertising revenue.

The "subscribe to site" mode of IE4 is also a robot, but it is more benign
in that it honours robots.txt and modifies its users agent string to
allow sites to block it easily.

There are a number of much less benign Windows shareware ones, apparently,
which generate misleading user agent strings.

I think there are links to the defintion of robots.txt on www.w3.org,
and, if you can find your way around developer.microsoft.com, there 
should be information there.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]