bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] --page-requisites and robot exclusion issue


From: markk
Subject: [Bug-wget] --page-requisites and robot exclusion issue
Date: Sun, 4 Dec 2011 11:57:31 -0000
User-agent: SquirrelMail/1.4.21

Hi,

I'm using wget 1.13.4. There seems to be a problem with wget
over-zealously obeying robot exclusion when --page-requisites is used,
even when only downloading a single URL.

I attempted to download a single web page, specifying --page-requisites so
that images, css and javascript files required by the page are also
downloaded:
  wget -x -S --page-requisites http://www.example.com/path/file.html

In the HTML page downloaded, there was this line:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

The presence of that line causes wget to not download the page requisites.
(And there is nothing in the log output to indicate it is ignoring
--page-requisites.)

I think wget should not pay attention to robot exclusion when downloading
page requisites.

Typically, you won't know whether a particular page you're about to
download has a robots line in its HTML source. So you need to specify "-e
robots=off" whenever you use --page-requisites, to ensure all requisites
are downloaded.

But in cases where you *are* recursively downloading and using
--page-requisites, it would be polite to otherwise obey the robots
exclusion standard by default. Which you can't do if you have to use -e
robots=off to ensure all requisites are downloaded.


Mark





reply via email to

[Prev in Thread] Current Thread [Next in Thread]