[Bug-wget] --page-requisites and robot exclusion issue

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] --page-requisites and robot exclusion issue

From:	markk
Subject:	[Bug-wget] --page-requisites and robot exclusion issue
Date:	Sun, 4 Dec 2011 11:57:31 -0000
User-agent:	SquirrelMail/1.4.21

Hi,

I'm using wget 1.13.4. There seems to be a problem with wget
over-zealously obeying robot exclusion when --page-requisites is used,
even when only downloading a single URL.

I attempted to download a single web page, specifying --page-requisites so
that images, css and javascript files required by the page are also
downloaded:
  wget -x -S --page-requisites http://www.example.com/path/file.html

In the HTML page downloaded, there was this line:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

The presence of that line causes wget to not download the page requisites.
(And there is nothing in the log output to indicate it is ignoring
--page-requisites.)

I think wget should not pay attention to robot exclusion when downloading
page requisites.

Typically, you won't know whether a particular page you're about to
download has a robots line in its HTML source. So you need to specify "-e
robots=off" whenever you use --page-requisites, to ensure all requisites
are downloaded.

But in cases where you *are* recursively downloading and using
--page-requisites, it would be polite to otherwise obey the robots
exclusion standard by default. Which you can't do if you have to use -e
robots=off to ensure all requisites are downloaded.


Mark

[Prev in Thread]

Current Thread

[Next in Thread]

[Bug-wget] --page-requisites and robot exclusion issue, markk <=
- Re: [Bug-wget] --page-requisites and robot exclusion issue, Giuseppe Scrivano, 2011/12/04
  - Re: [Bug-wget] --page-requisites and robot exclusion issue, Paul Wratt, 2011/12/05
    - Re: [Bug-wget] --page-requisites and robot exclusion issue, markk, 2011/12/05
    - Re: [Bug-wget] --page-requisites and robot exclusion issue, Giuseppe Scrivano, 2011/12/05

Prev by Date: [Bug-wget] Disable progress display when log output to file?
Next by Date: Re: [Bug-wget] --page-requisites and robot exclusion issue
Previous by thread: [Bug-wget] Disable progress display when log output to file?
Next by thread: Re: [Bug-wget] --page-requisites and robot exclusion issue
Index(es):
- Date
- Thread