bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] Re: How to ignore errors with time stamping


From: Morten Lemvigh
Subject: [Bug-wget] Re: How to ignore errors with time stamping
Date: Fri, 12 Dec 2008 12:21:18 +0100
User-agent: Thunderbird 2.0.0.18 (X11/20081125)

Andre Majorel wrote:
On 2008-12-12 09:03 +0100, Morten Lemvigh wrote:

No links on a page with a missing last-modified header are
scanned, if  the page is on the disk already. If I run:

wget -r -N http://eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML

--08:51:24-- http://eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML
           => `eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML'
Resolving eur-lex.europa.eu... 147.67.136.2, 147.67.136.102, 147.67.119.2, ...
Connecting to eur-lex.europa.eu|147.67.136.2|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9.709 (9.5K) [text/html]
Last-modified header missing -- time-stamps turned off.
08:51:24 (82.42 KB/s) - `eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML' saved [9709/9709]
[....]

wget will retrieve the page and continue recursively getting all the linked pages, as I would expect.

OK. This is normal.

If I issue this command a second time,  all I get is this:

wget -r -N http://eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML
--08:53:18-- http://eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML
           => `eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML'
Resolving eur-lex.europa.eu... 147.67.119.2, 147.67.119.102, 147.67.136.2, ...
Connecting to eur-lex.europa.eu|147.67.119.2|:80... connected.
HTTP request sent, awaiting response... 500 Internal Server Error
08:53:18 ERROR 500: Internal Server Error.
FINISHED --08:53:18--
Downloaded: 0 bytes in 0 files

So all the pages linked from this page are ignored to. It's fine
if wget  skips the problematic document, but I would prefer wget
to continue the  recursive scan.

The first time, the local file doesn't exist so Wget issues a GET
request, which succeeds (200).

The second time, the local file exists so Wget must first check
whether the resource has changed. To that end, it issues a HEAD
request.  The server apparently doesn't know when the document was
last modified. It could fullfill the HEAD request without a
Last-modified header. Instead, it rejects it with a 500.

It's not that that missing Last-modified header causes Wget to
"ignore the links". It's that there is no document to scan for
links because, when queried about it, the server replied 500.

To work around that kind of brokenness, Wget would have to ignore
the 500 error and fall back on parsing the local file. That should
probably not be made the default behaviour, though.


Ah, I see! Thank you for your answer. I guess I'll just have to script may way around it then...

Regards,
Morten





reply via email to

[Prev in Thread] Current Thread [Next in Thread]