bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] [Bug-Wget] Issue in recursive retrievals


From: Ángel González
Subject: Re: [Bug-wget] [Bug-Wget] Issue in recursive retrievals
Date: Sat, 22 Mar 2014 22:21:07 +0100
User-agent: Thunderbird

On 22/03/14 18:10, Darshit Shah wrote:
There was a case ealier today on the IRC channel that I'd like to
bring out here.
The user in question was attempting to continue a recursive retrieval.
The files being dowloaded were large binaries. However, Wget still
happens to load files that have already been downloaded in an attempt
to find new links. Below is the debug output that the user shared:
(...)
As you can see, Wget receives only a HTTP 416 response with
Content-type text/html, but it still loads the complete 2GB file in
memory, looking for links. Since Wget does not know the filetype at
this moment, I agree it might be the right thing to do, but according
to section 7.2.1 of RFC2616,
"
    Any HTTP/1.1 message containing an entity-body SHOULD include a
    Content-Type header field defining the media type of that body. If
    and only if the media type is not given by a Content-Type field, the
    recipient MAY attempt to guess the media type via inspection of its
    content and/or the name extension(s) of the URI used to identify the
    resource. If the media type remains unknown, the recipient SHOULD
    treat it as type "application/octet-stream".

"
Hence, Wget's behaviour seems to be against what the specifications mandates.
Well, the text/html content-type in the reply seems to indicate that the file *is* html, so it makes sense that it scans for links (although I suspect that
the server is wrong and it isn't).


However, I understand that for continuing recursive retrievals, we may
want to scan all existing files too. Maybe, Wget could write a simple
flat file with the relevant details in case it is being aborted? This
way it knows what files it *Should* parse and which ones it shouldn't.

The user reporting this issue had the problem that Wget would block
for almost 30 seconds on each downloaded file while it loads it into
memory, while it simply skipped over newly downloaded files, giving me
the idea that the server did indeed send the right content-type
headers with HTTP 200 responses.

I'm looking for comments and opinions of how Wget should hand;e such
corner cases.
Even worse, I have seen wget trying to parse for links files bigger than it could load into memory (first trying to mmap, which failed, then slowly usind read() and realloc(), until it finally crashed…) A simple optimization for these cases
would be to quickly skip the link-scanning if the file looks like binary.


A different issue we could fix for download continuation is to add a parameter to skip download of existing files, ie. if there's a file with the name we would use, treat it as the final file we wante dto download and don't ask the server at all
about it.
When continuing downloads of a large number of files, the roundtrups of
continue-this / 416 can give a significant delay.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]