bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] wget -crNl inf --- filenames mangled


From: Tim Rühsen
Subject: Re: [Bug-wget] wget -crNl inf --- filenames mangled
Date: Mon, 18 Feb 2019 12:00:23 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.5.1

On 2/18/19 12:02 AM, Andres Valloud wrote:
> Hi, so I ran wget like this:
> 
> wget --no-check-certificate -dcrNl inf $baseUrl/root/pub/mods/2012/ -P
> $baseLocal -o wget-mods-2012.log
> 
> Looking at the log, '1f43' appears (I think) as a consequence of -l inf,
> because .../mods/2012/ has a reference to .../mods/, which leads wget to
> read the entire .../mods/ index.

Use -np / --no-parent if you don't want to ascend to the parent directory.


> According to my understanding of the log file, wget then collects all
> the possible URLs from .../mods/.  It is here that, after what seems
> like thousands of file, a single merge log entry shows '1f43' (some path
> parts elided).

'1f43' is part of a 'chunked' download. I made some tests printing out
the raw received payload of /root/pub/mods/index.html. Seeing this in a
downloaded file is clearly a bug.

But I can't reproduce it with your command sequence. The different index
files in /root/pub/mods/ all have a size of 997015 here. Even after
several retries. Maybe you can send me via PM a full working command
sequence using a fresh / clean directory. To further reduce the
downloads, try with -R '*.mp3,*.xm,*.ogg,*.mod,*.it,*.spc'.


Regards, Tim

> .../root/pub/mods/index.html?C=N;O=D:
> merge(‘.../root/pub/mods/?C=N;O=D’, ‘lizardking_-_quest.mp31f43’) ->
> .../root/pub/mods/lizardking_-_quest.mp31f43
> appending ‘.../root/pub/mods/lizardking_-_quest.mp31f43’ to urlpos.
> 
> Then I issued the command (some path parts elided)
> 
> wget --no-check-certificate .../root/pub/mods/
> 
> which resulted in a 974kb index.html file that has no occurrences of
> '1f43' (more on this request down below).
> 
> I wondered whether this could be happening because there are .html files
> that *do* have '1f43' already downloaded in the local downloading
> directory.  That is, will wget look at existing files, or will it
> download them from scratch?  But the log file seems to indicate the
> index.html was downloaded from scratch, not examined from disk.
> 
> The "bad" request looks like this (some path parts elided):
> 
> ---request begin---
> GET /root/pub/mods/?C=N;O=D HTTP/1.1^M
> Referer: .../root/pub/mods/^M
> If-Modified-Since: Sun, 10 Feb 2019 02:33:09 GMT^M
> Range: bytes=998575-^M
> User-Agent: Wget/1.20.1 (linux-gnu)^M
> Accept: */*^M
> Accept-Encoding: identity^M
> Host: saphirjd.me^M
> Connection: Keep-Alive^M
> ^M
> ---request end---
> HTTP request sent, awaiting response...
> ---response begin---
> HTTP/1.1 200 OK^M
> Date: Sat, 16 Feb 2019 21:51:21 GMT^M
> Server: Apache/2.4.23 (Win64) OpenSSL/1.0.2h^M
> Keep-Alive: timeout=2, max=18^M
> Connection: Keep-Alive^M
> Transfer-Encoding: chunked^M
> Content-Type: text/html;charset=UTF-8^M
> ^M
> ---response end---
> 200 OK
> Length: unspecified [text/html]
> Saving to: ‘.../root/pub/mods/index.html?C=N;O=D’
> 
>      0K .......... .......... .......... .......... ..........  234K
>     50K .......... .......... .......... .......... .......... 11.6M
>    100K .......... .......... .......... .......... .......... 14.4M
>    150K .......... .......... .......... .......... ..........  238K
>    200K .......... .......... .......... .......... ..........  657K
>    250K .......... .......... .......... .......... .......... 11.3M
>    300K .......... .......... .......... .......... .......... 8.44M
>    350K .......... .......... .......... .......... ..........  397K
>    400K .......... .......... .......... .......... ..........  627K
>    450K .......... .......... .......... .......... .......... 2.38M
>    500K .......... .......... .......... .......... .......... 4.47M
>    550K .......... .......... .......... .......... .......... 3.46M
>    600K .......... .......... .......... .......... ..........  477K
>    650K .......... .......... .......... .......... .......... 4.14M
>    700K .......... .......... .......... .......... ..........  717K
>    750K .......... .......... .......... .......... .......... 3.50M
>    800K .......... .......... .......... .......... .......... 3.01M
>    850K .......... .......... .......... .......... .......... 4.40M
>    900K .......... .......... .......... .......... .......... 2.69M
>    950K .......... .......... ...                              68.9K=1.4s
> 
> Last-modified header missing -- time-stamps turned off.
> 2019-02-16 13:51:25 (717 KB/s) - ‘.../root/pub/mods/index.html?C=N;O=D’
> saved [998575]
> 
> Loaded .../root/pub/mods/index.html?C=N;O=D (size 998575).
> 
> 
> The "good" request looks like this:
> 
> ---request begin---
> GET /root/pub/mods/ HTTP/1.1^M
> User-Agent: Wget/1.20.1 (linux-gnu)^M
> Accept: */*^M
> Accept-Encoding: identity^M
> Host: saphirjd.me^M
> Connection: Keep-Alive^M
> ^M
> ---request end---
> HTTP request sent, awaiting response...
> ---response begin---
> HTTP/1.1 200 OK^M
> Date: Sun, 17 Feb 2019 22:42:04 GMT^M
> Server: Apache/2.4.23 (Win64) OpenSSL/1.0.2h^M
> Keep-Alive: timeout=2, max=25^M
> Connection: Keep-Alive^M
> Transfer-Encoding: chunked^M
> Content-Type: text/html;charset=UTF-8^M
> ^M
> ---response end---
> 200 OK
> Registered socket 5 for persistent reuse.
> Length: unspecified [text/html]
> Saving to: ‘index.html.1’
> 
>      0K .......... .......... .......... .......... .......... 71.1K
>     50K .......... .......... .......... .......... ..........  221K
>    100K .......... .......... .......... .......... ..........  241K
>    150K .......... .......... .......... .......... ..........  232K
>    200K .......... .......... .......... .......... .......... 4.81M
>    250K .......... .......... .......... .......... .......... 1.64M
>    300K .......... .......... .......... .......... ..........  249K
>    350K .......... .......... .......... .......... .......... 2.49M
>    400K .......... .......... .......... .......... .......... 3.71M
>    450K .......... .......... .......... .......... ..........  258K
>    500K .......... .......... .......... .......... .......... 1.41M
>    550K .......... .......... .......... .......... .......... 1.46M
>    600K .......... .......... .......... .......... .......... 2.32M
>    650K .......... .......... .......... .......... ..........  340K
>    700K .......... .......... .......... .......... .......... 2.19M
>    750K .......... .......... .......... .......... .......... 4.10M
>    800K .......... .......... .......... .......... .......... 2.68M
>    850K .......... .......... .......... .......... .......... 3.17M
>    900K .......... .......... .......... .......... .......... 3.22M
>    950K .......... .......... ...                              2.07M=2.1s
> 
> 2019-02-17 14:42:09 (453 KB/s) - ‘index.html.1’ saved [997015]
> 
> 
> So I examined the "bad" html file.  Unlike the "good" file, the "bad"
> file starts like this (contents enclosed by ====== bars):
> 
> ======================================================================
> 13a
> <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
> <html><head>
> <title>416 Requested Range Not Satisfiable</title>
> </head><body>
> <h1>Requested Range Not Satisfiable</h1>
> <p>None of the range-specifier values in the Range
> request-header field overlap the current extent
> of the selected resource.</p>
> </body></html>
> 
> 0
> 
> HTTP/1.1 200 OK
> Date: Sun, 10 Feb 2019 02:33:04 GMT
> Server: Apache/2.4.23 (Win64) OpenSSL/1.0.2h
> Keep-Alive: timeout=2, max=24
> Connection: Keep-Alive
> Transfer-Encoding: chunked
> Content-Type: text/html;charset=UTF-8
> 
> ee3
> ======================================================================
> 
> 
> The "13a" and "ee3" characters are present in the file.  This data also
> seems to explain why the file saved to disk is about 1kb larger than the
> file downloaded individually.  It looks like the index.html file saved
> to disk contains (i.e. begins with) garbage from a different request
> that ended in 416.  After that prolog of apparent junk, the file proper
> seems to begin as expected --- but it also has several occurrences of
> '1f43'.
> 
> A vimdiff run on bad.html and good.html shows some order differences,
> seemingly a table replaced with '1f43', and things of that nature.  The
> structure of the differences is not immediately obvious, as there are
> very large sections that differ seemingly because the file was served in
> different order.
> 
> Andres.
> 
> 
> On 2/17/19 12:15, Tim Rühsen wrote:
>> On 16.02.19 23:02, Andres Valloud wrote:
>>> Tim,
>>>
>>> I limited the data from 99gb to 3.3gb, and just to the directory where
>>> I've seen the problem occurs.  The strange string '1f43' appears in this
>>> limited setup.  The '1f43' substring seems to appear deterministically
>>> depending on the file name (I have not checked *every* occurrence by
>>> hand).
>>>
>>> How should I track this down?
>>
>> I'd use -d -olog and leave away -k. If 1f43 still appears, we know it's
>> not because of wget's parsing or conversion. In this case it#s from the
>> server... check in which file 1f43 appears and find the request in the
>> log file.
>>
>> Then try to download that file with a single (non-recursive) wget
>> command. Check if 1f43 appears in there. If it doesn't, compare both
>> requests to see the difference.
>>
>> Let us know the results.
>>
>> Regards, Tim
>>
>>>
>>> Andres.
>>>
>>> On 2/14/19 04:03, Tim Rühsen wrote:
>>>> On 2/14/19 12:25 PM, Andres Valloud wrote:
>>>>> Tim,
>>>>>
>>>>> On 2/14/19 02:03, Tim Rühsen wrote:
>>>>>>> I looked at the downloaded html files with grep.  They do contain
>>>>>>> the
>>>>>>> substring "1f43", seemingly after a ^M character (I did not check
>>>>>>> every
>>>>>>> single occurrence).  Sometimes, the ^M character is within a file
>>>>>>> name
>>>>>>> such as this:
>>>>>>>
>>>>>>> <tr><td valign="top"><img src="https://some.url/icons/mp3ogg.png^M
>>>>>>> 1f43^M
>>>>>>> "
>>>>>>
>>>>>> If this is contained in the HTML file, then 'mp3ogg.png1f43' seems
>>>>>> correct. ^M is a Carriage Return (Microsoft uses ^M plus linefeed for
>>>>>> End-Of-Line (EOL). In a HTML file, EOL has no meaning - parsers
>>>>>> simply
>>>>>> ignore it. This is nothing that can be addressed with
>>>>>> --restrict-file-names.
>>>>>>
>>>>>> But to make sure, look at the original file by downloading it with
>>>>>> 'wget
>>>>>> <URL>'. Does the file have the above 'lf43'/^M stuff in it as well
>>>>>> ? If
>>>>>> so, we can't do much about it.
>>>>>>
>>>>>> If all looks ok in there, please attach both files so we can compare
>>>>>> and
>>>>>> possibly reproduce.
>>>>>>
>>>>>> If you set the 'User-Agent' header to e.g. "Mozilla/5.0 (X11; Linux
>>>>>> x86_64; rv:65.0) Gecko/20100101 Firefox/65.0", the server thinks the
>>>>>> request is coming via Firefox.
>>>>>> curl and wget have both the --user-agent option for this.
>>>>>>
>>>>>> Do you get a different file when using that option ?
>>>>>
>>>>> There was one additional detail to make this work.  Instead of
>>>>> placing a
>>>>> request for index.html, I had to ask curl to get just the directory
>>>>> name
>>>>> ending with a slash.  Then the server responded with (essentially)
>>>>> index.html.
>>>>
>>>> A web server might give different content on 'dir', 'dir/' and
>>>> 'dir/index.html'. This is sometimes puzzling and as you can see, 'dir/'
>>>> can't be used as filename - so we use 'dir/index.html' for that. Which
>>>> is not correct if the server serves 'dir/index.php' when we request
>>>> 'dir/'.
>>>>
>>>>>
>>>>> Both curl and wget retrieve index.html contents without '1f43' when
>>>>> asking for just that URL.  vimdiff says the retrieved files are
>>>>> identical.
>>>>
>>>> Try to start with this URL using your original wget command line. You
>>>> could add a quota (-Q) to limit the amount of data. In the hope to
>>>> reproduce your issue with far less files/data to be downloaded.
>>>>
>>>>> I am at a loss as to how to explain how the '1f43' problem appears
>>>>> when
>>>>> asking wget to update the mirror of the site (rather than
>>>>> downloading a
>>>>> single file).  I'll look at the log file tomorrow and see if I get
>>>>> more
>>>>> ideas.
>>>>
>>>> Try to reduce the needed amount of data to reproduce it.
>>>>
>>>> Regards, Tim
>>>>
>>
> 

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]