bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] wget -crNl inf --- filenames mangled


From: Tim Rühsen
Subject: Re: [Bug-wget] wget -crNl inf --- filenames mangled
Date: Sun, 17 Feb 2019 21:15:51 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.5.1

On 16.02.19 23:02, Andres Valloud wrote:
> Tim,
> 
> I limited the data from 99gb to 3.3gb, and just to the directory where
> I've seen the problem occurs.  The strange string '1f43' appears in this
> limited setup.  The '1f43' substring seems to appear deterministically
> depending on the file name (I have not checked *every* occurrence by hand).
> 
> How should I track this down?

I'd use -d -olog and leave away -k. If 1f43 still appears, we know it's
not because of wget's parsing or conversion. In this case it#s from the
server... check in which file 1f43 appears and find the request in the
log file.

Then try to download that file with a single (non-recursive) wget
command. Check if 1f43 appears in there. If it doesn't, compare both
requests to see the difference.

Let us know the results.

Regards, Tim

> 
> Andres.
> 
> On 2/14/19 04:03, Tim Rühsen wrote:
>> On 2/14/19 12:25 PM, Andres Valloud wrote:
>>> Tim,
>>>
>>> On 2/14/19 02:03, Tim Rühsen wrote:
>>>>> I looked at the downloaded html files with grep.  They do contain the
>>>>> substring "1f43", seemingly after a ^M character (I did not check
>>>>> every
>>>>> single occurrence).  Sometimes, the ^M character is within a file name
>>>>> such as this:
>>>>>
>>>>> <tr><td valign="top"><img src="https://some.url/icons/mp3ogg.png^M
>>>>> 1f43^M
>>>>> "
>>>>
>>>> If this is contained in the HTML file, then 'mp3ogg.png1f43' seems
>>>> correct. ^M is a Carriage Return (Microsoft uses ^M plus linefeed for
>>>> End-Of-Line (EOL). In a HTML file, EOL has no meaning - parsers simply
>>>> ignore it. This is nothing that can be addressed with
>>>> --restrict-file-names.
>>>>
>>>> But to make sure, look at the original file by downloading it with
>>>> 'wget
>>>> <URL>'. Does the file have the above 'lf43'/^M stuff in it as well ? If
>>>> so, we can't do much about it.
>>>>
>>>> If all looks ok in there, please attach both files so we can compare
>>>> and
>>>> possibly reproduce.
>>>>
>>>> If you set the 'User-Agent' header to e.g. "Mozilla/5.0 (X11; Linux
>>>> x86_64; rv:65.0) Gecko/20100101 Firefox/65.0", the server thinks the
>>>> request is coming via Firefox.
>>>> curl and wget have both the --user-agent option for this.
>>>>
>>>> Do you get a different file when using that option ?
>>>
>>> There was one additional detail to make this work.  Instead of placing a
>>> request for index.html, I had to ask curl to get just the directory name
>>> ending with a slash.  Then the server responded with (essentially)
>>> index.html.
>>
>> A web server might give different content on 'dir', 'dir/' and
>> 'dir/index.html'. This is sometimes puzzling and as you can see, 'dir/'
>> can't be used as filename - so we use 'dir/index.html' for that. Which
>> is not correct if the server serves 'dir/index.php' when we request
>> 'dir/'.
>>
>>>
>>> Both curl and wget retrieve index.html contents without '1f43' when
>>> asking for just that URL.  vimdiff says the retrieved files are
>>> identical.
>>
>> Try to start with this URL using your original wget command line. You
>> could add a quota (-Q) to limit the amount of data. In the hope to
>> reproduce your issue with far less files/data to be downloaded.
>>
>>> I am at a loss as to how to explain how the '1f43' problem appears when
>>> asking wget to update the mirror of the site (rather than downloading a
>>> single file).  I'll look at the log file tomorrow and see if I get more
>>> ideas.
>>
>> Try to reduce the needed amount of data to reproduce it.
>>
>> Regards, Tim
>>

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]