[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] wget -crNl inf --- filenames mangled

From: Andres Valloud
Subject: Re: [Bug-wget] wget -crNl inf --- filenames mangled
Date: Thu, 14 Feb 2019 03:25:51 -0800
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Thunderbird/60.5.0


On 2/14/19 02:03, Tim Rühsen wrote:
I looked at the downloaded html files with grep.  They do contain the
substring "1f43", seemingly after a ^M character (I did not check every
single occurrence).  Sometimes, the ^M character is within a file name
such as this:

<tr><td valign="top"><img src="https://some.url/icons/mp3ogg.png^M

If this is contained in the HTML file, then 'mp3ogg.png1f43' seems
correct. ^M is a Carriage Return (Microsoft uses ^M plus linefeed for
End-Of-Line (EOL). In a HTML file, EOL has no meaning - parsers simply
ignore it. This is nothing that can be addressed with --restrict-file-names.

But to make sure, look at the original file by downloading it with 'wget
<URL>'. Does the file have the above 'lf43'/^M stuff in it as well ? If
so, we can't do much about it.

If all looks ok in there, please attach both files so we can compare and
possibly reproduce.
If you set the 'User-Agent' header to e.g. "Mozilla/5.0 (X11; Linux
x86_64; rv:65.0) Gecko/20100101 Firefox/65.0", the server thinks the
request is coming via Firefox.
curl and wget have both the --user-agent option for this.

Do you get a different file when using that option ?

There was one additional detail to make this work. Instead of placing a request for index.html, I had to ask curl to get just the directory name ending with a slash. Then the server responded with (essentially) index.html.

Both curl and wget retrieve index.html contents without '1f43' when asking for just that URL. vimdiff says the retrieved files are identical.

I am at a loss as to how to explain how the '1f43' problem appears when asking wget to update the mirror of the site (rather than downloading a single file). I'll look at the log file tomorrow and see if I get more ideas.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]