[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] wget -crNl inf --- filenames mangled
From: |
Tim Rühsen |
Subject: |
Re: [Bug-wget] wget -crNl inf --- filenames mangled |
Date: |
Thu, 14 Feb 2019 11:03:19 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.5.0 |
Hi Andres,
On 2/14/19 6:23 AM, Andres Valloud wrote:
> Hi,
>
> I've run into an issue with wget, I don't know what else to do to debug
> the problem. The use case is mirroring a website with the command
>
> wget -crNl inf https://... -P local/folder
>
> Initially I was running wget 1.17, and that was very slow in the case
> the files were already downloaded. I switched to 1.20.1 (latest), the
> behavior is way faster now. Things progress nicely for about 3 hours,
> but deterministically the file name "greenbrq.669" is transformed into
> "1f43greenbrq.669", this results in a 404 and wget aborts after 5805 files.
>
> Running this with logging turned on results in a 399 megabyte log file.
> Looking at the occurrences of greenbrq.669, I suspect because of -l inf
> the file is found several times. The last time, however, it looks like
> there is an index.html file on the server that has the wrong name. But
> using a web browser to presumably look at said index.html file does not
> result in a link to the wrongly named file, because the file downloads
> fine.
>
> Next I noted that when the 1f43 prefix to greenbrq.669 appears, there is
> a mention to IRI. I suspected that perhaps there was some confusion
> going on with filename encoding, so I provided --no-iri to wget and ran
> the job again. Another 399 megabyte log file was produced, and the
> result was the same. Interestingly, however, the log file has "[IRI"
> entries, even though the --no-iri switch was provided. Is this as
> expected? In both log files, egrep "^.IRI" results in lines that always
> end with "None".
>
> Looking at the log, it looks like the file URL is encountered several
> times. Some times it is mentioned with UTF-8, sometimes it isn't.
> Before the first time greenbrq.669 appears with the seemingly bogus 1f43
> prefix, the previous occurrence of greenbrq.669 in the log file is a log
> entry that says "no-follow".
>
> Also looking at the log, there are other files with mangled names,
> except these have 1f43 suffixed to the filename, e.g.: mod.png1f43. A
> quick check shows many of these mangled file names have URLs sizes that
> are zero modulo 4 (I did not check *every* mangled file name).
>
> I looked at the downloaded html files with grep. They do contain the
> substring "1f43", seemingly after a ^M character (I did not check every
> single occurrence). Sometimes, the ^M character is within a file name
> such as this:
>
> <tr><td valign="top"><img src="https://some.url/icons/mp3ogg.png^M
> 1f43^M
> "
If this is contained in the HTML file, then 'mp3ogg.png1f43' seems
correct. ^M is a Carriage Return (Microsoft uses ^M plus linefeed for
End-Of-Line (EOL). In a HTML file, EOL has no meaning - parsers simply
ignore it. This is nothing that can be addressed with --restrict-file-names.
But to make sure, look at the original file by downloading it with 'wget
<URL>'. Does the file have the above 'lf43'/^M stuff in it as well ? If
so, we can't do much about it.
If all looks ok in there, please attach both files so we can compare and
possibly reproduce.
>
> (and wget thinks it has to download "mp3ogg.png1f43", as if it had
> ignored ^M and had merged the path with the 1f43 segment) and some
> others, like this:
>
> <td align="right">2014-10-02 20:24 ^M
> 1f43^M
> </td>
>
> I have no idea whether these HTML files are valid or even meaningful. I
> tried using curl to get one of those HTML files with another mechanism,
> but unfortunately the site maintainer does not allow using curl.
If you set the 'User-Agent' header to e.g. "Mozilla/5.0 (X11; Linux
x86_64; rv:65.0) Gecko/20100101 Firefox/65.0", the server thinks the
request is coming via Firefox.
curl and wget have both the --user-agent option for this.
Do you get a different file when using that option ?
> There
> are additional restrictions on the web browsers allowed. I can look at
> the website with Safari (which downloads greenbrq.669 properly), and I
> can also ask Safari to save the page where the file greenbrq.669 is
> listed --- the saved file does not have any occurrences of "1f43".
>
> Googling for answers, and especially instances of "1f43", didn't turn up
> anything immediately interesting. However, I found the following with
> seems somewhat related to the problem.
>
> https://www.win.tue.nl/~aeb/linux/misc/wget.html
>
> Is there any credence to the above report? Just to make sure, doing as
> it said with --restrict-file-names=nocontrol did not eliminate the
> apparently spurious occurrences of "1f43" from the wget log file.
>
> What else can I do to diagnose why this apparent misbehavior is occurring?
>
> Andres.
>
Regards, Tim
signature.asc
Description: OpenPGP digital signature