bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] wget -crNl inf --- filenames mangled


From: Andres Valloud
Subject: [Bug-wget] wget -crNl inf --- filenames mangled
Date: Wed, 13 Feb 2019 21:23:59 -0800
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Thunderbird/60.5.0

Hi,

I've run into an issue with wget, I don't know what else to do to debug the problem. The use case is mirroring a website with the command

        wget -crNl inf https://... -P local/folder

Initially I was running wget 1.17, and that was very slow in the case the files were already downloaded. I switched to 1.20.1 (latest), the behavior is way faster now. Things progress nicely for about 3 hours, but deterministically the file name "greenbrq.669" is transformed into "1f43greenbrq.669", this results in a 404 and wget aborts after 5805 files.

Running this with logging turned on results in a 399 megabyte log file. Looking at the occurrences of greenbrq.669, I suspect because of -l inf the file is found several times. The last time, however, it looks like there is an index.html file on the server that has the wrong name. But using a web browser to presumably look at said index.html file does not result in a link to the wrongly named file, because the file downloads fine.

Next I noted that when the 1f43 prefix to greenbrq.669 appears, there is a mention to IRI. I suspected that perhaps there was some confusion going on with filename encoding, so I provided --no-iri to wget and ran the job again. Another 399 megabyte log file was produced, and the result was the same. Interestingly, however, the log file has "[IRI" entries, even though the --no-iri switch was provided. Is this as expected? In both log files, egrep "^.IRI" results in lines that always end with "None".

Looking at the log, it looks like the file URL is encountered several times. Some times it is mentioned with UTF-8, sometimes it isn't. Before the first time greenbrq.669 appears with the seemingly bogus 1f43 prefix, the previous occurrence of greenbrq.669 in the log file is a log entry that says "no-follow".

Also looking at the log, there are other files with mangled names, except these have 1f43 suffixed to the filename, e.g.: mod.png1f43. A quick check shows many of these mangled file names have URLs sizes that are zero modulo 4 (I did not check *every* mangled file name).

I looked at the downloaded html files with grep. They do contain the substring "1f43", seemingly after a ^M character (I did not check every single occurrence). Sometimes, the ^M character is within a file name such as this:

<tr><td valign="top"><img src="https://some.url/icons/mp3ogg.png^M
1f43^M
"

(and wget thinks it has to download "mp3ogg.png1f43", as if it had ignored ^M and had merged the path with the 1f43 segment) and some others, like this:

<td align="right">2014-10-02 20:24  ^M
1f43^M
</td>

I have no idea whether these HTML files are valid or even meaningful. I tried using curl to get one of those HTML files with another mechanism, but unfortunately the site maintainer does not allow using curl. There are additional restrictions on the web browsers allowed. I can look at the website with Safari (which downloads greenbrq.669 properly), and I can also ask Safari to save the page where the file greenbrq.669 is listed --- the saved file does not have any occurrences of "1f43".

Googling for answers, and especially instances of "1f43", didn't turn up anything immediately interesting. However, I found the following with seems somewhat related to the problem.

https://www.win.tue.nl/~aeb/linux/misc/wget.html

Is there any credence to the above report? Just to make sure, doing as it said with --restrict-file-names=nocontrol did not eliminate the apparently spurious occurrences of "1f43" from the wget log file.

What else can I do to diagnose why this apparent misbehavior is occurring?

Andres.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]