[Bug-wget] wget -crNl inf --- filenames mangled

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] wget -crNl inf --- filenames mangled

From:	Andres Valloud
Subject:	[Bug-wget] wget -crNl inf --- filenames mangled
Date:	Wed, 13 Feb 2019 21:23:59 -0800
User-agent:	Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Thunderbird/60.5.0

Hi,

I've run into an issue with wget, I don't know what else to do to debugthe problem. The use case is mirroring a website with the command


        wget -crNl inf https://... -P local/folder

Initially I was running wget 1.17, and that was very slow in the casethe files were already downloaded. I switched to 1.20.1 (latest), thebehavior is way faster now. Things progress nicely for about 3 hours,but deterministically the file name "greenbrq.669" is transformed into"1f43greenbrq.669", this results in a 404 and wget aborts after 5805 files.

Running this with logging turned on results in a 399 megabyte log file.Looking at the occurrences of greenbrq.669, I suspect because of -l infthe file is found several times. The last time, however, it looks likethere is an index.html file on the server that has the wrong name. Butusing a web browser to presumably look at said index.html file does notresult in a link to the wrongly named file, because the file downloads fine.

Next I noted that when the 1f43 prefix to greenbrq.669 appears, there isa mention to IRI. I suspected that perhaps there was some confusiongoing on with filename encoding, so I provided --no-iri to wget and ranthe job again. Another 399 megabyte log file was produced, and theresult was the same. Interestingly, however, the log file has "[IRI"entries, even though the --no-iri switch was provided. Is this asexpected? In both log files, egrep "^.IRI" results in lines that alwaysend with "None".

Looking at the log, it looks like the file URL is encountered severaltimes. Some times it is mentioned with UTF-8, sometimes it isn't.Before the first time greenbrq.669 appears with the seemingly bogus 1f43prefix, the previous occurrence of greenbrq.669 in the log file is a logentry that says "no-follow".

Also looking at the log, there are other files with mangled names,except these have 1f43 suffixed to the filename, e.g.: mod.png1f43. Aquick check shows many of these mangled file names have URLs sizes thatare zero modulo 4 (I did not check *every* mangled file name).

I looked at the downloaded html files with grep. They do contain thesubstring "1f43", seemingly after a ^M character (I did not check everysingle occurrence). Sometimes, the ^M character is within a file namesuch as this:


<tr><td valign="top"><img src="https://some.url/icons/mp3ogg.png^M
1f43^M
"

(and wget thinks it has to download "mp3ogg.png1f43", as if it hadignored ^M and had merged the path with the 1f43 segment) and someothers, like this:


<td align="right">2014-10-02 20:24  ^M
1f43^M
</td>

I have no idea whether these HTML files are valid or even meaningful. Itried using curl to get one of those HTML files with another mechanism,but unfortunately the site maintainer does not allow using curl. Thereare additional restrictions on the web browsers allowed. I can look atthe website with Safari (which downloads greenbrq.669 properly), and Ican also ask Safari to save the page where the file greenbrq.669 islisted --- the saved file does not have any occurrences of "1f43".

Googling for answers, and especially instances of "1f43", didn't turn upanything immediately interesting. However, I found the following withseems somewhat related to the problem.


https://www.win.tue.nl/~aeb/linux/misc/wget.html

Is there any credence to the above report? Just to make sure, doing asit said with --restrict-file-names=nocontrol did not eliminate theapparently spurious occurrences of "1f43" from the wget log file.


What else can I do to diagnose why this apparent misbehavior is occurring?

Andres.

[Prev in Thread]

Current Thread

[Next in Thread]

[Bug-wget] wget -crNl inf --- filenames mangled, Andres Valloud <=
- Re: [Bug-wget] wget -crNl inf --- filenames mangled, Tim Rühsen, 2019/02/14
  - Re: [Bug-wget] wget -crNl inf --- filenames mangled, Andres Valloud, 2019/02/14
    - Re: [Bug-wget] wget -crNl inf --- filenames mangled, Tim Rühsen, 2019/02/14
    - Message not available
    - Re: [Bug-wget] wget -crNl inf --- filenames mangled, Tim Rühsen, 2019/02/17
    - Re: [Bug-wget] wget -crNl inf --- filenames mangled, Andres Valloud, 2019/02/17
    - Re: [Bug-wget] wget -crNl inf --- filenames mangled, Tim Rühsen, 2019/02/18

Prev by Date: [Bug-wget] documentation bug: --logfile option
Next by Date: Re: [Bug-wget] documentation bug: --logfile option
Previous by thread: [Bug-wget] documentation bug: --logfile option
Next by thread: Re: [Bug-wget] wget -crNl inf --- filenames mangled
Index(es):
- Date
- Thread