bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] wget -crNl inf --- filenames mangled


From: Tim Rühsen
Subject: Re: [Bug-wget] wget -crNl inf --- filenames mangled
Date: Thu, 14 Feb 2019 11:03:19 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.5.0

Hi Andres,

On 2/14/19 6:23 AM, Andres Valloud wrote:
> Hi,
> 
> I've run into an issue with wget, I don't know what else to do to debug
> the problem.  The use case is mirroring a website with the command
> 
>     wget -crNl inf https://... -P local/folder
> 
> Initially I was running wget 1.17, and that was very slow in the case
> the files were already downloaded.  I switched to 1.20.1 (latest), the
> behavior is way faster now.  Things progress nicely for about 3 hours,
> but deterministically the file name "greenbrq.669" is transformed into
> "1f43greenbrq.669", this results in a 404 and wget aborts after 5805 files.
> 
> Running this with logging turned on results in a 399 megabyte log file.
> Looking at the occurrences of greenbrq.669, I suspect because of -l inf
> the file is found several times.  The last time, however, it looks like
> there is an index.html file on the server that has the wrong name.  But
> using a web browser to presumably look at said index.html file does not
> result in a link to the wrongly named file, because the file downloads
> fine.
> 
> Next I noted that when the 1f43 prefix to greenbrq.669 appears, there is
> a mention to IRI.  I suspected that perhaps there was some confusion
> going on with filename encoding, so I provided --no-iri to wget and ran
> the job again.  Another 399 megabyte log file was produced, and the
> result was the same.  Interestingly, however, the log file has "[IRI"
> entries, even though the --no-iri switch was provided.  Is this as
> expected?  In both log files, egrep "^.IRI" results in lines that always
> end with "None".
> 
> Looking at the log, it looks like the file URL is encountered several
> times.  Some times it is mentioned with UTF-8, sometimes it isn't.
> Before the first time greenbrq.669 appears with the seemingly bogus 1f43
> prefix, the previous occurrence of greenbrq.669 in the log file is a log
> entry that says "no-follow".
> 
> Also looking at the log, there are other files with mangled names,
> except these have 1f43 suffixed to the filename, e.g.: mod.png1f43.  A
> quick check shows many of these mangled file names have URLs sizes that
> are zero modulo 4 (I did not check *every* mangled file name).
> 
> I looked at the downloaded html files with grep.  They do contain the
> substring "1f43", seemingly after a ^M character (I did not check every
> single occurrence).  Sometimes, the ^M character is within a file name
> such as this:
> 
> <tr><td valign="top"><img src="https://some.url/icons/mp3ogg.png^M
> 1f43^M
> "

If this is contained in the HTML file, then 'mp3ogg.png1f43' seems
correct. ^M is a Carriage Return (Microsoft uses ^M plus linefeed for
End-Of-Line (EOL). In a HTML file, EOL has no meaning - parsers simply
ignore it. This is nothing that can be addressed with --restrict-file-names.

But to make sure, look at the original file by downloading it with 'wget
<URL>'. Does the file have the above 'lf43'/^M stuff in it as well ? If
so, we can't do much about it.

If all looks ok in there, please attach both files so we can compare and
possibly reproduce.

> 
> (and wget thinks it has to download "mp3ogg.png1f43", as if it had
> ignored ^M and had merged the path with the 1f43 segment) and some
> others, like this:
> 
> <td align="right">2014-10-02 20:24  ^M
> 1f43^M
> </td>
> 
> I have no idea whether these HTML files are valid or even meaningful.  I
> tried using curl to get one of those HTML files with another mechanism,
> but unfortunately the site maintainer does not allow using curl.

If you set the 'User-Agent' header to e.g. "Mozilla/5.0 (X11; Linux
x86_64; rv:65.0) Gecko/20100101 Firefox/65.0", the server thinks the
request is coming via Firefox.
curl and wget have both the --user-agent option for this.

Do you get a different file when using that option ?

> There
> are additional restrictions on the web browsers allowed.  I can look at
> the website with Safari (which downloads greenbrq.669 properly), and I
> can also ask Safari to save the page where the file greenbrq.669 is
> listed --- the saved file does not have any occurrences of "1f43".
> 
> Googling for answers, and especially instances of "1f43", didn't turn up
> anything immediately interesting.  However, I found the following with
> seems somewhat related to the problem.
> 
> https://www.win.tue.nl/~aeb/linux/misc/wget.html
> 
> Is there any credence to the above report?  Just to make sure, doing as
> it said with --restrict-file-names=nocontrol did not eliminate the
> apparently spurious occurrences of "1f43" from the wget log file.
> 
> What else can I do to diagnose why this apparent misbehavior is occurring?
> 
> Andres.
> 

Regards, Tim

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]