bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] wget -crNl inf --- filenames mangled


From: Andres Valloud
Subject: Re: [Bug-wget] wget -crNl inf --- filenames mangled
Date: Sun, 17 Feb 2019 15:02:22 -0800
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Thunderbird/60.5.1

Hi, so I ran wget like this:

wget --no-check-certificate -dcrNl inf $baseUrl/root/pub/mods/2012/ -P $baseLocal -o wget-mods-2012.log

Looking at the log, '1f43' appears (I think) as a consequence of -l inf, because .../mods/2012/ has a reference to .../mods/, which leads wget to read the entire .../mods/ index.

According to my understanding of the log file, wget then collects all the possible URLs from .../mods/. It is here that, after what seems like thousands of file, a single merge log entry shows '1f43' (some path parts elided).

.../root/pub/mods/index.html?C=N;O=D: merge(‘.../root/pub/mods/?C=N;O=D’, ‘lizardking_-_quest.mp31f43’) -> .../root/pub/mods/lizardking_-_quest.mp31f43
appending ‘.../root/pub/mods/lizardking_-_quest.mp31f43’ to urlpos.

Then I issued the command (some path parts elided)

wget --no-check-certificate .../root/pub/mods/

which resulted in a 974kb index.html file that has no occurrences of '1f43' (more on this request down below).

I wondered whether this could be happening because there are .html files that *do* have '1f43' already downloaded in the local downloading directory. That is, will wget look at existing files, or will it download them from scratch? But the log file seems to indicate the index.html was downloaded from scratch, not examined from disk.

The "bad" request looks like this (some path parts elided):

---request begin---
GET /root/pub/mods/?C=N;O=D HTTP/1.1^M
Referer: .../root/pub/mods/^M
If-Modified-Since: Sun, 10 Feb 2019 02:33:09 GMT^M
Range: bytes=998575-^M
User-Agent: Wget/1.20.1 (linux-gnu)^M
Accept: */*^M
Accept-Encoding: identity^M
Host: saphirjd.me^M
Connection: Keep-Alive^M
^M
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK^M
Date: Sat, 16 Feb 2019 21:51:21 GMT^M
Server: Apache/2.4.23 (Win64) OpenSSL/1.0.2h^M
Keep-Alive: timeout=2, max=18^M
Connection: Keep-Alive^M
Transfer-Encoding: chunked^M
Content-Type: text/html;charset=UTF-8^M
^M
---response end---
200 OK
Length: unspecified [text/html]
Saving to: ‘.../root/pub/mods/index.html?C=N;O=D’

     0K .......... .......... .......... .......... ..........  234K
    50K .......... .......... .......... .......... .......... 11.6M
   100K .......... .......... .......... .......... .......... 14.4M
   150K .......... .......... .......... .......... ..........  238K
   200K .......... .......... .......... .......... ..........  657K
   250K .......... .......... .......... .......... .......... 11.3M
   300K .......... .......... .......... .......... .......... 8.44M
   350K .......... .......... .......... .......... ..........  397K
   400K .......... .......... .......... .......... ..........  627K
   450K .......... .......... .......... .......... .......... 2.38M
   500K .......... .......... .......... .......... .......... 4.47M
   550K .......... .......... .......... .......... .......... 3.46M
   600K .......... .......... .......... .......... ..........  477K
   650K .......... .......... .......... .......... .......... 4.14M
   700K .......... .......... .......... .......... ..........  717K
   750K .......... .......... .......... .......... .......... 3.50M
   800K .......... .......... .......... .......... .......... 3.01M
   850K .......... .......... .......... .......... .......... 4.40M
   900K .......... .......... .......... .......... .......... 2.69M
   950K .......... .......... ...                              68.9K=1.4s

Last-modified header missing -- time-stamps turned off.
2019-02-16 13:51:25 (717 KB/s) - ‘.../root/pub/mods/index.html?C=N;O=D’ saved [998575]

Loaded .../root/pub/mods/index.html?C=N;O=D (size 998575).


The "good" request looks like this:

---request begin---
GET /root/pub/mods/ HTTP/1.1^M
User-Agent: Wget/1.20.1 (linux-gnu)^M
Accept: */*^M
Accept-Encoding: identity^M
Host: saphirjd.me^M
Connection: Keep-Alive^M
^M
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK^M
Date: Sun, 17 Feb 2019 22:42:04 GMT^M
Server: Apache/2.4.23 (Win64) OpenSSL/1.0.2h^M
Keep-Alive: timeout=2, max=25^M
Connection: Keep-Alive^M
Transfer-Encoding: chunked^M
Content-Type: text/html;charset=UTF-8^M
^M
---response end---
200 OK
Registered socket 5 for persistent reuse.
Length: unspecified [text/html]
Saving to: ‘index.html.1’

     0K .......... .......... .......... .......... .......... 71.1K
    50K .......... .......... .......... .......... ..........  221K
   100K .......... .......... .......... .......... ..........  241K
   150K .......... .......... .......... .......... ..........  232K
   200K .......... .......... .......... .......... .......... 4.81M
   250K .......... .......... .......... .......... .......... 1.64M
   300K .......... .......... .......... .......... ..........  249K
   350K .......... .......... .......... .......... .......... 2.49M
   400K .......... .......... .......... .......... .......... 3.71M
   450K .......... .......... .......... .......... ..........  258K
   500K .......... .......... .......... .......... .......... 1.41M
   550K .......... .......... .......... .......... .......... 1.46M
   600K .......... .......... .......... .......... .......... 2.32M
   650K .......... .......... .......... .......... ..........  340K
   700K .......... .......... .......... .......... .......... 2.19M
   750K .......... .......... .......... .......... .......... 4.10M
   800K .......... .......... .......... .......... .......... 2.68M
   850K .......... .......... .......... .......... .......... 3.17M
   900K .......... .......... .......... .......... .......... 3.22M
   950K .......... .......... ...                              2.07M=2.1s

2019-02-17 14:42:09 (453 KB/s) - ‘index.html.1’ saved [997015]


So I examined the "bad" html file. Unlike the "good" file, the "bad" file starts like this (contents enclosed by ====== bars):

======================================================================
13a
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>416 Requested Range Not Satisfiable</title>
</head><body>
<h1>Requested Range Not Satisfiable</h1>
<p>None of the range-specifier values in the Range
request-header field overlap the current extent
of the selected resource.</p>
</body></html>

0

HTTP/1.1 200 OK
Date: Sun, 10 Feb 2019 02:33:04 GMT
Server: Apache/2.4.23 (Win64) OpenSSL/1.0.2h
Keep-Alive: timeout=2, max=24
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html;charset=UTF-8

ee3
======================================================================


The "13a" and "ee3" characters are present in the file. This data also seems to explain why the file saved to disk is about 1kb larger than the file downloaded individually. It looks like the index.html file saved to disk contains (i.e. begins with) garbage from a different request that ended in 416. After that prolog of apparent junk, the file proper seems to begin as expected --- but it also has several occurrences of '1f43'.

A vimdiff run on bad.html and good.html shows some order differences, seemingly a table replaced with '1f43', and things of that nature. The structure of the differences is not immediately obvious, as there are very large sections that differ seemingly because the file was served in different order.

Andres.


On 2/17/19 12:15, Tim Rühsen wrote:
On 16.02.19 23:02, Andres Valloud wrote:
Tim,

I limited the data from 99gb to 3.3gb, and just to the directory where
I've seen the problem occurs.  The strange string '1f43' appears in this
limited setup.  The '1f43' substring seems to appear deterministically
depending on the file name (I have not checked *every* occurrence by hand).

How should I track this down?

I'd use -d -olog and leave away -k. If 1f43 still appears, we know it's
not because of wget's parsing or conversion. In this case it#s from the
server... check in which file 1f43 appears and find the request in the
log file.

Then try to download that file with a single (non-recursive) wget
command. Check if 1f43 appears in there. If it doesn't, compare both
requests to see the difference.

Let us know the results.

Regards, Tim


Andres.

On 2/14/19 04:03, Tim Rühsen wrote:
On 2/14/19 12:25 PM, Andres Valloud wrote:
Tim,

On 2/14/19 02:03, Tim Rühsen wrote:
I looked at the downloaded html files with grep.  They do contain the
substring "1f43", seemingly after a ^M character (I did not check
every
single occurrence).  Sometimes, the ^M character is within a file name
such as this:

<tr><td valign="top"><img src="https://some.url/icons/mp3ogg.png^M
1f43^M
"

If this is contained in the HTML file, then 'mp3ogg.png1f43' seems
correct. ^M is a Carriage Return (Microsoft uses ^M plus linefeed for
End-Of-Line (EOL). In a HTML file, EOL has no meaning - parsers simply
ignore it. This is nothing that can be addressed with
--restrict-file-names.

But to make sure, look at the original file by downloading it with
'wget
<URL>'. Does the file have the above 'lf43'/^M stuff in it as well ? If
so, we can't do much about it.

If all looks ok in there, please attach both files so we can compare
and
possibly reproduce.

If you set the 'User-Agent' header to e.g. "Mozilla/5.0 (X11; Linux
x86_64; rv:65.0) Gecko/20100101 Firefox/65.0", the server thinks the
request is coming via Firefox.
curl and wget have both the --user-agent option for this.

Do you get a different file when using that option ?

There was one additional detail to make this work.  Instead of placing a
request for index.html, I had to ask curl to get just the directory name
ending with a slash.  Then the server responded with (essentially)
index.html.

A web server might give different content on 'dir', 'dir/' and
'dir/index.html'. This is sometimes puzzling and as you can see, 'dir/'
can't be used as filename - so we use 'dir/index.html' for that. Which
is not correct if the server serves 'dir/index.php' when we request
'dir/'.


Both curl and wget retrieve index.html contents without '1f43' when
asking for just that URL.  vimdiff says the retrieved files are
identical.

Try to start with this URL using your original wget command line. You
could add a quota (-Q) to limit the amount of data. In the hope to
reproduce your issue with far less files/data to be downloaded.

I am at a loss as to how to explain how the '1f43' problem appears when
asking wget to update the mirror of the site (rather than downloading a
single file).  I'll look at the log file tomorrow and see if I get more
ideas.

Try to reduce the needed amount of data to reproduce it.

Regards, Tim





reply via email to

[Prev in Thread] Current Thread [Next in Thread]