Re: [Bug-wget] bad filenames (again)

From: Tim Ruehsen
Subject: Re: [Bug-wget] bad filenames (again)
Date: Fri, 21 Aug 2015 16:34:36 +0200
On Friday 21 August 2015 14:22:22 Andries E. Brouwer wrote:
> On Fri, Aug 21, 2015 at 01:31:45PM +0200, Tim Ruehsen wrote:
> > > There is a remote site.
> > > Nothing is known about this remote site.
> > 
> > Wrong. Regarding HTTP(S), we exactly know the encoding
> > of each downloaded HTML and CSS document
> > (that's what I call 'remote encoding').
> You are an optimist. In my experience Firefox rarely gets it right.
> Let me find some random site. Say
> http://web2go.board19.com/gopro/go_view.php?id=12345

I try to be an optimist in all situations, yes :-)

> If I go there with Firefox, I get a go board with a lot of mojibake
> around it. Firefox took the encoding to be Unicode. Trying out what
> I have to say in the "Text encoding" menu, it turns out to be
> "Chinese, Traditional".

The server tell us the document is UTF-8.
The document tell us it is 'UTF-8.
But then, some moron (there are a lot of these dudes doing webpage 'design') 
put non UTF-8 text into the document.
That is like putting plum pudding into a jar labeled 'strawberry jam'. You 
will you do ? Go back and return it ? Or accept it saying 'uh oh, my 
strawberry allergy will bite me, but I am a tough guy'.

*BUT* that is not the point for wget, since wget doesn't mess around with the 
texttual content (no conversion takes place). When used recursive, wget will 
extract URLs from the document. *NOT* from the text but from the HTML 
tags/attributes. And *surprise*, all of the links in the document are UTF-8 / 
ASCII (else not a single browser in the world would expect anything else).
And all that matters are the URLs from the HTML attributes.

> And you say "misconfigured servers", but often one gets a
> Unix or Windows file hierarchy, and several character sets occur.
> The server doesnt know. The sysadmin doesnt know. A university
> machine will have many users with files in several languages
> and character sets.

Trust them, They know. If not, their web site will be heavily broken.
But there is nothing to fix for us.

> Moreover, the character set of a filename is in general unrelated
> to the character set of the contents of the file. That is most clear
> when the file is not a text file. What character set is the filename
> http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg

Wrong question. It is a JPEG file. Content doesn't matter to wget.

Despite from that, if you want to download the above mentioned web page and 
you have a UTF-8 locale, you have to tell wget via --local-encoding what 
encoding the URL is. But if wget --recursive finds the above URL within a HTML 
attribute, you won't need --local-encoding. By the measures taken from 
http://www.w3.org/TR/html4/charset.html#h-5.2.2, wget will know the correct 
encoding and just will do the right thing (after the currently discussed 
change regarding charsets / file naming). Wget2 already does it.

$ wget --local-encoding=iso-8859-1 
--2015-08-21 16:30:05--  
Resolving www.win.tue.nl (www.win.tue.nl)...
Connecting to www.win.tue.nl (www.win.tue.nl)||:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2015-08-21 16:30:05 ERROR 404: Not Found.

--2015-08-21 16:30:05--  
Reusing existing connection to www.win.tue.nl:80.
HTTP request sent, awaiting response... 200 OK
Length: 11690 (11K) [image/jpeg]
Saving to: ‘knäckebröd.jpg’

11.42K  --.-KB/s   in 0.002s 

2015-08-21 16:30:05 (6.83 MB/s) - ‘knäckebröd.jpg’ saved [11690/11690]

(Old wget having the progress bar bug.)


