[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] bad filenames (again)
From: |
Tim Ruehsen |
Subject: |
Re: [Bug-wget] bad filenames (again) |
Date: |
Fri, 21 Aug 2015 16:34:36 +0200 |
User-agent: |
KMail/4.14.2 (Linux/4.1.0-1-amd64; KDE/4.14.2; x86_64; ; ) |
On Friday 21 August 2015 14:22:22 Andries E. Brouwer wrote:
> On Fri, Aug 21, 2015 at 01:31:45PM +0200, Tim Ruehsen wrote:
> > > There is a remote site.
> > > Nothing is known about this remote site.
> >
> > Wrong. Regarding HTTP(S), we exactly know the encoding
> > of each downloaded HTML and CSS document
> > (that's what I call 'remote encoding').
>
> You are an optimist. In my experience Firefox rarely gets it right.
> Let me find some random site. Say
> http://web2go.board19.com/gopro/go_view.php?id=12345
I try to be an optimist in all situations, yes :-)
> If I go there with Firefox, I get a go board with a lot of mojibake
> around it. Firefox took the encoding to be Unicode. Trying out what
> I have to say in the "Text encoding" menu, it turns out to be
> "Chinese, Traditional".
The server tell us the document is UTF-8.
The document tell us it is 'UTF-8.
But then, some moron (there are a lot of these dudes doing webpage 'design')
put non UTF-8 text into the document.
That is like putting plum pudding into a jar labeled 'strawberry jam'. You
will you do ? Go back and return it ? Or accept it saying 'uh oh, my
strawberry allergy will bite me, but I am a tough guy'.
*BUT* that is not the point for wget, since wget doesn't mess around with the
texttual content (no conversion takes place). When used recursive, wget will
extract URLs from the document. *NOT* from the text but from the HTML
tags/attributes. And *surprise*, all of the links in the document are UTF-8 /
ASCII (else not a single browser in the world would expect anything else).
And all that matters are the URLs from the HTML attributes.
> And you say "misconfigured servers", but often one gets a
> Unix or Windows file hierarchy, and several character sets occur.
> The server doesnt know. The sysadmin doesnt know. A university
> machine will have many users with files in several languages
> and character sets.
Trust them, They know. If not, their web site will be heavily broken.
But there is nothing to fix for us.
> Moreover, the character set of a filename is in general unrelated
> to the character set of the contents of the file. That is most clear
> when the file is not a text file. What character set is the filename
>
> http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg
Wrong question. It is a JPEG file. Content doesn't matter to wget.
Despite from that, if you want to download the above mentioned web page and
you have a UTF-8 locale, you have to tell wget via --local-encoding what
encoding the URL is. But if wget --recursive finds the above URL within a HTML
attribute, you won't need --local-encoding. By the measures taken from
http://www.w3.org/TR/html4/charset.html#h-5.2.2, wget will know the correct
encoding and just will do the right thing (after the currently discussed
change regarding charsets / file naming). Wget2 already does it.
$ wget --local-encoding=iso-8859-1
'http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg'
--2015-08-21 16:30:05--
http://www.win.tue.nl/~aeb/linux/lk/kn%C3%A4ckebr%C3%B6d.jpg
Resolving www.win.tue.nl (www.win.tue.nl)... 131.155.0.177
Connecting to www.win.tue.nl (www.win.tue.nl)|131.155.0.177|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2015-08-21 16:30:05 ERROR 404: Not Found.
--2015-08-21 16:30:05--
http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg
Reusing existing connection to www.win.tue.nl:80.
HTTP request sent, awaiting response... 200 OK
Length: 11690 (11K) [image/jpeg]
Saving to: ‘knäckebröd.jpg’
knäckebröd.jp
100%[=========================================================================>]
11.42K --.-KB/s in 0.002s
2015-08-21 16:30:05 (6.83 MB/s) - ‘knäckebröd.jpg’ saved [11690/11690]
(Old wget having the progress bar bug.)
Tim
signature.asc
Description: This is a digitally signed message part.
- Re: [Bug-wget] bad filenames (again), (continued)
- Re: [Bug-wget] bad filenames (again), Tim Ruehsen, 2015/08/20
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/20
- Re: [Bug-wget] bad filenames (again), Tim Ruehsen, 2015/08/21
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/21
- Re: [Bug-wget] bad filenames (again), Tim Ruehsen, 2015/08/21
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/21
- Re: [Bug-wget] bad filenames (again),
Tim Ruehsen <=
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/21
- Re: [Bug-wget] bad filenames (again), Tim Rühsen, 2015/08/21
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/21
- Re: [Bug-wget] bad filenames (again), Tim Ruehsen, 2015/08/24
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/25
- Re: [Bug-wget] bad filenames (again), Eli Zaretskii, 2015/08/20
- Re: [Bug-wget] bad filenames (again), Tim Ruehsen, 2015/08/20
- Re: [Bug-wget] bad filenames (again), Eli Zaretskii, 2015/08/19
- Re: [Bug-wget] bad filenames (again), Ángel González, 2015/08/20
- Re: [Bug-wget] bad filenames (again), Eli Zaretskii, 2015/08/20