bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] bad filenames (again)


From: Andries E. Brouwer
Subject: Re: [Bug-wget] bad filenames (again)
Date: Fri, 21 Aug 2015 14:22:22 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

On Fri, Aug 21, 2015 at 01:31:45PM +0200, Tim Ruehsen wrote:

> > There is a remote site.
> > Nothing is known about this remote site.
>
> Wrong. Regarding HTTP(S), we exactly know the encoding
> of each downloaded HTML and CSS document
> (that's what I call 'remote encoding').

You are an optimist. In my experience Firefox rarely gets it right.
Let me find some random site. Say
http://web2go.board19.com/gopro/go_view.php?id=12345

If I go there with Firefox, I get a go board with a lot of mojibake
around it. Firefox took the encoding to be Unicode. Trying out what
I have to say in the "Text encoding" menu, it turns out to be
"Chinese, Traditional".

> Leaving these misconfigured servers away as a special case

But most of the East Asian servers I meet are misconfigured in this way.
They announce text/html with charset utf-8 and come with some random
charset.
So trusting this announced charset should be done cautiously.

And you say "misconfigured servers", but often one gets a
Unix or Windows file hierarchy, and several character sets occur.
The server doesnt know. The sysadmin doesnt know. A university
machine will have many users with files in several languages
and character sets.

Moreover, the character set of a filename is in general unrelated
to the character set of the contents of the file. That is most clear
when the file is not a text file. What character set is the filename

http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg

in? You recognize ISO 8859-1 or similar. My local machine is on UTF-8.
The HTTP headers say "Content-Type: image/jpeg".
How can wget guess?

Andries



reply via email to

[Prev in Thread] Current Thread [Next in Thread]