Re: [Bug-wget] bad filenames (again)

From: Andries E. Brouwer
Subject: Re: [Bug-wget] bad filenames (again)
Date: Fri, 21 Aug 2015 14:22:22 +0200
On Fri, Aug 21, 2015 at 01:31:45PM +0200, Tim Ruehsen wrote:

> > There is a remote site.
> > Nothing is known about this remote site.
> Wrong. Regarding HTTP(S), we exactly know the encoding
> of each downloaded HTML and CSS document
> (that's what I call 'remote encoding').

You are an optimist. In my experience Firefox rarely gets it right.
Let me find some random site. Say

If I go there with Firefox, I get a go board with a lot of mojibake
around it. Firefox took the encoding to be Unicode. Trying out what
I have to say in the "Text encoding" menu, it turns out to be
"Chinese, Traditional".

> Leaving these misconfigured servers away as a special case

But most of the East Asian servers I meet are misconfigured in this way.
They announce text/html with charset utf-8 and come with some random
So trusting this announced charset should be done cautiously.

And you say "misconfigured servers", but often one gets a
Unix or Windows file hierarchy, and several character sets occur.
The server doesnt know. The sysadmin doesnt know. A university
machine will have many users with files in several languages
and character sets.

Moreover, the character set of a filename is in general unrelated
to the character set of the contents of the file. That is most clear
when the file is not a text file. What character set is the filename


in? You recognize ISO 8859-1 or similar. My local machine is on UTF-8.
The HTTP headers say "Content-Type: image/jpeg".
How can wget guess?


