bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] bad filenames (again)


From: Andries E. Brouwer
Subject: Re: [Bug-wget] bad filenames (again)
Date: Fri, 21 Aug 2015 17:28:09 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

On Fri, Aug 21, 2015 at 04:34:36PM +0200, Tim Ruehsen wrote:
> On Friday 21 August 2015 14:22:22 Andries E. Brouwer wrote:

> > Let me find some random site. Say
> > http://web2go.board19.com/gopro/go_view.php?id=12345

> The server tell us the document is UTF-8.
> The document tell us it is 'UTF-8.

And it is not. So - this example establishes that remote character set
information, when present, is often unreliable.

Let me add one more example, 

http://www.win.tue.nl/~aeb/linux/lk/r%f8dgr%f8d.html

a famous Danish recipe. The headers say "Content-Type: text/html"
without revealing any character set.

> > Moreover, the character set of a filename is in general unrelated
> > to the character set of the contents of the file. That is most clear
> > when the file is not a text file. What character set is the filename
> > 
> > http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg
> 
> Wrong question. It is a JPEG file. Content doesn't matter to wget.

Hmm. I thought the topic of our discussion was filenames and character sets.
Here is a file, and its name is in ISO 8859-1.
When wget saves it. What will the filename be?

> If you want to download the above mentioned web page and 
> you have a UTF-8 locale, you have to tell wget via --local-encoding what 
> encoding the URL is.

Are you sure you do not mean --remote-encoding?

But whatever you mean, it is an additional option.
If the wget user already knows the character set, she can of course tell wget.

The discussion is about the situation where the user does not know.

So, that is the situation we are discussing: a remote site, the user
does not know what encoding is used (she will find out after downloading),
and the headers have either no information or wrong information.
Now if one invokes iconv it is likely that garbage will be the result.

Andries


Here a Korean example.
http://cfile204.uf.daum.net/attach/1847B5314CF754B83134B7
The http headers say Content-Type: text/plain; charset=iso-8859-1
(which is incorrect), an internal header says that this is ISO-2022-KR
(which is also incorrect), in fact the content is in EUC-KR.
That is none of wget's business, we want to save this file.
The headers say
Content-Disposition: attachment; 
filename="20101202_%EB%86%8D%EC%8B%AC%EC%8B%A0%EB%9D%BC%EB%A9%B4%EB%B0%B0_%EB%B0%94%EB%91%91(%EB%8B%A4%EC%B9%B4%EC%98%A4%EC%8B%A0%EC%A7%809%EB%8B%A8-%EB%B0%B1_.sgf"
This encodes a valid utf-8 filename, and that name should be used.
So wget should save this file under the name
20101202_농심신라면배_바둑(다카오신지9단-백_.sgf



reply via email to

[Prev in Thread] Current Thread [Next in Thread]