bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] bad filenames (again)


From: Tim Rühsen
Subject: Re: [Bug-wget] bad filenames (again)
Date: Fri, 21 Aug 2015 20:54:28 +0200
User-agent: KMail/4.14.2 (Linux/4.1.0-1-amd64; KDE/4.14.2; x86_64; ; )

Am Freitag, 21. August 2015, 17:28:09 schrieb Andries E. Brouwer:
> On Fri, Aug 21, 2015 at 04:34:36PM +0200, Tim Ruehsen wrote:
> > On Friday 21 August 2015 14:22:22 Andries E. Brouwer wrote:
> > > Let me find some random site. Say
> > > http://web2go.board19.com/gopro/go_view.php?id=12345
> > 
> > The server tell us the document is UTF-8.
> > The document tell us it is 'UTF-8.
> 
> And it is not. So - this example establishes that remote character set
> information, when present, is often unreliable.
> 
> Let me add one more example,
> 
> http://www.win.tue.nl/~aeb/linux/lk/r%f8dgr%f8d.html
> 
> a famous Danish recipe. The headers say "Content-Type: text/html"
> without revealing any character set.

1. There is no URL to parse in this document, so encoding does not matter 
anyway.

2. If the server AND the document do not explicitly specify the character 
encoding, there still is one - namely the default. Has been ISO-8859-1 a while 
ago. AFAIR, HTML5 might have changed that (too late for me now to look it up).

The is a good diagram - maybe not perfectly up-to-date but it still shows 
roughly how to operate:
http://nikitathespider.com/articles/EncodingDivination.html

 
> > > Moreover, the character set of a filename is in general unrelated
> > > to the character set of the contents of the file. That is most clear
> > > when the file is not a text file. What character set is the filename
> > > 
> > > http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg
> > 
> > Wrong question. It is a JPEG file. Content doesn't matter to wget.
> 
> Hmm. I thought the topic of our discussion was filenames and character sets.
> Here is a file, and its name is in ISO 8859-1.
> When wget saves it. What will the filename be?
> 
> > If you want to download the above mentioned web page and
> > you have a UTF-8 locale, you have to tell wget via --local-encoding what
> > encoding the URL is.
> 
> Are you sure you do not mean --remote-encoding?

Yes, I am sure. Here my tests (my locale is UTF-8):

Wrong:
$ wget -nv --remote-encoding=iso-8859-1 
http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg
2015-08-21 20:09:30 URL:http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg 
[11690/11690] -> "kn�ckebr�d.jpg.1" [1]

Right:
http://www.win.tue.nl/~aeb/linux/lk/kn%C3%A4ckebr%C3%B6d.jpg:
2015-08-21 20:14:18 FEHLER 404: Not Found.
2015-08-21 20:14:18 URL:http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg 
[11690/11690] -> "knäckebröd.jpg" [1]


> But whatever you mean, it is an additional option.
> If the wget user already knows the character set, she can of course tell
> wget.
> 
> The discussion is about the situation where the user does not know.
> 
> So, that is the situation we are discussing: a remote site, the user
> does not know what encoding is used (she will find out after downloading),
> and the headers have either no information or wrong information.
> Now if one invokes iconv it is likely that garbage will be the result.


> Here a Korean example.
> http://cfile204.uf.daum.net/attach/1847B5314CF754B83134B7
> The http headers say Content-Type: text/plain; charset=iso-8859-1
> (which is incorrect), an internal header says that this is ISO-2022-KR
> (which is also incorrect), in fact the content is in EUC-KR.
> That is none of wget's business, we want to save this file.
> The headers say
> Content-Disposition: attachment;
> filename="20101202_%EB%86%8D%EC%8B%AC%EC%8B%A0%EB%9D%BC%EB%A9%B4%EB%B0%B0_%
> EB%B0%94%EB%91%91(%EB%8B%A4%EC%B9%B4%EC%98%A4%EC%8B%A0%EC%A7%809%EB%8B%A8-%E
> B%B0%B1_.sgf" This encodes a valid utf-8 filename, and that name should be
> used. So wget should save this file under the name
> 20101202_농심신라면배_바둑(다카오신지9단-백_.sgf

This is a different issue. Here we are talking about the encoding of HTTP 
headers, especially 'filename' values within Content-Disposition HTTP header.
The above is correctly encoded (UTF-8 percent encoding).

The encoding is described in RFC5987 (Character Set and Language Encoding for
 Hypertext Transfer Protocol (HTTP) Header Field Parameters).

Wget simply does not parse this correctly - it is just not coded in.
That is why support for Content-Disposition in Wget is documented as 
'experimental' (you have to explicitly enable it via --content-disposition).

Again the server encoding is known. Regarding filename encoding, nothing is 
wrong in your example. It is just Wget missing some code here (worth opening a 
separate bug).


Default Wget behavior:
$ wget -nv http://cfile204.uf.daum.net/attach/1847B5314CF754B83134B7
2015-08-21 20:20:05 
URL:http://cfile204.uf.daum.net/attach/1847B5314CF754B83134B7 [1441/1441] -> 
"1847B5314CF754B83134B7" [1]


Enabled Content-Disposition support:
$ wget -nv --content-disposition 
http://cfile204.uf.daum.net/attach/1847B5314CF754B83134B7
2015-08-21 20:23:50 
URL:http://cfile204.uf.daum.net/attach/1847B5314CF754B83134B7 [1441/1441] -> 
"20101202_%EB%86%8D%EC%8B%AC%EC%8B%A0%EB%9D%BC%EB%A9%B4%EB%B0%B0_%EB%B0%94%EB%91%91(%EB%8B%A4%EC%B9%B4%EC%98%A4%EC%8B%A0%EC%A7%809%EB%8B%A8-%EB%B0%B1_.sgf"
 
[1]

As we see, unescaping and UTF-8 to locale conversion does not take place here.

Tim

Attachment: signature.asc
Description: This is a digitally signed message part.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]