bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Problem with ÅÄÖ and wget


From: Tim Ruehsen
Subject: Re: [Bug-wget] Problem with ÅÄÖ and wget
Date: Tue, 17 Sep 2013 09:49:37 +0200
User-agent: KMail/4.10.5 (Linux/3.10-3-amd64; KDE/4.10.5; x86_64; ; )

On Tuesday 17 September 2013 00:17:21 Ángel González wrote:
> On 16/09/13 12:50, Tim Ruehsen wrote:
> > Just to have it mentioned:
> > Your download (wget -r http://bmit.se/wget) succeeds, but it shouldn't !
> > IMHO, Wget has a bug here and just because of this bug your test case
> > succeeds.
> > 
> > Why ?
> > Your wget/index.html holds the UTF-8 encoded URL 'teståäöÅÄÖ', but neither
> > the server header (Content-Type: text/html) nor the document itself (META
> > http- equiv ...) defines the charset. That means the charset encoding of
> > index.html should be ISO-8859-1. See [1].
> > Wget should have taken the URL 'teståäöÅÄÖ' as ISO-8859-1 and convert it
> > into UTF-8, which would fail to download.
> > 
> > Conclusion
> > 1. Be prepared that Wget will change it's behaviour sooner or later (make
> > sure, you specify / deliver the charset encoding of your documents).
> > 2. Wget will/does have problems with ISO-8859-1 text/html pages if the
> > charset is not  specified AND special chars are used.
> > 
> > Someone proving me wrong ?
> 
> I think that in the past, if the document was in iso-8859-1, imho
> it would be legal to give the server the url *encoded in iso-8859-1*,
> thus resulting in the same %-encoded url.

Just to make clear, we are talking about two different things.
1. What is the encoding of an URL found in a downloaded document ?
[1] makes it clear, how the steps are.

2. What is the encoding of the URL provided in the GET request ?
RFC2616 and RFC3986 have the same opinion:
a. convert the URL into UTF-8
b. percent encode characters that are in the 'unreserved' set.
(see RFC3986 2.5, last paragraph)

To do 2.a. one has to know the 'original' character set.
When we are in recursive mode (as in the bmit.se example) we have to use [1] 
to determine the 'original' charset before we can generate a GET request.

The reason why Wget works with so many sites is that the most sites are either 
ASCII or ISO-8859-1. And sites with non-ASCII domain names use very often 
ASCII characters only in their URL patch/query/fragment. So no problem here.

I know nothing about how servers interpret the GET URL. They might have some 
guessing, e.g. using ISO-8859-1 if decoding from UTF-8 fails.

> > [1] http://nikitathespider.com/articles/EncodingDivination.html
> Note that these steps are outdated now (that was written at most at 2008).

Outdated by exactly what ? RFC3986 is of 2005 and does not contradict to [1].
See my explanation above.

> 
> On 16/09/13 16:29, Tony Lewis wrote:
> > Neither Firefox nor Internet Explorer can navigate that link. Both
> > fail trying to retrieve teståäöÅÄÖ.
> 
> That's strange. I can browse it on Firefox 23. Perhaps its guessing is
> better.

In Firefox 23.0.1 (Debian SID), running in an UTF-8 environment:
When I enter bmit.se/wget the third link is displayed wrong.
This is as expected since following [1], the page should be ISO-8859-1, but 
includes an UTF-8 encoded URL. These UTF-8 characters are now converted from 
ISO-8859-1 to UTF-8 and thus display wrong.
If you are in an ISO-8859-1 or similar environment (some Windows encodings are 
very similar), the link would display correctly but this is just a lucky 
effect (same goes to Wget here).

To have it looking correct everywhere, bmit.se/wget should either provide the 
charset UTF-8 (in the response header or in a META tag) or should have 
ISO-8859-1 characters in the page.

[1] http://nikitathespider.com/articles/EncodingDivination.html





reply via email to

[Prev in Thread] Current Thread [Next in Thread]