[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] Problem with ÅÄÖ and wget
Re: [Bug-wget] Problem with ÅÄÖ and wget
Tue, 17 Sep 2013 09:49:37 +0200
KMail/4.10.5 (Linux/3.10-3-amd64; KDE/4.10.5; x86_64; ; )
On Tuesday 17 September 2013 00:17:21 Ángel González wrote:
> On 16/09/13 12:50, Tim Ruehsen wrote:
> > Just to have it mentioned:
> > Your download (wget -r http://bmit.se/wget) succeeds, but it shouldn't !
> > IMHO, Wget has a bug here and just because of this bug your test case
> > succeeds.
> > Why ?
> > Your wget/index.html holds the UTF-8 encoded URL 'teståäöÅÄÖ', but neither
> > the server header (Content-Type: text/html) nor the document itself (META
> > http- equiv ...) defines the charset. That means the charset encoding of
> > index.html should be ISO-8859-1. See .
> > Wget should have taken the URL 'teståäöÅÄÖ' as ISO-8859-1 and convert it
> > into UTF-8, which would fail to download.
> > Conclusion
> > 1. Be prepared that Wget will change it's behaviour sooner or later (make
> > sure, you specify / deliver the charset encoding of your documents).
> > 2. Wget will/does have problems with ISO-8859-1 text/html pages if the
> > charset is not specified AND special chars are used.
> > Someone proving me wrong ?
> I think that in the past, if the document was in iso-8859-1, imho
> it would be legal to give the server the url *encoded in iso-8859-1*,
> thus resulting in the same %-encoded url.
Just to make clear, we are talking about two different things.
1. What is the encoding of an URL found in a downloaded document ?
 makes it clear, how the steps are.
2. What is the encoding of the URL provided in the GET request ?
RFC2616 and RFC3986 have the same opinion:
a. convert the URL into UTF-8
b. percent encode characters that are in the 'unreserved' set.
(see RFC3986 2.5, last paragraph)
To do 2.a. one has to know the 'original' character set.
When we are in recursive mode (as in the bmit.se example) we have to use 
to determine the 'original' charset before we can generate a GET request.
The reason why Wget works with so many sites is that the most sites are either
ASCII or ISO-8859-1. And sites with non-ASCII domain names use very often
ASCII characters only in their URL patch/query/fragment. So no problem here.
I know nothing about how servers interpret the GET URL. They might have some
guessing, e.g. using ISO-8859-1 if decoding from UTF-8 fails.
> >  http://nikitathespider.com/articles/EncodingDivination.html
> Note that these steps are outdated now (that was written at most at 2008).
Outdated by exactly what ? RFC3986 is of 2005 and does not contradict to .
See my explanation above.
> On 16/09/13 16:29, Tony Lewis wrote:
> > Neither Firefox nor Internet Explorer can navigate that link. Both
> > fail trying to retrieve testÃ¥Ã¤Ã¶Ã…Ã„Ã–.
> That's strange. I can browse it on Firefox 23. Perhaps its guessing is
In Firefox 23.0.1 (Debian SID), running in an UTF-8 environment:
When I enter bmit.se/wget the third link is displayed wrong.
This is as expected since following , the page should be ISO-8859-1, but
includes an UTF-8 encoded URL. These UTF-8 characters are now converted from
ISO-8859-1 to UTF-8 and thus display wrong.
If you are in an ISO-8859-1 or similar environment (some Windows encodings are
very similar), the link would display correctly but this is just a lucky
effect (same goes to Wget here).
To have it looking correct everywhere, bmit.se/wget should either provide the
charset UTF-8 (in the response header or in a META tag) or should have
ISO-8859-1 characters in the page.
Re: [Bug-wget] Problem with ÅÄÖ and wget, Björn Mattsson, 2013/09/12
Re: [Bug-wget] Problem with ÅÄÖ and wget, Tim Rühsen, 2013/09/12
- Re: [Bug-wget] Problem with ÅÄÖ and wget, (continued)
- Re: [Bug-wget] Problem with ÅÄÖ and wget, Tim Ruehsen, 2013/09/13
- Re: [Bug-wget] Problem with ÅÄÖ and wget, Björn Mattsson, 2013/09/13
- Re: [Bug-wget] Problem with ÅÄÖ and wget, Tim Ruehsen, 2013/09/16
- Re: [Bug-wget] Problem with ÅÄÖ and wget, Tony Lewis, 2013/09/16
- Re: [Bug-wget] Problem with ÅÄÖ and wget, Ángel González, 2013/09/16
- Re: [Bug-wget] Problem with ÅÄÖ and wget,
Tim Ruehsen <=
- Re: [Bug-wget] Problem with ÅÄÖ and wget, Ángel González, 2013/09/23
- Re: [Bug-wget] Problem with ÅÄÖ and wget, Tim Ruehsen, 2013/09/24