Re: [Bug-wget] Support non-ASCII URLs

From: Tim Rühsen
Subject: Re: [Bug-wget] Support non-ASCII URLs
Date: Sun, 20 Dec 2015 16:26:20 +0100
Am Samstag, 19. Dezember 2015, 14:11:20 schrieb Eli Zaretskii:
> > Date: Sat, 19 Dec 2015 10:15:03 +0200
> > From: Eli Zaretskii <address@hidden>
> > Cc: address@hidden
> > 
> > > 2. contrib/check-hard fails with
> > > 
> > > FAIL: Test-iri-forced-remote
> > > 
> > > My son has birthday tomorrow, so I am not sure how much time I can spend
> > > on
> > > the weekend on this issue. Maybe Eli or you could have a look ?
> > 
> > I cannot bootstrap the Git repo (too many prerequisites I don't have).
> > Can you or someone else produce a distribution tarball out of Git that
> > I could then build "as usual"?
> > 
> > Also, can you show me the log of the failed test?  Turkish locales
> > have "an issue" with certain upper/lower-case characters, maybe that's
> > the problem.  Or maybe it's something else; looking at the log might
> > give good clues.
> Tim sent me the tarball and the log off-list (thanks!).  I didn't yet
> try to build Wget, but just looking at the test, I guess I don't
> understand its idea.  It has an index.html page that's encoded in
> ISO-8859-15, but Wget is invoked with --remote-encoding=iso-8859-1,
> and the URLs themselves in "my %urls" are all encoded in UTF-8.  How's
> this supposed to work?

Regarding the wget man page, --remote-encoding just sets the *default* server 
encoding. This only comes into play when the HTTP header does not contain a 
Content-type with charset set *and* the HTML page does not contain a <meta 
http-equiv="Content-Type" with 'content=... charset=...'.

'index.html' in this test is correctly having a meta tag with charset=utf-8 
and the URLs encoded in utf-8.

> Also, I'm not following the logic of overriding Content-type by the
> remote encoding: p1_fran%C3%A7ais.html states "charset=UTF-8", but
> includes a link encoded in ISO-8859-1, and the test seems to expect
> Wget to use the remote encoding in preference to what "charset=" says.

Either the test is wrong here or the man page. I would say the man page should 
be correct here - it makes the most sense to me. In this case the test is 
wrong, also the comment.

> Does the remote encoding override the encoding for the _contents_ of
> the URL, not just for the URL itself?  That seems to make little sense
> to me: the contents and the name can legitimately be encoded
> differently, I think.

The filenames in %expected_downloaded_files depend on --local-encoding.
Since this is not given on the command line, this test will behave differently 
with different settings for LC_ALL ('make check' use LC_ALL=C, contrib/check-
hard will also 'make check' with turkish UTF-8 locale).

To fix the test, we should use --local-encoding to some kind of UTF-8 locale 
(or something else, but than we have to fix the filenames regarding that 

Regards, Tim

