bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] IDN and IRI tests fail on MS-Windows with wget 1.16.1


From: Tim Rühsen
Subject: Re: [Bug-wget] IDN and IRI tests fail on MS-Windows with wget 1.16.1
Date: Sat, 27 Dec 2014 20:05:51 +0100
User-agent: KMail/4.14.2 (Linux/3.16.0-4-amd64; KDE/4.14.2; x86_64; ; )

Am Samstag, 27. Dezember 2014, 13:57:21 schrieb Tim Rühsen:
> Am Samstag, 27. Dezember 2014, 10:39:25 schrieb Eli Zaretskii:
> > > From: Tim Rühsen <address@hidden>
> > > Date: Thu, 25 Dec 2014 15:43:27 +0100
> > > 
> > > >      FAIL: Test-idn-headers.px
> > > >      FAIL: Test-idn-meta.px
> > > >    
> > > >    These use EUC_JP encoded file name, but do not state
> > > >    --local-encoding on the wget command line, so the non-ASCII
> > > >    characters get mangled by Windows (because Windows tries to convert
> > > >    non-Unicode non-ASCII strings to the current system codepage).
> > > >    Test-idn-* tests that do state --local-encoding do succeed.  Is it
> > > >    possible that the tests assume something about the local encoding,
> > > >    like that it's UTF-8?
> > > 
> > > Let's start with 'Test-idn-meta'.
> > > No non-ASCII filename will be written to disk, the Content-type is
> > > stated
> > > correctly. --local-encoding set the encoding for when reading a local
> > > file
> > > or the command line. So it shouldn't influence this test. And i can't
> > > reproduce the stated behavior.
> > > 
> > > Please send me the --debug output of this test with and without --local-
> > > encoding given.
> > 
> > The output is attached.  I collected that by redirecting the test
> > script's stderr to a file, I hope that's what you meant.
> > 
> > I noticed that the output says:
> >   converted 'http://<bunch of octal escapes>/' (CP1255) ->
> >   'http://<another
> > 
> > bunch of octal escapes/' (UTF-8)
> > 
> > So I tried to use --local-encoding=EUC-JP, and that made the test
> > succeed.  The third attachment below is from that successful run.
> 
> Thanks, Eli.
> 
> Your tests helped me to reproduce the problem:
> - install (and set) a non-UTF-8 and non-C/POSIX locale
> - use this locale for testing, e.g.:
>   TESTS_ENVIRONMENT="address@hidden" make check TESTS=Test-idn-
> meta
> 
> And what I see in the logs Wget has a severe problem.
> When loading a saved (HTML) document, Wget parses it with the local-encoding
> instead of the encoding stated by the server (or document). Of course this
> can't work and this is the reason why your 3rd test works (setting the
> local- encoding to the real encoding of the document).
> 
> After the 400 server response, Wget loads the document again, now with the
> correct encoding. But Wget 'remembers' some incorrect conversions from the
> first try and thus fails again.
> 
> 
> I would expect Wget to load the document with the correct encoding in the
> first place... but it looks that this 'double loading' has been done on
> purpose.

After having a deeper look into IRI/IDN design of Wget I have to correct 
myself. IMHO, Wget's IRI support seems to be deeply broken. I guess it needs a 
redesign to fix it. And that exceeds the amount of time that I have.

Tim

Attachment: signature.asc
Description: This is a digitally signed message part.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]