bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] [PATCH] improved Test-idn-robots.txt


From: Tim Rühsen
Subject: Re: [Bug-wget] [PATCH] improved Test-idn-robots.txt
Date: Wed, 09 Oct 2013 19:44:08 +0200
User-agent: KMail/4.10.5 (Linux/3.10-3-amd64; KDE/4.10.5; x86_64; ; )

Am Dienstag, 8. Oktober 2013, 15:07:51 schrieb Giuseppe Scrivano:
> Tim Rühsen <address@hidden> writes:
> > I added two links/urls to follow in index.html, now there are three in
> > total. All three links/urls point to the same host, but have different
> > host encodings (plain international text, punycoding, percent escaping).
> > 
> > Wget should recognize these three codings as being the same and thus I
> > removed the -H (host spanning) option to verify that.
> > 
> > Now, Wget fails this test, I guess it needs a fix.
> > 
> > Regards, Tim
> > 
> > From 2e6f527121497b3b148496a9a9c774451d2e0017 Mon Sep 17 00:00:00 2001
> > From: Tim Ruehsen <address@hidden>
> > Date: Mon, 7 Oct 2013 23:37:42 +0200
> > Subject: [PATCH] improved Test-idn-robots.px
> > 
> > ---
> > 
> >  tests/ChangeLog          |  5 +++++
> >  tests/Test-idn-robots.px | 27 ++++++++++++++++++++++++++-
> >  2 files changed, 31 insertions(+), 1 deletion(-)
> 
> thanks for your test.  The IRI support is a bit of a mess and I am not
> sure how this issue should be fixed:
> 
> Should we check if the two domains are the same in recur.c (somewhere
> near line 633)?  It means that  we will need to check there for
> different encodings and convert among them.  Another solution would be
> that append_url stores the url in a specific format.
> 
> Probably the latter solution allows us to also deal with page specific
> locales when it is specified.
> 
> Have you already looked into this issue?  Do you have any
> idea/suggestion?

I already solved this issue in this experimental tool Mget where I put the 
URI/IRI parser into a library. I just can offer to contribute code from those 
source to Wget/FSF. Maybe you take a look and see what fits for Wget (since 
Mget does the same as Wget, it should fit).

The code for mget_iri_parse() is in
        https://github.com/rockdaboot/mget/blob/master/libmget/iri.c

Mget 'normalizes' all URI/IRIs by
- decode percent encoding
- encode to utf-8
- parsing into host/path/query etc.
- encoding host with toASCII() (libidn2+libunistring or libidn) to ascii form
  via mget_str_to_ascii(iri->host)

>From than on, this ascii form is taken as the host name for directories, DNS, 
HTTP, comparing etc.

If i can give you a helping hand, contact me.

Regards, Tim

Attachment: signature.asc
Description: This is a digitally signed message part.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]