bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] libpsl design [was: Re: Overly permissive hostname matchi


From: Ángel González
Subject: Re: [Bug-wget] libpsl design [was: Re: Overly permissive hostname matching]
Date: Fri, 21 Mar 2014 21:54:29 +0100
User-agent: Thunderbird

On 21/03/14 21:13, Daniel Kahn Gillmor wrote:
i've just pushed some cleanup suggestions here:

   https://github.com/rockdaboot/libpsl/pull/1

i see you've pulled them already, thanks!

i've got three more conceptual issues which warrant discussion, rather
than a patch, though.  If there's a better place to have this discussion
than this mailing list, i'm happy to move to it, please let me know where.

psl_is_tld() semantics
----------------------

the way i see it, we know what it means for psl_is_tld() to return
"true" -- but "false" could mean either:

(A) "this zone is subordinate to a TLD" (as example.com is to com)
    or
(B) "this zone is superior to a TLD" (as uk is to co.uk).  Note that
"uk" is not a public suffix.
Hmm, actually uk is a public suffix, since not matching anything explictely in
the list,  it will be caught by the implicit last-resource rule '*'.

Also, what would you do with a domain such as his.name?
It is both inferior to a public suffix (.name) and superior (forgot.his.name).


I think it should have a different return code, though.



IDNA
----

I hate to bring this up, because it's a nightmare and i have no good
answers, but what does this library expect to do about non-ASCII domain
names?  effective_tld_names.dat contains the limits in unicode, encoded
as UTF-8, e.g.:

// xn--mgba3a4f16a.ir (<iran>.ir, Persian YEH)
ایران.ir

should we assume that the input from the user is in a similar form?  do
we care about locale issues?  what about unicode canonicalization?  what
if the incoming data is in punycode (the xn--* ascii form) already?

the GNU folks have done the ugly ugly work for us if we're willing to
link to lgpl'ed libraries:

   https://www.gnu.org/software/libidn/
I would expect the input in punycode and optionally in utf-8. This means
a preprocessing step from the original list is needed.
If we are handed a i18n domain, punycode them with libidn if we are linked to it,
else return an error.

An application checking presumably will have already the need to deal with
i18n domain names, so I suppose that if they are able to get the punycode for things like querying the dns, and if they can't punycode it, it doesn't matter so
much that it doesn't work for them ;)

It is disgusting to do a roundtrip utf-8 -> punycode -> utf-8 for extracting the base
domain, though.

malformed inputs
----------------

What should the library do with malformed inputs?  i'm thinking about
super-long strings, strings starting with more than one dot, or with
multiple dots adjacent to each other, strings that don't match whatever
encoding we're expecting users to send, etc.

        --dkg
Return an error.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]