bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] libpsl design [was: Re: Overly permissive hostname matching]


From: Daniel Kahn Gillmor
Subject: [Bug-wget] libpsl design [was: Re: Overly permissive hostname matching]
Date: Fri, 21 Mar 2014 16:13:43 -0400
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Icedove/24.3.0

On 03/21/2014 05:03 AM, Tim Ruehsen wrote:
> Maybe you could just open issues (or even better, fork the repo, make your 
> changes and create pull requests). 

i've just pushed some cleanup suggestions here:

  https://github.com/rockdaboot/libpsl/pull/1

i see you've pulled them already, thanks!

i've got three more conceptual issues which warrant discussion, rather
than a patch, though.  If there's a better place to have this discussion
than this mailing list, i'm happy to move to it, please let me know where.

psl_is_tld() semantics
----------------------

the way i see it, we know what it means for psl_is_tld() to return
"true" -- but "false" could mean either:

(A) "this zone is subordinate to a TLD" (as example.com is to com)
   or
(B) "this zone is superior to a TLD" (as uk is to co.uk).  Note that
"uk" is not a public suffix.

libpsl in its current state appears to assume that psl_is_tld("uk")
return "true" even though "uk" is not a TLD, and is not a public suffix,
and does not meet Ángel's "one domain under which anyone* can register a
subdomain" definition.

perhaps if we invert the sense of the current test it will match more
cleanly.  what about:

psl_is_private(char* d)

so:

 psl_is_private("uk") → false
 psl_is_private("example.com") → true
 psl_is_private("www.example.com") → true
 psl_is_private("a.b.c.example.com") → true
 psl_is_private(".") → false
 psl_is_private("com") → false
 psl_is_private("co.ar") → false


the other API that might be relevant would be something like
psl_get_private_zone(char* d), which would return the shortest private
zone that contains d.  so:

 psl_get_private_zone("www.example.com") → "example.com"
 psl_get_private_zone("example.co.uk") → "example.co.uk"
 psl_get_private_zone("a.b.c.d.example.net") → "example.net"
 psl_get_private_zone("com") → ERROR
 psl_get_private_zone("uk") → ERROR

(this is the API supplied by regdom-libs, i think)

I chose the term "private" in contrast with the "public" from "public
suffix list" -- if folks have a better word to use, i'm happy to swap
something else in.  regdom-libs uses the term "registered", which i
think means "placed in the public registry", which is intelligible to
me, but maybe only because i've thought about this problem way more than
anyone should have to.  i don't know how much sense it would make to
users of the library.

IDNA
----

I hate to bring this up, because it's a nightmare and i have no good
answers, but what does this library expect to do about non-ASCII domain
names?  effective_tld_names.dat contains the limits in unicode, encoded
as UTF-8, e.g.:

// xn--mgba3a4f16a.ir (<iran>.ir, Persian YEH)
ایران.ir

should we assume that the input from the user is in a similar form?  do
we care about locale issues?  what about unicode canonicalization?  what
if the incoming data is in punycode (the xn--* ascii form) already?

the GNU folks have done the ugly ugly work for us if we're willing to
link to lgpl'ed libraries:

  https://www.gnu.org/software/libidn/


malformed inputs
----------------

What should the library do with malformed inputs?  i'm thinking about
super-long strings, strings starting with more than one dot, or with
multiple dots adjacent to each other, strings that don't match whatever
encoding we're expecting users to send, etc.

        --dkg

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]