bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] libpsl design


From: Ángel González
Subject: Re: [Bug-wget] libpsl design
Date: Fri, 21 Mar 2014 23:22:48 +0100
User-agent: Thunderbird

On 21/03/14 22:24, Daniel Kahn Gillmor wrote:
On 03/21/2014 04:54 PM, Ángel González wrote:
On 21/03/14 21:13, Daniel Kahn Gillmor wrote:
i've just pushed some cleanup suggestions here:

    https://github.com/rockdaboot/libpsl/pull/1

i see you've pulled them already, thanks!

i've got three more conceptual issues which warrant discussion, rather
than a patch, though.  If there's a better place to have this discussion
than this mailing list, i'm happy to move to it, please let me know
where.

psl_is_tld() semantics
----------------------

the way i see it, we know what it means for psl_is_tld() to return
"true" -- but "false" could mean either:

(A) "this zone is subordinate to a TLD" (as example.com is to com)
     or
(B) "this zone is superior to a TLD" (as uk is to co.uk).  Note that
"uk" is not a public suffix.
Hmm, actually uk is a public suffix, since not matching anything
explictely in
the list,  it will be caught by the implicit last-resource rule '*'.

Also, what would you do with a domain such as his.name?
It is both inferior to a public suffix (.name) and superior
(forgot.his.name).
hm, the same problem is present for amazonaws.com; it is superior to
s3.amazonaws.com (and 32 other public suffixes), and subordinate to .com

I think it should have a different return code, though.
can you propose a specific API?  the devil is in the details.
I think I will code something this weekend. Maybe just the API, maybe also an implementation.


    https://www.gnu.org/software/libidn/
I would expect the input in punycode and optionally in utf-8. This means
a preprocessing step from the original list is needed.
This implies that people wouldn't be able to use effective_tld_names.dat
as distributed, right?  I can see this working for OS-level
distributions (I can preprocess effective_tld_names.dat when
distributing it in publicsuffix for debian), but for regular users it
sounds terrible.
Right. Using utf-8 makes sense for viewing the domain labels (when you understand those glyphs!) but it's terrible for programs. In fact the first thing Mozilla does with the list is to transform it placing them into
a C structure… in punycode.

Another option is to cheat and abuse the fact that we will have the punycode
equivalence in a preceding comment.

If we are handed a i18n domain, punycode them with libidn if we are
linked to it, else return an error.
How do you propose we determine that we're handed an i18n domain if
we're not linked to libidn?  just check for any byte other than
printable ascii?
If they are not rfc1035, consider them an i18n domain. It's just a difference between rejection codes, so I don't think it matters much if we treat as an i18n domain something
that it's not (eg. because it contains a prohibited codepoint).


should we do the same thing for psl_load_file() ?
Rejecting the file if it has i18n domains and we're not linked to libidn? No.


If we implement somthing like psl_get_private_zone(), what form should
the returned name be?
The same provided by the user unless overriden with a flag?

Regards




reply via email to

[Prev in Thread] Current Thread [Next in Thread]