sks-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

UTF-8/non-ASCII chars in keys (was Re: [Sks-devel] 1.0.8 patches)


From: Jason Harris
Subject: UTF-8/non-ASCII chars in keys (was Re: [Sks-devel] 1.0.8 patches)
Date: Tue, 19 Oct 2004 18:10:30 -0400
User-agent: Mutt/1.4.2.1i

On Tue, Oct 19, 2004 at 11:33:41AM -0400, Jason Harris wrote:

> This seems to work on pks servers whether they send UTF-8 or not.
> For Noèl Koethe's keys, I can use ALT-h to generate è and get
> back both 0x307D56ED and 0x0986B74D on keyserver.kjsl.com:11371.
> This also works from the iso-8859-1 (assumed) search pages at
> stinkfoot.org (using elinks and lynx, anyway), which returns UTF-8
> results, and at dtype.org, which returns iso-8859-1 (assumed) results.
> 
> On noreply.org, I only get 0x307D56ED, however.  The links:
> 
>   
> http://keyserver.noreply.org/pks/lookup?search=no%C3%A8l+koethe&fingerprint=on&op=index
>   
> http://keyserver.kjsl.com:11371/pks/lookup?search=no%C3%A8l+koethe&fingerprint=on&op=index

[self-reply]

Actually, that happened only by luck on pks.  pks uses isalnum(3) in
kd_add_userid_to_wordlist() to tokenize userid strings.  ispunct(3) 
would seem a better choice, however, in the presence of non-ASCII
characters.

For key 0x0986B74D, No\xe8\x6c K\xf6\x74he <noel koethe.net>, or
Noèl Köthe, pks currently stores the following "words"
from the userid:

 koethe
 net
 no
 noel
 the

With the following changes to kd_add_userid_to_wordlist() in kd_generic.c:

    while (end < userid+userid_len) {
       /* find beginning of word */
       start = end;
-      while ((start < userid+userid_len) && !isalnum(*start))
+      while ((start < userid+userid_len) &&
+            (ispunct(*start) || isspace (*start)))
         start++;
 
       /* find end of word */
       end = start;
-      while ((end < userid+userid_len) && isalnum(*end))
+      while ((end < userid+userid_len) &&
+            (!ispunct(*end) && !isspace (*end)))
         end++;
 
       /* store it if it's > 1 char */

pks stores the following (actual) words from the userid (printed using
hex escapes):

 koethe
 k\f6the
 net
 noel
 no\e8l

This seems fine, but elinks (using the ISO 8859-1 charset) and lynx send
query strings of:

  http://localhost:11371/pks/lookup?op=index&search=no%C3%A8l&fingerprint=on

instead of:

  http://localhost:11371/pks/lookup?op=index&search=no%e8l&fingerprint=on

which is needed to find no\e8l on 0x0986B74D.  The first query string
does return 0x307D56ED, another key of Noèl's, however, since it uses
UTF-8 encoding in the actual userid string.

Therefore, the above patch seems to tokenize older binary as well as UTF-8
userids properly and stores them in raw format in worddb.  elinks and lynx,
at least, send UTF-8 query strings that match newer keys that encode
userids in UTF-8.  Older keys can still be found using the old hex codes,
when necessary.

NB:  For full effect, anyone using this patch to more fully support UTF-8
needs to make a keydump and rebuild their pks database(es) from scratch.

I imagine a similar fix is necessary for SKS.

-- 
Jason Harris           |  NIC:  JH329, PGP:  This _is_ PGP-signed, isn't it?
address@hidden _|_ web:  http://keyserver.kjsl.com/~jharris/
          Got photons?   (TM), (C) 2004

Attachment: pgpGF_M7t_C6n.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]