pika-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Pika-dev] String hashing in hackerlab?


From: Tom Lord
Subject: Re: [Pika-dev] String hashing in hackerlab?
Date: Mon, 7 Jun 2004 09:52:50 -0700 (PDT)

    > From: Andreas Rottmann <address@hidden>

    > I think it would make sense to move the hash_symbol_name() code to
    > hackerlab proper, calling it something like ustr_hash() and habitate
    > it in either ustr.[hc] or a files ustr-hash.[hc]. 

The latter, I think, but yes.  (But hang on a couple more days.)

    > (BTW: Tom, how's the state of udstr?)

Pretty good.  

The interface wound up being sets of functions like 


  udstr_cv_substr               udstr_cp_substr
  udstr_cv_substr_x             udstr_cp_substr_x
  udstr_cv_substr_fw            udstr_cp_substr_fw
  udstr_cv_substr_fw_x          udstr_cp_substr_fw_x

    The "_cv_" versions take string indexes expressed in
    coding value units.

    The "_cp_" versioons take string indexes expressed as
    codepoints.

    The "_x" varients modify their first string argument while
    the non-"_x" varients allocate a new string for the result.
    Thus "substr(a, from, to)" creates a new string but 
    "substr_x(a, from, to)" modifies a to contain nothing
    but the indicated substring.

    The "_fw" varients ensure that the result is wide enough
    so that each codepoint it contains fits in a single
    coding value, but no wider.

In those functions, there are many integer parameters measured in
coding values (like string length or substr offsets in _cv_ functions)
and many integer parameters measured in code points (like in _cp_
functions).

I decided to make the interface slightly more verbose but hopefully
less error prone by passing those integer parameters around in
structures.   There are:

        struct ustr_cv_index 
        {
          ssize_t cv;
        };

and 

        struct ustr_cp_index 
        {
          ssize_t cp;
        };

for coding value and code point unit scalars.


For example:

  t_udstr
  udstr_cv_delete_fw (alloc_limits limits,
                      t_udstr d,
                      ustr_cv_index_t from,
                      ustr_cv_index_t to);



I've just been coding up the string functions and presuming that
testing will follow later.  We can focus testing on just the udstr
layer because that will exercise the ustr and uni layers quite well
(and because udstr is the primary purpose of those lower layers).

                CODED

        ref/unref
          (reference counting)
        cv_length
        cp_length
        encoding
        str
        cv_ref/cp_ref
          (code point reference)
        save_generic (heavily parameterized)
        save
        save_n
        save_fw
        save_fw_n
        copy
        copy_fw
        fw_x (make the argument _fw'ish)
        cv_substr
        cv_substr_fw
        cp_substr
        cp_substr_fw
        append
        append_x
        append_fw
        append_fw_x
        delete
        delete_x
        delete_fw
        delete_fw_x


                STILL TO CODE

        cv_normalize
          (find the beginning coding value of a multi-codevalue
           codepoint given a pointer into the sequence)

        cv_inc
        cv_dec
           (increment and decrement a codevalue index to the next
            or previous codepoint)

        cp_to_cv
        cv_to_cp

        cv_substr_x
        cp_substr_x
        cv_substr_fw_x
        cp_substr_fw_x

        cp_delete
        cp_delete_x
        cp_delete_fw
        cp_delete_fw_x

        cv_replace              cp_replace
        cv_replace_x            cp_replace_x
        cv_replace_fw           cp_replace_fw
        cv_replace_fw_x         cp_replace_fw_x
        cv_replace_fw           cp_replace_fw
        cv_replace_fw_x         cp_replace_fw_x

        cv_set                  cp_set
        cv_set_x                cp_set_x
        cv_set_fw               cp_set_fw
        cv_set_fw_x             cp_set_fw_x
        cv_set_fw               cp_set_fw
        cv_set_fw_x             cp_set_fw_x


The "_set_" functions are trivial (they are front-ends to "replace"
taking a single codepoint rather than a string to designate the
replacement).

I don't think I'll do anything special for the remaining substr_x
functions -- they can just be trivial front-ends to the non-_x forms.

Similarly, I'll just leave the cp_delete family of functions as 
slightly inefficient front ends to the _cv_ varients.

The replace family of functions deserves some special attention.

-t






reply via email to

[Prev in Thread] Current Thread [Next in Thread]