[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Pika-dev] String hashing in hackerlab?
From: |
Tom Lord |
Subject: |
Re: [Pika-dev] String hashing in hackerlab? |
Date: |
Mon, 7 Jun 2004 09:52:50 -0700 (PDT) |
> From: Andreas Rottmann <address@hidden>
> I think it would make sense to move the hash_symbol_name() code to
> hackerlab proper, calling it something like ustr_hash() and habitate
> it in either ustr.[hc] or a files ustr-hash.[hc].
The latter, I think, but yes. (But hang on a couple more days.)
> (BTW: Tom, how's the state of udstr?)
Pretty good.
The interface wound up being sets of functions like
udstr_cv_substr udstr_cp_substr
udstr_cv_substr_x udstr_cp_substr_x
udstr_cv_substr_fw udstr_cp_substr_fw
udstr_cv_substr_fw_x udstr_cp_substr_fw_x
The "_cv_" versions take string indexes expressed in
coding value units.
The "_cp_" versioons take string indexes expressed as
codepoints.
The "_x" varients modify their first string argument while
the non-"_x" varients allocate a new string for the result.
Thus "substr(a, from, to)" creates a new string but
"substr_x(a, from, to)" modifies a to contain nothing
but the indicated substring.
The "_fw" varients ensure that the result is wide enough
so that each codepoint it contains fits in a single
coding value, but no wider.
In those functions, there are many integer parameters measured in
coding values (like string length or substr offsets in _cv_ functions)
and many integer parameters measured in code points (like in _cp_
functions).
I decided to make the interface slightly more verbose but hopefully
less error prone by passing those integer parameters around in
structures. There are:
struct ustr_cv_index
{
ssize_t cv;
};
and
struct ustr_cp_index
{
ssize_t cp;
};
for coding value and code point unit scalars.
For example:
t_udstr
udstr_cv_delete_fw (alloc_limits limits,
t_udstr d,
ustr_cv_index_t from,
ustr_cv_index_t to);
I've just been coding up the string functions and presuming that
testing will follow later. We can focus testing on just the udstr
layer because that will exercise the ustr and uni layers quite well
(and because udstr is the primary purpose of those lower layers).
CODED
ref/unref
(reference counting)
cv_length
cp_length
encoding
str
cv_ref/cp_ref
(code point reference)
save_generic (heavily parameterized)
save
save_n
save_fw
save_fw_n
copy
copy_fw
fw_x (make the argument _fw'ish)
cv_substr
cv_substr_fw
cp_substr
cp_substr_fw
append
append_x
append_fw
append_fw_x
delete
delete_x
delete_fw
delete_fw_x
STILL TO CODE
cv_normalize
(find the beginning coding value of a multi-codevalue
codepoint given a pointer into the sequence)
cv_inc
cv_dec
(increment and decrement a codevalue index to the next
or previous codepoint)
cp_to_cv
cv_to_cp
cv_substr_x
cp_substr_x
cv_substr_fw_x
cp_substr_fw_x
cp_delete
cp_delete_x
cp_delete_fw
cp_delete_fw_x
cv_replace cp_replace
cv_replace_x cp_replace_x
cv_replace_fw cp_replace_fw
cv_replace_fw_x cp_replace_fw_x
cv_replace_fw cp_replace_fw
cv_replace_fw_x cp_replace_fw_x
cv_set cp_set
cv_set_x cp_set_x
cv_set_fw cp_set_fw
cv_set_fw_x cp_set_fw_x
cv_set_fw cp_set_fw
cv_set_fw_x cp_set_fw_x
The "_set_" functions are trivial (they are front-ends to "replace"
taking a single codepoint rather than a string to designate the
replacement).
I don't think I'll do anything special for the remaining substr_x
functions -- they can just be trivial front-ends to the non-_x forms.
Similarly, I'll just leave the cp_delete family of functions as
slightly inefficient front ends to the _cv_ varients.
The replace family of functions deserves some special attention.
-t