[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Pika-dev] Re: string work
From: |
Tom Lord |
Subject: |
[Pika-dev] Re: string work |
Date: |
Sun, 25 Jan 2004 17:39:44 -0800 (PST) |
> From: "Jose A. Ortega Ruiz" <address@hidden>
> i'm back from a weekend abroad, and a little bit overwhelmed by the
> sheer amount of mails regarding pika strings that i've found in my
> inbox. after a quick perusal, it seems like Matthew is already working
> on pika strings, advancing at his usual quick pace, right?
He's working on the boilerplate code for wrapping a hackerlab t_udstr
as a Scheme object. I suspect (intending no disrespect to Matthew)
that this code will need to be revised as Unicode string facilities
come on-line. It's still a useful thing to do just to lay out the
framework and to help with bootstrapping.
> hm. given the amount of time i can devote to pika, i would need a week
> to read all the relevant postings, then enter into hackerlab, then
> catch Matthew's code... by that time, it seems like i'd be of little
> use, wouldn't i?
> tom, i think i really need a quiet corner of pika to hack on (at a
> calm, slow pace) for the time being; otherwise, i'll become a
> bottleneck. right now, i can hardly follow your discussions on the
> list(s). i'd need to know better the pika innards. to that end, i'd
> like to hack slowly on something complex enough, but probably out of
> the critical path, so that i can take my time. what do you think? is
> there such a corner?
> or am i panicking without reason? :)
How do you feel about working on libhackerlab, for now, rather than
libscm directly? That would complement Matthew's work but also is
something that can proceed mostly separately at its own pace. In
fact, this work includes working on the guts of the standard Scheme
string procedures --- the needed functions are those that will allow
Pika and Pika applications to be first-class Unicode-using programs;
you'd be implementing the engine for things like STRING-APPEND and
STRING-SET!.
Below is what's needed in ./src/hackerlab/strings.
* Minor annoyance: C aliasing rules safety
It's probably worth fixing this early, before lots of new code comes
to depend on a mistake I made.
The string functions use `uni_string' parameters. That type is
intended to be a way to pass pointers to string which might be:
ASCII or iso8859-1 t_uchar *
UTF-8 t_uchar *
UTF-16 t_uint16 *
It's defined in src/hackerlab/unicode/unicode.h as:
typedef struct uni__undefined_struct * uni_string;
which is a case of my "old school" coding habbits showing throw.
It should instead be defined as a union type:
union uni_string
{
t_uchar * iso8859_1;
t_uchar * utf8;
t_uint16 * utf16;
t_uint32 * utf32;
};
and uses updated to reflect that.
For convenience, there should be functions (which are optionally
inlined) like:
union uni_string
uni_string_utf8 (t_uchar * utf8_data)
{
union uni_string answer;
answer.utf8 = utf8_data;
return answer;
}
(Such functions are a stand-in for cast-to-union which is not
portable enough to use.)
* utf-32
Notice that procedures like `udstr_save' take parameters of type
`enum uni_encoding_scheme'. Encodings are defined for iso8859_1,
utf8, and variations on utf16.
Support is also needed for the encoding `uni_utf32' which is a
32-bits-per-character encoding using the native byte order.
* fixed width routines
Pika should ultimately not use `udstr_save' or similar functions,
at least in many circumstances.
Rather, it should use `udstr_save_fw' (which needs to be written).
`udstr_save_fw' returns the narrowest `t_udstr' in which all
characters have the same width representation.
For example, if `udstr_save_fw' is passed a UTF-16 string, but all
of the characters in that string are ASCII characters, then the
`t_udstr' returned should use the encoding `uni_iso8859_1' and
use one byte per character.
On the other hand, if `udstr_save_fw' is passed a UTF-8 string, but
not all of the characters are ASCII, then the `t_udstr' returned
should use the encoding `uni_utf16' or `uni_utf32'.
* illegal coding sequences, uni_bogus32, and 23-bit characters
What happens if `udstr_save_fw' is passed, for example, a
purportedly UTF-8 string --- but the string is not valid UTF-8?
libhackerlab should be modified to use a 22-bit extended version of
Unicode. If the higher order bit, bit 21, of a character is set,
then that means that the character is a "bogus" character.
For example, when scanning a UTF-8 string, an illegal byte is
interpreted to represent a bogus character -- the scan returns that
byte ORed with (1<<21). (In the current code it returns a generic
substitution character but the addition of "bogus characters" will
allow a scan to work on ill-formed strings without discarding
information.)
The encoding `uni_bogus32' is used for `t_udstr' strings that
contain bogus characters. In this encoding, each character is
represented by 23 bits.
If bits 21 and 22 are 0, then the character is a normal Unicode
codepoint.
If bit 21 is 1, then the character is a "bogus character".
The use for bit 22 is explained further below.
* buckybits and 28 bit characters
A tiny bit of machinery should be moved from Pika into libhackerlab
for dealing with buckybits.
Strings (`t_udstr') can contain characters with buckybits set.
It represents these as `uni_utf32' strings (or `uni_bogus32' if the
string also contains bogus characters).
* Character Properties and The Like
For the purpose of character-property functions, such as:
/* src/hackerlab/unidata/case-db-inlines.h
*/
unidata_to_upper (t_unicode c);
a character with non-0 buckybits should be treated analogously to
its buckybit version. In other words, mask out the buckybits,
compute the function on the underlying codepoint, and then put the
buckybits back on the result.
Bogus characters should be treated like unassigned Unicode
codepoints. For example, unidata_to_upper (bogus_char) returns
its argument.
* hackerlab versions of standard Scheme string procedures
Scheme defines standard procedures like STRING-REF, STRING-SET!,
STRING-APPEND, etc.
Analogous functions need to be provided for `t_udstr' -- these will
be used to implement the Pika functions.
Note that some of those functions should be `_fw' versions. For
example, perhaps libhackerlab should have:
ustr_set (uni_string str, int coding_value_index, t_unicode value)
but that isn't what Pika should use. Pika should use:
ustr_set_fw (uni_string str,
int coding_value_index,
t_unicode value)
where the _fw varient will be doing things like promoting a
`uni_iso8859_1' string to a `uni_utf16' string if it takes 16 bits
to represent `value'.
* Bit 22 in `uni_bogus32'
The routines described above should preserve the following invariant
for `uni_bogus32' strings:
If bit 21 is 0 (the character is not bogus), then bit 22 is 0.
If bit 21 is 1, and there is a character to the left in which
bit 22 is 0 or else there is no character to the left, then
bit 22 is 1.
Otherwise, bit 22 is 0.
That implies that in a substring of bogus characters in a
`uni_bogus32' string, bit 22 alternates between 1 and 0
(assuming the string has been manipulated only by the functions
described above).
And it implies that in a maximal length substring of bogus
characters, the first one has bit 22 set to 1.
STRING-REF (and its hackerlab equivalent) should mask out bit 22.
STRING-SET (and its hackerlab equivalent) should update the bit 22s
of adjacent characters as necessary.
Later, we'll add functions that cause bit 22 to not alternate that
way (and the udpates performed by functions like STRING-SET should
take this into account).
For example, I think we'll want Pika to have a string append
procedure, distinct from STRING-APPEND, that preserves combining
character sequence boundaries.
In other words, suppose that I have a string which is just:
"A"
and another string which is the ill-formed combinding character
sequence:
"\U+0301."
where that's my way of typing in ASCII a string containing just
U+0301 ("COMBINING ACUTE ACCENT").
Now suppose we have a procedure GRAPHEME-LENGTH which reports,
essentially, the number of combining character sequences in a
string. Then:
(grapheme-length "A") => 1
(grapheme-length "\U+0301") => 1
and given:
(define s (string-append "A" "\U+0301"))
we get:
(string-length s) => 2
(grapheme-length s) => 1
however, and this where bit 22 comes in:
(define s2 (string-grapheme-append "A" "\U+0301"))
(string-length s2) => 2
(grapheme-length s2) => 2
(string=? s s2) => #t
-t