pika-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Pika-dev] other char/string things


From: Tom Lord
Subject: [Pika-dev] other char/string things
Date: Sat, 24 Jan 2004 17:13:27 -0800 (PST)


It's been called to my attention that I forgot to specify a "control"
buckybit.  Oops.

Just for the record, what we eventually need to tweak-things into
is to have:

        C-

as a character name prefix, mean to set the control buckybit.

Thus, 

        (char->integer #\C-a)

is some large integer -- not U+0001

Additionally, character _names_ (not bucky-like prefixes) are needed 
for ASCII control characters:


        ctl-a           == U+0001
        ctl-[           == U+001B
        ctl-@           == U+0000
        ctl-space       == U+0000

etc.

We currently have four buckybits and adding control will make five.

I think that there is a total _possible_ number of 8 buckybits
(because 32 == 21 + 8 + 3 and 3 == log_2(8) and 2*sizeof (t_scm_word)
is 8 on a 32 bit machine --- in other words,

        tag-bits + codepoint + buckybits == 32


Of those 8 possible buckybits, I'd like to reserve 2 for purely
internal use and to make use of these in uni_utf32 strings.  The
purpose of these extra bits in UTF32 strings is to represent
ill-formed and unrepresentable sequences of Unicode characters.  (See
enclosed.  There are variations on that idea using just 1 bit and
variations using the four values of 2-bits differently -- we can work
that out as it comes up which won't be for a while.)

So that will leave one unallocated bit in characters.


-t






    > From: Tom Lord <address@hidden>
    > [To: gnu-arch-users]

    [....]

    >     > Just decide how many ISO 10646 planes you want to support, and use 
the
    >     > appropriate number of bits (21 is fine).  Use an additional bit to
    >     > squeeze in 256 code positions you might want to use to represent 
invalid
    >     > UTF-8 input data (so you have round-trip capability even for binary
    >     > files accidentally interpreted as UTF-8).

    > I'm not giving UTF-8 that kind of priveleged role in Pika.

    > However, it's a fascinating idea and I thank you for it.   It solves a
    > nasty little problem I was facing.

    > Let's suppose that I use up two buckybits purely internally to
    > represent "ill-formed-characters".   That is to say: users would have
    > 6 buckybits, not 8, and there's two bits per character for internal
    > use.

    > I don't actually need 2 bits --- I just need a bit more than 1.5 and
    > current hw isn't too good at fractional (let alone irrationally
    > fractional) bits yet.

    > Now I can have a string like:

    >   <00 codepoint><00 codepoint><01 bogus><10 bogus><10 bogus><00 codepoint>
    >                                         ^
    >                                         |
    >                                         X

    > in which <01 bogus> and <10 bogus><10 bogus> are ill-formed combining
    > character sequences that should be treated as distinct graphemes by
    > procedures like GRAPHEME-LENGTH and GRAPHEME-REF.

    > Now if I insert a string of the form:

    >         <01 bogus>

    > at point X in that string, then the result is:

    >   <00 cp><00 cp><01 bogus><10 bogus><01 bogus><01 bogus><00 cp>
    >                 \        /\        /\                  /
    >                  \      /  \      /  \                /
    >                   modified  insertion   modified by
    >                   by                     insertion
    >                   insertion

    > In other words, such an insertion has to change adjacent characters to 
    > preserve the "bogus grapheme" boundaries.

    > The upshot of this is that I can pun a single string as both a
    > sequence of codepoints and a sequence of (possibly ill-formed)
    > combining sequences -- and that is, btw, sufficient to provide the
    > round-tripping ability you were after not only for UTF-8 but for
    > UTF-16 and UTF-32 as well.   Total win -- thanks again.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]