[Pika-dev] Re: string work

pika-dev
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Pika-dev] Re: string work

From:	Tom Lord
Subject:	[Pika-dev] Re: string work
Date:	Sun, 25 Jan 2004 17:39:44 -0800 (PST)

    > From: "Jose A. Ortega Ruiz" <address@hidden>

    > i'm back from a weekend abroad, and a little bit overwhelmed by the
    > sheer amount of mails regarding pika strings that i've found in my
    > inbox. after a quick perusal, it seems like Matthew is already working
    > on pika strings, advancing at his usual quick pace, right?

He's working on the boilerplate code for wrapping a hackerlab t_udstr
as a Scheme object.  I suspect (intending no disrespect to Matthew)
that this code will need to be revised as Unicode string facilities
come on-line.  It's still a useful thing to do just to lay out the
framework and to help with bootstrapping.


    > hm. given the amount of time i can devote to pika, i would need a week
    > to read all the relevant postings, then enter into hackerlab, then
    > catch Matthew's code... by that time, it seems like i'd be of little
    > use, wouldn't i?

    > tom, i think i really need a quiet corner of pika to hack on (at a
    > calm, slow pace) for the time being; otherwise, i'll become a
    > bottleneck. right now, i can hardly follow your discussions on the
    > list(s). i'd need to know better the pika innards. to that end, i'd
    > like to hack slowly on something complex enough, but probably out of
    > the critical path, so that i can take my time. what do you think? is
    > there such a corner?

    > or am i panicking without reason? :)

How do you feel about working on libhackerlab, for now, rather than
libscm directly?  That would complement Matthew's work but also is
something that can proceed mostly separately at its own pace.  In
fact, this work includes working on the guts of the standard Scheme
string procedures --- the needed functions are those that will allow
Pika and Pika applications to be first-class Unicode-using programs;
you'd be implementing the engine for things like STRING-APPEND and
STRING-SET!.


Below is what's needed in ./src/hackerlab/strings.   

* Minor annoyance: C aliasing rules safety

  It's probably worth fixing this early, before lots of new code comes
  to depend on a mistake I made.

  The string functions use `uni_string' parameters.  That type is
  intended to be a way to pass pointers to string which might be:

        ASCII or iso8859-1              t_uchar *
        UTF-8                           t_uchar *
        UTF-16                          t_uint16 *

   It's defined in src/hackerlab/unicode/unicode.h as:

        typedef struct uni__undefined_struct * uni_string;

   which is a case of my "old school" coding habbits showing throw.

   It should instead be defined as a union type:

        union uni_string
        {
          t_uchar * iso8859_1;
          t_uchar * utf8;
          t_uint16 * utf16;
          t_uint32 * utf32;
        };

  and uses updated to reflect that.

  For convenience, there should be functions (which are optionally
  inlined) like:

        union uni_string
        uni_string_utf8 (t_uchar * utf8_data)
        {
          union uni_string answer;

          answer.utf8 = utf8_data;
          return answer;
        }

  (Such functions are a stand-in for cast-to-union which is not
  portable enough to use.)


* utf-32

  Notice that procedures like `udstr_save' take parameters of type
  `enum uni_encoding_scheme'.   Encodings are defined for iso8859_1,
  utf8, and variations on utf16.

  Support is also needed for the encoding `uni_utf32' which is a
  32-bits-per-character encoding using the native byte order.



* fixed width routines

  Pika should ultimately not use `udstr_save' or similar functions,
  at least in many circumstances.

  Rather, it should use `udstr_save_fw' (which needs to be written).

  `udstr_save_fw' returns the narrowest `t_udstr' in which all
  characters have the same width representation.

  For example, if `udstr_save_fw' is passed a UTF-16 string, but all
  of the characters in that string are ASCII characters, then the
  `t_udstr' returned should use the encoding `uni_iso8859_1' and 
  use one byte per character.

  On the other hand, if `udstr_save_fw' is passed a UTF-8 string, but
  not all of the characters are ASCII, then the `t_udstr' returned
  should use the encoding `uni_utf16' or `uni_utf32'.



* illegal coding sequences, uni_bogus32, and 23-bit characters

  What happens if `udstr_save_fw' is passed, for example, a
  purportedly UTF-8 string --- but the string is not valid UTF-8?

  libhackerlab should be modified to use a 22-bit extended version of
  Unicode.  If the higher order bit, bit 21, of a character is set,
  then that means that the character is a "bogus" character.

  For example, when scanning a UTF-8 string, an illegal byte is
  interpreted to represent a bogus character -- the scan returns that
  byte ORed with (1<<21).  (In the current code it returns a generic
  substitution character but the addition of "bogus characters" will
  allow a scan to work on ill-formed strings without discarding
  information.)

  The encoding `uni_bogus32' is used for `t_udstr' strings that
  contain bogus characters.  In this encoding, each character is
  represented by 23 bits.

  If bits 21 and 22 are 0, then the character is a normal Unicode
  codepoint.

  If bit 21 is 1, then the character is a "bogus character".

  The use for bit 22 is explained further below.



* buckybits and 28 bit characters

  A tiny bit of machinery should be moved from Pika into libhackerlab
  for dealing with buckybits.

  Strings (`t_udstr') can contain characters with buckybits set.  
  It represents these as `uni_utf32' strings (or `uni_bogus32' if the
  string also contains bogus characters).



* Character Properties and The Like

  For the purpose of character-property functions, such as:

        /* src/hackerlab/unidata/case-db-inlines.h 
         */
        unidata_to_upper (t_unicode c);

  a character with non-0 buckybits should be treated analogously to
  its buckybit version.   In other words, mask out the buckybits,
  compute the function on the underlying codepoint, and then put the
  buckybits back on the result.

  Bogus characters should be treated like unassigned Unicode
  codepoints.   For example, unidata_to_upper (bogus_char) returns
  its argument.




* hackerlab versions of standard Scheme string procedures

  Scheme defines standard procedures like STRING-REF, STRING-SET!,
  STRING-APPEND, etc.

  Analogous functions need to be provided for `t_udstr' -- these will
  be used to implement the Pika functions.

  Note that some of those functions should be `_fw' versions.   For
  example, perhaps libhackerlab should have:

        ustr_set (uni_string str, int coding_value_index, t_unicode value)

  but that isn't what Pika should use.   Pika should use:

        ustr_set_fw (uni_string str, 
                     int coding_value_index, 
                     t_unicode value)

  where the _fw varient will be doing things like promoting a
  `uni_iso8859_1' string to a `uni_utf16' string if it takes 16 bits
  to represent `value'.



* Bit 22 in `uni_bogus32'

  The routines described above should preserve the following invariant
  for `uni_bogus32' strings:

        If bit 21 is 0 (the character is not bogus), then bit 22 is 0.

        If bit 21 is 1, and there is a character to the left in which 
        bit 22 is 0 or else there is no character to the left, then
        bit 22 is 1.

        Otherwise, bit 22 is 0.


  That implies that in a substring of bogus characters in a
  `uni_bogus32' string, bit 22 alternates between 1 and 0 
  (assuming the string has been manipulated only by the functions
  described above).

  And it implies that in a maximal length substring of bogus
  characters, the first one has bit 22 set to 1.

  STRING-REF (and its hackerlab equivalent) should mask out bit 22.

  STRING-SET (and its hackerlab equivalent) should update the bit 22s
  of adjacent characters as necessary.

  Later, we'll add functions that cause bit 22 to not alternate that
  way (and the udpates performed by functions like STRING-SET should
  take this into account).

  For example, I think we'll want Pika to have a string append
  procedure, distinct from STRING-APPEND, that preserves combining
  character sequence boundaries.

  In other words, suppose that I have a string which is just:

        "A"

  and another string which is the ill-formed combinding character
  sequence:

        "\U+0301."

  where that's my way of typing in ASCII a string containing just
  U+0301 ("COMBINING ACUTE ACCENT").

  Now suppose we have a procedure GRAPHEME-LENGTH which reports,
  essentially, the number of combining character sequences in a
  string.  Then:

        (grapheme-length "A") => 1
        (grapheme-length "\U+0301") => 1

  and given:

        (define s (string-append "A" "\U+0301"))

  we get:

        (string-length s) => 2

        (grapheme-length s) => 1

  however, and this where bit 22 comes in:

        (define s2 (string-grapheme-append "A" "\U+0301"))

        (string-length s2) => 2

        (grapheme-length s2) => 2

        (string=? s s2) => #t

-t
[Prev in Thread]
Current Thread
[Next in Thread]
[Pika-dev] string work, Jose A. Ortega Ruiz, 2004/01/25
- Re: [Pika-dev] string work, Matthew Dempsky, 2004/01/25
  - Re: [Pika-dev] string work, Jose A Ortega Ruiz, 2004/01/25
  - Re: [Pika-dev] string work, Tom Lord, 2004/01/25
- [Pika-dev] Re: string work, Tom Lord <=
  - [Pika-dev] Re: string work, Jose A Ortega Ruiz, 2004/01/26
Prev by Date: Re: [Pika-dev] string work
Next by Date: Re: [Pika-dev] string work
Previous by thread: Re: [Pika-dev] string work
Next by thread: [Pika-dev] Re: string work
Index(es):
- Date
- Thread