[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gcl-devel] utf8 and emacs text/string multibyte representation
From: |
Camm Maguire |
Subject: |
Re: [Gcl-devel] utf8 and emacs text/string multibyte representation |
Date: |
Sat, 01 Nov 2014 15:26:44 -0400 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux) |
Greetings, and thanks so much! I think we are converging...
1) The proposal under consideration is due to Carl, that gcl's lisp
character still be governed by char-code-limit==256, i.e. equivalent to
an uint8_t. aref/aset work the same for all types of arrays. This lisp
character has no correspondence to a unicode character other than the
overlap in the ascii range. In some fashion, gcl would then provide on
top of these primitives (unichar s i), etc. to get unicodes from utf8
encoded strings. These are not random access, but can be cached. So
(code-char #xa0) != no-break-space.
2) To the extent that anyone constructs unicode strings from individual
codepoints, the use of these routines will ensure that the utf8 output
is correct. An improperly formatted utf8-encoding will serve as valid
input into aref, but not #'unichar, or whatever we call it.
3) There appear to be two meanings of alphabetic in the Hyperspec, which
are not the same. The first is a graphic character with case, the
second is a designator for constituent characters which do not separate
tokens.
alphabetic n., adj. 1. adj. (of a character) being one of the standard
characters A through Z or a through z, or being any
implementation-defined character that has case, or being some other
graphic character defined by the implementation to be
alphabetic[1]. 2. a. n. one of several possible constituent traits of a
character. For details, see Section 2.1.4.1 (Constituent Characters) and
Section 2.2 (Reader Algorithm). b. adj. (of a character) being a
character that has syntax type constituent in the current readtable and
that has the constituent trait alphabetic[2a]. See Figure 2-8.
Defining octets >=128 as alpha-char-p means they are not used as token
separators, at least in gcl's default reader. This makes sense for the
input of a pair of octets representing no-break-space, as presumably
this is a constituent character too. It does not make sense if one
assumes that there must be distinct octets for the pair in
no-break-space that correspond to the opposite case. So there appears
to be a bit of an ambiguity here. Are there any non-constituent unicode
codepoints in the non-ascii range? (Assuming yes, but probably not
important.)
4) I think a dominant consideration here are the forms of most probable
input and output. Files, terminals, even cut-paste from emacs buffers,
all transfer valid utf8 encoded byte sequences into GCL which then
intern, print, and string-compare correctly. Asking the unusual user
who might want to set strings directly via their unicode codepoints to
use a setf on unichar instead of aref, or better yet a unicode-char
which outputs a string for concatenation, seems a small price to pay.
Just thoughts...
Take care,
Raymond Toy <address@hidden> writes:
>>>>>> "Matt" == Matt Kaufmann <address@hidden> writes:
>
> Matt> I saw your question and was curious, so I looked into it a bit:
> >>> To your knowledge, is there any objection to defining alpha-char-p as
> >>> including code-char's >= 128?
>
> Matt> I see that SBCL 1.2.2 is OK with that, for example:
>
> Matt> * (code-char 232)
>
> Matt> #\LATIN_SMALL_LETTER_E_WITH_GRAVE
> Matt> * (alpha-char-p (code-char 232))
>
> Matt> T
> Matt> *
>
> Matt> In fact, that alpha-char-p call also returns T in (versions of)
> Matt> Allegro CL, CCL, CLISP, CMU CL, LispWorks, and SBCL.
>
> Try (code-char #xa0). This is the unicode character
> no-break-space. This has no case and would presumably not be
> alpha-char-p. I think there are quite a few characters that would not
> be (from cmucl):
>
> (count nil (loop for k from 128 upto 255 collect (alpha-char-p (code-char
> k))))
> 63
>
>
> I think there is some confusion here, at least for me. If gcl uses
> 8-bit code-units and utf-8 strings, what exactly is (coode-char 232)?
> You can store that into a utf-8 string but it won't be
> #\latin_small_letter_e_with_grave because that would be encoded as two
> octets in a utf-8 string: 195 168.
>
> I think it's perfectly legal for gcl to say everything above 128 is
> alpha-char-p. I think, however, that people will just get confused
> that no such characters can be stored into a string and processed
> correctly as utf-8 without a bit of work.
>
> But perhaps this is just how 8-bit chars and utf-8 strings just have
> to work.
>
> I think 16-bit chars with utf-16 or 32-bit chars with utf-32 are far
> easier to explain.
>
> K.I.S.S?
>
> --
> Ray
>
>
>
>
>
>
>
> _______________________________________________
> Gcl-devel mailing list
> address@hidden
> https://lists.gnu.org/mailman/listinfo/gcl-devel
>
>
>
>
--
Camm Maguire address@hidden
==========================================================================
"The earth is but one country, and mankind its citizens." -- Baha'u'llah