[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gcl-devel] utf8 and emacs text/string multibyte representation
From: |
Raymond Toy |
Subject: |
Re: [Gcl-devel] utf8 and emacs text/string multibyte representation |
Date: |
Sat, 01 Nov 2014 09:45:47 -0700 |
User-agent: |
Gnus/5.101 (Gnus v5.10.10) XEmacs/21.5-b34 (darwin) |
>>>>> "Matt" == Matt Kaufmann <address@hidden> writes:
Matt> I saw your question and was curious, so I looked into it a bit:
>>> To your knowledge, is there any objection to defining alpha-char-p as
>>> including code-char's >= 128?
Matt> I see that SBCL 1.2.2 is OK with that, for example:
Matt> * (code-char 232)
Matt> #\LATIN_SMALL_LETTER_E_WITH_GRAVE
Matt> * (alpha-char-p (code-char 232))
Matt> T
Matt> *
Matt> In fact, that alpha-char-p call also returns T in (versions of)
Matt> Allegro CL, CCL, CLISP, CMU CL, LispWorks, and SBCL.
Try (code-char #xa0). This is the unicode character
no-break-space. This has no case and would presumably not be
alpha-char-p. I think there are quite a few characters that would not
be (from cmucl):
(count nil (loop for k from 128 upto 255 collect (alpha-char-p (code-char k))))
63
I think there is some confusion here, at least for me. If gcl uses
8-bit code-units and utf-8 strings, what exactly is (coode-char 232)?
You can store that into a utf-8 string but it won't be
#\latin_small_letter_e_with_grave because that would be encoded as two
octets in a utf-8 string: 195 168.
I think it's perfectly legal for gcl to say everything above 128 is
alpha-char-p. I think, however, that people will just get confused
that no such characters can be stored into a string and processed
correctly as utf-8 without a bit of work.
But perhaps this is just how 8-bit chars and utf-8 strings just have
to work.
I think 16-bit chars with utf-16 or 32-bit chars with utf-32 are far
easier to explain.
K.I.S.S?
--
Ray