[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gcl-devel] utf8 and emacs text/string multibyte representation
From: |
Matt Kaufmann |
Subject: |
Re: [Gcl-devel] utf8 and emacs text/string multibyte representation |
Date: |
Sat, 1 Nov 2014 12:37:16 -0500 |
Hi --
I think you and Camm know more about this than I do, but to answer
your question, below is what I get in GCL 2.6.12. Except, I don't
know how mailers handle high characters of the sort GCL printed in the
output from (string (code-char 232)) below, so although that string
was printed using a single character, here I show it as four
characters (that visually appear just like the one-character version).
>(code-char 232)
#\\350
>(string (code-char 232))
"\350"
>
Interestingly, your (count nil (loop ...)) form also evaluates to 63
in CCL, CLISP, and SBCL, but it evaluates to 32 in Allegro CL and 66
in LispWorks. It seems to me that the HyperSpec documentation allows
for these differences.
I've pasted in an sbcl log below in case it's illuminating somehow.
sloth:~% sbcl
This is SBCL 1.2.2, an implementation of ANSI Common Lisp.
More information about SBCL is available at <http://www.sbcl.org/>.
SBCL is free software, provided as is, with absolutely no warranty.
It is mostly in the public domain; some portions are provided under
BSD-style licenses. See the CREDITS and COPYING files in the
distribution for more information.
* sb-impl::*default-external-format*
:UTF-8
* (code-char 232)
#\LATIN_SMALL_LETTER_E_WITH_GRAVE
* (string (code-char 232))
"è"
* (length *)
1
* (setq sb-impl::*default-external-format* :iso-8859-1)
:ISO-8859-1
* (code-char 232)
#\LATIN_SMALL_LETTER_E_WITH_GRAVE
* (string (code-char 232))
"è"
* (length *)
1
*
-- Matt
From: Raymond Toy <address@hidden>
Date: Sat, 01 Nov 2014 09:45:47 -0700
>>>>> "Matt" == Matt Kaufmann <address@hidden> writes:
Matt> I saw your question and was curious, so I looked into it a bit:
>>> To your knowledge, is there any objection to defining alpha-char-p as
>>> including code-char's >= 128?
Matt> I see that SBCL 1.2.2 is OK with that, for example:
Matt> * (code-char 232)
Matt> #\LATIN_SMALL_LETTER_E_WITH_GRAVE
Matt> * (alpha-char-p (code-char 232))
Matt> T
Matt> *
Matt> In fact, that alpha-char-p call also returns T in (versions of)
Matt> Allegro CL, CCL, CLISP, CMU CL, LispWorks, and SBCL.
Try (code-char #xa0). This is the unicode character
no-break-space. This has no case and would presumably not be
alpha-char-p. I think there are quite a few characters that would not
be (from cmucl):
(count nil (loop for k from 128 upto 255 collect (alpha-char-p (code-char
k))))
63
I think there is some confusion here, at least for me. If gcl uses
8-bit code-units and utf-8 strings, what exactly is (coode-char 232)?
You can store that into a utf-8 string but it won't be
#\latin_small_letter_e_with_grave because that would be encoded as two
octets in a utf-8 string: 195 168.
I think it's perfectly legal for gcl to say everything above 128 is
alpha-char-p. I think, however, that people will just get confused
that no such characters can be stored into a string and processed
correctly as utf-8 without a bit of work.
But perhaps this is just how 8-bit chars and utf-8 strings just have
to work.
I think 16-bit chars with utf-16 or 32-bit chars with utf-32 are far
easier to explain.
K.I.S.S?
--
Ray
_______________________________________________
Gcl-devel mailing list
address@hidden
https://lists.gnu.org/mailman/listinfo/gcl-devel