gcl-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gcl-devel] utf8 and emacs text/string multibyte representation


From: Matt Kaufmann
Subject: Re: [Gcl-devel] utf8 and emacs text/string multibyte representation
Date: Sat, 1 Nov 2014 12:37:16 -0500

Hi --

I think you and Camm know more about this than I do, but to answer
your question, below is what I get in GCL 2.6.12.  Except, I don't
know how mailers handle high characters of the sort GCL printed in the
output from (string (code-char 232)) below, so although that string
was printed using a single character, here I show it as four
characters (that visually appear just like the one-character version).

>(code-char 232)

#\\350

>(string (code-char 232))

"\350"

>

Interestingly, your (count nil (loop ...)) form also evaluates to 63
in CCL, CLISP, and SBCL, but it evaluates to 32 in Allegro CL and 66
in LispWorks.  It seems to me that the HyperSpec documentation allows
for these differences.

I've pasted in an sbcl log below in case it's illuminating somehow.

sloth:~% sbcl
This is SBCL 1.2.2, an implementation of ANSI Common Lisp.
More information about SBCL is available at <http://www.sbcl.org/>.

SBCL is free software, provided as is, with absolutely no warranty.
It is mostly in the public domain; some portions are provided under
BSD-style licenses.  See the CREDITS and COPYING files in the
distribution for more information.
* sb-impl::*default-external-format*

:UTF-8
* (code-char 232)

#\LATIN_SMALL_LETTER_E_WITH_GRAVE
* (string (code-char 232))

"è"
* (length *)

1
* (setq sb-impl::*default-external-format* :iso-8859-1)

:ISO-8859-1
* (code-char 232)

#\LATIN_SMALL_LETTER_E_WITH_GRAVE
* (string (code-char 232))

"è"
* (length *)

1
* 

-- Matt
   From: Raymond Toy <address@hidden>
   Date: Sat, 01 Nov 2014 09:45:47 -0700

   >>>>> "Matt" == Matt Kaufmann <address@hidden> writes:

       Matt> I saw your question and was curious, so I looked into it a bit:
       >>> To your knowledge, is there any objection to defining alpha-char-p as
       >>> including code-char's >= 128?

       Matt> I see that SBCL 1.2.2 is OK with that, for example:

       Matt> * (code-char 232)

       Matt> #\LATIN_SMALL_LETTER_E_WITH_GRAVE
       Matt> * (alpha-char-p (code-char 232))

       Matt> T
       Matt> * 

       Matt> In fact, that alpha-char-p call also returns T in (versions of)
       Matt> Allegro CL, CCL, CLISP, CMU CL, LispWorks, and SBCL.

   Try (code-char #xa0). This is the unicode character
   no-break-space. This has no case and would presumably not be
   alpha-char-p. I think there are quite a few characters that would not
   be (from cmucl):

   (count nil (loop for k from 128 upto 255 collect (alpha-char-p (code-char 
k))))
   63


   I think there is some confusion here, at least for me. If gcl uses
   8-bit code-units and utf-8 strings, what exactly is (coode-char 232)? 
   You can store that into a utf-8 string but it won't be
   #\latin_small_letter_e_with_grave because that would be encoded as two
   octets in a utf-8 string: 195 168.

   I think it's perfectly legal for gcl to say everything above 128 is
   alpha-char-p. I think, however, that people will just get confused
   that no such characters can be stored into a string and processed
   correctly as utf-8 without a bit of work.

   But perhaps this is just how 8-bit chars and utf-8 strings just have
   to work.

   I think 16-bit chars with utf-16 or 32-bit chars with utf-32 are far
   easier to explain.

   K.I.S.S?

   --
   Ray







   _______________________________________________
   Gcl-devel mailing list
   address@hidden
   https://lists.gnu.org/mailman/listinfo/gcl-devel




reply via email to

[Prev in Thread] Current Thread [Next in Thread]