[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gcl-devel] utf8 and emacs text/string multibyte representation
From: |
Matt Kaufmann |
Subject: |
Re: [Gcl-devel] utf8 and emacs text/string multibyte representation |
Date: |
Sat, 1 Nov 2014 10:18:27 -0500 |
I saw your question and was curious, so I looked into it a bit:
>> To your knowledge, is there any objection to defining alpha-char-p as
>> including code-char's >= 128?
I see that SBCL 1.2.2 is OK with that, for example:
* (code-char 232)
#\LATIN_SMALL_LETTER_E_WITH_GRAVE
* (alpha-char-p (code-char 232))
T
*
In fact, that alpha-char-p call also returns T in (versions of)
Allegro CL, CCL, CLISP, CMU CL, LispWorks, and SBCL.
Next, I checked the CL HyperSpec
http://www.lispworks.com/documentation/HyperSpec/Body/f_alpha_.htm#alpha-char-p
and found this for alpha-char-p:
Returns true if character is an alphabetic[1] character; otherwise,
returns false.
I followed the link to "alphabetic"
http://www.lispworks.com/documentation/HyperSpec/Body/26_glo_a.htm#alphabetic
and found this as the first definition, which seems to justify the
above return value of T.
adj. (of a character) being one of the standard characters A through
Z or a through z, or being any implementation-defined character that
has case, or being some other graphic character defined by the
implementation to be alphabetic[1].
[By the way, ACL2 has this wrong! So I'm glad you asked. I'll fix
that....]
-- Matt
From: Camm Maguire <address@hidden>
Date: Sat, 01 Nov 2014 10:50:48 -0400
Cc: Raymond Toy <address@hidden>, address@hidden
Greetings!
Carl Shapiro <address@hidden> writes:
> On Fri, Oct 31, 2014 at 11:20 AM, Camm Maguire <address@hidden> wrote:
>
> It really appears that unicode refers more to a glyph than anything
> else. If we follow your suggestions, and leave characters 8-bit, aref
> random O(1) access, is there any utility to providing unicode functions
> #'glyph-length or some such in a common lisp implementation?
>
> Yes, a Common Lisp character is a UTF-8 code unit. As such, (length "א")
would return 2 in GCL whereas it returns 1 in CMUCL.
>
> For iterating across strings in ways other than by UTF-8 code unit, you
will want to provide an iterators for iterating by code point, by glyph,
> and so forth.
>
> In theory, something like CL-UNICODE would provide that but I think its
really lacking in a number of important ways. GCL being what it is, you
> could link against ICU and use their functions to start with.
>
Thanks so much for these tips. They certainly seem to illuminate the
path forward. Can't see how we could do better than icu.
To your knowledge, is there any objection to defining alpha-char-p as
including code-char's >= 128?
Take care,
--
Camm Maguire address@hidden
==========================================================================
"The earth is but one country, and mankind its citizens." -- Baha'u'llah
_______________________________________________
Gcl-devel mailing list
address@hidden
https://lists.gnu.org/mailman/listinfo/gcl-devel