[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gcl-devel] utf8 and emacs text/string multibyte representation
From: |
David Kastrup |
Subject: |
Re: [Gcl-devel] utf8 and emacs text/string multibyte representation |
Date: |
Sat, 01 Nov 2014 19:41:22 +0100 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/25.0.50 (gnu/linux) |
"Stephen J. Turnbull" <address@hidden> writes:
> Eli Zaretskii writes:
>
> > > Been discussing this elsewhere, and its come to my attention that not
> > > only do all unicode code-points not fit into UTF-16, but all unicode
> > > characters don't fit into unicode code-points :-). Presumably this is
> > > why emacs expanded to 22bits?
> >
> > Not sure what you mean here. All Unicode characters do fit into the
> > Unicode codepoint space. Emacs extends that codepoint space beyond 22
> > bits because it needs to support cultures which don't want unification
> > yet.
>
> I suppose he means grapheme complexes, such as various accented
> characters that can be constructed from composing characters but do
> not have precomposed forms in Unicode. As you say, that's not why
> Emacs extended the code space.
>
> > > Did you consider leaving aref, char-code and code-char alone and writing
> > > unicode functions on top of these, i.e. unicode-length!=length, as
> > > opposed to making aref itself do this translation under the hood,
> > > thereby violating the expectation of O(1) access, (which is certainly
> > > offered in other kinds of arrays, though it is questionable whether real
> > > users actually expect this for strings)?
>
> Actually, originally Emacs allowed you to treat text (buffers and
> strings) either as sequences of characters or arrays of bytes, and
> this was a real bug-breeder (and why XEmacs chose the pain of the
> incompatible separation of integer type from character type).
>
> I'm not sure if the feature is present in modern Emacs, but at the
> very least the usage is so rare today that I'm unaware of any.
string-as-unibyte and string-as-multibyte most certainly are available
for going from one to the other. But the commands working on either
unibyte or multibyte strings are the same. Similar for buffers. I have
no idea whether this is a problem vector for creating inconsistent
multibyte content. I could imagine it to be, but so could be
user-created CCL programs for code conversion.
--
David Kastrup
- Re: [Gcl-devel] utf8 and emacs text/string multibyte representation, (continued)
Re: [Gcl-devel] utf8 and emacs text/string multibyte representation, Camm Maguire, 2014/11/01
Re: [Gcl-devel] utf8 and emacs text/string multibyte representation, Stephen J. Turnbull, 2014/11/01
- Re: [Gcl-devel] utf8 and emacs text/string multibyte representation,
David Kastrup <=