help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Solved] RE: Differences between identical strings in Emacs lisp


From: Jürgen Hartmann
Subject: RE: [Solved] RE: Differences between identical strings in Emacs lisp
Date: Wed, 8 Apr 2015 13:01:16 +0200

Thank you, Eli Zaretskii, for your explanations:

>> [About mapping between unibyte and multibyte strings]
>>
>> First I thought that some hidden decoding based on some charsets or
>> coding
>> systems occurs.
>
> Actually, some sort of "decoding" does occur, albeit perhaps not in
> the use cases you tried -- Emacs will sometimes silently convert
> unibyte characters to their locale-dependent multibyte equivalents.

On which occasion such a conversion is done? Has this anything to do with the
the charset that is individually defined in language-info-alist for nearly
each language environment?

> This whole area of unibyte strings is replete with dwim-ish hacks and
> kludges, all in an attempt to do what the user expects.  Thus the
> confusion and the advice to stay away of that gray area.

Sounds like the well known design conflict between "behaving smart" and
"being straight".

>> [About "\x3FFFBA\x3FFFBB\x3FFFBC"]
>
> It's a "unibyte string", which, by definition, contains raw bytes.
>
> But it is actually better to say that the raw bytes there are \272 and
> not \x3FFFBC.  The latter is just the representation Emacs uses for
> the former, Emacs goes out of its way not to show that internal
> representation to the user.
>
>> ...
>>
>> Ah, that seems to be the key: raw bytes are not characters.
>
> Exactly.

Great! Lesson learned.

>> [About raw bytes]
>
> They _are_ a special "character set", but only in the very technical
> sense of "character set" in Emacs.  By their nature and their
> properties in Emacs, they are not characters.
>
>> [About characters and raw bytes in unibyte context]
>
> Raw bytes are only those whose value is above 127, so A is a
> character.
>
> For subtle technical reasons (or maybe by some historical accident), a
> pure-ASCII string is a unibyte string, although it contains
> characters, not raw bytes.  So having a unibyte string does not yet
> mean you have raw bytes in it.

It seems that all my related observations that puzzled me before can be well
explained by the strict distinction between characters and raw bytes and the
mapping between the latter's integer representations in the range
[0x80..0xFF] in an unibyte context and in the range [0x3FFF80..0x3FFFFF] in a
multibyte context.

> By far the only valid use case where you need to manipulate unibyte
> strings of raw bytes is if you need to encode or decode strings by
> calling encode-coding-region and its ilk.  E.g., an application that
> needs to send base64-encoded text needs first to encode it using
> whatever coding-system is appropriate, which produces unibyte text
> containing raw bytes, and then call base64-encode-region to produce
> the final result.  And similarly for decoding such stuff.  You will
> see examples of this in Gnus and Rmail, for example.
>
>> So, thank you very much for your enlightening answers.
>
> You are welcome.

Thank you very much.

Juergen

                                          

reply via email to

[Prev in Thread] Current Thread [Next in Thread]