help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Solved] RE: Differences between identical strings in Emacs lisp


From: Jürgen Hartmann
Subject: [Solved] RE: Differences between identical strings in Emacs lisp
Date: Tue, 7 Apr 2015 15:55:48 +0200

Thank you Pascal Bourguignon for your explanation:

> ...
> 
>     (mapcar 'multibyte-string-p (list "\xBA" (concat '(#xBA))))
>     --> (nil t)
> 
> string-equal (and therefore string=) don't ignore the multibyte property
> of a string.

So it's all about the multibyte property?

> You can use:
> 
>     (mapcar 'string-as-unibyte  (list "\xBA" (concat '(#xBA))))
>     --> ("\272" "\302\272")
> 
> to see the difference.

I see: "\xBA" stays as it is--a unibyte string containing the raw character
\272--, while the multibyte string (concat '(#xBA)) gets converted in its
UTF-8 unibyte form.

> Now, it's hard to say how to "solve" this problem, basically, you asked
> for it: "\xBA" is not a valid way to write a string containing masculine
> ordinal.

In seams that one can use "\u00BA" to achieve this in a string constant; it
evaluates to a multibyte string containing the integer 186:

   "\u00BA"
   --> "º"

   (multibyte-string-p "\u00BA")
   --> t

   (append "\u00BA" ())
   --> (186)

I found it very surprising, that it is not only the escape sequences
(characters) in the string constant that determine its multibyte property,
but it is also the other way round: The sequence \x yields
different results depending on the multibyte property of the string constant
it is used in. For example the constant "\x3FFFBA" is an unibyte string
containing the integer 186:

   "\x3FFFBA"
   --> "\272"

   (multibyte-string-p "\x3FFFBA")
   --> nil

   (append "\x3FFFBA" ())
   --> (186)

The constant "\x3FFFBA Ä" on the other hand is a mulibyte string in which the
sequence \x3FFFBA yields the integer 4194234:

   "\x3FFFBA Ä"
   --> "\272 Ä"

   (multibyte-string-p "\x3FFFBA Ä")
   --> t

   (append "\x3FFFBA Ä" ())
   --> (4194234 32 196)

This seems to be an undocumented feature.

> I guess you could extract back the bytes, and recreate the string
> correctly:
> 
>     (map 'string 'identity (map 'list 'identity "\xBA"))
>     --> "º"
> 
>     (string= (map 'string 'identity (map 'list 'identity "\xBA"))
>              (concat '(#xBA)))
>     --> t

So reassembling the string by means of map 'string results in a string
containing the same integer as "\xBA", namely 186, but as a multibyte string
and the according interpretation of its contents?

In this respect it is interesting to compare another pair of strings: "A" and
(substring "AÄ" 0 1). Both of them contain the same integer, namely 65, and are
printed as "A"--they only differ in their multibyte property: The former is
an unibyte string, the latter multibyte:

   "A"
   --> "A"

   (multibyte-string-p "A")
   --> nil

   (append "A" ())
   --> (65)

and

   (substring "AÄ" 0 1)
   --> "A"

   (multibyte-string-p (substring "AÄ" 0 1))
   --> t

   (append (substring "AÄ" 0 1) ())
   --> (65)

The point is that they compare equal in spite of their different multibyte
property:

   (string= "A" (substring "AÄ" 0 1))
   --> t

So, as you said before: "string-equal (and therefore string=) don't ignore
the multibyte property of a string". But it seems that it is not this
property per se that makes the difference, but the differing interpretation
of the strings contents as a result of this property.

> (On the other hand, one might argue that having both unibyte and
> multibyte strings in a lisp implementation is not a good idea, and
> there's the opportunity for a big refactoring and simplification).
>
> ...

At least it makes it hard to keep the concepts clear.

To illustrate this, consider the strings "A" and (substring "AÄ" 0 1) from
above. They have the same integer content, only differ in their multibyte
property and compare equal.

If we just change their integer values--in both strings alike--from 65 to
186, we get the pair "\xBA" and (concat '(#xBA)), that we also discussed
before. Also here the only difference lies in the multibyte property, while
the integer values are the same. But this time the strings compare different.

One might say that this is not surprising, because this time the integers are
interpreted as different characters. But this would be in contradiction to
the definition of the term character according to which a character actually
_is_ that integer (cf. lisp manual, section "2.3.3 Character Type").

Does we come to the limit of the definition of what a character is?

But this gets pretty philosophical. For the practical purpose you helped me
a lot and I think that I got some better feeling for this topic.

Thank you very much.

Jürgen

                                          

reply via email to

[Prev in Thread] Current Thread [Next in Thread]