Philipp Stephani <
address@hidden> schrieb am So., 22. Nov. 2015 um 10:25 Uhr:
Eli Zaretskii <
address@hidden> schrieb am Sa., 21. Nov. 2015 um 14:23 Uhr:
> From: Philipp Stephani <address@hidden>
> Date: Sat, 21 Nov 2015 12:11:45 +0000
> Cc: address@hidden, address@hidden, address@hidden
>
> No, we cannot, or rather should not. It is unreasonable to expect
> external modules to know the intricacies of the internal
> representation. Most Emacs hackers don't.
>
> Fine with me, but how would we then represent Emacs strings that are not valid
> Unicode strings? Just raise an error?
No need to raise an error. Strings that are returned to modules
should be encoded into UTF-8. That encoding already takes care of
these situations: it either produces the UTF-8 encoding of the
equivalent Unicode characters, or outputs raw bytes.
Then we should document such a situation and give module authors a way to detect them. For example, what happens if a sequence of such raw bytes happens to be a valid UTF-8 sequence? Is there a way for module code to detect this situation?
I've thought a bit more about this issue an in the following I'll attempt to derive the desired behavior from first principles without referring to internal Emacs functions.
- There are two sets of functions for creating and reading strings, unibyte and multibyte. If a string of the wrong type is passed, a signal is raised. This way the two types are clearly separated.
- The behavior of the unibyte API is uncontroversial and has no failure modes apart from generic ones such as wrong type, argument out of range, OOM.
- The multibyte API should use an extension of UTF-8 to encode Emacs strings. The extension is the obvious one already in use in multiple places.
- There should be a one-to-one mapping between Emacs multibyte strings and encoded module API strings. Therefore non-shortest forms, illegal code unit sequences, and code unit sequences that would encode values outside the range of Emacs characters are illegal and raise a signal. Likewise, such sequences will never be returned from Emacs.
I think this is a relatively simple and unsurprising approach. It allows encoding the documented Emacs character space while still being fully compatible with UTF-8 and not resorting to undocumented Emacs internals.