Re: Dynamic loading progress

Philipp Stephani <address@hidden> schrieb am So., 22. Nov. 2015 um 10:25 Uhr:

Eli Zaretskii <address@hidden> schrieb am Sa., 21. Nov. 2015 um 14:23 Uhr:
> From: Philipp Stephani <address@hidden>
> Date: Sat, 21 Nov 2015 12:11:45 +0000
> Cc: address@hidden, address@hidden, address@hidden
>
> No, we cannot, or rather should not. It is unreasonable to expect
> external modules to know the intricacies of the internal
> representation. Most Emacs hackers don't.
>
> Fine with me, but how would we then represent Emacs strings that are not valid
> Unicode strings? Just raise an error?

No need to raise an error. Strings that are returned to modules
should be encoded into UTF-8. That encoding already takes care of
these situations: it either produces the UTF-8 encoding of the
equivalent Unicode characters, or outputs raw bytes.

Then we should document such a situation and give module authors a way to detect them. For example, what happens if a sequence of such raw bytes happens to be a valid UTF-8 sequence? Is there a way for module code to detect this situation?

I've thought a bit more about this issue an in the following I'll attempt to derive the desired behavior from first principles without referring to internal Emacs functions.

There are two kinds of Emacs strings, unibyte and multibyte. https://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html and https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Codes.html agree that multibyte strings are sequences of integers (let's avoid the overloaded and vague term "characters") in the range 0 to #x3FFFFF (inclusive). It is also clear that within the subset of that range corresponding the the Unicode codespace the intergers are interpreted as Unicode code points. Given that new APIs should use UTF-8, the following approach looks reasonable to me:

- There are two sets of functions for creating and reading strings, unibyte and multibyte. If a string of the wrong type is passed, a signal is raised. This way the two types are clearly separated.
- The behavior of the unibyte API is uncontroversial and has no failure modes apart from generic ones such as wrong type, argument out of range, OOM.

- The multibyte API should use an extension of UTF-8 to encode Emacs strings. The extension is the obvious one already in use in multiple places.

- There should be a one-to-one mapping between Emacs multibyte strings and encoded module API strings. Therefore non-shortest forms, illegal code unit sequences, and code unit sequences that would encode values outside the range of Emacs characters are illegal and raise a signal. Likewise, such sequences will never be returned from Emacs.

I think this is a relatively simple and unsurprising approach. It allows encoding the documented Emacs character space while still being fully compatible with UTF-8 and not resorting to undocumented Emacs internals.

From:	Philipp Stephani
Subject:	Re: Dynamic loading progress
Date:	Sun, 22 Nov 2015 14:56:12 +0000