help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: `write-region' writes different bytes than passed to it?


From: Eli Zaretskii
Subject: Re: `write-region' writes different bytes than passed to it?
Date: Sun, 23 Dec 2018 17:20:32 +0200

> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Sat, 22 Dec 2018 23:58:07 +0100
> Cc: help-gnu-emacs <help-gnu-emacs@gnu.org>
> 
> > Yes, because "\xC1\xB2" just happens to be the internal multibyte
> > representation of a raw-byte F2.  Raw bytes are always converted to
> > their single-byte values on output, regardless of the encoding you
> > request.
> >
> 
> Is that documented somewhere?

Which part(s)?

> Or, in other words, what are the semantics of
> 
> (let ((coding-system-for-write 'utf-8-unix)) (write-region STRING ...))
> 
> ?
>
> There are two easy cases:
> 1. STRING is a unibyte string containing only bytes within the ASCII range
> 2. STRING is a multibyte string containing only Unicode scalar values
> In those cases the answer is simple: The form writes the UTF-8
> representation of STRING.
> However, the interesting cases are as follows:
> 3. STRING is a unibyte string with at least one byte outside the ASCII range
> 4. STRING is a multibyte string with at least one elements that is not
> a Unicode scalar value

You are actually asking what code conversion does in these cases, so
let's limit the discussion to that part.  write-region is not really
relevant here.

One technicality before I answer the question: there are no "Unicode
scalar values" in Emacs strings and buffers.  The internal
representation is a multibyte one, so any non-ASCII character, be it a
valid Unicode character or a raw byte, is always stored as a multibyte
sequence.  So let's please use a less confusing wording, like
"strictly valid UTF-8 sequence" or something to that effect.

> My example is an instance of (3). I admit I haven't read the entire
> Emacs Lisp reference manual, but quite some parts of it, and I
> couldn't find a description of the cases (3) and (4). Naively there
> are a couple options:
> - Signal an error. That would seem appropriate as such strings can't
> be encoded as UTF-8. However, evidently Emacs doesn't do this.
> - For case 3, write the bytes in STRING, ignoring the coding system. I
> had expected this to happen, but apparently it isn't the case either.

IMO, doing encoding on unibyte strings invokes undefined behavior,
since encoding is only defined for multibyte strings.  Admittedly, we
don't say that explicitly (we could if that's deemed important), but
the entire description in "Coding System Basics" makes no sense
without this assumption, and even hints on that indirectly:

     The coding system ‘raw-text’ is special in that it prevents character
  code conversion, and causes the buffer visited with this coding system
  to be a unibyte buffer.  For historical reasons, you can save both
  unibyte and multibyte text with this coding system.

The last sentence implicitly tells you that coding systems other than
raw-text (with the exception of no-conversion, described in the very
next paragraph) can only be meaningfully used when writing multibyte
text.

Since this is undefined behavior, Emacs can do anything that best
suits the relevant use cases.  What it actually does is convert raw
bytes from their internal two-byte representation to a single byte.
Emacs jumps through many hoops to avoid exposing the internal
multibyte representation of raw bytes outside of buffers and strings,
and this is one of those hoops.  That's because exposing that internal
representation is considered to be corruption of the original byte
stream, and is not generally useful.

Signaling an error in this situation is also not useful, because it
turns out many Lisp programs did this kind of thing in the past (Gnus
is a notable example), and undoubtedly quite a few still do.

Emacs handles this case like it does because many years of bitter
experience have taught us that this suits best the use cases we want
to support.  In particular, signaling errors when encountering invalid
UTF-8 sequences is a bad idea in a text-editing application, where
users expect an arbitrary byte stream to pass unscathed from input to
output.  This is why Emacs is decades ahead of other similar systems,
such as Guile, which still throw exceptions in such cases (and claim
that they are "correct").

> > I'm not sure that single use case is important enough to change
> > something that was working like that since Emacs 23.  Who knows how
> > many more important use cases this will break?
>
> It's important for correctness and for actually describing what "encoding" 
> does.

So does labeling this as undefined behavior, which is what it is.  We
don't really need to describe undefined behavior in detail, because
Lisp programs shouldn't do that.

> Do we expect users to explicitly put the byte sequences for the
> (undocumented) internal encoding into unibyte strings? Shouldn't we
> rather expect that users want to write unibyte strings as is, in all
> cases?

To avoid the undefined behavior, a Lisp program should never try to
encode a unibyte string with anything other than no-conversion or
raw-text (the latter also allows the application to convert EOL
format, if that is desired).  IOW, you should have used either
raw-text-unix or no-conversion in your example, not utf-8.

> > Oh, indeed, especially since it sounds to me like the problem is in the
> > original code (if you don't want to change bytes, the use a `binary`
> > encoding rather than utf-8).
> 
> That wouldn't work with multibyte strings, right? Because they need to
> be encoded.

You can detect when a string is a unibyte string with
multibyte-string-p, if your application needs to handle both unibyte
and multibyte strings.  For unibyte strings, use only raw-text or
no-conversion.

> > Exactly: I think we should try and get rid of those heuristics
> > (progressively).  Actually, it's already what we've been doing since
> > Emacs-20, tho "lately" the progression in this respect has slowed
> > down I think.
> 
> I'd definitely welcome any simplification in this area. There seems to
> be a lot of incidental complexity and undocumented corner cases here.

AFAIK, all of that heuristics are in the undefined behavior
department.  Lisp programs are well advised to stay away from that.
If Lisp programs do stay away, they will never need to deal with the
complexity and the undocumented corner cases.

We keep the current behavior for backward compatibility, and for this
reason I would suggest to avoid changes in this area unless we have a
very good reason for a change.  What was the reason you needed to
write something like the original snippet?



reply via email to

[Prev in Thread] Current Thread [Next in Thread]