help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: `write-region' writes different bytes than passed to it?


From: Eli Zaretskii
Subject: Re: `write-region' writes different bytes than passed to it?
Date: Sun, 10 Feb 2019 22:05:19 +0200

> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Sun, 10 Feb 2019 20:06:57 +0100
> Cc: help-gnu-emacs <help-gnu-emacs@gnu.org>
> 
> > > > Yes, because "\xC1\xB2" just happens to be the internal multibyte
> > > > representation of a raw-byte F2.  Raw bytes are always converted to
> > > > their single-byte values on output, regardless of the encoding you
> > > > request.
> > > >
> > >
> > > Is that documented somewhere?
> >
> > Which part(s)?
> 
> All of it? ;)
> Basically, "what is the behavior of write-region".

Like I said, write-region is not relevant here, encoding is.

> > One technicality before I answer the question: there are no "Unicode
> > scalar values" in Emacs strings and buffers.  The internal
> > representation is a multibyte one, so any non-ASCII character, be it a
> > valid Unicode character or a raw byte, is always stored as a multibyte
> > sequence.  So let's please use a less confusing wording, like
> > "strictly valid UTF-8 sequence" or something to that effect.
> 
> I don't think we should change the terminology. Emacs multibyte
> strings are sequences of integers

No, they are not.  They are sequences of bytes (as evidenced by the
"multibyte" part) which represent sequences of Unicode codepoints.
The latter are scalar integers.  But these scalars are not explicitly
present in the multibyte representation.

> > IMO, doing encoding on unibyte strings invokes undefined behavior,
> > since encoding is only defined for multibyte strings.
> 
> That is very unfortunate. Is there any hope we can get out of that situation?

Unlikely.

> But in this question there is never any internal representation

Yes, there is: you have succeeded to use one of the few loopholes to
create such a byte sequence.

> > Signaling an error in this situation is also not useful, because it
> > turns out many Lisp programs did this kind of thing in the past (Gnus
> > is a notable example), and undoubtedly quite a few still do.
> 
> Well, if the behavior is unspecified, then signaling an error would
> absolutely be a legal (and even expected) behavior.

It's possible, but not useful, so we don't do that.

> I'm not saying that Emacs should necessary start signaling errors when
> visiting files with invalid UTF-8 sequences. That it degrades
> gracefully in this case is very welcome and user-friendly.
> But visiting a file can't result in a call to write-region with a
> unibyte string, right?

Why not?  Of course it can: imagine that you modify some part of the
file's text that doesn't include raw undecoded bytes, then write the
result to a file.  You will expect that portions of text you didn't
modify remain intact, right?

> > > It's important for correctness and for actually describing what 
> > > "encoding" does.
> >
> > So does labeling this as undefined behavior, which is what it is.  We
> > don't really need to describe undefined behavior in detail, because
> > Lisp programs shouldn't do that.
> 
> Rather than describing it in detail, it should be removed. Unspecified
> behavior makes a programming system hard to use and reason about.

It cannot be removed.  Raw bytes that cannot be decoded are a fact of
life, removing them will make Emacs a lame duck.

> > To avoid the undefined behavior, a Lisp program should never try to
> > encode a unibyte string with anything other than no-conversion or
> > raw-text (the latter also allows the application to convert EOL
> > format, if that is desired).  IOW, you should have used either
> > raw-text-unix or no-conversion in your example, not utf-8.
> 
> If Lisp code shouldn't try that, then the encoding functions should
> signal an error on such cases.

Signaling an error is not useful, so Emacs should not do that.

> > You can detect when a string is a unibyte string with
> > multibyte-string-p, if your application needs to handle both unibyte
> > and multibyte strings.  For unibyte strings, use only raw-text or
> > no-conversion.
> 
> I get that, but this is too subtle and nontrivial.

Then try not to write code that could bump into these subtleties.  You
shouldn't need that.

> > AFAIK, all of that heuristics are in the undefined behavior
> > department.  Lisp programs are well advised to stay away from that.
> > If Lisp programs do stay away, they will never need to deal with the
> > complexity and the undocumented corner cases.
> 
> You can't tell programmers to stay away from something.

No, but I can advise them.

> Either it should work as documented or signal an error. Silently
> doing the wrong thing is the worst choice.

It doesn't do the wrong thing, it does the right thing: it stays out
of the hair of programmers who might need to write such stuff
(assuming they know what they are doing).

> > What was the reason you needed to write something like the
> > original snippet?
> 
> I'm writing a function to write an arbitrary string to a file. This
> should be trivial, but as you can see, it isn't.

It wasn't a string, it was a sequence of bytes that cannot be
interpreted as a text string.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]