help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: `write-region' writes different bytes than passed to it?


From: Stefan Monnier
Subject: Re: `write-region' writes different bytes than passed to it?
Date: Sun, 23 Dec 2018 23:27:52 -0500
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux)

> There are two easy cases:
> 1. STRING is a unibyte string containing only bytes within the ASCII range
> 2. STRING is a multibyte string containing only Unicode scalar values
> In those cases the answer is simple: The form writes the UTF-8
> representation of STRING.

Not sure what you mean by "unicode scalar values", but a multibyte
string is a sequence of chars, i.e. a sequence of char codes (integers)
And utf-8 is a way to encode a sequence of integer char codes into
a sequence of bytes.

So your sample code will pretty much always write the utf-8
representation of the multibyte string.

[ The only exception is when the multibyte string contains chars in the
  eight-bit charset, because those are supposed to stand for raw bytes.
  This is exception is used to make sure that if you read a file using
  the utf-8 coding-system and the file's content is not valid utf-8,
  writing the buffer will still generate the exact same byte sequence.  ]

> However, the interesting cases are as follows:
> 3. STRING is a unibyte string with at least one byte outside the ASCII range

I don't think this case is clearly documented, indeed.

I believe what happens currently is that Emacs looks at the byte
sequence in the unibyte string as if it was the internal representation
of a multibyte string.  Changing behavior (e.g. by simply outputting the
bytes unchanged like I suggested) will likely affect some code out there
somewhere.  I think it'd be a good change, tho, because I think that any
code thus affected is likely buggy and needs to be fixed anyway (and
actually that change might be the fix the code needs).

What makes this question a bit more tricky is that when a string is all
ASCII, Emacs tends to choose rather arbitrarily between unibyte
and multibyte.  But if we decide that coding-system doesn't affect
unibyte strings, then we get into trouble with

    (let ((coding-system-for-write 'ebcdic-int)) (write-region STRING ...))

since for a purely ASCII string, we still need to do a conversion,
so we'd need to be more careful about the distinction between unibyte and
multibyte ASCII strings.

Maybe we should just drop support for coding systems that aren't
supersets of ASCII and be done with it, but I'm not sure we're ready to
do that.


        Stefan




reply via email to

[Prev in Thread] Current Thread [Next in Thread]