help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Text copied from *grep* buffer has NUL (0x00) characters


From: R. Diez
Subject: Re: Text copied from *grep* buffer has NUL (0x00) characters
Date: Sun, 9 May 2021 23:13:36 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1


EZ> That's not the same.  the warning you saw is triggered by a failure to
EZ> convert to the external encoding, so it consumes no extra CPU cycles.

But it could be, from my (admittedly naive) point of view:

(convert-to-external-encoding  but-with-some-extra-flag-to-warn-about-NUL-chars)


EZ> Null bytes will not fail anything, so you should test for them
EZ> explicitly (and in some encodings, like UTF-16, they are necessary and
EZ> cannot be avoided).

I didn't know that about UTF-16, but I could not find any information about it 
either. Why is a NUL char necessary in UTF-16 and not UTF-8?

Or do you mean that UTF-16 tends to have many interleaved zero bytes? In this case, I would have thought that the problem would be the 16-bit NUL character, I mean 0x0000. That is the character to watch out for in UTF-16.

Encodings like UTF-16, that always need more than one byte pro character, are uncommon, won't work with many text editors or tools like 'grep', and most people will expect problems with them anyway. So I wouldn't worry too much about them.

The NUL char issue (the unexpected problems I talked about), that you are likely to run into sooner or later, will probably only affect the popular, single-byte-oriented formats like ASCII, ISO/IEC 8859-1 and UTF-8.


SM> I do think there's a real plain bug here, tho, if you change your
SM> "recipe" to `uft-8` instead `utf-8-with-signature`: take a utf-8 text
SM> file (in a UTF-8 locale), add a NUL byte to it, save, close, and
SM> re-open: you now get a unibyte buffer showing the bytes rather than
SM> the chars.
SM>
SM> Emacs should generally try and warn you when saving a file with a coding
SM> system different than the one it would guess when later re-opening the file.
SM> The problem doesn't show up with `utf-8-with-signature` because
SM> apparently the BOM is given more weight than the NUL byte in determining
SM> which coding system to use.

Thanks for pointing that out.

That is why I think that NUL may be a valid character, perfectly fine in theory, but it even easily trips up Emacs itself. This is why I would make Emacs smarter and warn about it, either on paste, or on save.

There may be one more quirk in this area, because my text file had somehow lost 
the UTF-8 BOM too, and I only edit it with Emacs.

I cannot invest more time into this issue at the moment. I hope these posts 
provide enough information if somebody is interested in the future.

Regards,
  rdiez


reply via email to

[Prev in Thread] Current Thread [Next in Thread]