[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Text copied from *grep* buffer has NUL (0x00) characters
From: |
R. Diez |
Subject: |
Text copied from *grep* buffer has NUL (0x00) characters |
Date: |
Sun, 9 May 2021 11:19:38 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 |
Hi all:
I have been using encoding utf-8-with-signature-dos for years with my main notes.txt file, because it is very portable. Even ancient versions of
Windows Notepad honour the UTF-8 BOM correctly.
Recently, my notes.txt became corrupt a few times. I started seeing ^M characters at the end of each line, and other text editors started complaining
about invalid UTF-8 sequences inside.
I thought my network connection was unreliable, or maybe my local disk, or Emacs had a bug. Restoring the notes.txt file wasn't easy, because it was
not obvious what was wrong with it. I couldn't find a command-line tool that would easily replace any invalid UTF-8 sequences with their hex code
equivalents, but I must admit that I did not actually invest much time looking. After all, I have automated backups.
Yesterday, I remembered exactly what I had done last: I had copied text from
the *grep* buffer after using 'rgrep'.
After some investigation, it turns out Emacs' default "Grep Command" is "grep --color -nH --null -e ", which includes option "--null". This means that
grep is embedding an ASCII NUL character (a binary 0x00) after the filenames.
This is what an rgrep text search occurrence looks like in the *grep* buffer:
./some/file.txt:123:some text line
The first ':' is actually a binary null, but the *grep* buffer hides this fact.
If you copy that text line to an Emacs text file buffer, it then looks like
this:
./some/file.txt^@123:some text line
The ^@ is the representation for the binary null. With my preliminary testing, I could not reproduce the kind of text file "corruption" I had seen
before, but other text editors started complaining again about an invalid UTF-8 sequence or the like.
For example, the MATE Desktop text editor, Pluma, complained about an "incomplete multibyte sequence in input". Pluma refuses to open short files with
embedded NUL characters because it cannot detect the character encoding, or because it claims that it looks like a binary file. Merge tool 'Meld' also
complained about invalid characters.
I would say that Emacs has 2 issues here:
1) If a text file encoding is utf-8-with-signature-dos, I do not think that it
is a good idea for Emacs to allow binary zeros without any warning.
A character sequence like ^@ is easy to miss in the middle of long text lines,
as it is not coloured in red and does not have any other visible hint.
A 0x00 may well be a valid UTF-8 character, but it is probably going to cause problems in many places. This kind of problem is not new, see also
"modified UTF-8".
I think that I have seen warnings from Emacs before about characters that could not be encoded in the current buffer encoding. I would welcome such a
warning for binary zeros.
2) Copying text from a *grep* buffer that looks like ":" should not suddenly deliver a NUL character instead. That's just unexpected and prone to
problems down the line.
Regards,
rdiez
- Text copied from *grep* buffer has NUL (0x00) characters,
R. Diez <=
Re: Text copied from *grep* buffer has NUL (0x00) characters, Stefan Monnier, 2021/05/09