help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Text copied from *grep* buffer has NUL (0x00) characters


From: R. Diez
Subject: Text copied from *grep* buffer has NUL (0x00) characters
Date: Sun, 9 May 2021 11:19:38 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1

Hi all:

I have been using encoding utf-8-with-signature-dos for years with my main notes.txt file, because it is very portable. Even ancient versions of Windows Notepad honour the UTF-8 BOM correctly.

Recently, my notes.txt became corrupt a few times. I started seeing ^M characters at the end of each line, and other text editors started complaining about invalid UTF-8 sequences inside.

I thought my network connection was unreliable, or maybe my local disk, or Emacs had a bug. Restoring the notes.txt file wasn't easy, because it was not obvious what was wrong with it. I couldn't find a command-line tool that would easily replace any invalid UTF-8 sequences with their hex code equivalents, but I must admit that I did not actually invest much time looking. After all, I have automated backups.

Yesterday, I remembered exactly what I had done last: I had copied text from 
the *grep* buffer after using 'rgrep'.

After some investigation, it turns out Emacs' default "Grep Command" is "grep --color -nH --null -e ", which includes option "--null". This means that grep is embedding an ASCII NUL character (a binary 0x00) after the filenames.

This is what an rgrep text search occurrence looks like in the *grep* buffer:

./some/file.txt:123:some text line

The first ':' is actually a binary null, but the *grep* buffer hides this fact.

If you copy that text line to an Emacs text file buffer, it then looks like 
this:

./some/file.txt^@123:some text line

The ^@ is the representation for the binary null. With my preliminary testing, I could not reproduce the kind of text file "corruption" I had seen before, but other text editors started complaining again about an invalid UTF-8 sequence or the like.

For example, the MATE Desktop text editor, Pluma, complained about an "incomplete multibyte sequence in input". Pluma refuses to open short files with embedded NUL characters because it cannot detect the character encoding, or because it claims that it looks like a binary file. Merge tool 'Meld' also complained about invalid characters.

I would say that Emacs has 2 issues here:


1) If a text file encoding is utf-8-with-signature-dos, I do not think that it 
is a good idea for Emacs to allow binary zeros without any warning.

A character sequence like ^@ is easy to miss in the middle of long text lines, 
as it is not coloured in red and does not have any other visible hint.

A 0x00 may well be a valid UTF-8 character, but it is probably going to cause problems in many places. This kind of problem is not new, see also "modified UTF-8".

I think that I have seen warnings from Emacs before about characters that could not be encoded in the current buffer encoding. I would welcome such a warning for binary zeros.


2) Copying text from a *grep* buffer that looks like ":" should not suddenly deliver a NUL character instead. That's just unexpected and prone to problems down the line.


Regards,
  rdiez


reply via email to

[Prev in Thread] Current Thread [Next in Thread]