Text copied from *grep* buffer has NUL (0x00) characters

help-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Text copied from grep buffer has NUL (0x00) characters

From:	R. Diez
Subject:	Text copied from grep buffer has NUL (0x00) characters
Date:	Sun, 9 May 2021 11:19:38 +0200
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1

Hi all:

I have been using encoding utf-8-with-signature-dos for years with my main notes.txt file, because it is very portable. Even ancient versions ofWindows Notepad honour the UTF-8 BOM correctly.

Recently, my notes.txt became corrupt a few times. I started seeing ^M characters at the end of each line, and other text editors started complainingabout invalid UTF-8 sequences inside.

I thought my network connection was unreliable, or maybe my local disk, or Emacs had a bug. Restoring the notes.txt file wasn't easy, because it wasnot obvious what was wrong with it. I couldn't find a command-line tool that would easily replace any invalid UTF-8 sequences with their hex codeequivalents, but I must admit that I did not actually invest much time looking. After all, I have automated backups.


Yesterday, I remembered exactly what I had done last: I had copied text from 
the *grep* buffer after using 'rgrep'.

After some investigation, it turns out Emacs' default "Grep Command" is "grep --color -nH --null -e ", which includes option "--null". This means thatgrep is embedding an ASCII NUL character (a binary 0x00) after the filenames.


This is what an rgrep text search occurrence looks like in the *grep* buffer:

./some/file.txt:123:some text line

The first ':' is actually a binary null, but the *grep* buffer hides this fact.

If you copy that text line to an Emacs text file buffer, it then looks like 
this:

./some/file.txt^@123:some text line

The ^@ is the representation for the binary null. With my preliminary testing, I could not reproduce the kind of text file "corruption" I had seenbefore, but other text editors started complaining again about an invalid UTF-8 sequence or the like.

For example, the MATE Desktop text editor, Pluma, complained about an "incomplete multibyte sequence in input". Pluma refuses to open short files withembedded NUL characters because it cannot detect the character encoding, or because it claims that it looks like a binary file. Merge tool 'Meld' alsocomplained about invalid characters.


I would say that Emacs has 2 issues here:


1) If a text file encoding is utf-8-with-signature-dos, I do not think that it 
is a good idea for Emacs to allow binary zeros without any warning.

A character sequence like ^@ is easy to miss in the middle of long text lines, 
as it is not coloured in red and does not have any other visible hint.

A 0x00 may well be a valid UTF-8 character, but it is probably going to cause problems in many places. This kind of problem is not new, see also"modified UTF-8".

I think that I have seen warnings from Emacs before about characters that could not be encoded in the current buffer encoding. I would welcome such awarning for binary zeros.

2) Copying text from a *grep* buffer that looks like ":" should not suddenly deliver a NUL character instead. That's just unexpected and prone toproblems down the line.



Regards,
  rdiez

[Prev in Thread]

Current Thread

[Next in Thread]

Text copied from *grep* buffer has NUL (0x00) characters, R. Diez <=
- Re: Text copied from *grep* buffer has NUL (0x00) characters, Eli Zaretskii, 2021/05/09
  - Re: Text copied from *grep* buffer has NUL (0x00) characters, R. Diez, 2021/05/09
    - Re: Text copied from *grep* buffer has NUL (0x00) characters, Eli Zaretskii, 2021/05/09
    - Re: Text copied from *grep* buffer has NUL (0x00) characters, R. Diez, 2021/05/09
    - Re: Text copied from *grep* buffer has NUL (0x00) characters, tomas, 2021/05/10
    - Re: Text copied from *grep* buffer has NUL (0x00) characters, Stefan Monnier, 2021/05/09
- Re: Text copied from *grep* buffer has NUL (0x00) characters, Stefan Monnier, 2021/05/09

Prev by Date: Re: outline-minor-mode and org-mode capabilities for programming languages
Next by Date: Re: Text copied from *grep* buffer has NUL (0x00) characters
Previous by thread: outline-minor-mode and org-mode capabilities for programming languages
Next by thread: Re: Text copied from *grep* buffer has NUL (0x00) characters
Index(es):
- Date
- Thread