bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer

From:	ynyaaa
Subject:	bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents
Date:	Sun, 06 Oct 2019 02:18:08 +0900
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/26.3 (windows-nt)

Eli Zaretskii <eliz@gnu.org> writes:
> I don't think this is a bug.  Changing the multibyte-ness of a buffer
> really does change the contents.  You should only do that where it
> makes sense.

Sometimes I find broken utf-8 texts on the Internet.
Some characters are split into surrogate pairs, and each surrogate
character is encoded as if it is a normal BMP character.

utf-8 coding system does not decode such sequences.
Changing multibyte-ness converts them to surrogate characters.
And encode-decode process with utf-16be outputs the intended characeters.

Suppose the character is #x10000,
the correspoding pair is (#xD800 #xDC00).
The miss-encoded sequence is:
  (encode-coding-string "\xD800\xDC00" 'utf-8)
  => "\355\240\200\355\260\200"

It is not decoded with utf-8.
  (decode-coding-string (encode-coding-string "\xD800\xDC00" 'utf-8)
                        'utf-8)
  => "\355\240\200\355\260\200"

Changing multibyte-ness, the sequence is converted into surrogate
characters.
  (with-temp-buffer
    (insert (encode-coding-string "\xD800\xDC00" 'utf-8))
    (set-buffer-multibyte nil)
    (set-buffer-multibyte t)
    (buffer-string))
  => "\xD800\xDC00"

The surrogate pair can be converted into the original character.
  (decode-coding-string (encode-coding-string "\xD800\xDC00" 'utf-16be)
                        'utf-16be)
  => "\x10000"

[Prev in Thread]

Current Thread

[Next in Thread]

bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents, ynyaaa, 2019/10/02
- bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents, Eli Zaretskii, 2019/10/02
  - bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents, ynyaaa <=
    - bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents, Eli Zaretskii, 2019/10/05
    - bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents, Stefan Kangas, 2019/10/28

Prev by Date: bug#37551: [PATCH]: Update privacy usage descriptions for macOS 10.15 Catalina (nextstep)
Next by Date: bug#37633: Column part interpreted wrong in compilation mode
Previous by thread: bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents
Next by thread: bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents
Index(es):
- Date
- Thread