help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Parsing of multibyte strings frpom process output


From: Helmut Eller
Subject: Re: Parsing of multibyte strings frpom process output
Date: Tue, 08 May 2018 13:00:13 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux)

On Tue, May 08 2018, Michael Albinus wrote:

> Hi,
>
> I call a local process ("gio list ...", to name it), which returns utf8
> multibyte codes like
>
> --8<---------------cut here---------------start------------->8---
> standard::symlink-target=/home/albinus/tmp/\xc2\x9abung
> --8<---------------cut here---------------end--------------->8---
>
> The bytes "\xc2\x9a" stand for the multibyte char ?\x9a.

The UTF-8 byte sequence \xc2\x9a is a control character.

Maybe the byte sequence \xc3\x9c would make a better example as that
corresponds to Ü (LATIN CAPITAL LETTER U WITH DIAERESIS).

> However, I
> don't know how to parse it that I could retrieve it. All what I have
> tried returns always the *two* characters ?\xc2 ?\x9a, multibyte
> encoded. How could I get just the multibyte character ?\x9a from this?

You could use (set-process-coding-system <proc> 'utf-8) if you know that
the all output of the process is indeed utf-8 encoded.

Alternatively, you could use 'binary as coding system and manually call
decode-coding-string on the parts that are utf-8 encoded.  However keep
in mind, that "raw bytes" in multibyte strings have char codes in the
range #x3FFF00..#x3FFFFF.

If you want even more confusion: you could set up the process so that it
generates unibyte strings and then use decode-coding-string to create
the multibyte string.

> I know that (decode-coding-string "/home/albinus/tmp/\xc2\x9a\ bung" 'utf-8)
> does what I want. But here, the string is a string *constant*, which
> allows to write characters in hex syntax. When I read the string from
> the output buffer (after including the trailing "\ "), this does not work.

Remember, if a hexadecimal or octal escape sequence occurs in a string
literal then the string is automatically becomes a unibyte string:

(multibyte-string-p "\xc3\x9c") => nil

Also consider these examples:

  (decode-coding-string "\xc3\x9c" 'utf-8) => "Ü"
  (decode-coding-string (string #xc3 #x9c) 'utf-8) => "Ã\234"
  (decode-coding-string (string #x3FFFc3 #x3FFF9c) 'utf-8) => "Ü"

Helmut


reply via email to

[Prev in Thread] Current Thread [Next in Thread]