[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Parsing of multibyte strings frpom process output
From: |
Helmut Eller |
Subject: |
Re: Parsing of multibyte strings frpom process output |
Date: |
Tue, 08 May 2018 13:00:13 +0200 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux) |
On Tue, May 08 2018, Michael Albinus wrote:
> Hi,
>
> I call a local process ("gio list ...", to name it), which returns utf8
> multibyte codes like
>
> --8<---------------cut here---------------start------------->8---
> standard::symlink-target=/home/albinus/tmp/\xc2\x9abung
> --8<---------------cut here---------------end--------------->8---
>
> The bytes "\xc2\x9a" stand for the multibyte char ?\x9a.
The UTF-8 byte sequence \xc2\x9a is a control character.
Maybe the byte sequence \xc3\x9c would make a better example as that
corresponds to Ü (LATIN CAPITAL LETTER U WITH DIAERESIS).
> However, I
> don't know how to parse it that I could retrieve it. All what I have
> tried returns always the *two* characters ?\xc2 ?\x9a, multibyte
> encoded. How could I get just the multibyte character ?\x9a from this?
You could use (set-process-coding-system <proc> 'utf-8) if you know that
the all output of the process is indeed utf-8 encoded.
Alternatively, you could use 'binary as coding system and manually call
decode-coding-string on the parts that are utf-8 encoded. However keep
in mind, that "raw bytes" in multibyte strings have char codes in the
range #x3FFF00..#x3FFFFF.
If you want even more confusion: you could set up the process so that it
generates unibyte strings and then use decode-coding-string to create
the multibyte string.
> I know that (decode-coding-string "/home/albinus/tmp/\xc2\x9a\ bung" 'utf-8)
> does what I want. But here, the string is a string *constant*, which
> allows to write characters in hex syntax. When I read the string from
> the output buffer (after including the trailing "\ "), this does not work.
Remember, if a hexadecimal or octal escape sequence occurs in a string
literal then the string is automatically becomes a unibyte string:
(multibyte-string-p "\xc3\x9c") => nil
Also consider these examples:
(decode-coding-string "\xc3\x9c" 'utf-8) => "Ü"
(decode-coding-string (string #xc3 #x9c) 'utf-8) => "Ã\234"
(decode-coding-string (string #x3FFFc3 #x3FFF9c) 'utf-8) => "Ü"
Helmut