[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-libunistring] u32_conv_from_encoding doesn't always respect lco
From: |
Bruno Haible |
Subject: |
Re: [bug-libunistring] u32_conv_from_encoding doesn't always respect lconveh_escape_sequence |
Date: |
Thu, 11 Jul 2024 17:54:20 +0200 |
Hi,
Rob Browning wrote:
>
> This call appears to end up taking the "indirectly" "?" path in
> mem_cd_iconveh_internal, even though it asks for
> iconveh_escape_sequence:
>
> size_t u32len = 0;
> const char *str = "\xb5"; // ISO-8859-1 Greek Mu
> wchar_t *u32 = u32_conv_from_encoding ("utf8", iconveh_escape_sequence,
> str, strlen(str), NULL, NULL,
> &u32len);
This is expected, because for iconveh_escape_sequence to produce an \unnnn
escape sequence, it would know what Unicode character it is. You gave "\xb5"
which is not UTF-8, and no indication that it is ISO-8859-1. It could also
be U+013E (in ISO-8859-2), U+2563 (in KOI8-R), and so on.
But you can try specifying "autodetect_utf8", which supports both UTF-8 and
ISO-8859-1 encoding inputs (for the entire string, not on a character-by-
character basis).
> Also seen via:
>
> guile -c '(write (program-arguments)) (newline)' $'\xb5'
>
> in a UTF-8 locale, which (eventually) makes a similar call.
I think this is the same situation: It tries to convert a lone byte,
with no indication from which character set it is to be interpreted,
to Unicode. You can't hope for anything better than a "?".
Bruno