bug-libunistring
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-libunistring] u32_conv_from_encoding doesn't always respect lco


From: Bruno Haible
Subject: Re: [bug-libunistring] u32_conv_from_encoding doesn't always respect lconveh_escape_sequence
Date: Thu, 11 Jul 2024 17:54:20 +0200

Hi,

Rob Browning wrote:
> 
> This call appears to end up taking the "indirectly" "?" path in
> mem_cd_iconveh_internal, even though it asks for
> iconveh_escape_sequence:
> 
>   size_t u32len = 0;
>   const char *str = "\xb5"; // ISO-8859-1 Greek Mu
>   wchar_t *u32 = u32_conv_from_encoding ("utf8", iconveh_escape_sequence,
>                                          str, strlen(str), NULL, NULL, 
> &u32len);

This is expected, because for iconveh_escape_sequence to produce an \unnnn
escape sequence, it would know what Unicode character it is. You gave "\xb5"
which is not UTF-8, and no indication that it is ISO-8859-1. It could also
be U+013E (in ISO-8859-2), U+2563 (in KOI8-R), and so on.

But you can try specifying "autodetect_utf8", which supports both UTF-8 and
ISO-8859-1 encoding inputs (for the entire string, not on a character-by-
character basis).

> Also seen via:
> 
>   guile -c '(write (program-arguments)) (newline)' $'\xb5'
> 
> in a UTF-8 locale, which (eventually) makes a similar call.

I think this is the same situation: It tries to convert a lone byte,
with no indication from which character set it is to be interpreted,
to Unicode. You can't hope for anything better than a "?".

Bruno






reply via email to

[Prev in Thread] Current Thread [Next in Thread]