[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: From wchar_t to char32_t
From: |
Bruno Haible |
Subject: |
Re: From wchar_t to char32_t |
Date: |
Sun, 02 Jul 2023 22:18:58 +0200 |
Paul Eggert wrote:
> On 2023-07-02 06:33, Bruno Haible wrote:
> > + else if (bytes == (size_t) -3)
> > + bytes = 0;
>
> Why is this sort of thing needed?
I tried to explain it in
https://lists.gnu.org/archive/html/bug-gnulib/2023-06/msg00134.html .
Basically, since ISO C 23 says that mbrtoc32() can return (size_t) -3,
I want to write future-proof code by handling this case. Even though
currently no implementation produces this return code, and I consider
it unlikely that any implementation ever will.
> I thought that (size_t) -3 was
> possible only after a low surrogate, which is possible when decoding
> valid UTF-16 to Unicode
The mbrtoc16() function returns (size_t) -3 when it stores a low surrogate
(as second char16_t after the first one was a high surrogate), right.
But we don't use the mbrtoc16() function, and I don't plan to use it, ever.
The mbrtoc32() function MUST return (size_t) -1 and errno = EILSEQ when
the input is an UTF-8 byte sequence whose value would be a surrogate.
See https://en.wikipedia.org/wiki/UTF-8#Invalid_sequences_and_error_handling
> but not when decoding valid UTF-8 to Unicode.
When decoding valid or invalid UTF-8 through mbrtoc32(), (size_t) -3
can never occur.
> When can we get (size_t) -3 in a real-world system?
It can/could occur if all of the following conditions are met:
* The locale encoding is BIG5-HKSCS, e.g. on a glibc system the
zh_HK.BIG5-HKSCS the locale.
* The input is one of the 4 characters in that encoding that map to
a sequence of two Unicode characters:
input maps to
----- -------
0x88 0x62 U+00CA U+0304
0x88 0x64 U+00CA U+030C
0x88 0xA3 U+00EA U+0304
0x88 0xA5 U+00EA U+030C
* glibc is changed so that, in this case, mbrtoc32() does not work
identically to mbrtowc().
* The other glibc bug that causes gnulib to override mbrtoc32 gets fixed:
https://sourceware.org/bugzilla/show_bug.cgi?id=19932
https://sourceware.org/bugzilla/show_bug.cgi?id=29511
I consider this unlikely. It is more likely that glibc's behaviour does not
change, or that the zh_HK.BIG5-HKSCS locale becomes unsupported.
> If (size_t) -3 is possible, I suppose I should change diffutils to take
> this into account, as bleeding-edge diffutils/src/side.c treats (size_t)
> -3 as meaning the next input byte is an encoding error, which is
> obviously wrong.
If you want the diffutils code to be future-proof, yes.
> The simplest way to fix this would be for diffutils to
> go back to using wchar_t,
?? We are talking about 2 lines of code, which lead to 2 instructions at
run time. If you want to micro-optimize execution time, you could
conditionally disable these 2 lines, until we know that the problem will
actually occur with glibc.
> although I don't know what the downsides of
> that would be (diffutils doesn't care about Unicode; all it cares is
> about is character classes and print widths).
With plain wchar_t (as opposed to char32_t), character classes and print widths
of non-BMP characters come out wrong on Cygwin, native Windows, and 32-bit AIX.
[1]
Bruno
[1] https://lists.gnu.org/archive/html/bug-gnulib/2023-06/msg00102.html
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/01
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/01
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/02
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/02
- Re: From wchar_t to char32_t,
Bruno Haible <=
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/03
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/03
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/03
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/03
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/04
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/04
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/06
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/06
- mbcel module for Gnulib?, Paul Eggert, 2023/07/09
- Re: mbcel module for Gnulib?, Bruno Haible, 2023/07/11