|
From: | Paul Eggert |
Subject: | Re: From wchar_t to char32_t |
Date: | Tue, 11 Jul 2023 15:14:31 -0700 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 |
On 7/2/23 13:18, Bruno Haible wrote:
Paul Eggert wrote:When can we get (size_t) -3 in a real-world system?It can/could occur if all of the following conditions are met: * The locale encoding is BIG5-HKSCS, e.g. on a glibc system the zh_HK.BIG5-HKSCS the locale. * The input is one of the 4 characters in that encoding that map to a sequence of two Unicode characters: input maps to ----- ------- 0x88 0x62 U+00CA U+0304 0x88 0x64 U+00CA U+030C 0x88 0xA3 U+00EA U+0304 0x88 0xA5 U+00EA U+030C > ...
I looked into this some more and unfortunately don't understand the above. Could you explain a bit more?
<http://www.nits.org.cn/index/article/4034> says that the official mapping table for GB 18030-2022 and BMP is here:
http://www.nits.org.cn/cmsfile/download/134 and this contains the following (nonconsecutive) lines: 5746 8862 5749 8864 57BC 88A3 57BE 88A5which, if I understand things correctly, means the four two-byte sequences that you mention should convert to the following four Unicode characters:
坆 U+5746 CJK IDEOGRAPH-5746 坉 U+5749 CJK IDEOGRAPH-5749 垼 U+57BC CJK IDEOGRAPH-57BC 垾 U+57BE CJK IDEOGRAPH-57BE without mbrtoc23 having to return (size_t) -3.Perhaps there was a problem with an earlier version of GB 18030 that has been fixed in the 2022 edition?
[Prev in Thread] | Current Thread | [Next in Thread] |