bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: From wchar_t to char32_t


From: Paul Eggert
Subject: Re: From wchar_t to char32_t
Date: Tue, 11 Jul 2023 15:14:31 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.12.0

On 7/2/23 13:18, Bruno Haible wrote:
Paul Eggert wrote:
When can we get (size_t) -3 in a real-world system?

It can/could occur if all of the following conditions are met:

   * The locale encoding is BIG5-HKSCS, e.g. on a glibc system the
     zh_HK.BIG5-HKSCS the locale.

   * The input is one of the 4 characters in that encoding that map to
     a sequence of two Unicode characters:

        input         maps to
        -----         -------
      0x88 0x62    U+00CA U+0304
      0x88 0x64    U+00CA U+030C
      0x88 0xA3    U+00EA U+0304
      0x88 0xA5    U+00EA U+030C >       ...

I looked into this some more and unfortunately don't understand the above. Could you explain a bit more?

<http://www.nits.org.cn/index/article/4034> says that the official mapping table for GB 18030-2022 and BMP is here:

http://www.nits.org.cn/cmsfile/download/134

and this contains the following (nonconsecutive) lines:

  5746  8862
  5749  8864
  57BC  88A3
  57BE  88A5

which, if I understand things correctly, means the four two-byte sequences that you mention should convert to the following four Unicode characters:

  坆 U+5746 CJK IDEOGRAPH-5746
  坉 U+5749 CJK IDEOGRAPH-5749
  垼 U+57BC CJK IDEOGRAPH-57BC
  垾 U+57BE CJK IDEOGRAPH-57BE

without mbrtoc23 having to return (size_t) -3.

Perhaps there was a problem with an earlier version of GB 18030 that has been fixed in the 2022 edition?



reply via email to

[Prev in Thread] Current Thread [Next in Thread]