bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: From wchar_t to char32_t


From: Bruno Haible
Subject: Re: From wchar_t to char32_t
Date: Fri, 30 Jun 2023 22:50:06 +0200

I did:
>       * lib/mbiter.h: Include <uchar.h> instead of <wchar.h>.
>       (mbiter_multi_next): Use mbrtoc32 instead of mbrtowc.
>       * lib/mbuiter.h: Include <uchar.h> instead of <wchar.h>.
>       (mbuiter_multi_next): Use mbrtoc32 instead of mbrtowc.
>       * lib/mbfile.h (mbfile_multi_getc): Use mbrtoc32 instead of mbrtowc.

There's small difference between mbrtowc and mbrtoc32: While the return values
(size_t)(-1) and (size_t)(-2) have the same meaning, mbrtoc32 (in theory) has
a possible return value (size_t)(-3). This adds one case to the rule how to
compute the number of consumed bytes.

In mbrtowc:

       Return value       Consumed bytes
       ------------       --------------
       small n > 0        n
       0                  1

In mbrtoc32:

       Return value       Consumed bytes
       ------------       --------------
       small n > 0        n
       0                  1
       (size_t)(-3)       0

The patch below thus fixes the uses of mbrtoc32.

I said "in theory". This situation occurs if and only if there is a character
in the locale's encoding that corresponds to a sequence of two or more Unicode
characters. To find which encodings have these properties, do

  $ ls -1 glibc/iconvdata/*.precomposed
  glibc/iconvdata/BIG5HKSCS.precomposed
  glibc/iconvdata/EUC-JISX0213.precomposed
  glibc/iconvdata/SHIFT_JISX0213.precomposed
  glibc/iconvdata/TCVN5712-1.precomposed
  glibc/iconvdata/TSCII.precomposed

The encodings EUC-JISX0213, SHIFT_JISX0213, TSCII are not used as the locale
encoding of any locale on any system (see localcharset.h). TCVN5712-1 was
used as a locale encoding until 2012-05-21 (see glibc/localedata/SUPPORTED).

The only system that still has a locale with BIG5-HKSCS encoding is glibc,
AFAIK. But since in glibc, mbrtoc32 is identical to mbrtowc (except for the
private internal state), mbrtoc32 cannot return (size_t)(-3) either.
(Although maybe glibc may get fixed to handle the zh_HK.BIG5-HKSCS locale
better? Or maybe this locale will be dropped, like the TCVN5712-1 locale
before?)

So, for the moment, no mbrtoc32() implementation returns (size_t)(-3). But
IMO, in order to be future-proof, we should include the code to handle this
case; especially since it's only 2 lines of code.


2023-06-30  Bruno Haible  <bruno@clisp.org>

        Accommodate a difference between mbrtowc and mbrtoc32.
        * lib/mbiter.h (mbiter_multi_next): Handle the mbrtoc32 return value
        (size_t)(-3).
        * lib/mbuiter.h (mbuiter_multi_next): Likewise.
        * lib/mbfile.h (mbfile_multi_getc): Likewise.

diff --git a/lib/mbfile.h b/lib/mbfile.h
index 7c6d70fcae..716ab3fc89 100644
--- a/lib/mbfile.h
+++ b/lib/mbfile.h
@@ -183,6 +183,10 @@ mbfile_multi_getc (struct mbchar *mbc, struct mbfile_multi 
*mbf)
               assert (mbf->buf[0] == '\0');
               assert (mbc->wc == 0);
             }
+          else if (bytes == (size_t) -3)
+            /* The previous multibyte sequence produced an additional 32-bit
+               wide character.  */
+            bytes = 0;
           mbc->wc_valid = true;
           break;
         }
diff --git a/lib/mbiter.h b/lib/mbiter.h
index 93bad990a1..fadefe104b 100644
--- a/lib/mbiter.h
+++ b/lib/mbiter.h
@@ -163,6 +163,10 @@ mbiter_multi_next (struct mbiter_multi *iter)
               assert (*iter->cur.ptr == '\0');
               assert (iter->cur.wc == 0);
             }
+          else if (iter->cur.bytes == (size_t) -3)
+            /* The previous multibyte sequence produced an additional 32-bit
+               wide character.  */
+            iter->cur.bytes = 0;
           iter->cur.wc_valid = true;
 
           /* When in the initial state, we can go back treating ASCII
diff --git a/lib/mbuiter.h b/lib/mbuiter.h
index 02e3190f1c..954e11f635 100644
--- a/lib/mbuiter.h
+++ b/lib/mbuiter.h
@@ -172,6 +172,10 @@ mbuiter_multi_next (struct mbuiter_multi *iter)
               assert (*iter->cur.ptr == '\0');
               assert (iter->cur.wc == 0);
             }
+          else if (iter->cur.bytes == (size_t) -3)
+            /* The previous multibyte sequence produced an additional 32-bit
+               wide character.  */
+            iter->cur.bytes = 0;
           iter->cur.wc_valid = true;
 
           /* When in the initial state, we can go back treating ASCII






reply via email to

[Prev in Thread] Current Thread [Next in Thread]