[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: From wchar_t to char32_t
From: |
Bruno Haible |
Subject: |
Re: From wchar_t to char32_t |
Date: |
Fri, 30 Jun 2023 22:50:06 +0200 |
I did:
> * lib/mbiter.h: Include <uchar.h> instead of <wchar.h>.
> (mbiter_multi_next): Use mbrtoc32 instead of mbrtowc.
> * lib/mbuiter.h: Include <uchar.h> instead of <wchar.h>.
> (mbuiter_multi_next): Use mbrtoc32 instead of mbrtowc.
> * lib/mbfile.h (mbfile_multi_getc): Use mbrtoc32 instead of mbrtowc.
There's small difference between mbrtowc and mbrtoc32: While the return values
(size_t)(-1) and (size_t)(-2) have the same meaning, mbrtoc32 (in theory) has
a possible return value (size_t)(-3). This adds one case to the rule how to
compute the number of consumed bytes.
In mbrtowc:
Return value Consumed bytes
------------ --------------
small n > 0 n
0 1
In mbrtoc32:
Return value Consumed bytes
------------ --------------
small n > 0 n
0 1
(size_t)(-3) 0
The patch below thus fixes the uses of mbrtoc32.
I said "in theory". This situation occurs if and only if there is a character
in the locale's encoding that corresponds to a sequence of two or more Unicode
characters. To find which encodings have these properties, do
$ ls -1 glibc/iconvdata/*.precomposed
glibc/iconvdata/BIG5HKSCS.precomposed
glibc/iconvdata/EUC-JISX0213.precomposed
glibc/iconvdata/SHIFT_JISX0213.precomposed
glibc/iconvdata/TCVN5712-1.precomposed
glibc/iconvdata/TSCII.precomposed
The encodings EUC-JISX0213, SHIFT_JISX0213, TSCII are not used as the locale
encoding of any locale on any system (see localcharset.h). TCVN5712-1 was
used as a locale encoding until 2012-05-21 (see glibc/localedata/SUPPORTED).
The only system that still has a locale with BIG5-HKSCS encoding is glibc,
AFAIK. But since in glibc, mbrtoc32 is identical to mbrtowc (except for the
private internal state), mbrtoc32 cannot return (size_t)(-3) either.
(Although maybe glibc may get fixed to handle the zh_HK.BIG5-HKSCS locale
better? Or maybe this locale will be dropped, like the TCVN5712-1 locale
before?)
So, for the moment, no mbrtoc32() implementation returns (size_t)(-3). But
IMO, in order to be future-proof, we should include the code to handle this
case; especially since it's only 2 lines of code.
2023-06-30 Bruno Haible <bruno@clisp.org>
Accommodate a difference between mbrtowc and mbrtoc32.
* lib/mbiter.h (mbiter_multi_next): Handle the mbrtoc32 return value
(size_t)(-3).
* lib/mbuiter.h (mbuiter_multi_next): Likewise.
* lib/mbfile.h (mbfile_multi_getc): Likewise.
diff --git a/lib/mbfile.h b/lib/mbfile.h
index 7c6d70fcae..716ab3fc89 100644
--- a/lib/mbfile.h
+++ b/lib/mbfile.h
@@ -183,6 +183,10 @@ mbfile_multi_getc (struct mbchar *mbc, struct mbfile_multi
*mbf)
assert (mbf->buf[0] == '\0');
assert (mbc->wc == 0);
}
+ else if (bytes == (size_t) -3)
+ /* The previous multibyte sequence produced an additional 32-bit
+ wide character. */
+ bytes = 0;
mbc->wc_valid = true;
break;
}
diff --git a/lib/mbiter.h b/lib/mbiter.h
index 93bad990a1..fadefe104b 100644
--- a/lib/mbiter.h
+++ b/lib/mbiter.h
@@ -163,6 +163,10 @@ mbiter_multi_next (struct mbiter_multi *iter)
assert (*iter->cur.ptr == '\0');
assert (iter->cur.wc == 0);
}
+ else if (iter->cur.bytes == (size_t) -3)
+ /* The previous multibyte sequence produced an additional 32-bit
+ wide character. */
+ iter->cur.bytes = 0;
iter->cur.wc_valid = true;
/* When in the initial state, we can go back treating ASCII
diff --git a/lib/mbuiter.h b/lib/mbuiter.h
index 02e3190f1c..954e11f635 100644
--- a/lib/mbuiter.h
+++ b/lib/mbuiter.h
@@ -172,6 +172,10 @@ mbuiter_multi_next (struct mbuiter_multi *iter)
assert (*iter->cur.ptr == '\0');
assert (iter->cur.wc == 0);
}
+ else if (iter->cur.bytes == (size_t) -3)
+ /* The previous multibyte sequence produced an additional 32-bit
+ wide character. */
+ iter->cur.bytes = 0;
iter->cur.wc_valid = true;
/* When in the initial state, we can go back treating ASCII