bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From wchar_t to char32_t


From: Bruno Haible
Subject: From wchar_t to char32_t
Date: Mon, 19 Jun 2023 20:05:15 +0200

For many years, processing multibyte strings required the mbrtowc functions
and the 'wchar_t' type.

The major limitation of this API is that on Windows platforms (Cygwin as well
as native Windows) and in 32-bit mode on AIX, a 'wchar_t' is limited to 16 bits,
and this causes all sorts of bugs with characters outside the Unicode BMP.

Before 2010, we thought that this would only impact rarely used Chinese
characters. But nowadays, emoticons are in Unicode, outside the BMP, and
are frequently used on the web. So, supporting characters outside the BMP
has become more important.

In 2011, ISO C added the 'char32_t' type as a "32-bit wide character" type.
Meanwhile, many OSes have this type and the corresponding mbrtoc32 function.
Elements of this type are actual Unicode code points. The ISO C 11 standard
did only hint at it; but ISO C 23 actually requires it. All platforms that
have the mbrtoc32 function fulfil this requirement, and Gnulib's substitute
(module 'mbrtoc32') does so as well.

In particular, on glibc systems: since glibc 2.24, mbrtoc32 is identical to
mbrtowc. And the Gnulib convenience functions for char32_t characters
just delegate to the corresponding glibc functions for wchar_t wide characters.

So, we are now in a position to support characters outside the BMP in GNU
programs overall and in a portable and maintainable way.

I added some documentation a month ago:
https://www.gnu.org/software/gnulib/manual/html_node/Strings-and-Characters.html

The migration from wchar_t to char32_t can be done by writing 'char32_t'
instead of 'wchar_t', and replacing function names according to this table:

  wchar_t             char32_t
  -------             --------
  7.31.2
  *wprintf            --             rarely used
  *wscanf             --             rarely used
  7.31.3
  fgetwc              --             rarely used, see "The wchar_t mess"
  fputwc              --             rarely used
  7.31.4.1
  wcsto{f,d,ld}       --             rarely used
  wcsto{l,ll,ul,ull}  --             rarely used
  7.31.4.2
  wcscpy              u32_strcpy
  wcsncpy             u32_strncpy
  wmemcpy             u32_cpy
  wmemmove            u32_move
  7.31.4.3
  wcscat              u32_strcat
  wcsncat             u32_strncat
  7.31.4.4
  wcscmp              u32_strcmp
  wcscoll             u32_strcoll
  wcsncmp             u32_strncmp
  wcsxfrm             --             rarely used
  wmemcmp             u32_cmp
  7.31.4.5/6
  wcschr              u32_strchr
  wcscspn             u32_strcspn
  wcspbrk             u32_strpbrk
  wcsrchr             u32_strrchr
  wcsspn              u32_strspn
  wcsstr              u32_strstr
  wcstok              u32_strtok
  wmemchr             u32_chr
  7.31.4.7
  wcslen              u32_strlen
  wmemset             u32_set
  7.31.5
  wcsftime            --             rarely used
  7.31.6.1
  btowc               btoc32
  wctob               c32tob
  7.31.6.2
  mbsinit             mbsinit
  7.31.6.3
  mbrlen              --             rarely used, use mbrtoc32 instead
  mbrtowc             mbrtoc32
  wcrtomb             c32rtomb
  7.31.6.4
  mbsrtowcs           mbsrtoc32s
  wcsrtombs           c32srtombs
  7.32.2.1
  iswalnum            c32isalnum
  iswalpha            c32isalpha
  iswblank            c32isblank
  iswcntrl            c32iscntrl
  iswdigit            c32isdigit
  iswgraph            c32isgraph
  iswlower            c32islower
  iswprint            c32isprint
  iswpunct            c32ispunct
  iswspace            c32isspace
  iswupper            c32isupper
  iswxdigit           c32isxdigit
  7.32.2.2
  iswctype            --             rarely used
  wctype              --             rarely used
  7.32.3.1
  towlower            c32tolower
  towupper            c32toupper
  7.32.3.2
  towctrans           --             rarely used
  wctrans             --             rarely used
  POSIX
  wcwidth             c32width
  wcswidth            c32swidth


Paul has already started this migration, in diffutils:
https://git.savannah.gnu.org/gitweb/?p=diffutils.git;a=commitdiff;h=a2e301b52cc5bdb44540aa66860dc59fa1fa5a89

In Gnulib, the following areas will need migration:

* lib/mbchar.h
  lib/mbiter.h
  lib/mbuiter.h
  Draft patch attached.

* lib/dfa.c
  lib/localeinfo.h
  lib/localeinfo.c
  Needs to be carefully done, so as to not break gawk.

* lib/regcomp.c
  lib/regexec.c
  lib/regex_internal.h
  lib/regex_internal.c
  Needs to be done in a way that is acceptable to glibc upstream.

* lib/fnmatch.c
  Likewise.

* lib/exclude.c

* lib/nstrftime.c

* lib/quotearg.c


Bruno

Attachment: mbchar-migration.diff
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]