[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: doc: New chapter "Strings and Characters"
From: |
Bruno Haible |
Subject: |
Re: doc: New chapter "Strings and Characters" |
Date: |
Mon, 19 Jun 2023 19:29:36 +0200 |
Hi Paul,
Thanks for the feedback.
> > +The @posixheader{ctype.h} API, that was designed only with unibyte
> > +encodings in mind, is useless nowadays; it does not work in
> > +multibyte locales.
>
> It's still useful, even in multibyte locales, when dealing with data
> that is inherently unibyte. Perhaps prepend "for general text
> processing" to the sentence. Similarly for the later occurrence of
> "useless and obsolete".
>
>
> > +While UTF-8 is the most common multibyte encoding, GB18030 is there as
> > +well and will not go away within decades, because it is a Chinese
> > +government standard, last revised in 2022.
>
> Again, let's not focus on GB18030 to the exclusion of other national
> encodings.
I still need to mention GB18030 as the worst-case example, to explain why
strchr() and similar functions may be problematic. BIG5 is not _that_ bad.
DEC-HANYU and ISO-IR-165, which are also that bad, are not supported as
locale encodings in glibc.
> > +For complex string processing, the provided strings functions may not be
>
> strings -> string
Done as follows:
2023-06-19 Bruno Haible <bruno@clisp.org>
doc: Corrections to the "Strings and Characters" chapter.
Suggested by Paul Eggert.
* doc/strings.texi: Corrections: GB18030 is rarely used nowadays.
<ctype.h> functions can be useful for specific data.
diff --git a/doc/strings.texi b/doc/strings.texi
index 131221f583..cbed6533c4 100644
--- a/doc/strings.texi
+++ b/doc/strings.texi
@@ -76,7 +76,7 @@
``unibyte locale'', otherwise of a ``multibyte locale''.
It is important to realize that the majority of Unix installations
-nowadays use UTF-8 or GB18030 as locale encoding; therefore, the
+nowadays use UTF-8 as locale encoding; therefore, the
majority of users are using multibyte locales.
Three important facts to remember are:
@@ -89,8 +89,8 @@
@itemize @bullet
@item
The @posixheader{ctype.h} API, that was designed only with unibyte
-encodings in mind, is useless nowadays; it does not work in
-multibyte locales.
+encodings in mind, is useless nowadays for general text processing; it
+does not work in multibyte locales.
@item
The @posixfunc{strlen} function does not return the number of characters
in a string. Nor does it return the number of screen columns occupied
@@ -107,9 +107,9 @@
@emph{Multibyte does not imply UTF-8 encoding.}
@end cartouche
-While UTF-8 is the most common multibyte encoding, GB18030 is there as
-well and will not go away within decades, because it is a Chinese
-government standard, last revised in 2022.
+While UTF-8 is the most common multibyte encoding, GB18030 is also a
+supported locale encoding on GNU systems (mostly because it is a Chinese
+government standard, last revised in 2022).
@cartouche
@emph{Searching for a character in a string is not the same as searching
@@ -184,7 +184,7 @@
@node Iterating through strings
@subsubsection Iterating through strings
-For complex string processing, the provided strings functions may not be
+For complex string processing, the provided string functions may not be
enough, and what you need is a way to iterate through a string while
processing each (possibly multibyte) character in turn. Gnulib provides
two modules for this purpose. Both iterate through the string in
@@ -604,7 +604,8 @@
program that runs only in unibyte locales.
ISO C and POSIX standardized an API for characters of type @code{char},
-in @code{<ctype.h>}. This API is nowadays useless and obsolete.
+in @code{<ctype.h>}. This API is nowadays useless and obsolete, when it
+comes to general text processing.
The important lessons to remember are: