[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
advice about encodings
From: |
Bruno Haible |
Subject: |
advice about encodings |
Date: |
Tue, 10 May 2011 13:25:06 +0200 |
User-agent: |
KMail/1.9.9 |
Hi Karl and others,
The most recent update of standards.texi has an update in the section
"Writing Robust Programs":
Whenever possible, try to make programs work properly with
sequences of bytes that represent multibyte characters, using encodings
such as UTF-8 and others.
+ You can use libiconv to deal with a wide
+ range of encodings.
I don't think this advice is technically adequate, and would therefore suggest
that it be expanded with more details, or removed.
1) What is an adequate recommendation?
Many programs need to deal with character data in the encoding of the
locale. For these programs, input should be thought to consist of
multibyte characters.
- As much processing of multibyte characters as possible should be based
on the function mbrtowc() and the wide character functions of <wchar.h>
and <wctype.h>. Gnulib contains portability fixes for these functions
from POSIX, as well as multibyte-safe analogs of the <string.h>
functions.
- Processing of multibyte characters that requires Unicode properties, on
the other hand, such as line breaking, word bounds determination, and
similar, should be done with GNU libunistring.
Some programs need to deal with character data where the character encoding
depends on the document or network protocol session, not on the locale.
These programs should use the POSIX iconv() facility. It is implemented
in glibc. For non-glibc platforms, the most reliable and complete
implementation is GNU libiconv. Note that the iconv() function is tricky
to use without programming errors. For this reason, Gnulib contains a
couple of convenience modules ('striconv', 'striconveh', 'xstriconv') that
contain easy-to-use and well-tested wrappers around iconv().
2) Is there a need to deal with a wide range of encodings?
To support "a wide range of encodings" was important until ca. 2003.
Nowadays UTF-8 is by far the most widely used encoding, and support
for KOI8-R, ISO-8859-2, EUC-JP, etc. are only important for few programs,
such as web browsers. XML for example only requires 2 supported encodings
[1][2].
3) What's the dangers of leaving the current text as-is?
- People would think that installing GNU libiconv on a glibc system is
recommended. It is actually useless.
- People would think that they need to fiddle with iconv() instead of
doing multibyte processing via mbrtowc(). Which is overkill and leads
to unnecessarily complex code.
- People would think that they need to fiddle with iconv() directly,
not knowing about the services of GNU libunistring and the Gnulib
convenience wrappers around iconv().
I don't know what is the intent of the current wording, therefore I cannot
suggest an alternate wording.
Bruno
[1] XML 1.0 <http://www.w3.org/TR/2008/REC-xml-20081126/#charsets>
[2] XML 1.1 <http://www.w3.org/TR/2006/REC-xml11-20060816/#charsets>
--
In memoriam Siegfried Rädel <http://en.wikipedia.org/wiki/Siegfried_Rädel>
- advice about encodings,
Bruno Haible <=