advice about encodings

bug-standards

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

advice about encodings

From:	Bruno Haible
Subject:	advice about encodings
Date:	Tue, 10 May 2011 13:25:06 +0200
User-agent:	KMail/1.9.9

Hi Karl and others,

The most recent update of standards.texi has an update in the section
"Writing Robust Programs":

  Whenever possible, try to make programs work properly with
  sequences of bytes that represent multibyte characters, using encodings
  such as UTF-8 and others.
+ You can use libiconv to deal with a wide
+ range of encodings.

I don't think this advice is technically adequate, and would therefore suggest
that it be expanded with more details, or removed.

1) What is an adequate recommendation?

   Many programs need to deal with character data in the encoding of the
   locale. For these programs, input should be thought to consist of
   multibyte characters.
   - As much processing of multibyte characters as possible should be based
     on the function mbrtowc() and the wide character functions of <wchar.h>
     and <wctype.h>. Gnulib contains portability fixes for these functions
     from POSIX, as well as multibyte-safe analogs of the <string.h>
     functions.
   - Processing of multibyte characters that requires Unicode properties, on
     the other hand, such as line breaking, word bounds determination, and
     similar, should be done with GNU libunistring. 

   Some programs need to deal with character data where the character encoding
   depends on the document or network protocol session, not on the locale.
   These programs should use the POSIX iconv() facility. It is implemented
   in glibc. For non-glibc platforms, the most reliable and complete
   implementation is GNU libiconv. Note that the iconv() function is tricky
   to use without programming errors. For this reason, Gnulib contains a
   couple of convenience modules ('striconv', 'striconveh', 'xstriconv') that
   contain easy-to-use and well-tested wrappers around iconv().
   
2) Is there a need to deal with a wide range of encodings?

   To support "a wide range of encodings" was important until ca. 2003.
   Nowadays UTF-8 is by far the most widely used encoding, and support
   for KOI8-R, ISO-8859-2, EUC-JP, etc. are only important for few programs,
   such as web browsers. XML for example only requires 2 supported encodings
   [1][2].

3) What's the dangers of leaving the current text as-is?

   - People would think that installing GNU libiconv on a glibc system is
     recommended. It is actually useless.
   - People would think that they need to fiddle with iconv() instead of
     doing multibyte processing via mbrtowc(). Which is overkill and leads
     to unnecessarily complex code.
   - People would think that they need to fiddle with iconv() directly,
     not knowing about the services of GNU libunistring and the Gnulib
     convenience wrappers around iconv().

I don't know what is the intent of the current wording, therefore I cannot
suggest an alternate wording.

Bruno

[1] XML 1.0 <http://www.w3.org/TR/2008/REC-xml-20081126/#charsets>
[2] XML 1.1 <http://www.w3.org/TR/2006/REC-xml11-20060816/#charsets>

-- 
In memoriam Siegfried Rädel <http://en.wikipedia.org/wiki/Siegfried_Rädel>

[Prev in Thread]

Current Thread

[Next in Thread]

advice about encodings, Bruno Haible <=
- Re: advice about encodings, Karl Berry, 2011/05/10

Next by Date: Re: advice about encodings
Next by thread: Re: advice about encodings
Index(es):
- Date
- Thread