bug-glibc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

iconv 2.2.4 doesn't handle UTF-8 BOM


From: Alexander Dupuy
Subject: iconv 2.2.4 doesn't handle UTF-8 BOM
Date: Fri, 21 Mar 2003 04:41:49 -0500

While iconv 2.2.4 (and/or libiconv 1.7) will eat a zero-width
nonbreaking space at the beginning of a file (aka Byte Order Mark, or
BOM) in UTF-16* input (and output a BOM for UTF-16* output), it doesn't
ignore an initial BOM in UTF-8 data.  While use of a BOM for UTF-8
encoding isn't as common (since there are no byte-ordering issues for
8-bit data), there are some applications/OS which will use a BOM for
UTF-8 to distinguish it from other 8-bit character data in the default
locale (I have heard rumors that Mac OS X does this).

The Unicode website documents that BOM may occur in any Unicode text
transformation http://www.unicode.org/faq/utf_bom.html#23 and explicitly
notes that if you really want a zero-width nonbreaking space at the
start of your data stream, you should double it.  (Of course, even
that's not good enough, since GNU iconv will eat BOM anywhere in UTF-16,
but that's another issue, and I'm not complaining about it.)

While I have no position on whether iconv should eat BOM anywhere in
UTF-8 data (I'm inclined to say no, but don't feel very strongly about
it) it certainly seems that iconv should at least eat BOM at the start
of a conversion string.  Prepending a BOM to UTF-8 (or UTF-7) output
would probably be a bad idea, since many other applications, like iconv
currently, would just choke on the UTF-8 BOM.

@alex





reply via email to

[Prev in Thread] Current Thread [Next in Thread]