Re: UTF-8 BOM parse error

bug-gnu-utils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF-8 BOM parse error

From:	Bruno Haible
Subject:	Re: UTF-8 BOM parse error
Date:	Mon, 13 Sep 2004 13:41:01 +0200
User-agent:	KMail/1.5

David Necas wrote:
> Gettext version: 0.14.1
>
> Problem: msgfmt (and probably other gettext tools) print an
> unhelpful error
>
>     somefile.po:1:2: parse error
>
> when a PO file starts with UTF-8 BOM (0xef 0xbb 0xbf).

This behaviour is correct. The so-called "UTF-8 BOM" is specified in
the document that defines UTF-8, namely RFC 3629
(http://www.faqs.org/rfcs/rfc3629.html):
   "It is important to understand that the character U+FEFF appearing at
    any position other than the beginning of a stream MUST be interpreted
    with the semantics for the zero-width non-breaking space, and MUST
    NOT be interpreted as a signature."

The PO file format only allows for ASCII white space characters, not for
U+FEFF.

The Unix Unicode FAQ (http://www.cl.cam.ac.uk/~mgk25/unicode.html) also
says:
   "Linux/Unix does not use any BOMs and signatures. They would break
    far too many existing ASCII syntax conventions"

> What makes it worse is that any UTF-8-capable text editor or
> viewer does not show the BOM (or at least should not show),
> so one gazes at the file wondering what could be wrong with
> the comment on its first line...

Yes. I've also once seen the problem on an XML file.

The problem is not the file formats which don't allow U+FEFF to be ignored.
The problem are the editors which put the "UTF-8 BOM".

> Do not use tab characters. Their effect is not predictable.

Do not use UTF-8 BOM. Its effect is predictable: it causes hassles.

Bruno

[Prev in Thread]

Current Thread

[Next in Thread]

UTF-8 BOM parse error, David Necas (Yeti), 2004/09/11
- Re: UTF-8 BOM parse error, Bruno Haible <=

Prev by Date: Re: Cygwin awk & metacharacter
Next by Date: Re: problem with egrep and fgrep
Previous by thread: UTF-8 BOM parse error
Next by thread: Cygwin awk & metacharacter
Index(es):
- Date
- Thread