[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: UTF-8 BOM parse error
From: |
Bruno Haible |
Subject: |
Re: UTF-8 BOM parse error |
Date: |
Mon, 13 Sep 2004 13:41:01 +0200 |
User-agent: |
KMail/1.5 |
David Necas wrote:
> Gettext version: 0.14.1
>
> Problem: msgfmt (and probably other gettext tools) print an
> unhelpful error
>
> somefile.po:1:2: parse error
>
> when a PO file starts with UTF-8 BOM (0xef 0xbb 0xbf).
This behaviour is correct. The so-called "UTF-8 BOM" is specified in
the document that defines UTF-8, namely RFC 3629
(http://www.faqs.org/rfcs/rfc3629.html):
"It is important to understand that the character U+FEFF appearing at
any position other than the beginning of a stream MUST be interpreted
with the semantics for the zero-width non-breaking space, and MUST
NOT be interpreted as a signature."
The PO file format only allows for ASCII white space characters, not for
U+FEFF.
The Unix Unicode FAQ (http://www.cl.cam.ac.uk/~mgk25/unicode.html) also
says:
"Linux/Unix does not use any BOMs and signatures. They would break
far too many existing ASCII syntax conventions"
> What makes it worse is that any UTF-8-capable text editor or
> viewer does not show the BOM (or at least should not show),
> so one gazes at the file wondering what could be wrong with
> the comment on its first line...
Yes. I've also once seen the problem on an XML file.
The problem is not the file formats which don't allow U+FEFF to be ignored.
The problem are the editors which put the "UTF-8 BOM".
> Do not use tab characters. Their effect is not predictable.
Do not use UTF-8 BOM. Its effect is predictable: it causes hassles.
Bruno