bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: iso-codes .pot msgid strings contain non-ASCII characters


From: Alastair McKinstry
Subject: Re: iso-codes .pot msgid strings contain non-ASCII characters
Date: Tue, 25 Apr 2006 09:36:52 +0100
User-agent: Debian Thunderbird 1.0.7 (X11/20051017)

retitle 330990 iso-codes UTF-8 entries are displayed incorrectly
tags 330990 wontfix
thanks

I think this solution from Bruno is the good. I will look at the code using
iso-codes and implement / recommend a solution: I am retitling the bug
and leaving it open for developers to see the bugfix for _their_ code.

I think the underlying problem is in gettext(), though: it should not be
returning UTF-8 code when the locale is EUC-JIS, etc. Unfortunately all the
solutions I can think of at the moment (e.g. adding another field to the header
to identify the charset of the msgid strings, etc.) add too much complexity
to be worth implementing to solve this rare corner case: viz. msgid string in UTF-8, locale in incompatible charset, and no msgstr defined. Defining two charsets
in the .po files really bites people who edit the .po files with vi, etc.
I'd be interested in seeing solutions, though.


Regards
Alastair


Bruno Haible wrote:

Paul Eggert wrote:
the GNU gettext manual says:

     Note that the MSGID argument to `gettext' is not subject to
  character set conversion.  Also, when `gettext' does not find a
  translation for MSGID, it returns MSGID unchanged - independently of
  the current output character set.  It is therefore recommended that all
  MSGIDs be US-ASCII strings.

This recommendation is directed to the "normal" use of xgettext, i.e.
extraction of the msgids from source code. The other issue - not mentioned
in the GNU gettext manual, but quite important - is that source code should
be viewable in different encodings, and when you convert some source code
from ISO-8859-1 to UTF-8 (or vice versa), the behaviour of the program
should remain the same.

The situation for iso-codes is different, because
 - It is not extracted from source code; the use of XML files for the
   list of country/location names greatly reduces the possible problems
   when these files would be stored in a different encoding (thanks to
   the encoding declaration present in XML files).
 - There are quite a number of languages/countries/locations in the world
   which cannot be written in ASCII, such as Norwegian Bokmål, Côte
   d'Ivoire, etc.

Therefore I think it's actually OK for iso-codes to use UTF-8 as encoding
of the msgids.

The only remaining problem is in the C code: A program running in, say, an
EUC-JP locale, needs to be a little careful when accessing the message
catalog: not just

     country_translation = dgettext ("iso-codes", country_english_utf8);

but

     country_translation = dgettext ("iso-codes", country_english_utf8);
     if (country_translation == country_english_utf8)
       {
         /* Not found in the message catalog. Use the English name, converted
            to the correct encoding.  */
         country_translation =
           iconv_string (country_translation, "UTF-8", locale_charset ());
       }

You find code that is a little better than this one (cares about 
transliteration,
non-canonicalized locale_charset() result etc.) in propername.c at

 
http://cvs.savannah.gnu.org/viewcvs/*checkout*/gettext/gettext-tools/lib/propername.c?content-type=text%2Fplain&rev=1.1&root=gettext

In other words, UTF-8 is the current de-facto standard encoding. I would leave
the iso-codes PO files in that encoding, and keep the support of other encodings
purely in the C code that uses the ,mo files.

Can the format of the XML country list be extended to contain two
spellings, one in UTF-8, one ASCII-ized?  Then the algorithm wouldn't
need to transcode.
It would be possible to add ASCII-ized versions to the country list; as Bruno points out, transliteration in glibc is good enough and it would not be necessary.
However using ASCII msgid strings here breaks usability of the lists a lot.

A better example of this is the ISO-3166-2 province list: there is
lots of accented Latin script in province names: look at the Polish names, for example.

Imagine you are a Japanese user using EUC-JIS, and entering a list of Polish addresses into a database. The DB helpfully provides you with a drop-down list of territory names from ISO-3166-2, saving you from having to type in all those strange accents (and making errors). The list of Polish territories is unlikely to be translated into Japanese, and wouldn't be useful if it was: you want your addresses in Latin so a Polish postman can deliver your parcels. Now, if the msgid strings were ASCII transliterations, a Japanese user will have a hard time matching up the province name from a piece of paper, with accents, to a transliterated version: those rules are fairly arcane if Polish / English are not your native languages.
So we do want the accents to be intact in the presented list.

The transliteration in glibc and libiconv is good enough.

Bruno




reply via email to

[Prev in Thread] Current Thread [Next in Thread]