Re: iso-codes .pot msgid strings contain non-ASCII characters

bug-gnu-utils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: iso-codes .pot msgid strings contain non-ASCII characters

From:	Alastair McKinstry
Subject:	Re: iso-codes .pot msgid strings contain non-ASCII characters
Date:	Tue, 25 Apr 2006 09:36:52 +0100
User-agent:	Debian Thunderbird 1.0.7 (X11/20051017)

retitle 330990 iso-codes UTF-8 entries are displayed incorrectly
tags 330990 wontfix
thanks

I think this solution from Bruno is the good. I will look at the code using
iso-codes and implement / recommend a solution: I am retitling the bug
and leaving it open for developers to see the bugfix for _their_ code.

I think the underlying problem is in gettext(), though: it should not be
returning UTF-8 code when the locale is EUC-JIS, etc. Unfortunately all the

solutions I can think of at the moment (e.g. adding another field to theheader

to identify the charset of the msgid strings, etc.) add too much complexity

to be worth implementing to solve this rare corner case: viz. msgidstring inUTF-8, locale in incompatible charset, and no msgstr defined. Definingtwo charsets

in the .po files really bites people who edit the .po files with vi, etc.
I'd be interested in seeing solutions, though.


Regards
Alastair


Bruno Haible wrote:

Paul Eggert wrote:

the GNU gettext manual says:

     Note that the MSGID argument to `gettext' is not subject to
  character set conversion.  Also, when `gettext' does not find a
  translation for MSGID, it returns MSGID unchanged - independently of
  the current output character set.  It is therefore recommended that all
  MSGIDs be US-ASCII strings.


This recommendation is directed to the "normal" use of xgettext, i.e.
extraction of the msgids from source code. The other issue - not mentioned
in the GNU gettext manual, but quite important - is that source code should
be viewable in different encodings, and when you convert some source code
from ISO-8859-1 to UTF-8 (or vice versa), the behaviour of the program
should remain the same.

The situation for iso-codes is different, because
 - It is not extracted from source code; the use of XML files for the
   list of country/location names greatly reduces the possible problems
   when these files would be stored in a different encoding (thanks to
   the encoding declaration present in XML files).
 - There are quite a number of languages/countries/locations in the world
   which cannot be written in ASCII, such as Norwegian Bokmål, Côte
   d'Ivoire, etc.

Therefore I think it's actually OK for iso-codes to use UTF-8 as encoding
of the msgids.

The only remaining problem is in the C code: A program running in, say, an
EUC-JP locale, needs to be a little careful when accessing the message
catalog: not just

     country_translation = dgettext ("iso-codes", country_english_utf8);

but

     country_translation = dgettext ("iso-codes", country_english_utf8);
     if (country_translation == country_english_utf8)
       {
         /* Not found in the message catalog. Use the English name, converted
            to the correct encoding.  */
         country_translation =
           iconv_string (country_translation, "UTF-8", locale_charset ());
       }

You find code that is a little better than this one (cares about 
transliteration,
non-canonicalized locale_charset() result etc.) in propername.c at

 
http://cvs.savannah.gnu.org/viewcvs/*checkout*/gettext/gettext-tools/lib/propername.c?content-type=text%2Fplain&rev=1.1&root=gettext

In other words, UTF-8 is the current de-facto standard encoding. I would leave
the iso-codes PO files in that encoding, and keep the support of other encodings
purely in the C code that uses the ,mo files.

Can the format of the XML country list be extended to contain two
spellings, one in UTF-8, one ASCII-ized?  Then the algorithm wouldn't
need to transcode.

It would be possible to add ASCII-ized versions to the country list; asBrunopoints out, transliteration in glibc is good enough and it would not benecessary.

However using ASCII msgid strings here breaks usability of the lists a lot.

A better example of this is the ISO-3166-2 province list: there is

lots of accented Latin script in province names: look at the Polishnames, for example.

Imagine you are a Japanese user using EUC-JIS, and entering a list ofPolish addressesinto a database. The DB helpfully provides you with a drop-down list ofterritory namesfrom ISO-3166-2, saving you from having to type in all those strangeaccents (and makingerrors). The list of Polish territories is unlikely to be translatedinto Japanese, and wouldn'tbe useful if it was: you want your addresses in Latin so a Polishpostman can deliver yourparcels. Now, if the msgid strings were ASCII transliterations, aJapanese user will have a hardtime matching up the province name from a piece of paper, with accents,to a transliteratedversion: those rules are fairly arcane if Polish / English are not yournative languages.

So we do want the accents to be intact in the presented list.

The transliteration in glibc and libiconv is good enough.

Bruno

[Prev in Thread]

Current Thread

[Next in Thread]

Re: iso-codes .pot msgid strings contain non-ASCII characters, Paul Eggert, 2006/04/24
- Re: iso-codes .pot msgid strings contain non-ASCII characters, Bruno Haible, 2006/04/24
  - Re: iso-codes .pot msgid strings contain non-ASCII characters, Alastair McKinstry <=
    - Re: iso-codes .pot msgid strings contain non-ASCII characters, Paul Eggert, 2006/04/25

Prev by Date: Re: iso-codes .pot msgid strings contain non-ASCII characters
Next by Date: [PATCH] uudecode doesn't work on files with DOS line endings
Previous by thread: Re: iso-codes .pot msgid strings contain non-ASCII characters
Next by thread: Re: iso-codes .pot msgid strings contain non-ASCII characters
Index(es):
- Date
- Thread