[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
i18n
From: |
John Darrington |
Subject: |
i18n |
Date: |
Fri, 17 Mar 2006 21:52:59 +0800 |
User-agent: |
Mutt/1.5.4i |
On Sun, Mar 12, 2006 at 07:57:44PM -0800, Ben Pfaff wrote:
thinking about internationalization. I've been reworking the
PostScript driver to better support i18n, and I've been thinking
about how to better support it in PSPP in general. I'll likely
check in a new PostScript driver that does a better job, but it's
hard to say when. I've been reading the Unicode standard and
documentation from Adobe and others, trying to learn as much as I
can about these issues.
Some thoughts on internationalisation, which are only slightly
coherent at the moment, but I thought needed to be aired anyway.
Please forgive me for using this list as scrap paper.
0. Data strings that need internationalisation include:
* String Variable Data.
* Variable Names.
* Value Labels.
* Variable Labels.
* File Labels.
* Document Text.
1. If the system file format had been properly defined, it would
have stored the encoding used for its strings somewhere in the
file. The fact of the matter is, that it doesn't.
2. Therefore, we have to a) make a reasonable guess as to what a
system file's encoding is; and b) ensure that reasonable behaviour
ensues if that assumption is incorrect. We have to bear in mind
that PSPP can deal with more than one system file at the same time
eg: through the MATCH FILES command, and these could have been
written in different encodings.
2a might be acheived by i) using the LC_CTYPE environment variable,
ii) using the value set be SET LOCALE; or iii) we could introduce
an optional subcommand to the GET command to specify the locale.
2b might be achieved by heuristics, using a library such as unac
http://home.gna.org/unac/unac.en.html or if all else fails, replace
unknown byte sequences by "...."
3. At some level within PSPP we need to decide on an interface where
all strings will have a common encoding. For instance, one
possibility would be to decide that all strings contained within
the dictionary would be utf8. In this case, we'd need to convert
all string data to utf8 within the struct variable (except short_name).
Whilst that's feasible, casefiles cannot possibly (in the
current system) have this invariant, because the system files which
implement them may not in fact be utf8 and converting a casefile
doesn't scale.
An alternative, would be to decide that it is the responsibility of
the user interface and output subsystem to convert to utf8. In
which case, both these entities need to know the encoding of the
data they receive. Since, (as in the case of MATCH FILES)
variables can come from different system sources, each variable
within a dictionary may have a different encoding. Thus it may be
desirable to add an encoding property to struct variable.
4. However, when writing a system file, it would be sensible to
convert all variables to a common encoding first.
--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.
pgpTy3q_KFT15.pgp
Description: PGP signature
- my status, Ben Pfaff, 2006/03/12
- i18n,
John Darrington <=