[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: i18n
From: |
John Darrington |
Subject: |
Re: i18n |
Date: |
Mon, 20 Mar 2006 10:36:32 +0800 |
User-agent: |
Mutt/1.5.9i |
On Sun, Mar 19, 2006 at 05:26:47PM -0800, Ben Pfaff wrote:
>
> I don't know about the unac library. What are its advantages
> over iconv?
>
> Iconv is only useful if we know the source encoding. If we don't know
> it we have to guess. If we guess it wrong, then iconv will fail.
> Also, it won't convert between encodings where data would be lost.
> Unac on the other hand is a (more) robust but lossy thing. For example,
> given character 0xe1 (acute a) in iso-8859-1 it'll convert to 'a' in
> ascii. I don't know how it would handle converting from Japanese
> characters to ascii ....
I do not understand how unac could remove accents from text
without knowing the source encoding. I don't see any indication
that it can do so, now that I have read the unac manpage from the
webpage you pointed out. In fact, the first argument to the
unac_string() function is the name of the source encoding, and
unac is documented to use iconv internally to convert to UTF-16.
(Why would we want to remove accents, by the way?)
Ideally we wouldn't. I've only looked very briefly at the unac web
page. As I understood it, it was supposed to convert a string from an
arbitrary encoding, into a reasonable approximation of that string
which could be representing in plain ascii. Perhaps I need to read
the web page more closely.
> So we agree then that casefile data must not be meddled with.
> However, this also means that both a) The keys in Value Labels ; and
> b) the Missing Values must also be left verbatim. Otherwise, they'll
> no longer match. And this has a rather unfortunate consequence that
> the dictionary cannot be gauranteed to have a consistent encoding.
> Hence my suggestion of a per-variable encoding attribute.
This sounds like a mess. Any reference to more than one string
variable will have to deal with coding translation. The most
obvious place where this happens is in string expressions,
e.g. consider the CONCAT function especially. I'm sure we'll get
confused when we have to fix up code all over to do that. I bet
that our users will get even more confused.
True. I hadn't considered that.
Let me elaborate. Here is the plan that I envision:
i. PSPP adopts a single locale that defaults to the system locale
but can be changed with SET LOCALE. (I'll call this the "PSPP
locale".)
ii. All string data in all casefiles and dictionaries is in the
PSPP locale, or at least we make that assumption.
iii. The GET command assumes by default that data read in is in
the PSPP locale. If the user provides a LOCALE subcommand
specifying something different, then missing values and
value label keys are converted as the dictionary is read and
string case data is converted "on the fly" as data is read
from the file. We can also provide a NOCONVERT subcommand
(with a better name, I hope) that flags string variables
that are not to be converted.
iv. The SAVE command assumes by default that data written out is
to be in the PSPP locale. If the user provides a LOCALE
subcommand specifying something different, then we convert
string data, etc., as we write it, and again exceptions can
be accommodated.
v. Users who want accurate translations, as in your survey
example, choose a reasonable PSPP locale, e.g. something based
on UTF-8.
vi. We look into the possibility of tagging system files with a
locale. The system file format is extensible enough that
this would really just be a matter of testing whether SPSS
will complain loudly about our extension records or just
silently ignore them.
I think there is no ideal solution to this problem. Your proposal
might be as good as any other and certainly is simpler than what I had
suggested. However I'm worried about what happens if our assumption at
(ii) turns out to be wrong. We need to make sure of some sensible
behaviour (hence my idea of unac).
Regarding (vi), I don't think spss would complain (at least not
loudly) about unrecognised records. But all hell might break loose if
we commandeared an unused record type for this purpose, and a later
version of SPSS chose to use it for another purpose.
Incidently, SPSS V14 writes system files with a Type 7, Subtype 16
record. I haven't been able to determine the purpose of this record.
Perhaps it specifies the encoding?
J'
--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.
signature.asc
Description: Digital signature
- Re: my status, (continued)
- i18n, John Darrington, 2006/03/17
- Re: i18n, Ben Pfaff, 2006/03/18
- Re: i18n, John Darrington, 2006/03/19
- Re: i18n, Ben Pfaff, 2006/03/19
- Re: i18n,
John Darrington <=
- Re: i18n, Ben Pfaff, 2006/03/19