bug-gnu-pspp
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts


From: Müller , Andre
Subject: Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names
Date: Tue, 18 Feb 2014 15:17:26 +0000

> -----Original Message-----
> From: Ben Pfaff [mailto:address@hidden
> Sent: Monday, February 17, 2014 00:07
> To: Müller, Andre
> Cc: address@hidden
> Subject: Re: PSPP-BUG: Failure to handle an antique SPSS file containing
> umlauts in variable names
> 
> On Mon, Feb 10, 2014 at 04:29:11PM +0000, M?ller, Andre wrote:
> > > -----Original Message-----
> > > From: Ben Pfaff [mailto:address@hidden
> > > Sent: Monday, February 10, 2014 17:16
> > > To: M?ller, Andre
> > > Subject: Re: PSPP-BUG: Failure to handle an antique SPSS file containing
> > > umlauts in variable names
> > >
> > > On Mon, Feb 10, 2014 at 12:26:36PM +0000, M?ller, Andre wrote:
> > > > So I learn the .sav-file has no internal markers for the codepage used 
> > > > --
> > > > which in turn explains a lot of the codepage woes I have seen.
> > > > Thus, I will have to add a codepage-heuristic to my export-tool.
> > >
> > > It's only the very old SPSS files that lack an indication of codepage.
> > > This causes problems for a surprising number of PSPP users, so I'm
> > > working to add some codepage analysis to PSPP as well.
> >
> > Oh dear, that's work I'd hate to do for the general case.
> > I do have the advantage of a limited set of failure cases (~2k as my current
> estimate)
> > and a strong tendency for them to be from western europe,
> > so I can check the "file -bi" state of the output and check for umlaut
> presence.
> >
> > Most of the errors will go rather unnoticed, as the non-us-ascii chars are
> > not in the "functional" parts but only in the labels. There I find non-us-
> ascii chars replaced
> > to "?".
> >
> > Nevertheless: That work is much appreciated, and I'm looking forward to
> be able and
> > throw my lousy heuristics away.
> 
> I committed this work, in the form of a new option to SYSFILE INFO that,
> instead of outputting the system file dictionary, outputs an analysis of
> the string data in the dictionary.
> 
> This would be better if it were easily accessible through the GUI, but I
> guess that can be added later if necessary.
> 
> Example of use:
>         SYSFILE INFO FILE='ZA4209.sav' ENCODING='DETECT'.

Hi Ben,

I have tried the SYSFILE INFO and it works quite well. 
For now, I have piped some examples of uncommon codepages through it,
and it does well for SHIFT_JIS and IBM850 (or similar), for example.

The broken files I have, that actually contain entries in more than one 
codepage 
are not a valid test, but even then, I found at least some of the codepages it 
contains as suggestions.
That's nice.

Another rather unfair testcase is a failure to identify a source file in 
DIN_66003 coding,
but that really is to be expected -- DIN_66003 is a 7-bit-safe codepage for 
german,
where aöüÄÖÜߧ take the place of us-ascii's {|}[\]~@, respectively. An evil 
solution for problems long gone.
I think it's sane to not try and handle 7-bit non-ascii codings, so that's just 
to let you know.
Really I cannot think of any way of handling them short of looking at oddities 
in character counts
or success rates with matches against dictionarys.

I will keep testing datasets with these and hopefully can happily say goodbye 
to my bash-hackery:

locale -m | while read codepage; do echo -e "GET FILE='source.sav' 
ENCODING='$codepage' \nDISPLAY DICTIONARY" > psy ; echo "Now: $codepage" >> 
cp_catastrophe; /usr/bin/pspp -b -O format=csv -O separator="    " -O quote='"' 
psy | grep -e "whatever" -e "are troublemakers"  >> cp_catastrophe; done

Best,
Andre

reply via email to

[Prev in Thread] Current Thread [Next in Thread]