bug-gnu-pspp
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts


From: Müller , Andre
Subject: Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names
Date: Tue, 18 Feb 2014 18:37:37 +0000

> -----Original Message-----
> From: Ben Pfaff [mailto:address@hidden
> Sent: Tuesday, February 18, 2014 18:33
> To: Müller, Andre
> Cc: address@hidden
> Subject: Re: PSPP-BUG: Failure to handle an antique SPSS file containing
> umlauts in variable names
> 
> On Tue, Feb 18, 2014 at 03:17:26PM +0000, M??ller, Andre wrote:
> > Another rather unfair testcase is a failure to identify a source file
> > in DIN_66003 coding, but that really is to be expected -- DIN_66003 is
> > a 7-bit-safe codepage for german, where a?????????????? take the place
> > of us-ascii's {|}[\]~@, respectively. An evil solution for problems
> > long gone.  I think it's sane to not try and handle 7-bit non-ascii
> > codings, so that's just to let you know.  Really I cannot think of any
> > way of handling them short of looking at oddities in character counts
> > or success rates with matches against dictionarys.
> 
> The code that I wrote doesn't really identify encodings at all.
> Instead, it just tries to recode all the strings in the file from each
> of several possible encodings to UTF-8.  That means that it's easy to
> add more encodings, including DIN_66003.  The encodings that I chose are
> fairly arbitrary: I took them from the list at
> http://encoding.spec.whatwg.org/.  I can add DIN_66003; no problem.  Are
> there other encodings I should add?

Yes, I found that by "reading" your code... with reading in quotes because of 
my utter
lack of C knowledge. At least I can read the commentary, and it's quite 
thorough. 

In any case, I indeed missed one codepage on my first tests: IBM850. 
That is the predecessor to windows-1252, also called ms-dos latin1. 
To my surprise, it is not listed on the encoding.spec page. 
I think that would be a worthwile addition. 

More worthwile than the really strange and old DIN_66003. 
It would show up everytime the file actually is pure us-ascii.
But nevertheless, this obviously has been used, so you may want to add it.
I really leave that up to you, it may be opening a can of worms.
DIN_66003 is just the german variant of ISO_646 and there are a whole bunch 
of national variants to it: https://en.wikipedia.org/wiki/ISO/IEC_646
That may end up in a list from hell for each dataset coded in plain us-ascii.

That's all I know by now, but I am still digging through my pile and expect that
I still missed a few oddities; I will have to run more thorough tests later on.

So, if I identify any more codepages not covered by SYSFILE INFO, I will let 
you know.

Vielen Dank,
Andre




reply via email to

[Prev in Thread] Current Thread [Next in Thread]