koha-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Koha-devel] Encoding Problems and Solutions (IMPORTANT!)


From: Joshua Ferraro
Subject: [Koha-devel] Encoding Problems and Solutions (IMPORTANT!)
Date: Sat, 20 May 2006 05:15:33 -0700
User-agent: Mutt/1.4.1i

Hi all,

I've spent the better part of two days working on the encoding
problems that several of the Koha developers have been experiencing
and I think I've met with some success.

First, software I'm running:

Perl 5.8.4
XML::SAX 0.14
MARC::Record 2.0RC1
MARC::File::XML 0.83
MARC::Charset 0.95

There were two problems ... the first happens when you have MARC21
records with wrongly encoded characters. Let me be clear that I
mean wrongly encoded by the MARC21 MARC-8 encoding. The LOC has
defined all of the valid MARC-8 characters and mapped them to
UTF-8 here:

http://www.loc.gov/marc/specifications/codetables.xml
(warning, very large file)

This is the code table that MARC::Charset uses to convert MARC-8
encoded records into UTF-8 ... and that's what MARC::File::XML uses
to convert a binary MARC file into MARCXML.

By default, if MARC::Charset encounters a character that isn't in
that code table, it drops the whole subfield and throws a warn like
this:

no mapping found at position 8 in Price : � 7.99;    Inv.#  B 476913;    Date   
06/03/98; Supplier : Dawson UK;  Recd 20/03/98;  Contents : 1. The problem :    
 1. Don't bargain over positions;  2. The method :     2. Separate the people 
from the problem;     3. Focus on interests, not positions;     4. Invent 
options for mutual gain;     5. Insist on using objective criteria;  3. Yes, 
but :     6. What if they are more powerful?     7. What if they won't play?    
 8. What if they use dirty tricks?  4. In conclusion;  5. Ten questions people 
ask about getting to yes; g0=ASCII_DEFAULT g1=EXTENDED_LATIN at 
/usr/local/share/perl/5.8.4/MARC/Charset.pm line 197.

I don't know if you can see it or not, but before the 7.99 in the
above dump is a \x9C character, which is an invalid MARC-8 character.
The temporary solution is to add the following:

use MARC::Charset;
MARC::Charset->ignore_errors(1);

to any script or module that you expect to encounter wrong encodings.
This way, it will just drop the offending character rather than the
whole subfield (it still throws the warn though). I'm not 100% 
happy with that solution, but I can deal with it until someone has
better suggestion.

The second problem is the more serious one ... several of us have
tried to pass UTF-8 encoded XML records in to the new_from_xml() 
method and had the parser crash if there were 'combining characters' 
in the record. The problem seems to be with the PurePerl parser ...
as soon as I installed XML::SAX::Expat it went away.

I've attached a script to this email that you can use to test your
system to make sure things are set up correctly. The script attempts
to convert a binary MARC record in either UTF-8 or MARC-8 encoding, to
MARC-XML (encoded as UTF-8) and then back to binary MARC (as UTF-8).

You can test the first problem with the following record:

http://liblime.com/public/badencoding.mrc

And the second with this one:

http://liblime.com/public/combiningchar.mrc

Run the script like this:

$ ./roundtrip.pl badencoding.mrc out.utf8.mrc dump

Compare the original MARC record with the out.utf8.mrc ... the only
difference should be in the encoding and possibly missing chars if your
records had bad encoding. (you can edit roundtrip.pl to turn on/off
the ignore_errors flag). And don't forget to install XML::SAX::Expat.

So ... now that we've got the MARC21 encoding problems out of the
way, we need to look at UNIMARC. Mike Rylander has already done some
work on this but I think due to the second error above, we've not had
success testing thusfar.

Cheers,

-- 
Joshua Ferraro               VENDOR SERVICES FOR OPEN-SOURCE SOFTWARE
President, Technology       migration, training, maintenance, support
LibLime                                Featuring Koha Open-Source ILS
address@hidden |Full Demos at http://liblime.com/koha |1(888)KohaILS

Attachment: roundtrip.pl
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]