[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Koha-devel] encoding of z3950 imported records

From: Zbigniew Bomert OP
Subject: [Koha-devel] encoding of z3950 imported records
Date: Thu Apr 29 03:29:03 2004
User-agent: Mutt/1.5.6i


First, thank you all for this software; I am happy trying to use Koha.

Lately, Benedykt Barszcz reported a problem with encoding of records imported 
through z3950. This is a general problem.

1. In fact, the records a imported in ISO-8859-1 encoding. They are
converted from marc8/ANSEL or ISO5426/ISO6937 to ISO-8859-1 in the
subroutine 'char_decode' in Biblio.pm. This will not work for libraries
with books in languages of different charsets. The universal solution
would be to store data in utf-8. Then all localized templates schuld be 
translated to utf-8 encoding.

2. In 'char_decode' a string is converted to ISO-8859-1 separetly for
UNIMARC and MARC21. LoC uses MARC21 with ANSEL encoding, european
libraries use mostely UNIMARC with ISO5426 encoding - but it's not
always the case. Polish National Library's z3950 server in fact responds
with MARC21 records in ISO5426 encoding. Happily, it seems to me, that
both encodings don't overlap, so it is safe to make char-decoding for
both at ones.

For polish chars the code could look like this:

           s/(\xe2|\xc2)c/\xc4\x87/gm ;
           s/(\xe2|\xc2)C/\xc4\x86/gm ;
           s/(\xe2|\xc2)n/\xc5\x84/gm ;
           s/(\xe2|\xc2)N/\xc5\x83/gm ;
           s/(\xe2|\xc2)o/\xc3\xb3/gm ;
           s/(\xe2|\xc2)O/\xc3\x93/gm ;
           s/(\xe2|\xc2)s/\xc5\x9b/gm ;
           s/(\xe2|\xc2)S/\xc5\x9a/gm ;
           s/(\xe2|\xc2)z/\xc5\xba/gm ;
           s/(\xe2|\xc2)Z/\xc5\xb9/gm ;
           s/(\xf1|\xce)a/\xc4\x85/gm ;
           s/(\xf1|\xce)A/\xc4\x84/gm ;
           s/(\xf1|\xce)e/\xc4\x99/gm ;
           s/(\xf1|\xce)E/\xc4\x98/gm ;
           # łŁ
           s/(\xb1|\xf8)/\xc5\x82/gm ;
           s/(\xa1|\xe8)/\xc5\x82/gm ;
           s/(\xe7|\xc7)z/\xc5\xbc/gm ;
           s/(\xe7|\xc7)Z/\xc5\xbb/gm ;

For letters with acute:
           s/(\xe2|\xc2)a/\xc3\xa1/gm ;
           s/(\xe2|\xc2)A/\xc3\x81/gm ;
           s/(\xe2|\xc2)e/\xc3\xa9/gm ;
           s/(\xe2|\xc2)E/\xc3\x89/gm ;

and so on.

3. To see correct chars in search result one should also add
char-decoding of title and author in cgi-bin/z3950/search.pl

After those changes I can import correctly records from National
Library, and even polish records from LoC.

Zbigniew Bomert OP

reply via email to

[Prev in Thread] Current Thread [Next in Thread]