[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bug-ocrad] Request about adding more characters
From: |
Donald Rogers |
Subject: |
[Bug-ocrad] Request about adding more characters |
Date: |
Mon, 10 Jan 2005 08:32:37 +1300 |
User-agent: |
Mozilla Thunderbird 0.8 (X11/20041020) |
I have recently started using ocrad for OCR of English texts. I am
impressed with it - partly because it handles UTF-8 text. IMHO any OCR
program that does not handle Unicode characters is useless.
I would like to use ocrad for OCR of Esperanto texts. What is involved
with adding the recognition of extra characters to ocrad? I have looked
up the Unicode values of all the accented Esperanto letters and here
they are in the format used in file ucs.h:
Unicode characters for Esperanto:
CCCIRCU = 0x010C, // latin capital letter c with circumflex
SCCIRCU = 0x010D, // latin small letter c with circumflex
CGCIRCU = 0x011C, // latin capital letter g with circumflex
SGCIRCU = 0x011D, // latin small letter g with circumflex
CHCIRCU = 0x0124, // latin capital letter h with circumflex
SHCIRCU = 0x0125, // latin small letter h with circumflex
CJCIRCU = 0x0134, // latin capital letter j with circumflex
SJCIRCU = 0x0135, // latin small letter j with circumflex
CSCIRCU = 0x015C, // latin capital letter s with circumflex
SSCIRCU = 0x015D, // latin small letter s with circumflex
CUBREVE = 0x016C, // latin capital letter u with breve
SUBREVE = 0x016D, // latin small letter u with breve
I noticed in the ocrad source code that there are already some
characters with breves and some with circumflexes. Would it be a big job
for you to add the extra 12 characters?
The Esperanto letters are also in ISO-8859-3. I can send you a list of
their codes in this set too if you wish. I could also send a file or two
of scanned Esperanto text in say PBM format, with the 12 letters: ĈĜĤĴŜŬ
ĉĝĥĵŝŭ.
Donald Rogers
New Zealand
- [Bug-ocrad] Request about adding more characters,
Donald Rogers <=