Aspell Now Has Full UTF-8 Support

aspell-announce
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Aspell Now Has Full UTF-8 Support

From:	Kevin Atkinson
Subject:	Aspell Now Has Full UTF-8 Support
Date:	Thu, 18 Mar 2004 02:42:16 -0500 (EST)
Aspell now fully supports spell checking documents in UTF-8.  In addition
Aspell now has support for accepting all input and printing all output in
UTF-8 or any other encoding that Aspell supports.  The fact that Aspell is
still 8-bit internally can now be made completely transparent to the end
user.  Previous versions of Aspel supported Unicode to some extent;
however, word list still had to be in an 8-bit character set.  
Furthermore, spell checking documents in an encoding that is different from
the internal encoding was pragmatic.  This has all changed now.

With this change Aspell can now support any language that no more than 220
distinct characters, including different capitalizations and accents,
_even if_ there is not an existing 8-bit encoding that supports the
language.  All one has to do is creating a new character data file which
is a fairly simple task.  The internal encoding never has to be seen by
the end-user, including the word list author, since not even the word list
has to be in the same encoding that Aspell uses.

Full UTF-8 support was added with 0.51-20040219, the next snapshot, 
0.51-20040227 fixed a few bugs, while the latest 0.60-20040317 uses a new, 
simpler, format for the character data files.

Aspell snapshots can be downloaded from ftp://alpha.gnu.org/gnu/aspell/.  


Notes on 8-bit Characters
*************************

There is a very good reason I use 8-bit characters in Aspell. Speed and
simplicity. While many parts of my code can fairly be easily be
converted to some sort of wide character as my code is clean. Other
parts can not be.

   One of the reasons because is many, many places I use a direct lookup
to find out various information about characters. With 8-bit characters
this is very feasible because there is only 256 of them. With 16-bit
wide characters this will waste a LOT of space. With 32-bit characters
this is just plain impossible. Converting the lookup tables to some
other form, while certainly possible, will degrade performance
significantly.

   Furthermore, some of my algorithms relay on words consisting only on
a small number of distinct characters (often around 30 when case and
accents are not considered). When the possible character can consist of
any Unicode character this number because several thousand, if that. In
order for these algorithms to still be used some sort of limit will
need to be placed on the possible characters the word can contain. If I
impose that limit, I might as well use some sort of 8-bit characters
set which will automatically place the limit on what the characters can
be.

   There is also the issue of how I should store the word lists in
memory? As a string of 32 bit wide characters. Now that is using up 4
times more memory than charters would and for languages that can fit
within an 8-bit character that is, in my view, a gross waste of memory.
So maybe I should store them is some variable width format such as
UTF-8. Unfortunately, way, way to many of may algorithms will simply
not work with variable width characters without significant
modification which will very likely degrade performance. So the
solution is to work with the characters as 32-bit wide characters and
than convert it to a shorter representation when storing them in the
lookup tables. Now than can lead to an inefficiency. I could also use
16 bit wide characters however that may not be good enough to hold all
of future versions of Unicode and it has the same problems.

   As a response to the space waste used by storing word lists in some
sort of wide format some one asked:

     Since hard drive are cheaper and cheaper, you could store
     dictionary in a usable (uncompressed) form and use it directly
     with memory mapping. Then the efficiency would directly depend on
     the disk caching method, and only the used part of the
     dictionaries would relay be loaded into memory. You would no more
     have to load plain dictionaries into main memory, you'll just want
     to compute some indexes (or something like that) after mapping.

   However, the fact of the matter is that most of the dictionary will
be read into memory anyway if it is available. If it is not available
than there would be a good deal of disk swaps. Making characters 32-bit
wide will increase the change that there are more disk swap. So the
bottom line is that it will be cheaper to convert the characters from
something like UTF-8 into some sort of wide character. I could also use
some sort of disk space lookup table such as the Berkeley Database.
However this will *definitely* degrade performance.

   The bottom line is that keeping Aspell 8-bit internally is a very
well though out decision that is not likely to change any time soon.
Fell free to challenge me on it, but, don't expect me to change my mind
unless you can bring up some point that I have not thought of before
and quite possible a patch to solve cleanly convert Aspell to Unicode
internally with out a serious performance lost OR serious memory usage
increase.

Languages Which Aspell can Support
**********************************

Even though Aspell will remain 8-bit internally it should still be be
able to support any written languages not based on a logographic
script.  A only logographic writing system in current use are those
based on hànzi which includes Chinese, Japanese, and sometimes Korean.

Languages with 220 or Fewer Unique Symbols
==========================================

Aspell 0.60 should be able to support the following languages as, to
the best of my knowledge, they all contain 220 or fewer symbols and can
thus, fit within an 8-bit character set.  If an existing character set
does not exists than a new one can be invented.  This is true even if
the script is not yet supported by Unicode as the private use area can
be used.

Code   Language Name             Script               Dictionary   Gettext
                                                      Available    Translation

ab     Abkhazian                 Cyrillic             -            -
ae     Avestan                   Avestan              -            -
af     Afrikaans                 Latin                Yes          -
an     Aragonese                 Latin                -            -
ar     Arabic                    Arabic               -            -
as     Assamese                  Bengali              -            -
ay     Aymara                    Latin                -            -
az     Azerbaijani               Arabic               -            -
az                               Cyrillic             -            -
az                               Latin                -            -

ba     Bashkir                   Cyrillic             -            -
be     Belarusian                Cyrillic             -            Yes
bg     Bulgarian                 Cyrillic             Yes          -
bh     Bihari                    Devanagari           -            -
bn     Bengali                   Bengali              -            -
bo     Tibetan                   Tibetan              -            -
br     Breton                    Latin                Yes          -
bs     Bosnian                   Latin                -            -

ca     Catalan/Valencian         Latin                Yes          -
ce     Chechen                   Cyrillic             -            -
ch     Chamorro                  Latin                -            -
co     Corsican                  Latin                -            -
cr     Cree                      Canadian Syllabics   -            -
cr                               Latin                -            -
cs     Czech                     Latin                Yes          -
cv     Chuvash                   Cyrillic             -            -
cy     Welsh                     Latin                Yes          -

da     Danish                    Latin                Yes          -
de     German                    Latin                Yes          -
dv     Divehi                    Dhives Akuru         -            -
dz     Dzongkha                  Tibetan              -            -

el     Greek                     Greek                Yes          -
en     English                   Latin                Yes          -
eo     Esperanto                 Latin                Yes          -
es     Spanish                   Latin                Yes          Incomplete
et     Estonian                  Latin                -            -
eu     Basque                    Latin                -            -

fa     Persian                   Arabic               -            -
fi     Finnish                   Latin                -            -
fj     Fijian                    Latin                -            -
fo     Faroese                   Latin                Yes          -
fr     French                    Latin                Yes          Yes
fy     Frisian                   Latin                -            -

ga     Irish                     Latin                Yes          Yes
gd     Scottish Gaelic           Latin                -            -
gl     Gallegan                  Latin                Yes          -
gn     Guarani                   Latin                -            -
gu     Gujarati                  Gujarati             -            -
gv     Manx                      Latin                -            -

ha     Hausa                     Latin                -            -
he     Hebrew                    Hebrew               -            -
hi     Hindi                     Devanagari           -            -
hr     Croatian                  Latin                Yes          -
hu     Hungarian                 Latin                -            -
hy     Armenian                  Armenian             -            -

ia     Interlingua (IALA)        Latin                -            -
id     Indonesian                Arabic               -            -
id                               Latin                Yes          -
io     Ido                       Latin                -            -
is     Icelandic                 Latin                Yes          -
it     Italian                   Latin                Yes          -
iu     Inuktitut                 Canadian Syllabics   -            -
iu                               Latin                -            -

ja     Japanese                  Latin                -            -
jv     Javanese                  Javanese             -            -
jv                               Latin                -            -

ka     Georgian                  Georgian             -            -
kk     Kazakh                    Cyrillic             -            -
kl     Kalaallisut/Greenlandic   Latin                -            -
km     Khmer                     Khmer                -            -
kn     Kannada                   Kannada              -            -
ko     Korean                    Hangeul              -            -
kr     Kanuri                    Latin                -            -
ks     Kashmiri                  Arabic               -            -
ks                               Devanagari           -            -
ku     Kurdish                   Arabic               -            -
ku                               Cyrillic             -            -
ku                               Latin                -            -
kv     Komi                      Cyrillic             -            -
kw     Cornish                   Latin                -            -
ky     Kirghiz                   Arabic               -            -
ky                               Cyrillic             -            -
ky                               Latin                -            -

la     Latin                     Latin                -            -
lo     Lao                       Lao                  -            -
lt     Lithuanian                Latin                -            -
lv     Latvian                   Latin                -            -

mi     Maori                     Latin                Yes          -
mk     Makasar                   Lontara/Makasar      -            -
ml     Malayalam                 Latin                -            -
ml                               Malayalam            -            -
mn     Mongolian                 Cyrillic             -            -
mn                               Mongolian            -            -
mo     Moldavian                 Cyrillic             -            -
mr     Marathi                   Devanagari           -            -
ms     Malay                     Arabic               -            -
ms                               Latin                Yes          -
mt     Maltese                   Latin                -            -
my     Burmese                   Myanmar              -            -

nb     Norwegian Bokmal          Latin                -            -
ne     Nepali                    Devanagari           -            -
nl     Dutch                     Latin                Yes          Yes
nn     Norwegian Nynorsk         Latin                -            -
no     Norwegian                 Latin                Yes          -
nv     Navajo                    Latin                -            -

oc     Occitan/Provencal         Latin                -            -
oj     Ojibwa                    Ojibwe               -            -
or     Oriya                     Oriya                -            -
os     Ossetic                   Cyrillic             -            -

pa     Punjabi                   Gurmukhi             -            -
pi     Pali                      Devanagari           -            -
pi                               Sinhala              -            -
pl     Polish                    Latin                Yes          -
ps     Pushto                    Arabic               -            -
pt     Portuguese                Latin                Yes          Yes

qu     Quechua                   Latin                -            -

rm     Raeto-Romance             Latin                -            -
ro     Romanian                  Latin                Yes          -
ru     Russian                   Cyrillic             Yes          -

sa     Sanskrit                  Devanagari           -            -
sa                               Sinhala              -            -
sd     Sindhi                    Arabic               -            -
se     Northern Sami             Latin                -            -
sk     Slovak                    Latin                Yes          -
sl     Slovenian                 Latin                Yes          -
sn     Shona                     Latin                -            -
so     Somali                    Latin                -            -
sq     Albanian                  Latin                -            -
sr     Serbian                   Cyrillic             -            Yes
sr                               Latin                -            -
su     Sundanese                 Latin                -            -
sv     Swedish                   Latin                Yes          -
sw     Swahili                   Latin                -            -

ta     Tamil                     Tamil                -            -
te     Telugu                    Telugu               -            -
tg     Tajik                     Arabic               -            -
tg                               Cyrillic             -            -
tg                               Latin                -            -
tk     Turkmen                   Arabic               -            -
tk                               Cyrillic             -            -
tk                               Latin                -            -
tl     Tagalog                   Latin                -            -
tl                               Tagalog              -            -
tr     Turkish                   Arabic               -            -
tr                               Latin                -            -
tt     Tatar                     Cyrillic             -            -
ty     Tahitian                  Latin                -            -

ug     Uighur                    Arabic               -            -
ug                               Cyrillic             -            -
ug                               Latin                -            -
ug                               Uyghur               -            -
uk     Ukrainian                 Cyrillic             Yes          -
ur     Urdu                      Arabic               -            -
uz     Uzbek                     Cyrillic             -            -
uz                               Latin                -            -

vi     Vietnamese                Latin                -            -
vo     Volapuk                   Latin                -            -

wa     Walloon                   Latin                -            Incomplete

yi     Yiddish                   Hebrew               -            -
yo     Yoruba                    Latin                -            -

zu     Zulu                      Latin                -            -

Languages in Which the Exact Script Used in Unknown
===================================================

Aspell can most likely support any of the following languages; however,
I am unsure what script they are written in.  Most of them are probably
written in Latin but I am not sure.  If you have any information about
these languages please email me at <address@hidden>.

Code Language Name

aa   Afar
ak   Akan
av   Avaric

bi   Bislama
bm   Bambara

cu   Old Slavonic

ee   Ewe

ff   Fulah

ho   Hiri Motu
ht   Haitian Creole
hz   Herero

ie   Interlingue
ig   Igbo
ii   Sichuan Yi
ik   Inupiaq

kg   Kongo
ki   Kikuyu/Gikuyu
kj   Kwanyama

lb   Luxembourgish
lg   Ganda
li   Limburgan
ln   Lingala
lu   Luba-Katanga

mg   Malagasy
mh   Marshallese

na   Nauru
nd   North Ndebele
ng   Ndonga
nr   South Ndebele
ny   Nyanja

rn   Rundi
rw   Kinyarwanda

sc   Sardinian
sg   Sango
si   Sinhalese
sm   Samoan
ss   Swati
st   Southern Sotho

tn   Tswana
to   Tonga
ts   Tsonga
tw   Twi

ve   Venda

wo   Wolof

xh   Xhosa

za   Zhuang

The Ethiopic Script
===================

Even though the Ethiopic script has more than 220 distinct characters
with a little work Aspell can still handle it.  The idea is to split
each character into two parts based on the matrix representation.  The
first 3 bits will be the first part and could be mapped to `10000???'.
The next 6 bits will be the second part and could be mapped to
`11??????'.  The combined character will then be mapped with the upper
bits coming first.  Thus each Ethiopic syllabary will have the form
`11?????? 10000???'.  By mapping the first and second parts to separate
8-bit characters it is easy to tell which part represents the consonant
and which part represents the vowel of the syllabary.  This encoding of
the syllabary is far more useful to Aspell than if they were stored in
UTF-8 or UTF-16.  In fact, the exiting suggestion strategy of Aspell
will work well with this encoding with out any additional
modifications.  However, additional improvements may be possible by
taking advantage of the consonant-vowel structure of this encoding.

   In fact, the split consonant-vowel representation may prove to be so
useful that it may be beneficial to encode other syllabary in this
fashion, even if they are less than 220 of them.

   The code to break up a syllabary into the consonant-vowel parts does
not exists as of Aspell 0.60.  However, it will be fairly easy to add
it as part of the Unicode normalization process once that is written.

The Thai Script
===============

The Thai script presents a different problem for Aspell.  The problem
is not that there are more than 220 unique symbols, but that there are
no spaces between words.  This means that there is no easy way to split
a sentence into individual words.  However, it is still possible to
spell check Thai, it is just a lot more difficult.  I will be happy to
work within someone who is interested in adding Thai support to Aspell,
but it is not likely something I will do in the foreseeable future.

Languages which use Hànzi Characters
====================================

Hànzi Characters are used to write Chinese, Japanese, Korean, and were
once used to write Vietnamese.  Each hànzi character represents a
syllable of a spoken word and also has a meaning.  Since there are
around 3,000 of them in common usage it is unlikely that Aspell will
ever be able to support spell checking languages written using hànzi.
However, I am not even sure if these languages need spell checking since
hànzi characters are generally not entered in directly.  Furthermore
even if Aspell could spell check hànzi the exiting suggestion strategy
will not work well at all, and thus a completely new strategy will need
to be developed.

Japanese
========

Modern Japanese is written in a mixture of hiragana, katakana, kanji,
and sometimes romaji.  Hiragana, Katakana are both syllabary unique to
Japan, kanji is a modified form of hànzi, and romaji uses the Latin
alphabet.  With some work, Aspell should be able to check the non-kanji
part of Japanese text.  However, based on my limited understanding of
Japanese hiragana is often used at the end of kanji.  Thus if Aspell
was to simply separate out the hiragana from kanji it would end up with
a lot of word endings which are not proper words and will thus be
flagged as misspellings.

Languages Written in Multiple Scripts
=====================================

Aspell should be able to check text written in the same language but in
multiple scripts with some work.  If the number of unique symbols in
both scripts is less than 220 than a special character set can be used
to allow both scripts to be encoding in the same dictionary.  However
this may not be the most efficient solution.  An alternate solution is
to store each script in its own dictionary and allow Aspell to chose
the correct dictionary based on which script the given word is written
in.  Aspell currently does not support this mode of spell checking
however it is something that I hope to eventually support.

-- 
http://kevin.atkinson.dhs.org
[Prev in Thread]
Current Thread
[Next in Thread]
Aspell Now Has Full UTF-8 Support, Kevin Atkinson <=
Prev by Date: Aspell af,bg,cs,fo,hr,id,mi,ms,sk,sl Dictionaries Updated
Next by Date: Language Info Needed for Aspell
Previous by thread: Aspell af,bg,cs,fo,hr,id,mi,ms,sk,sl Dictionaries Updated
Next by thread: Language Info Needed for Aspell
Index(es):
- Date
- Thread