[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [aspell-devel] Language Info Needed for Aspell
From: |
Kevin Patrick Scannell |
Subject: |
Re: [aspell-devel] Language Info Needed for Aspell |
Date: |
Tue, 23 Mar 2004 15:51:41 -0600 |
User-agent: |
KMail/1.6.1 |
Dear Kevin,
I'm pleased to hear you are trying to extend aspell
support as widely as possible. I'm hoping I can contribute
in a substantial way here. I have some web crawling software
available that targets particular languages:
http://borel.slu.edu/crubadan/
It "bootstraps" a model of the target language based on
previously seen texts and rarely makes mistakes if provided
with sufficient "seed" texts.
As you can see on the status page I've built up text corpora for quite a few
languages. Part of the crawler is a module that ranks words in terms
of the likelihood that they are actually correctly spelled words in the
target language. The highest frequency words make it of course --
also n-gram statistics are calculated which are a good way of
disqualifying the foreign (mostly English) words that sneak in.
In the cases where
I can find a dictionary I can check any suspect words manually.
This is also, I should say, an excellent way of improving
existing word lists. I've been in contact with the Breton
and Welsh maintainers already.
The upshot is that I should be able to package up reasonably
clean wordlists for Manx Gaelic (gv), Scottish Gaelic (gd),
Cebuano (ceb-- though I think "proc" chokes on 3-letter
ISO-639 codes), and Setswana (tn).
I've been contacted about starting Bambara (bm) as well.
The Walloon ispell dictionary has a Makefile target that
builds and installs an aspell dictionary, so that should
be easy enough.
Perhaps in future if you have speakers of small languages
contacting you about creating spellcheckers from scratch you can direct
them to me.
I should mention that it works out of the box for ISO-8859 character sets
but takes some effort for utf8...
-Kevin