aspell-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [aspell-devel] Language Info Needed for Aspell


From: Kevin Patrick Scannell
Subject: Re: [aspell-devel] Language Info Needed for Aspell
Date: Tue, 23 Mar 2004 15:51:41 -0600
User-agent: KMail/1.6.1

Dear Kevin, 

   I'm pleased to hear you are trying to extend aspell
 support as widely as possible.  I'm hoping I can contribute
 in a substantial way here.   I have some web crawling software
 available that targets particular languages:

 http://borel.slu.edu/crubadan/

 It "bootstraps" a model of the target language based on 
 previously seen texts and rarely makes mistakes if provided
 with sufficient "seed" texts.

 As you can see on the status page I've built up text corpora for quite a few
 languages.     Part of the crawler is a module that ranks words in terms
 of the likelihood that they are actually correctly spelled words in the
 target language.    The highest frequency words make it of course --
 also n-gram statistics are calculated which are a good way of
 disqualifying the foreign (mostly English) words that sneak in.
 In the cases where
 I can find a dictionary I can check any suspect words manually.
 This is also, I should say, an excellent way of improving 
 existing word lists.  I've been in contact with the Breton 
 and Welsh maintainers already.

 The upshot is that I should be able to package up reasonably
 clean wordlists for Manx Gaelic (gv), Scottish Gaelic (gd),
 Cebuano (ceb-- though I think "proc" chokes on 3-letter
 ISO-639 codes), and Setswana (tn).       

 I've been contacted about starting Bambara (bm) as well.

 The Walloon ispell dictionary has a Makefile target that
 builds and installs an aspell dictionary, so that should
 be easy enough.

 Perhaps in future if you have speakers of small languages
 contacting you about creating spellcheckers from scratch you can direct
 them to me.      

  I should mention that it works out of the box for ISO-8859 character sets
  but takes some effort for utf8... 

 -Kevin





reply via email to

[Prev in Thread] Current Thread [Next in Thread]