aspell-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [aspell-devel] Big wordlist and affix lexicons


From: Børre Gaup
Subject: Re: [aspell-devel] Big wordlist and affix lexicons
Date: Sat, 25 Nov 2006 12:27:06 +0100
User-agent: KMail/1.9.5

Láv, skábmamánu 25. b. 2006 00.12, Kevin Atkinson čálii:
> On Fri, 24 Nov 2006, Børre Gaup wrote:
> > I work in a project which is going to make spellcheckers for Northern and
> > Lule Sami, among others a high-quality Aspell spell checker.
> >
> > We use Xerox two-level morphological tools to make fullform word lists.
> > The Northern Sami fullform word list is now about 24GB. The word list can
> > be broken down into word forms covering a single stem + inflection and
> > other endings. Each word can have up to 16000 unique endings, and the set
> > of inflectional endings a word can have varies. We thus have several such
> > sets of inflectional endings. The exact number needed for Aspell is not
> > yet known, but the present Xerox-based lexicons have more than 150 such
> > sets.
>
> It sounds like hunspell might be a better choice since it supports twofold
> affix stripping.  I would very mush like to incorporate many of hunspell
> features into Aspell but I simply don't have the time.  I would greatly
> appreciate any help in this area.
>
The problem is that hunspell is not as ubiquitous as aspell. As far as I have 
seen hunspell is not commonly used, but aspell is used both in Linux and in 
Mac OS X (through Cocoaspell). Hunspell is _intended_ to replace myspell in 
openoffice.org (according to it's homepage).

What features in hunspell would you specifically like to have in aspell?

> > We made an affix file containing the 16000 unique endings one of our
> > words had, and that file alone became 1.5 MB. Our calculations tell us
> > that if we continue in this vein for all our words, we will end up with
> > an affix file that can be as big as 50MB.
> >
> > As far as we understand there are 52 available affix classes for the
> > affix file. It is probable that we would need more affix classes than the
> > existing 52. Is it possible to increase this number?
>
> More like around 200 since you can use any 8-bit symbol.
>
Ok, then that misunderstanding is cleared away.

> > If that is not possible, we will probably end up with a very big
> > wordlist, amounting up to some gigabyte. How well will aspell tackle a
> > wordlist of that size?
>
> Well Aspell should do just fine if it will all fit in memory.  All bets
> all of it if doesn't.
:)
-- 
Børre Gaup
Prošeaktamielbargi - Project worker
tel(W): +47 77 64 59 64
tel(GSM): +47 41 08 03 64
e-mail:address@hidden
http://divvun.no/english.html




reply via email to

[Prev in Thread] Current Thread [Next in Thread]