aspell-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [aspell-devel] Big wordlist and affix lexicons


From: Kevin Atkinson
Subject: Re: [aspell-devel] Big wordlist and affix lexicons
Date: Fri, 24 Nov 2006 16:12:04 -0700 (MST)

On Fri, 24 Nov 2006, Børre Gaup wrote:

I work in a project which is going to make spellcheckers for Northern and Lule
Sami, among others a high-quality Aspell spell checker.

We use Xerox two-level morphological tools to make fullform word lists. The
Northern Sami fullform word list is now about 24GB. The word list can be
broken down into word forms covering a single stem + inflection and other
endings. Each word can have up to 16000 unique endings, and the set of
inflectional endings a word can have varies. We thus have several such sets
of inflectional endings. The exact number needed for Aspell is not yet known,
but the present Xerox-based lexicons have more than 150 such sets.

It sounds like hunspell might be a better choice since it supports twofold affix stripping. I would very mush like to incorporate many of hunspell features into Aspell but I simply don't have the time. I would greatly appreciate any help in this area.

We made an affix file containing the 16000 unique endings one of our words
had, and that file alone became 1.5 MB. Our calculations tell us that if we
continue in this vein for all our words, we will end up with an affix file
that can be as big as 50MB.

As far as we understand there are 52 available affix classes for the affix
file. It is probable that we would need more affix classes than the existing
52. Is it possible to increase this number?

More like around 200 since you can use any 8-bit symbol.

If that is not possible, we will probably end up with a very big wordlist,
amounting up to some gigabyte. How well will aspell tackle a wordlist of that
size?

Well Aspell should do just fine if it will all fit in memory. All bets all of it if doesn't.
reply via email to

[Prev in Thread] Current Thread [Next in Thread]