aspell-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[aspell] Affix Compression


From: Kevin Atkinson
Subject: [aspell] Affix Compression
Date: Wed, 14 Jun 2000 05:41:22 -0400 (EDT)

If anyone was interested in implementing affix compression here is a nice
write up by the author of Ispell.

But before Aspell can even use Affix compression it will need an alternate
strategy of suggesting words that is not based on based
"soundslike" words.  I plan on implanting a special dictionary that when
asked for all words with X soundslike it will simply return all words that
are equal to X after being stripped of the capitalization and
accents.  This will be possible to do without using an extra lookup table
because the words will be indexes based in there lowercase, accent striped
equivalent.

---------- Forwarded message ----------
Date: Wed, 14 Jun 2000 00:12:20 -0700
From: Geoff Kuenning <address@hidden>
To: address@hidden
Subject: Re: Affix Compression Again

> I release that you are rather busy however if you have the time I would
> appreciate it if you could give me an English overview of how affix
> compression works.   For example how does looking the words up in the
> dictionary work, does it strip the word to its base name, if so what does
> it do if there could be more than one base name?

As it turns out, I had to implement it in both directions.  The "add
affixes" direction, which is the way people think of it when they
write affix files, was originally created for generating suggestions
and for use by munchlist.  However, it's also turned out to be very
useful for all of those people who want to turn an ispell dictionary
back into a raw word list.  I'll describe that one first.

For simplicity, I'll describe everything in terms of suffixes, which
is what existed first.  It should be obvious how to extend things to
prefixes, with one exception that I'll cover below.

Adding an affix to a word is fairly easy.  First, you check the
contexts.  That's easy and quick because the contexts are encoded as
flag arrays, so checking them is a simple lookup.  For example,
consider flag T from english.aff:

flag T:
    E           >       ST              # As in late > latest
    [^AEIOU]Y   >       -Y,IEST         # As in dirty > dirtiest
    [AEIOU]Y    >       EST             # As in gray > grayest
    [^EY]       >       EST             # As in small > smallest


Suppose we have "dirty/T".  First I check to make sure the word is at
least 2 characters long (1 more than the "strip" length -- you can't
wind up with a null root after stripping!).  Then I check the
contexts; I think I go backwards for convenience.

The context characters are encoded as bits within an array of bytes,
named "conds".  There are 256 entries in the array, one for each 8-bit
character (in some cases there can be more than 256 entries, but I'm
simplifying a bit).  In any entry, bit 0 is set if the corresponding
character is a legal context for the last character in the word.
(Actually, only the entries for uppercase characters matter, but I
maintain them all so that the code will run fast.)  Similarly, bit 1
is set if the character is legal for the next-to-last character, and
so forth.  That limits me to 8 context characters, but 8 seems to be
enough for everybody.

So, for example, the second flag-table entry for "T" will state that
there are two context characters.  I use the "y" (converted to
uppercase) to index into the 256-byte "conds" array and pick up an
8-bit value.  Since bit 0 is set, I know that "Y" is legal as the
final character of a string for this suffix entry.  I then do the same
thing for the "t" of "dirty"; this time I look at bit 1.  Note that in
this case, bit 0 will be set only in the entry for "y", while bit 1
will be set in every entry except for the vowels.

Having checked the context, the rest is easy.  Since I know the length
of what I need to strip, and the contex matches, I just delete that
number of characters.  Then I tack on the "append" string (it's called
"affix" in the flagent struct) and I'm done--except for
capitalization, which is a horrible mess I'll discuss later.
Actually, the delete/tack on part can be done with a single strcpy for
suffixes; prefixes are a bit trickier but really just take a bit of
careful programming.

Whew!  The "remove affixes" direction, which is used when you need to
verify the spelling of a word, works in almost the same way.  The big
difference is that since the context rules apply to the root, you
can't check them until after you've made the transformation.  Again,
the first thing I do is to check the length of the word against the
affix lengths to make sure that what I'm about to do is sensible.
Then I string-compare the "append" string with the tail of the word.
If they match, I strip off the "append" string (by just trimming the
right number of characters) and add on the "strip" string.  Again,
that can be done with a single strcpy for suffixes.

Finally, after doing the suffix transformation, I can check the
context using the same rules as before.  Again, I have to correct
capitalization in a separate step.

Prefixes work the same way except that bit 0 of the "conds" array
corresponds to the first character of the string.

Oh, I missed one of your questions: "what does it do if there could be
more than one base name?"  The answer is that I try all
possibilities.  I just loop through the flag table, looking for a
hit.  If I recall correctly, there's an argument to the function that
tells it whether to stop on the first success, or return all matches.
That's used in suggestion generation, I think.

If you've been thinking about performance as you read this, you may
have spotted a nasty little problem: when you're working in the
"remove affixes" direction, you have to spend a lot of time in strcmp.
Yup.  In fact, that turned out to be a killer, and I had to come up
with a clever data structure to help out.  Unfortunately, I don't
remember what I did!  Lemme look...hmmm, looks like there's a
tree-like table of pointers to flag entries, indexed by the characters
of the word.  For example, the entry for "T" would point to a list of
flag entries describing words that could end in "T", including
"dirtiest".  If there are a lot of suffixes ending in T, the table
would contain another level of indexing, so you'd wind up following
the "S" entry to find all words that end in "ST", and so forth.  This
tree is built dynamically in lookup.c when the affix table is read in,
and is used in tgood.c (search for "flagptr", which is the struct used
for the tree).  Man, I have a convoluted mind when I need to.  Anyway,
you can look at the table by turning on the DUMPINDEX compile-time
option.

The one thing I've been postponing talking about is capitalization,
because it's so tricky.  You probably know that ispell divides the
world up into four categories all-caps (IBM), capitalized (Kevin),
followcase (practically every dot-com nowadays) and lowercase.  To
make things fun, there's a "promotion" sequence: lowercase words can
be capitalized, and any word can be all-caps (as in titles for
example).  The result is that you need to maintain the capitalization
when you manipulate affixes.  Here, from memory, is how it's done.

The first thing you need is the root word, because that's where the
information is kept about which category the word falls into.  If it's
all-caps, life is simple: just force everything into uppercase.
Otherwise, we need to do what the user wants.  All of this only
matters in "add affixes" mode, because the "remove" mode always wants
to look up an uppercase version of the word.

To properly add affixes, you need to know what the original word
looked like.  There are two cases: generation from a raw dictionary
(the -e switch) and generation of suggestions for a misspelled word.

Working from a raw dictionary, I use the following rules:

Lowercase word: the root and all affixes become lowercase
Capitalized word: suffixes become lowercase.  Prefixes are
    capitalized, and the first letter of the root is forced to
    lowercase.
All-caps word: the root and all affixes become uppercase
Followcase word: the prefix is forced to the case of the first
    character of the root, and the suffix is forced to the case of the
    last character of the root.  That doesn't always do the right
    thing, but munchlist makes sure that a followcase word doesn't get
    an affix flag unless the capitalization works out OK; otherwise
    the "expanded" versions will get listed separately in the
    dictionary.  Also, with followcase words there can be more than
    one legal case choice, so I might have to handle several options
    and produce the word multiple times.

Incidentally, you can get a good idea of what ispell will do in these
situations by typing into the standard input of "ispell -ee"...hmm, in
fact I see that the rules above are incorrect, because giving it
"ITcorp/U" gives "Unitcorp" instead of "UNITcorp" as I clamed -- but I
don't know if that's a bug or a feature!  I think it's a bug.

Finally, for generating suggestions, the rules above are followed, but
the original misspelled word is used as a prototype.  If it was
all-caps, then the suggestion is generated in the same form.  If the
prototype was capitalized, the suggestion is generated capitalized
unless the root was marked as all-caps or followcase -- in those
situations, the rules listed with the root are followed just as if we
had been working from a raw dictionary.  Finally, if the prototype was
all lowercase, then the raw-dictionary rules are followed.

My code isn't nearly as clean as the above description.  In
particular, I should have written the raw-dictionary code separately,
and then called it from the prototype cde when appropriate.  But
that's not the way it grew; you know how *that* goes.

> It would help me out a lot so that when implementing it myself I don't
> make the same mistakes that you make.

Me?  Mistakes?  Surely you jest... :-)
-- 
    Geoff Kuenning   address@hidden   http://www.cs.hmc.edu/~geoff/

"Du kannst dem Leben nicht mehr Tage geben, aber den Tag mehr Leben."
        -- source unknown, attribution requested
(You can't give your life more days, but you can give your days more life.)




reply via email to

[Prev in Thread] Current Thread [Next in Thread]