[aspell-devel] How Aspell Works (Repost)

aspell-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[aspell-devel] How Aspell Works (Repost)

From:	Kevin Atkinson
Subject:	[aspell-devel] How Aspell Works (Repost)
Date:	Mon, 13 Nov 2006 17:30:55 -0700 (MST)
Reposting so that the correctly format version gets in the archive.


Date: Fri, 30 Sep 2005 06:21:54 -0600 (MDT)
From: Kevin Atkinson <address@hidden>
Subject: How Aspell Works: Part 1: The Compiled Dictionary Format

The details of how Aspell really works is a mystery to most everyone but me.
So I thought it is finally time to fully explain Aspell's core algorithms
and data structures.  In doing so I hope people will appreciate the amount
of cleverness I put into it, and perhapses be able to improve it even more
in the future.  However, doing so will take some time.  Thus I will be
explaining it in small segments.  This information will eventually be
included in the Aspell source in one form or another.

In this part I will explain how the data is laid out in the compiled
dictionary for Aspell 0.60.  Source file: readonly_ws.cpp

Aspell's main word list is laid out as follows:

* header
* jump table for editdist 1
* jump table for editdist 2
* data block
* hash table

There is nothing particular interesting about the header.  Just a bunch of
meta information.  I will get back to the jump tables.

Words in the data block are grouped based on the soundslike.  Each group is
as follows:
    <8 bit: offset to next item><8 bit: soundslike
size><soundslike><null><words>
Each word group is as follows:
    <8 bit: flags><8 bit: offset to next word><8 bit: word size><word><null>
    [<affix info><null>]
The offset for the final word in each group points to the next word in the
following group.  The offset for the final word and soundslike group in the
dictionary is 0.

I made some previsions for additional info to be stored with the word but
for simplicity I will leave it out.  If soundslike data is not used than the
soundslike block it not used.

This format make it easy to iterate over the data without using the hash
table.

Each soundslike group can be a maximum of about 256 bytes.  If this limit is
reached than the soundslike group is split.  Using 2 bytes for the
soundslike offset would of solved this problem however
    1) 256 is normally sufficient, thus I would of wasted some space by
       using an extra byte.
    2) More importantly, Using 2 bytes means I would of had to worry about
       alignment issues.

The soundslike groups are sorted in more or less alphabetic order.

The hash table is a simple open address hash table.  The key is the
dictionary word in all lowercase form with any accents removed (what is
known as the "clean" form of the word).  The value stored in the table is an
32 bit offset to the beginning of the word.  An integer offset is used
rather than a pointer so that 1) the complied dictionary can be mmaped in
which makes loading the dictionary very fast and so that the memory can be
shared between processed, and 2) on 64 bit platforms using pointers would of
doubled the size of the hash table.

Additional information on the word can be derived from the offset:
    word size: offset - 1
    offset to next word: offset - 2
    flags: offset - 3
I use helper functions for getting this information.  Doing it this way
rather than having a data structure is slightly evil, I admit, but I have my
reasons.

In the next part I will explain how Aspell uses the jump tables to quickly
search the list for all soundslike with an edit-distance of 1 or 2.



Date: Sun, 2 Oct 2005 06:01:36 -0600 (MDT)
From: Kevin Atkinson <address@hidden>
To: address@hidden
Subject: How Aspell Works: Part 2: Quickly Finding Similar Soundslike

[Including grammatical corrections from follow up reply]

In order for Aspell to find suggestions for a misspelled word Aspell 1)
creates a list of candidate words, 2) scores them, and 3) returns the most
likely candidates.  One of the ways Aspell finds candidate words is to look
for all words with a soundslike which is of a small edit distance from the
soundslike of the original word.  The edit distance is the total number of
deletions, insertions, exchanges, or adjacent swaps needed to make one
string equivalent to the other. The exact distance chosen is either 1 or 2
depending on a number of factors.  In this part I will focus on how Aspell
find all such soundslike efficiently and how the jump tables play a key
role.

In order two find all possible soundslike within a fixed edit distance one
of two things can be tried.  1) Use trial an error by trying all possible
edits and then seeing if they are in the dictionary, and 2) Scan the
dictionary for possible soundslikes.  Before Aspell 0.60 Aspell used the
first method when the edit distance was one and the second otherwise.
Aspell now uses the second method for both methods as I was able to make the
scan very efficient, therefore saving the space of having to store a
separate hash table for the soundslike.

The naive way to scan the list for all possible soundslike is to compute the
edit-distance of every soundslike in the dictionary and keep the ones within
the threshold.  This is exactly what Aspell did prior to 0.60. When a fast
enough edit distance function is used this method turns out not to be
unbearably slow, at least for English.  For other languages, with large word
lists and no soundslike, it can be slow due to the number of items scanned.

Aspell uses a special edit distance function which gives up if the distance
is larger than the threshold, thus making it very fast.  The basic algorithm
is as follows:
   limit_edit_distance(A,B,limit) = ed(A,B,0)
     where ed(A,B,d) = d                              if A & B is empty.
                     = infinity                       if d > limit
                     = ed(A[2..],B[2..], d)           if A[1] == B[1]
                     = min ( ed(A[2..],B[2..], d+1),
                             ed(A,     B[2..], d+1),
                             ed(A[2..],B,      d+1) ) otherwise
However the algorithm used also allows for swaps and is not recursive.
Specialized version are provided for an edit distance of one and two. The
running time is asymptotically bounded above by (3^l)*n where l is the limit
and n is the maximum of strlen(A),strlen(B). Based on my informal tests,
however, the n does not really matter and the running time is more like
(3^l).  For complete details on this algorithm see the files leditdist.hpp
and leditdist.cpp in the source distribution under modules/speller/default.

By exploiting the properties of limit_edit_distance is possible to avoid
having to look at many of the soundslikes in the dictionary.
Limit_edit_distance is efficient because in many cases, it doesn't have to
look at the entire word before it can determine that it isn't within the
given threshold.  By having it return the last position looked at, "p", it
is possible to avoid having to look at similar soundslike which are not
within the threshold.  That is if two soundslike are the same up to the
position "p" than nether of them are within the given threshold.

Aspell 0.60 exploits this property by using jump tables.  Each entry in the
jump table contains two fields: the first N letters of a soundslike, and an
offset.  The entries are sorted in lexicographic order based on the raw byte
value.  Aspell maintains two jump tables.  The first one contains the first
2 letter of a soundslike and the offset points into the second jump table.
The second one contains the first 3 letters of a soundslike where the offset
points to the location of the soundslike in the data block.  The soundslike
in the datablock are sorted so that a linear scan can be used to find all
soundslike with the same prefix.  If limit_edit_distance stops before
reaching the end of a "soundslike" in the one of the jump tables than it is
possible to skip all the soundslike in the data block with the same prefix.

Thus, the scan for all soundslike within a given edit distance goes
something like this:

   1) Compare the entry in the first jump table using limit_edit_distance.
      If limit_edit_distance scanned passed the end of the word than go the
      first entry in the second jump table with the same prefix.
      Otherwise go to the next entry in the first jump table and repeat.

   2) Compare the entry in the second jump table.  If limit_edit_distance
      passed the end of the word than go the first soundslike in the data
      block with this prefix.  Otherwise if the first two letters of the
      next entry are the same as the current one go it and repeat.  If the
      first two letters are not the same than go to the next entry in the
      first jump table and repeat step 1.

   3) Compare the soundslike in the data block.  If the edit distance
      is within the target distance add to the candidate list, otherwise
      don't.  Let N be the position where limit_edit_distance stopped
      (starting at 0).  If N is less than 6 skip over any soundslike that
      have the same first N + 1 letters.  If, after skipping over
      any similar soundslike, the next soundslike does not have the same
      first three letters go to the next entry in the second jump table
      and repeat step 2.  Otherwise repeat this step with the next
      soundslike.

The part of skipping over soundslike with the first N + 1 letters in step 3
was added in Aspell 0.60.3.  The function responsible for most of this is
ReadOnlyDict::SoundslikeElements::next found in readonly_ws.cpp

In the next part I will describe how Aspell deals with soundslike lookup
when affix compression is involved.
[Prev in Thread]
Current Thread
[Next in Thread]
[aspell-devel] How Aspell Works (Repost), Kevin Atkinson <=
Prev by Date: Re: [aspell-devel] Thoughts on using aspell for Indian language ing
Next by Date: Re: [aspell-devel] Thoughts on using aspell for Indian language ing
Previous by thread: [aspell-devel] Am interested in working for Aspell in Tamil
Next by thread: [aspell-devel] Culinary terms
Index(es):
- Date
- Thread