[aspell-devel] How Aspell Works: Part 2: Quickly Finding Similar Soundsl

aspell-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[aspell-devel] How Aspell Works: Part 2: Quickly Finding Similar Soundsl

From:	Kevin Atkinson
Subject:	[aspell-devel] How Aspell Works: Part 2: Quickly Finding Similar Soundslike
Date:	Sun, 2 Oct 2005 06:01:36 -0600 (MDT)

In order for Aspell to find suggestions for a misspelled word Aspell 1)creates a list of candidate words, 2) scores them, and 3) returns the mostlikely candidates. One of the ways Aspell finds candidate words is tolook for all words with a soundslike which is of a small edit distancefrom the soundslike of the original word. The edit distance is the totalnumber of deletions, insertions, exchanges, or adjacent swaps needed tomake one string equivalent to the other. The exact distance chosen iseither 1 or 2 depending on a number of factors. In this part I will focuson how Aspell find all such soundslike efficiently and how the jump tablesplay a key role.

In order two find all possible soundslike within a fixed edit distance oneof two things can be tried. 1) Use trial an error by trying all possibleedits and then seeing if they are in the dictionary, and 2) Scan thedictionary for possible soundslikes. Before Aspell 0.60 Aspell used thefirst method when the edit distance was one and the second otherwise.Aspell now uses the second method for both methods as I was able to makethe scan very efficient, therefore saving the space of having to store aseparate hash table for the soundslike.

The naive way to scan the list for all possible soundslike is to computethe edit-distance of every soundslike in the dictionary and keep the oneswithin the threshold. This is exactly what Aspell did prior to 0.60.When a fast enough edit distance function is used this method turns outnot to be unbearably slow, at least for English. For other languages,with large word lists and no soundslike, it can be slow due to the numberof items scanned.

Aspell uses a special edit distance function which gives up if thedistance is larger than the threshold, thus making it very fast. Thebasic algorithm is as follows:

  limit_edit_distance(A,B,limit) = ed(A,B,0)
    where ed(A,B,d) = d                              if A & B is empty.
                    = infinity                       if d > limit
                    = ed(A[2..],B[2..], d)           if A[1] == B[1]
                    = min ( ed(A[2..],B[2..], d+1),
                            ed(A,     B[2..], d+1),
                            ed(A[2..],B,      d+1) ) otherwise

However the algorithm used also allows for swaps and is not recursive.Specialized version are provided for an edit distance of one and two. Therunning time is asymptotically bounded above by (3^l)*n where l is thelimit and n is the maximum of strlen(A),strlen(B). Based on my informaltests, however, the n does not really matter and the running time is morelike (3^l). For complete details on this algorithm see the filesleditdist.hpp and leditdist.cpp in the source distribution undermodules/speller/default.

By exploiting the properties of limit_edit_distance is possible to avoidhaving to look many of the soundslike in the dictionary.Limit_edit_distance is effecent because in many cases it doesn't have tolook at the entire word before determine that it isn't within the giventhreshold. By having it return the last position looked at, "p", it ispossible to avoid having to look ta similar soundslike which are notwithin in threshold. That is if two soundslike are the same up to theposition "p" than nether of them are within the given threshold.

Aspell 0.60 exploits this property by using jump tables. Each entry inthe jump table contains two fields: the first N letters of a soundslike,and an offset. The entries are sorted in lexicographic order based on theraw byte value. Aspell maintains two jump tables. The first onecontains the first 2 letter of a soundslike and the offset points into thesecond jump table. The second one contains the first 3 letters of asoundslike where the offset points to the location of the soundslike inthe data block. The soundslike in the datablock are sorted so that alinear scan can be used to find all soundslike with the same prefix. Iflimit_edit_distance stops before reaching the end of a "soundslike" in theone of the jump tables than it is possible to skip all the soundslike inthe data block with the same prefix.

Thus, the scan for all soundslike within a given edit distance goessomething like this:


  1) Compare the entry in the first jump table using limit_edit_distance.
     If limit_edit_distance scanned passed the end of the word than go the
     first entry in the second jump table with the same prefix.
     Otherwise go to the next entry in the first jump table and repeat.

  2) Compare the entry in the second jump table.  If limit_edit_distance
     passed the end of the word than go the first soundslike in the data
     block with this prefix.  Otherwise if the first two letters of the
     next entry are the same as the current one go it and repeat.  If the
     first two letters are not the same than go to the next entry in the
     first jump table and repeat step 1.

  3) Compare the soundslike in the data block.  If the edit distance
     is within the target distance add to the candidate list, otherwise
     don't.  Let N be the position where limit_edit_distance stopped
     (starting at 0).  If N is less than 6 skip over any soundslike that
     have the same first N + 1 letters.  If, after skipping over
     any similar soundslike, the next soundslike does not have the same
     first three letters go to the next entry in the second jump table
     and repeat step 2.  Otherwise repeat this step with the next
     soundslike.

The part of skipping over soundslike with the first N + 1 letters in step3 was added in Aspell 0.60.3. The function responsible for most of thisis ReadOnlyDict::SoundslikeElements::next found in readonly_ws.cpp

In the next part I will describe how Aspell deals with soundslike lookupwhen affix compression is involved.

[Prev in Thread]

Current Thread

[Next in Thread]

[aspell-devel] How Aspell Works: Part 2: Quickly Finding Similar Soundslike, Kevin Atkinson <=
- Re: [aspell-devel] How Aspell Works: Part 2: Quickly Finding Similar Soundslike, Kevin Atkinson, 2005/10/03

Next by Date: Re: [aspell-devel] How Aspell Works: Part 2: Quickly Finding Similar Soundslike
Next by thread: Re: [aspell-devel] How Aspell Works: Part 2: Quickly Finding Similar Soundslike
Index(es):
- Date
- Thread