gnumed-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnumed-devel] re: Soundex


From: Tim Churches
Subject: Re: [Gnumed-devel] re: Soundex
Date: 22 Aug 2004 09:15:30 +1000

On Sun, 2004-08-22 at 15:33, sjtan wrote:
> >
> >
> >- dBASE and FoxBASE features a SOUNDEX() function that would have 
> >helped the above scenario, is it available to Postgres? Matches could 
> >be included automatically or only to expand a failed search.
> >
> >  
> >
> 
> http://www.archives.gov/research_room/genealogy/census/soundex.html
> 
> and
> 
> http://lists.gnu.org/archive/html/gnumed-devel/2003-01/msg00022.html
> 
> 
> interesting rules.

Open sourced Python code for various phonetic indexing functions such as
Soundex, and for various string similarity comparator functions, such as
edit (Levenshtein) distance, can be found in the Febrl proababilistic
record linkage project at
http://datamining.anu.edu.au/projects/linkage.html - in particular see
Section 9.2 of the Febrl manual (also downloadable from SourceForge -
see above URL). There are several better functions than Soundex
available, such as Metaphone or Double Metaphone, NYSIIS or mod_soundex.
It is often helpful to use more than one phonetic index (eg Double
metaphone plus NYSIIS, or even these functions on a reversed version of
the name - to get around the lack of robustness which these phonetic
indexing functions to initial letter errors) and then form the union of
the candidate record sets retrieved by each indexed look-up, and then
use a similarity comparator function to winnow the list down to a more
manageable set. The Winkler comparator is widely thought to be bet best
available for record linkage purposes, and that fact probably translates
to record look-up too. It is less expensive to calculate than the
Levenshtein distance.

Alternatively, if you wish to do an indexed look-up of a more general
comparison, there are several choices described in the CS literature,
particularly the work by Chilean researchers Baeza-Yates and Navarro -
they have described methods which can be implemented using SQL queries.

However, another alternative is to use a technique we we have dubbed
"n-gram indexes" (since we developed the method for our record linkage
project). We still haven't written a definitive paper on it, but it is
implemented in the Febrl software and described in the manual, and there
is a paper describing it relative retrieval performance - see
http://datamining.anu.edu.au/publications/2003/kdd03-6pages.pdf

I plan to work on an improved implementation of this technique (in
Python of course) over the next several months for use in our public
health data collection systems (where case/patient look-up and
deduplication is vital, but where we have hundreds of thousands or
millions of records) - when this work is complete you might want to
evaluate it for use in GNUmed. It might be overkill for general practice
databases with a few thousand patients, but the technique is
conceptually simple and elegant and unlike teh phonetic indexing
functions, makes no assumptions about name or string morphology and
phonetics - thus it works equally well with alphabetic names from any
culture, including Pinying Chinese names. It takes a set-theoretic
approach, and the faster, built-in set data type in Python 2.4 improves
its speed considerably.

-- 

Tim C

PGP/GnuPG Key 1024D/EAF993D0 available from keyservers everywhere
or at http://members.optushome.com.au/tchur/pubkey.asc
Key fingerprint = 8C22 BF76 33BA B3B5 1D5B  EB37 7891 46A9 EAF9 93D0







reply via email to

[Prev in Thread] Current Thread [Next in Thread]