...another alternative is to use a technique we we have dubbed
"n-gram indexes" (since we developed the method for our record linkage
project). We still haven't written a definitive paper on it, but it is
implemented in the Febrl software and described in the manual, and there
is a paper describing it relative retrieval performance - see
http://datamining.anu.edu.au/publications/2003/kdd03-6pages.pdf
I plan to work on an improved implementation of this technique (in
Python of course) over the next several months for use in our public
health data collection systems (where case/patient look-up and
deduplication is vital, but where we have hundreds of thousands or
millions of records) - when this work is complete you might want to
evaluate it for use in GNUmed. It might be overkill for general practice
databases with a few thousand patients, but the technique is
conceptually simple and elegant and unlike teh phonetic indexing
functions, makes no assumptions about name or string morphology and
phonetics - thus it works equally well with alphabetic names from any
culture, including Pinying Chinese names. It takes a set-theoretic
approach, and the faster, built-in set data type in Python 2.4 improves
its speed considerably.