[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
re: [Gnumed-devel] record matching
From: |
Tim Churches |
Subject: |
re: [Gnumed-devel] record matching |
Date: |
22 Aug 2004 20:01:11 +1000 |
On Mon, 2004-08-23 at 15:02, sjtan wrote:
> >
> >
> >I think , not sure, that each record's keying fields are compared to
> >the m * k sublists that are generated ( m is the blocking key count,
> >k is the average number of
> >sublists that are generated for a given threshold value) . The
> >comparison function might run as a simple string comparison on each
> >pair, starting on the next
> >character after the last match, and return true if it makes it through
> >the whole sublist. If a match occurs, each sublist is associated with a
> >list of candidate
> >record numbers for comparison , and the record that matches is added as
> >a candidate.
> >Then for the m * k sublists, the sum of the decreasing sequence j-1 to
> >1 comparisons are made for each j the size of the candidate list in a
> >sublist.
> >
> >
> this is a pretty horrible paraphrasing, trying to work out the O
> complexity of actually running the comparisons
> generated by Bigram indexing.
> the example was 'ba', 'ax', 'xt', 'te', 'er' from 'baxter' a block
> key , so there are 5 bigrams. a threshold value of 0.8 , means
> 0.8 x the number of bigrams for a block key , in this 0.8 x 5 = 4,
> so all various inorder combinations of the 4 of the 5 bigrams are
> used for comparison.
> ba ax xt te , ba xt te er, ba ax te er, ax xt te er
No, you are correct about "combinations" but combinations are by
definition unordered with no repeats (as opposed to permutations which
are ordered and permit repeats) - so if the threshold is 4 out 5 then
there are rCn combinations to check where rCn = n! / r! * (n - r)! (from
memory - I think that is correct).
>
> below is an procedure to determine the bigram sequences ; you can change
> the value of t and threshold to get different sequences.
> It seems to resemble the logic of inductive proof,
> e.g. prove the sum of the sequence(1..n) is = n (n+1) / 2 ; first
> prove for n =1, assume it's true for n, then prove for n=n'+1 by
> converging LHS and RHS.
> The methodology is to define the known basic case, and then guess the logic.
>
>
> def select( start, n , seq):
> if n == 0:
> return []
> if n == 1:
> ll = []
> for i in range (start, len(seq)):
> ll.append ( [seq[i]] )
> return ll
> ll = []
> for i in range( start, len(seq) - n + 1):
> ll2 = select ( i + 1, n -1, seq)
> for l in ll2:
> ll.append ( [ seq[i] ] + l )
> return ll
>
> if __name__=='__main__':
> t = [ 'ba', 'ax', 'xt', 'te', 'er']
> threshold = 0.6
> n = int(threshold * len(t) + 0.00001)
> print n
> l = select ( 0, n, t)
> print l
>
Um, OK, I think. We use a similar recursive function (in pure Python) to
create the combinations, or the Probstat module, which is written in C,
which is faster (but speed of calculating the combinations doesn't
matter much for a single "bigram index" look-up). Anyway, its all there
in the Febrl code which is freely downloadable and re-usable.
Of course, having formed candidate record lists by doing bigram (or
other) indexed look-ups on each field (separate lookups for surnane,
given name etc), you then need to calculate a set of weighted sums (the
"match weights") and choose the highest of these, or a set above a
threshold, as the candidate matches. You can choose the weights
deterministically or you can use a probabilistic approach to choose
statistically optimal weights. Both methods are implemented, in Python,
in Febrl. However, Febrl is intended for batch matching of many records
to many records, but the single record look-up problem is just a
degenerate case of that.
As I said, all this may be overkill for a small GP database.
OK, digging about on my hard disc I found this pure Python function to
calculate rCn combinations:
def sublist_comb(in_list, length):
"""Routine to recursively compute all combinations of sub-lists
of the given 'in_list' of length 'len'.
Based on Gagan Saksena's routines from Activestate Python
cookbook, see:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/66465
and modified by Ole Nielsen, MSI ANU, November 2002
"""
sub_lists = []
in_list_len = len(in_list) # Number of elements in in_list
if (length == 0):
return [[]]
else:
for i in range(in_list_len):
sub = sublist_comb(in_list[i+1:], length-1)
for l in sub:
l.insert(0, in_list[i])
sub_lists += sub
return sub_lists
print sublist_comb(['ba','ax','xt','te','er'],4)
--
Tim C
PGP/GnuPG Key 1024D/EAF993D0 available from keyservers everywhere
or at http://members.optushome.com.au/tchur/pubkey.asc
Key fingerprint = 8C22 BF76 33BA B3B5 1D5B EB37 7891 46A9 EAF9 93D0