[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
re: [Gnumed-devel] record matching
From: |
sjtan |
Subject: |
re: [Gnumed-devel] record matching |
Date: |
Mon, 23 Aug 2004 15:02:49 +1000 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7) Gecko/20040616 |
I think , not sure, that each record's keying fields are compared to
the m * k sublists that are generated ( m is the blocking key count,
k is the average number of
sublists that are generated for a given threshold value) . The
comparison function might run as a simple string comparison on each
pair, starting on the next
character after the last match, and return true if it makes it through
the whole sublist. If a match occurs, each sublist is associated with a
list of candidate
record numbers for comparison , and the record that matches is added as
a candidate.
Then for the m * k sublists, the sum of the decreasing sequence j-1 to
1 comparisons are made for each j the size of the candidate list in a
sublist.
this is a pretty horrible paraphrasing, trying to work out the O
complexity of actually running the comparisons
generated by Bigram indexing.
the example was 'ba', 'ax', 'xt', 'te', 'er' from 'baxter' a block
key , so there are 5 bigrams. a threshold value of 0.8 , means
0.8 x the number of bigrams for a block key , in this 0.8 x 5 = 4,
so all various inorder combinations of the 4 of the 5 bigrams are
used for comparison.
ba ax xt te , ba xt te er, ba ax te er, ax xt te er
below is an procedure to determine the bigram sequences ; you can change
the value of t and threshold to get different sequences.
It seems to resemble the logic of inductive proof,
e.g. prove the sum of the sequence(1..n) is = n (n+1) / 2 ; first
prove for n =1, assume it's true for n, then prove for n=n'+1 by
converging LHS and RHS.
The methodology is to define the known basic case, and then guess the logic.
def select( start, n , seq):
if n == 0:
return []
if n == 1:
ll = []
for i in range (start, len(seq)):
ll.append ( [seq[i]] )
return ll
ll = []
for i in range( start, len(seq) - n + 1):
ll2 = select ( i + 1, n -1, seq)
for l in ll2:
ll.append ( [ seq[i] ] + l )
return ll
if __name__=='__main__':
t = [ 'ba', 'ax', 'xt', 'te', 'er']
threshold = 0.6
n = int(threshold * len(t) + 0.00001)
print n
l = select ( 0, n, t)
print l
- re: [Gnumed-devel] record matching,
sjtan <=