[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Gnumed-devel] record matching : thinking out loud
From: |
sjtan |
Subject: |
[Gnumed-devel] record matching : thinking out loud |
Date: |
Sun, 22 Aug 2004 22:23:22 +1000 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7) Gecko/20040616 |
thanks to Tim Churches for the references.
The problem seems to be :
how to join mispelt records that mean the same identity /locality .
The paper says, for n records, without grouping records for comparison,
n ^ 2 records can be compared,
which for large n , (100,000 to millions) is expensive and time
consuming. ( 10 ^ 5 becomes 10^ 10 comparisons,
and comparison function may not be simple).
The class of algorithm's to deal with this problem is called 'blocking'
meaning
records are scanned in one pass ( n ) and put into m blocks, with
upto n/m records in each block.
then (n/m) ^ 2 comparisons are done for m blocks, given n ^ 2 / (m ^
2) * m = n ^ 2 / m . Even better,
a cluster with c nodes, could parallelize block processing, giving n
^ 2 / ( m * c) so for say a cluster of 20 computers,
and 10000 blocks, that's c * m = 20,000 or 2 x 10^5, so 10^5 records
can now be processed in time it takes to run 10^6 non-simple comparisons,
instead of 10^ 10.
The paper talks about Standard Blocking, "Sort/Window" (actually it's
Sorted Neighbourhood) Blocking, Bigram Blocking, which a subtype
is Canopy Clustering with Term Frequency Inverse Document Frequency
threshold .
This is what I could glean:
- Standard blocking uses a key , e.g. a fixed length prefix combination
of first 4 letters of surname and first 4 letters of firstname,
to make the blocks. ( e.g. select id_identity from names order by
prefixNames ( lastnames,4 , firstnames, 4) )
- Sorted Neighbourhood sorts on a key and then selects with a
fixed length window,
( select id_identity from names order by
prefixNames(lastnames,4,firstnames,4) limit w offset i*w )
- Bigram Index blocking gets the set of keys , chops them into
overlapping 2-character string pairs ( 'baxter ' becomes 'ba', 'ax',
'xt', 'te', 'er')
, uses a threshold value between 0.0 to 1.0 which is multiplied by
the total length of each list of string pairs ( so the above has 5
pairs, so if threshold
is 0.8 , 5 x 0.8 = 4 , which will be the length of the sublist set to
find ). The set of sublists with length n (threshold x orig length) is
found. The sublists
are the in order combinations of n pairs that exist that can be
generated from the original key bigrams. ( e.g. 'ba', 'xt', 'te', 'er'
would be valid for threshold
0.8 on the original 5 pair list, but 'xt', 'ba', 'te', 'er' is out of
order , so isn't valid).
I think , not sure, that each record's keying fields are compared to
the m * k sublists that are generated ( m is the blocking key count,
k is the average number of
sublists that are generated for a given threshold value) . The
comparison function might run as a simple string comparison on each
pair, starting on the next
character after the last match, and return true if it makes it through
the whole sublist. If a match occurs, each sublist is associated with a
list of candidate
record numbers for comparison , and the record that matches is added as
a candidate.
Then for the m * k sublists, the sum of the decreasing sequence j-1 to
1 comparisons are made for each j the size of the candidate list in a
sublist.
- The canopy clustering with TFIDF blocking is not intuitive from what
the paper describes; it says it chooses a randomized blocking key , adds
records within a loose threshold function, and then removes records
including the chosen random key which are within a tight threshold function,
and the threshold function has something to do with term frequency,
inverse document frequency, which is unexplained. The result is the
block for comparison.
There was a description of Winkler/Jaro function which I found with google,
which is a string comparison function with the formula s = 1/3
( c/m + c/n - ( c - t) / c ) where m and n are the lengths
of the 2 strings to compare,
c is the number of common characters, and t is the number of out of
order transpositions of characters . This function is the average of
terms which
are always between 0 and 1.0 , so s is in 0.0 to 1.0 .
For identical strings , c = m = n , t =0, and s=1. Finding t ,
sounds involved - comparing pairs of characters from the sequence
containing the common characters in each string.
The result of the function is then compared to a threshold for matching
e.g. 0.9 , and then used to decide whether to record link or not.
Any clarifications appreciated.
BTW , here's another use case: results come in to be viewed on the EHR
in use at a practice I work in, but sometimes the result filer can't
match to the patient, and prompts with a list of identities, which the
user chooses one for filing against.
It might do this each time, and doesn't record a patient choice between
result batches.
This might be a good thing, in case one's choice is wrong for a session.
Sometimes a result comes in for another practice's patient, and there is
no where to put it, but the EHR allows one to delete the result.
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [Gnumed-devel] record matching : thinking out loud,
sjtan <=