[Gnumed-devel] record matching : thinking out loud

gnumed-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gnumed-devel] record matching : thinking out loud

From:	sjtan
Subject:	[Gnumed-devel] record matching : thinking out loud
Date:	Sun, 22 Aug 2004 22:23:22 +1000
User-agent:	Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7) Gecko/20040616

thanks to Tim Churches for the references.

The problem seems to be :how to join mispelt records that mean the same identity /locality .The paper says, for n records, without grouping records for comparison,n ^ 2 records can be compared,which for large n , (100,000 to millions) is expensive and timeconsuming. ( 10 ^ 5 becomes 10^ 10 comparisons,

and comparison function may not be simple).

The class of algorithm's to deal with this problem is called 'blocking'meaningrecords are scanned in one pass ( n ) and put into m blocks, withupto n/m records in each block.then (n/m) ^ 2 comparisons are done for m blocks, given n ^ 2 / (m ^2) * m = n ^ 2 / m . Even better,a cluster with c nodes, could parallelize block processing, giving n^ 2 / ( m * c) so for say a cluster of 20 computers,and 10000 blocks, that's c * m = 20,000 or 2 x 10^5, so 10^5 recordscan now be processed in time it takes to run 10^6 non-simple comparisons,

instead of 10^ 10.

The paper talks about Standard Blocking, "Sort/Window" (actually it'sSorted Neighbourhood) Blocking, Bigram Blocking, which a subtypeis Canopy Clustering with Term Frequency Inverse Document Frequencythreshold .

This is what I could glean:

- Standard blocking uses a key , e.g. a fixed length prefix combinationof first 4 letters of surname and first 4 letters of firstname,to make the blocks. ( e.g. select id_identity from names order byprefixNames ( lastnames,4 , firstnames, 4) )

- Sorted Neighbourhood sorts on a key and then selects with afixed length window,( select id_identity from names order byprefixNames(lastnames,4,firstnames,4) limit w offset i*w )

- Bigram Index blocking gets the set of keys , chops them intooverlapping 2-character string pairs ( 'baxter ' becomes 'ba', 'ax','xt', 'te', 'er'), uses a threshold value between 0.0 to 1.0 which is multiplied bythe total length of each list of string pairs ( so the above has 5pairs, so if thresholdis 0.8 , 5 x 0.8 = 4 , which will be the length of the sublist set tofind ). The set of sublists with length n (threshold x orig length) isfound. The sublistsare the in order combinations of n pairs that exist that can begenerated from the original key bigrams. ( e.g. 'ba', 'xt', 'te', 'er'would be valid for threshold0.8 on the original 5 pair list, but 'xt', 'ba', 'te', 'er' is out oforder , so isn't valid).I think , not sure, that each record's keying fields are compared tothe m * k sublists that are generated ( m is the blocking key count,k is the average number ofsublists that are generated for a given threshold value) . Thecomparison function might run as a simple string comparison on eachpair, starting on the nextcharacter after the last match, and return true if it makes it throughthe whole sublist. If a match occurs, each sublist is associated with alist of candidaterecord numbers for comparison , and the record that matches is added asa candidate.Then for the m * k sublists, the sum of the decreasing sequence j-1 to1 comparisons are made for each j the size of the candidate list in asublist.

- The canopy clustering with TFIDF blocking is not intuitive from whatthe paper describes; it says it chooses a randomized blocking key , addsrecords within a loose threshold function, and then removes recordsincluding the chosen random key which are within a tight threshold function,and the threshold function has something to do with term frequency,inverse document frequency, which is unexplained. The result is theblock for comparison.


There was a description of Winkler/Jaro function which I found with google,

which is a string comparison function with the formula s = 1/3( c/m + c/n - ( c - t) / c ) where m and n are the lengthsof the 2 strings to compare,c is the number of common characters, and t is the number of out oforder transpositions of characters . This function is the average ofterms which

are always between 0 and 1.0 , so s is in 0.0 to 1.0 .

For identical strings , c = m = n , t =0, and s=1. Finding t ,sounds involved - comparing pairs of characters from the sequencecontaining the common characters in each string.The result of the function is then compared to a threshold for matchinge.g. 0.9 , and then used to decide whether to record link or not.


Any clarifications appreciated.

BTW , here's another use case: results come in to be viewed on the EHRin use at a practice I work in, but sometimes the result filer can'tmatch to the patient, and prompts with a list of identities, which theuser chooses one for filing against.It might do this each time, and doesn't record a patient choice betweenresult batches.

This might be a good thing, in case one's choice is wrong for a session.

Sometimes a result comes in for another practice's patient, and there isno where to put it, but the EHR allows one to delete the result.

[Prev in Thread]

Current Thread

[Next in Thread]

[Gnumed-devel] record matching : thinking out loud, sjtan <=

Prev by Date: Re: [Gnumed-devel] re: Soundex
Next by Date: re: [Gnumed-devel] record matching
Previous by thread: [Gnumed-devel] re: use cases
Next by thread: re: [Gnumed-devel] record matching
Index(es):
- Date
- Thread