Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Using Fuzzy Lookups for Record Linkage


Using Fuzzy Lookups for Record Linkage

Author
Message
bnordberg
bnordberg
Old Hand
Old Hand (311 reputation)Old Hand (311 reputation)Old Hand (311 reputation)Old Hand (311 reputation)Old Hand (311 reputation)Old Hand (311 reputation)Old Hand (311 reputation)Old Hand (311 reputation)

Group: General Forum Members
Points: 311 Visits: 569
Comments posted here are about the content posted at http://www.sqlservercentral.com/columnists/bnordberg/3105.asp



Peter Malloy
Peter Malloy
Forum Newbie
Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)

Group: General Forum Members
Points: 1 Visits: 3
Thanks for this quick review of the fuzzy logic transform in SSIS. Very helpful for patient identification/matching in health care organizations like ours. We're a community health center in Boston that has lots of patient matching needs.
Mike Dougherty-384281
Mike Dougherty-384281
Old Hand
Old Hand (342 reputation)Old Hand (342 reputation)Old Hand (342 reputation)Old Hand (342 reputation)Old Hand (342 reputation)Old Hand (342 reputation)Old Hand (342 reputation)Old Hand (342 reputation)

Group: General Forum Members
Points: 342 Visits: 944
"I will have to generate a work list for someone to review"

And what is the confidence factor of the reviewer? (since this is likely to be an unpleasant task that may end up being an intern or a temp)

It's an interesting concept, but I would feel much better seeing it used for marketing analysis/data mining rather than patient records. When my new doctor's office requested medical files from my old doctor's office, they recieved information for a patient that was not me (but had the same first and last name) - I assumed it was human error. The intentional use of 'fuzzy' record matching at the machine/DB level seems unwise.

I would make a stronger argument for information integrity at entry rather than risk "fuzziness" as a systemic means of mitigating sloppy data. Of course that's a different topic and a considerable amount of work in itself.

.. and not to be completely down on the author of this article - the screenshots and examples make it obvious there is considerable power available with this technique. (a feature I did not know, and would likely have tried to home-grow if needed) I feel this is the same kind of power as using index hints or NoLock directives: if used properly it's a great asset, if used improperly it's a great liability.
Randy Farmer
Randy Farmer
Forum Newbie
Forum Newbie (3 reputation)Forum Newbie (3 reputation)Forum Newbie (3 reputation)Forum Newbie (3 reputation)Forum Newbie (3 reputation)Forum Newbie (3 reputation)Forum Newbie (3 reputation)Forum Newbie (3 reputation)

Group: General Forum Members
Points: 3 Visits: 14

It is worth mentioning that these "Fuzzy" features require an Enterprise SQL edition.

http://msdn2.microsoft.com/en-us/library/ms143761.aspx


Barry Ezell
Barry Ezell
Forum Newbie
Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)

Group: General Forum Members
Points: 1 Visits: 6
In answer to Mr. Dougherty, it is always preferable to create correct record linkages from the start. However, fuzzy matching as described here is sometimes a necessity, especially when two recordsets come from completely different sources but have some overlap in the entities represented. A classic example would be finding common patients in two different vaccination campaigns with no collaboration. Matching through inner joins will both miss true matches and cause duplicates in the output.

I very much enjoyed this article. The SQL Server 2005 fuzzy record linking implementation is a good start and will suffice for many users. My company, Balance Engines, has been doing similar reconciliations for the cellular and oil industries for five years. Other factors that our software addresses are fuzzy lookups that aren't based on string "edit distance" such as custom date and Double comparisons and rules-based match score modifications. The review process mentioned in the article is absolutely essential because fuzzy comparisons are always an imperfect form of machine learning. However, they can be extremely useful in comparing large recordsets that would be cost prohibitive to review in a completely manual way. By eliminating all but a few percent of sure matches or non-matches, the labor required to review the rest can be made affordable.

I love talking about record matching issues. Feel free to contact me at bezell balanceengines.com
bnordberg
bnordberg
Old Hand
Old Hand (311 reputation)Old Hand (311 reputation)Old Hand (311 reputation)Old Hand (311 reputation)Old Hand (311 reputation)Old Hand (311 reputation)Old Hand (311 reputation)Old Hand (311 reputation)

Group: General Forum Members
Points: 311 Visits: 569

I agree with Mr Dougherty, in a perfect world we would not want or need to do this. Unfortunately when you get patients in the ER, getting a correct spelling of their name is second to stopping the bleeding. I've been using probabilistic techniques for years - particularly with EMS (emergency medical services), where the patient may not even be able to give a name.

I found that once we get below .80 the matches are of variable use. But it depends on the fields you have to match with, and the discriminating power of each within the set.

Hyphenated last names (for example Nordberg vs Halbe-Nordberg) are the worst as they produce very weak matches, but the match is evident when human reviewed - hence the need for human intervention.





Aaron N. Cutshall
Aaron N. Cutshall
SSCommitted
SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)

Group: General Forum Members
Points: 1621 Visits: 967
Brian,

It's an interesting approach and one that I wish I could use, however this doesn't allow me to establish conditional matching as well. Therefore, I'm attempting to do record grouping within T-SQL. While I can identify the matches easily enough for a single record, I'm having trouble doing so in bulk. I cannot seem to create unique group identifiers for each set of matches.

If I take each record one at a time and compare against all records, I get a reasonably accurate match rate, but it takes almost 24 hours to group 90,000 records (and that's just my sample size!!). I currently have 23 various tests with a decreasing confidence level to determine if records should be grouped together. I considered using a RANK() OVER partition method, but the ORDER BY portion is not precise enough to account for allowable variances (such as month and day being reversed in a DOB).

Does anyone have any suggestions?

Aaron


"...when ye are in the service of your fellow beings ye are only in the service of your God." -- Mosiah 2:17
frecal
frecal
SSC Rookie
SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)

Group: General Forum Members
Points: 44 Visits: 287
Hi not sure if you will get this post or not but here goes.
I have started working on a Linkage Project and was mucking around with the Fuzzy Matching in sql 2008. Do you know if there is a way of sort of fuzzy matching on date ie birthdate which is a common field in different data sets. I guess it would do a check to see how far it was out by and allocate a score.
thanks
Fred
bnordberg
bnordberg
Old Hand
Old Hand (311 reputation)Old Hand (311 reputation)Old Hand (311 reputation)Old Hand (311 reputation)Old Hand (311 reputation)Old Hand (311 reputation)Old Hand (311 reputation)Old Hand (311 reputation)

Group: General Forum Members
Points: 311 Visits: 569
Yes date of birth is always good to use. And it will allocate a proximity score. Unfortunatly DOB's can differ significantly with simple typos e.g: 5/11/1972 vs 5/11/1927 or the more common 5/11/1972 vs 11/5/1972... So make sure you have other, more reliable variables to include (gender, names ...)



frecal
frecal
SSC Rookie
SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)

Group: General Forum Members
Points: 44 Visits: 287
thanks for your reply. why is that Date fields will only display for exact matches?
I cant set it to fuzzy matching.
the fields i use are firstname, surname, sex and Date of birth. but unfortunately the Date of birth only looks at exact matches.
any ideas?

thanks again
Fred
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search