﻿<?xml version='1.0' encoding='UTF-8'?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/"><channel><title>SQLServerCentral / Discuss Content Posted by Brian Nordberg / Article Discussions / Article Discussions by Author  / Using Fuzzy Lookups for Record Linkage / Latest Posts</title><generator>InstantForum.NET v2.9.0</generator><description>SQLServerCentral</description><link>http://www.sqlservercentral.com/Forums/</link><webMaster>notifications@sqlservercentral.com</webMaster><lastBuildDate>Mon, 20 May 2013 10:09:01 GMT</lastBuildDate><ttl>20</ttl><item><title>RE: Using Fuzzy Lookups for Record Linkage</title><link>http://www.sqlservercentral.com/Forums/Topic382130-396-1.aspx</link><description>Hello Brian, et al,Thank you for time in this article. It was very helpful.For posterity's sake, I did want to point out what appears to be a small, but important, wrinkle in the settings provided in your example that I think are worth a mentioning.In your example, the Similarity Threshold at the component level on the Advanced Tab is set to .52. However, the Minimum Similarity at the join level for the MPI_PT_dob field (date of birth) is set to .2.According to MSDN:[i]To satisfy the similarity that is specified at the component level, all rows must have a similarity across all matches that is greater than or equal to the similarity threshold that is specified at the component level. That is, you cannot specify a very close match at the component level unless the matches at the row or join level are equally close.[/i][u][url=http://msdn.microsoft.com/en-us/library/ms137786.aspx]Fuzzy Lookup Transformation[/url][/u]Now what I got from that was if a .3 similarity match is made at the join level, which is higher than the threshold you have set of .2, Fuzzy Lookup will still not return the record because the similarity at the component level is .52. I did a quick test and this is, in fact, the case.It appears that each similarity at the join level acts like an additional filter on each field at the join level to further refine the lookup's results, after the similarity at the component level has already reduced the matches for the entire row.</description><pubDate>Fri, 13 Aug 2010 16:40:16 GMT</pubDate><dc:creator>Langston Montgomery</dc:creator></item><item><title>RE: Using Fuzzy Lookups for Record Linkage</title><link>http://www.sqlservercentral.com/Forums/Topic382130-396-1.aspx</link><description>Hi Brianwhat did you mean by &amp;gt;Can you script them out? I know it is error prone such as Mary Anne is Anne the middle ... Then once you script them to their own field find the person that made the database and give them a good tutorial on atomicity of fields! regardsFred</description><pubDate>Tue, 14 Apr 2009 17:23:25 GMT</pubDate><dc:creator>frecal</dc:creator></item><item><title>RE: Using Fuzzy Lookups for Record Linkage</title><link>http://www.sqlservercentral.com/Forums/Topic382130-396-1.aspx</link><description>I think thats where my problem is..as you said if Mary Jane is in one firstname field and the comparing Firstname has Only Mary. it doesnt seem to match it. I think I need to muck around with the weights.</description><pubDate>Tue, 14 Apr 2009 17:20:43 GMT</pubDate><dc:creator>frecal</dc:creator></item><item><title>RE: Using Fuzzy Lookups for Record Linkage</title><link>http://www.sqlservercentral.com/Forums/Topic382130-396-1.aspx</link><description>We'd probably have to time travel back to the 70's to chat with the db designers of most hospital EMR systems. Plus all these old MUMPS/CACHE "databases" that make data retrieval real fun! I'm attempting to parse dosage's from SIG/Instruction fields from our medication tables - there are &amp;gt;25,000 different textual ways of telling someone how to take a drug!My only other recommendation would be to ensure your thresholds/weights are low for first name and higher for the other fields.B</description><pubDate>Tue, 14 Apr 2009 08:31:15 GMT</pubDate><dc:creator>bnordberg</dc:creator></item><item><title>RE: Using Fuzzy Lookups for Record Linkage</title><link>http://www.sqlservercentral.com/Forums/Topic382130-396-1.aspx</link><description>[quote][b]bnordberg (4/14/2009)[/b][hr]...find the person that made the database and give them a good tutorial on atomicity of fields![/quote]If only it were that easy!!  I deal with data all the time that is very messy because folks don't care as long as it prints on paper OK.  They don't consider the benefits of putting data into the appropriate fields.  I get data from Electronic Medical Record systems and the majority of important data is entered into comment fields simply because it's easier to enter that way and as long as the printed paper has the info for the doc to read, it's OK in their minds.  Of course, that wrecks havoc on any attempt to utilize the data for any other purpose!</description><pubDate>Tue, 14 Apr 2009 08:04:26 GMT</pubDate><dc:creator>Aaron N. Cutshall</dc:creator></item><item><title>RE: Using Fuzzy Lookups for Record Linkage</title><link>http://www.sqlservercentral.com/Forums/Topic382130-396-1.aspx</link><description>No I have not seen that, it could be that it is dragging the match weight down enough that it is below your threshold for a match? Regardless they will lower your scores. Can you script them out? I know it is error prone such as Mary Anne is Anne the middle ... Then once you script them to their own field find the person that made the database and give them a good tutorial on atomicity of fields! B</description><pubDate>Tue, 14 Apr 2009 07:44:20 GMT</pubDate><dc:creator>bnordberg</dc:creator></item><item><title>RE: Using Fuzzy Lookups for Record Linkage</title><link>http://www.sqlservercentral.com/Forums/Topic382130-396-1.aspx</link><description>Hi Brianthanks for your help with this..I have a querie i am trying to work out not sure if you have come across an issue like this. I am comparing Surname, given name and Date Of birth. seems like if the given name has a middle name included in the reference field and not in the source given name field it doesnt even seem to compare the other fields as it gives a 0 confidence.is there a particulare setting i should be using?</description><pubDate>Tue, 14 Apr 2009 00:36:46 GMT</pubDate><dc:creator>frecal</dc:creator></item><item><title>RE: Using Fuzzy Lookups for Record Linkage</title><link>http://www.sqlservercentral.com/Forums/Topic382130-396-1.aspx</link><description>excellent idea thanks.was wondering how it compares when the dates are slightly different. i guess as you said as long as we have other variables like firstname, surname and gender should get a good result</description><pubDate>Sat, 11 Apr 2009 06:35:24 GMT</pubDate><dc:creator>frecal</dc:creator></item><item><title>RE: Using Fuzzy Lookups for Record Linkage</title><link>http://www.sqlservercentral.com/Forums/Topic382130-396-1.aspx</link><description>I'm going to quote my own article "If your data columns are anything aside from strings (datetime, int, etc.) the Mapping Type will only allow exact."  I dislike this fact, but its true. So do a view of your data with a convert(varchar, datetimefield,101)B</description><pubDate>Fri, 10 Apr 2009 08:52:19 GMT</pubDate><dc:creator>bnordberg</dc:creator></item><item><title>RE: Using Fuzzy Lookups for Record Linkage</title><link>http://www.sqlservercentral.com/Forums/Topic382130-396-1.aspx</link><description>thanks for your reply. why is that Date fields will only display for exact matches?I cant set it to fuzzy matching.the fields i use are firstname, surname, sex and Date of birth. but unfortunately the Date of birth only looks at exact matches. any ideas? thanks againFred</description><pubDate>Thu, 09 Apr 2009 23:41:11 GMT</pubDate><dc:creator>frecal</dc:creator></item><item><title>RE: Using Fuzzy Lookups for Record Linkage</title><link>http://www.sqlservercentral.com/Forums/Topic382130-396-1.aspx</link><description>Yes date of birth is always good to use. And it will allocate a proximity score. Unfortunatly DOB's can differ significantly with simple typos e.g:  5/11/1972   vs 5/11/1927  or the more common 5/11/1972 vs 11/5/1972... So make sure you have other, more reliable variables to include (gender, names ...)</description><pubDate>Thu, 09 Apr 2009 07:55:03 GMT</pubDate><dc:creator>bnordberg</dc:creator></item><item><title>RE: Using Fuzzy Lookups for Record Linkage</title><link>http://www.sqlservercentral.com/Forums/Topic382130-396-1.aspx</link><description>Hi not sure if you will get this post or not but here goes.I have started working on a Linkage Project and was mucking around with the Fuzzy Matching in sql 2008. Do you know if there is a way of sort of fuzzy matching on date ie birthdate which is a common field in different data sets. I guess it would do a check to see how far it was out by and allocate a score. thanksFred</description><pubDate>Wed, 08 Apr 2009 23:47:21 GMT</pubDate><dc:creator>frecal</dc:creator></item><item><title>RE: Using Fuzzy Lookups for Record Linkage</title><link>http://www.sqlservercentral.com/Forums/Topic382130-396-1.aspx</link><description>Brian,It's an interesting approach and one that I wish I could use, however this doesn't allow me to establish conditional matching as well.  Therefore, I'm attempting to do record grouping within T-SQL.  While I can identify the matches easily enough for a single record, I'm having trouble doing so in bulk.  I cannot seem to create unique group identifiers for each set of matches.If I take each record one at a time and compare against all records, I get a reasonably accurate match rate, but it takes almost 24 hours to group 90,000 records (and that's just my sample size!!).  I currently have 23 various tests with a decreasing confidence level to determine if records should be grouped together.  I considered using a RANK() OVER partition method, but the ORDER BY portion is not precise enough to account for allowable variances (such as month and day being reversed in a DOB).Does anyone have any suggestions?Aaron</description><pubDate>Wed, 18 Feb 2009 14:47:49 GMT</pubDate><dc:creator>Aaron N. Cutshall</dc:creator></item><item><title>RE: Using Fuzzy Lookups for Record Linkage</title><link>http://www.sqlservercentral.com/Forums/Topic382130-396-1.aspx</link><description>&lt;P&gt;I agree with Mr Dougherty, in a perfect world we would not want or need to do this. Unfortunately when you get patients in the ER, getting a correct spelling of their name is second to stopping the bleeding. I've been using probabilistic techniques for years - particularly with EMS (emergency medical services), where the patient may not even be able to give a name. &lt;/P&gt;&lt;P&gt;I found that once we get below .80 the matches are of variable use. But it depends on the fields you have to match with, and the discriminating power of each within the set. &lt;/P&gt;&lt;P&gt;Hyphenated last names (for example Nordberg vs Halbe-Nordberg) are the worst as they produce very weak matches, but the match is evident when human reviewed - hence the need for human intervention.&lt;/P&gt;</description><pubDate>Thu, 19 Jul 2007 13:44:00 GMT</pubDate><dc:creator>bnordberg</dc:creator></item><item><title>RE: Using Fuzzy Lookups for Record Linkage</title><link>http://www.sqlservercentral.com/Forums/Topic382130-396-1.aspx</link><description>In answer to Mr. Dougherty, it is always preferable to create correct record linkages from the start.  However, fuzzy matching as described here is sometimes a necessity, especially when two recordsets come from completely different sources but have some overlap in the entities represented.  A classic example would be finding common patients in two different vaccination campaigns with no collaboration.  Matching through inner joins will both miss true matches and cause duplicates in the output.  I very much enjoyed this article.  The SQL Server 2005 fuzzy record linking implementation is a good start and will suffice for many users.  My company, Balance Engines, has been doing similar reconciliations for the cellular and oil industries for five years.  Other factors that our software addresses are fuzzy lookups that aren't based on string "edit distance" such as custom date and Double comparisons and rules-based match score modifications.  The review process mentioned in the article is absolutely essential because fuzzy comparisons are always an imperfect form of machine learning.  However, they can be extremely useful in comparing large recordsets that would be cost prohibitive to review in a completely manual way.  By eliminating all but a few percent of sure matches or non-matches, the labor required to review the rest can be made affordable.I love talking about record matching issues.  Feel free to contact me at bezell &lt;at&gt; balanceengines.com</description><pubDate>Thu, 19 Jul 2007 08:15:00 GMT</pubDate><dc:creator>Barry Ezell</dc:creator></item><item><title>RE: Using Fuzzy Lookups for Record Linkage</title><link>http://www.sqlservercentral.com/Forums/Topic382130-396-1.aspx</link><description>&lt;P&gt;It is worth mentioning that these "Fuzzy" features require an Enterprise SQL edition.&lt;/P&gt;&lt;P&gt;&lt;A href="http://msdn2.microsoft.com/en-us/library/ms143761.aspx"&gt;http://msdn2.microsoft.com/en-us/library/ms143761.aspx&lt;/A&gt;&lt;/P&gt;</description><pubDate>Thu, 19 Jul 2007 07:31:00 GMT</pubDate><dc:creator>Randy Farmer</dc:creator></item><item><title>RE: Using Fuzzy Lookups for Record Linkage</title><link>http://www.sqlservercentral.com/Forums/Topic382130-396-1.aspx</link><description>"I will have to generate a work list for someone to review"And what is the confidence factor of the reviewer?  (since this is likely to be an unpleasant task that may end up being an intern or a temp)It's an interesting concept, but I would feel much better seeing it used for marketing analysis/data mining rather than patient records.  When my new doctor's office requested medical files from my old doctor's office, they recieved information for a patient that was not me (but had the same first and last name) - I assumed it was human error.  The intentional use of 'fuzzy' record matching at the machine/DB level seems unwise.I would make a stronger argument for information integrity at entry rather than risk "fuzziness" as a systemic means of mitigating sloppy data.  Of course that's a different topic and a considerable amount of work in itself... and not to be completely down on the author of this article - the screenshots and examples make it obvious there is considerable power available with this technique. (a feature I did not know, and would likely have tried to home-grow if needed)  I feel this is the same kind of power as using index hints or NoLock directives: if used properly it's a great asset, if used improperly it's a great liability.</description><pubDate>Thu, 19 Jul 2007 07:28:00 GMT</pubDate><dc:creator>Mike Dougherty-384281</dc:creator></item><item><title>RE: Using Fuzzy Lookups for Record Linkage</title><link>http://www.sqlservercentral.com/Forums/Topic382130-396-1.aspx</link><description>Thanks for this quick review of the fuzzy logic transform in SSIS.  Very helpful for patient identification/matching in health care organizations like ours.  We're a community health center in Boston that has lots of patient matching needs.</description><pubDate>Thu, 19 Jul 2007 07:11:00 GMT</pubDate><dc:creator>Peter Malloy</dc:creator></item><item><title>Using Fuzzy Lookups for Record Linkage</title><link>http://www.sqlservercentral.com/Forums/Topic382130-396-1.aspx</link><description>Comments posted here are about the content posted at &lt;A HREF="http://www.sqlservercentral.com/columnists/bnordberg/3105.asp"&gt;http://www.sqlservercentral.com/columnists/bnordberg/3105.asp&lt;/A&gt;</description><pubDate>Mon, 16 Jul 2007 16:39:00 GMT</pubDate><dc:creator>bnordberg</dc:creator></item></channel></rss>