Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Using Fuzzy Lookups for Record Linkage


Using Fuzzy Lookups for Record Linkage

Author
Message
bnordberg
bnordberg
SSC Veteran
SSC Veteran (287 reputation)SSC Veteran (287 reputation)SSC Veteran (287 reputation)SSC Veteran (287 reputation)SSC Veteran (287 reputation)SSC Veteran (287 reputation)SSC Veteran (287 reputation)SSC Veteran (287 reputation)

Group: General Forum Members
Points: 287 Visits: 569
I'm going to quote my own article "If your data columns are anything aside from strings (datetime, int, etc.) the Mapping Type will only allow exact." I dislike this fact, but its true. So do a view of your data with a convert(varchar, datetimefield,101)
B



frecal
frecal
SSC Rookie
SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)

Group: General Forum Members
Points: 44 Visits: 287
excellent idea thanks.
was wondering how it compares when the dates are slightly different. i guess as you said as long as we have other variables like firstname, surname and gender should get a good result
frecal
frecal
SSC Rookie
SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)

Group: General Forum Members
Points: 44 Visits: 287
Hi Brian
thanks for your help with this..I have a querie i am trying to work out not sure if you have come across an issue like this.
I am comparing Surname, given name and Date Of birth. seems like if the given name has a middle name included in the reference field and not in the source given name field it doesnt even seem to compare the other fields as it gives a 0 confidence.
is there a particulare setting i should be using?
bnordberg
bnordberg
SSC Veteran
SSC Veteran (287 reputation)SSC Veteran (287 reputation)SSC Veteran (287 reputation)SSC Veteran (287 reputation)SSC Veteran (287 reputation)SSC Veteran (287 reputation)SSC Veteran (287 reputation)SSC Veteran (287 reputation)

Group: General Forum Members
Points: 287 Visits: 569
No I have not seen that, it could be that it is dragging the match weight down enough that it is below your threshold for a match?
Regardless they will lower your scores. Can you script them out? I know it is error prone such as Mary Anne is Anne the middle ... Then once you script them to their own field find the person that made the database and give them a good tutorial on atomicity of fields!

B



Aaron N. Cutshall
Aaron N. Cutshall
SSCommitted
SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)

Group: General Forum Members
Points: 1596 Visits: 964
bnordberg (4/14/2009)
...find the person that made the database and give them a good tutorial on atomicity of fields!

If only it were that easy!! I deal with data all the time that is very messy because folks don't care as long as it prints on paper OK. They don't consider the benefits of putting data into the appropriate fields. I get data from Electronic Medical Record systems and the majority of important data is entered into comment fields simply because it's easier to enter that way and as long as the printed paper has the info for the doc to read, it's OK in their minds. Of course, that wrecks havoc on any attempt to utilize the data for any other purpose!


"...when ye are in the service of your fellow beings ye are only in the service of your God." -- Mosiah 2:17
bnordberg
bnordberg
SSC Veteran
SSC Veteran (287 reputation)SSC Veteran (287 reputation)SSC Veteran (287 reputation)SSC Veteran (287 reputation)SSC Veteran (287 reputation)SSC Veteran (287 reputation)SSC Veteran (287 reputation)SSC Veteran (287 reputation)

Group: General Forum Members
Points: 287 Visits: 569
We'd probably have to time travel back to the 70's to chat with the db designers of most hospital EMR systems. Plus all these old MUMPS/CACHE "databases" that make data retrieval real fun! I'm attempting to parse dosage's from SIG/Instruction fields from our medication tables - there are >25,000 different textual ways of telling someone how to take a drug!
My only other recommendation would be to ensure your thresholds/weights are low for first name and higher for the other fields.
B



frecal
frecal
SSC Rookie
SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)

Group: General Forum Members
Points: 44 Visits: 287
I think thats where my problem is..as you said if Mary Jane is in one firstname field and the comparing Firstname has Only Mary. it doesnt seem to match it. I think I need to muck around with the weights.
frecal
frecal
SSC Rookie
SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)SSC Rookie (44 reputation)

Group: General Forum Members
Points: 44 Visits: 287
Hi Brian

what did you mean by >Can you script them out? I know it is error prone such as Mary Anne is Anne the middle ... Then once you script them to their own field find the person that made the database and give them a good tutorial on atomicity of fields!

regards

Fred
Langston Montgomery
Langston Montgomery
SSC Rookie
SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)

Group: General Forum Members
Points: 42 Visits: 508
Hello Brian, et al,

Thank you for time in this article. It was very helpful.

For posterity's sake, I did want to point out what appears to be a small, but important, wrinkle in the settings provided in your example that I think are worth a mentioning.

In your example, the Similarity Threshold at the component level on the Advanced Tab is set to .52. However, the Minimum Similarity at the join level for the MPI_PT_dob field (date of birth) is set to .2.

According to MSDN:

To satisfy the similarity that is specified at the component level, all rows must have a similarity across all matches that is greater than or equal to the similarity threshold that is specified at the component level. That is, you cannot specify a very close match at the component level unless the matches at the row or join level are equally close.

Fuzzy Lookup Transformation

Now what I got from that was if a .3 similarity match is made at the join level, which is higher than the threshold you have set of .2, Fuzzy Lookup will still not return the record because the similarity at the component level is .52. I did a quick test and this is, in fact, the case.

It appears that each similarity at the join level acts like an additional filter on each field at the join level to further refine the lookup's results, after the similarity at the component level has already reduced the matches for the entire row.
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search