Using Fuzzy Lookups for Record Linkage

  • We'd probably have to time travel back to the 70's to chat with the db designers of most hospital EMR systems. Plus all these old MUMPS/CACHE "databases" that make data retrieval real fun! I'm attempting to parse dosage's from SIG/Instruction fields from our medication tables - there are >25,000 different textual ways of telling someone how to take a drug!

    My only other recommendation would be to ensure your thresholds/weights are low for first name and higher for the other fields.


  • I think thats where my problem you said if Mary Jane is in one firstname field and the comparing Firstname has Only Mary. it doesnt seem to match it. I think I need to muck around with the weights.

  • Hi Brian

    what did you mean by >Can you script them out? I know it is error prone such as Mary Anne is Anne the middle ... Then once you script them to their own field find the person that made the database and give them a good tutorial on atomicity of fields!



  • Hello Brian, et al,

    Thank you for time in this article. It was very helpful.

    For posterity's sake, I did want to point out what appears to be a small, but important, wrinkle in the settings provided in your example that I think are worth a mentioning.

    In your example, the Similarity Threshold at the component level on the Advanced Tab is set to .52. However, the Minimum Similarity at the join level for the MPI_PT_dob field (date of birth) is set to .2.

    According to MSDN:

    To satisfy the similarity that is specified at the component level, all rows must have a similarity across all matches that is greater than or equal to the similarity threshold that is specified at the component level. That is, you cannot specify a very close match at the component level unless the matches at the row or join level are equally close.

    Fuzzy Lookup Transformation

    Now what I got from that was if a .3 similarity match is made at the join level, which is higher than the threshold you have set of .2, Fuzzy Lookup will still not return the record because the similarity at the component level is .52. I did a quick test and this is, in fact, the case.

    It appears that each similarity at the join level acts like an additional filter on each field at the join level to further refine the lookup's results, after the similarity at the component level has already reduced the matches for the entire row.

Viewing 4 posts - 16 through 18 (of 18 total)

You must be logged in to reply to this topic. Login to reply