• In answer to Mr. Dougherty, it is always preferable to create correct record linkages from the start. However, fuzzy matching as described here is sometimes a necessity, especially when two recordsets come from completely different sources but have some overlap in the entities represented. A classic example would be finding common patients in two different vaccination campaigns with no collaboration. Matching through inner joins will both miss true matches and cause duplicates in the output.

    I very much enjoyed this article. The SQL Server 2005 fuzzy record linking implementation is a good start and will suffice for many users. My company, Balance Engines, has been doing similar reconciliations for the cellular and oil industries for five years. Other factors that our software addresses are fuzzy lookups that aren't based on string "edit distance" such as custom date and Double comparisons and rules-based match score modifications. The review process mentioned in the article is absolutely essential because fuzzy comparisons are always an imperfect form of machine learning. However, they can be extremely useful in comparing large recordsets that would be cost prohibitive to review in a completely manual way. By eliminating all but a few percent of sure matches or non-matches, the labor required to review the rest can be made affordable.

    I love talking about record matching issues. Feel free to contact me at bezell balanceengines.com