RE: Roll Your Own Fuzzy Match / Grouping (Jaro Winkler) - T-SQL

Old Hand

Points: 388

February 3, 2011 at 5:01 pm

I've found (heavily) indexed #temp tables to help significantly with speed on larger datasets. Calculate your double metaphone and cleaned up variants once, and then do Jaro-Winkler based on a WHERE clause (first two characters match, first two characters of double metaphone match, etc.). It's naturally a cartesian [O(n^2)] operation, which does get very slow, very fast, so to speak.

Agreed. I try to avoid the cartesian joins during matching as much as I can, and generally require certain demographics to match (inner join on DOB + Gender instead of cross join) to limit the result set, then also limit the amount I run through at a time. I may miss a handful of records with this approach, but they are easy to identify, and since the matching is fuzzy to begin with, you are going to need a clean up process anyway.

Adam Sottosanti