RE: Roll Your Own Fuzzy Match / Grouping (Jaro Winkler) - T-SQL

SSC-Insane

Points: 20039

June 18, 2009 at 1:19 pm

#1012008

I agree that parsing is critical. On the other hand, how to parse is also very much an "it depends" question.

In general, for my only slightly imperfect matching I tend to combine split fields (addr1, addr2, first name, last name), clean them up (nested REPLACE() for SQL), remove extraneous elements (professional designations, etc.), remove all remaining spaces, and then compare them. With at least some matching on another selective field (license number, for instance), this produces high quality results.

For later, fuzzier passes... well, that's where splitting pieces apart, cross-matching, Jaro-Winkler, Double Metaphone, and so on come in.

As far as splitting pieces apart, for the ' ' + addr1 + ' ' + addr2 + ' ' (etc.), I have code based on U.S. addresses that tries to find all leading digits, all suite/apartment indicators (whether they be first or last), whatever's in the middle, and the first "word" of whatever's in the middle. These are then used in various combinations with City+State and/or Zip code on the address portion of the matching.

In pure T-SQL, splitting the pieces apart is fairly resource heavy.