RE: Roll Your Own SSIS Fuzzy Matching / Grouping (Jaro - Winkler)

SSCrazy

Points: 2146

March 6, 2009 at 9:18 pm

Our Toolkit provides the means to automate this process significantly, however there would always be some manual intervention. The objective is to limit the number of records sent thru the Fuzzy matching , and provide a list on near matches for use in a review process.

Actually there is a method to the Fuzzy madness. I suggest you follow a methodology that encompasses the following: Profiling, Cleansing, Parsing/Standardization ,Matching, Enrichment, Monitoring:

Profiling – Detect any anomalies, via pattern analysis, in the name field, such as To#m, or spaces in name. I have an earlier post on this using the script component http://www.sqlservercentral.com/articles/ETL/63792/ as well as components in the Melissa Data SSIS Total Data Quality Toolkit(TDQ-IT).

Cleansing – Remove theses anomalies and spaces. Same post as above.

Parsing / Standardization – Use a SSIS (Equality)Lookup or Melissa SSIS Normalize to get the “correct name” and eliminate simple equality matches. Basically you will create a “lookup” table that will serve as you set of Standardization Rules. You can take the “DDL for NameLookup”, (Send me and email) and populate it with three ‘first_name” entries of Thomas a Tom and a Terrance all pointing to a first_name_normalized value of “Thomas” all with a unique ‘name_group_id”. Tom and Thomas are already in the sample. This will handle simple translations of names. The next step will handle similar names. There are many source for creating a comprehensive table of names. In the US we have a Census list available of the most common names in each state. Another technique is to use a data profiling process known column domain profiling or value distribution. This process will create a set of unique values or first names and automate the population of your name NameLookup table.(Available in DQT Toolkit or use a SSIS Sort). Obvioulsy this approach is aqnostic and can be used for any domain , lookup or cross reference situation.

Matching – Now that we have the simple transformations or equality matches you will need to match “similar” names , such as “Tommy” or “Tomas” We take the remaining records and run them thru a Fuzzy Match process (Microsoft Fuzzy Lookup or Melissa Jaro-Winkler for Names or n-Gram for general string matching) and split the records in three groups (Match, Near Match and No Match). The only group needing “Manual Review” would be the “Near Match”.

The methodology for “manual intervention” would be to update the “Standardization Rules” table with these values. This process can also be further automated by automatically inserting the “Near Match” unique values into the NameLookup or “Standardization Rules” table and have the reviewer enter the proper “first_name_normalized”. You can also log the number of matches and statistics on the scores to help manage you matching process.

Enrichment – You may want to include gender identification for the name. You can add this to you tables or use the (Toolkit SSIS Name Component)

Monitoring – As you perform each step of the process you can collect and log the various matched or exceptions and table entries and “Score” your progress in automating the match or grouping process. I will have a video for this on the soon.

I also have a brief video on JumpstartTV using the SSIS Fuzzy Lookup. http://www.jumpstarttv.com/applied-fuzzy-lookup_73.aspx demonstrating this approach.

[font="Comic Sans MS"]Ira Warren Whiteside[/font]