Roll Your Own SSIS Fuzzy Matching / Grouping (Jaro - Winkler)

  • Comments posted to this topic are about the item Roll Your Own SSIS Fuzzy Matching / Grouping (Jaro - Winkler)

    [font="Comic Sans MS"]Ira Warren Whiteside[/font]

  • If anyone wants the sample tables, insert scripts and SSIS Package send me an email ira.whiteside@actualitybusinessintelligence.com

    [font="Comic Sans MS"]Ira Warren Whiteside[/font]

  • Thanks for the article....

    Have you done any comparison of your results against the SSIS Fuzzy Lookup Data Flow Task?

    Its just that I am working with this problem at the moment as I have records to add into a database as part of a ETL suite and I have about 12% of the records having a fuzzy lookup match using the SSIS Fuzzy Lookup task.

    Facts are stubborn things, but statistics are more pliable - Mark Twain
    Carolyn
    SQLServerSpecialists[/url]

  • The performance should be the same for the first time you run. However the Microsoft SSIS Fuzzy Lookup can create an index to use for the subsequent runs.

    I also have a SQL Server Stored Procedure version(Upcoming Article), and if you like I can send you the source code for SSIS.

    Also all of our original ABI SSIS Components will soon be available through Melissa Data. http://www.melissadata.com/dqt/total-data-quality-integration.htm

    Melissa Data SSIS Total Data Quality Toolkit(TDQ-IT) offers a wide range of SSIS data transformation and cleansing functionality including profiling, parsing, cleansing, fuzzy matching and monitoring functionality built right in to SSIS.

    We have created several short videos.

    [font="Comic Sans MS"]Ira Warren Whiteside[/font]

  • Will this toolkit be able to find the master record in a fuzzy match say you have a Thomas a Tom and a Terrance will it identify or make a call on which is the correct name. If it does make the decision how does it arrive at it?

    I currently see this as a manual intervention type action, where I pull the records out and put a user interface in place for a user to make this decision, but I would prefer a automated solution if it is possible.

    Facts are stubborn things, but statistics are more pliable - Mark Twain
    Carolyn
    SQLServerSpecialists[/url]

  • Our Toolkit provides the means to automate this process significantly, however there would always be some manual intervention. The objective is to limit the number of records sent thru the Fuzzy matching , and provide a list on near matches for use in a review process.

    Actually there is a method to the Fuzzy madness. I suggest you follow a methodology that encompasses the following: Profiling, Cleansing, Parsing/Standardization ,Matching, Enrichment, Monitoring:

    Profiling – Detect any anomalies, via pattern analysis, in the name field, such as To#m, or spaces in name. I have an earlier post on this using the script component http://www.sqlservercentral.com/articles/ETL/63792/ as well as components in the Melissa Data SSIS Total Data Quality Toolkit(TDQ-IT).

    Cleansing – Remove theses anomalies and spaces. Same post as above.

    Parsing / Standardization – Use a SSIS (Equality)Lookup or Melissa SSIS Normalize to get the “correct name” and eliminate simple equality matches. Basically you will create a “lookup” table that will serve as you set of Standardization Rules. You can take the “DDL for NameLookup”, (Send me and email) and populate it with three ‘first_name” entries of Thomas a Tom and a Terrance all pointing to a first_name_normalized value of “Thomas” all with a unique ‘name_group_id”. Tom and Thomas are already in the sample. This will handle simple translations of names. The next step will handle similar names. There are many source for creating a comprehensive table of names. In the US we have a Census list available of the most common names in each state. Another technique is to use a data profiling process known column domain profiling or value distribution. This process will create a set of unique values or first names and automate the population of your name NameLookup table.(Available in DQT Toolkit or use a SSIS Sort). Obvioulsy this approach is aqnostic and can be used for any domain , lookup or cross reference situation.

    Matching – Now that we have the simple transformations or equality matches you will need to match “similar” names , such as “Tommy” or “Tomas” We take the remaining records and run them thru a Fuzzy Match process (Microsoft Fuzzy Lookup or Melissa Jaro-Winkler for Names or n-Gram for general string matching) and split the records in three groups (Match, Near Match and No Match). The only group needing “Manual Review” would be the “Near Match”.

    The methodology for “manual intervention” would be to update the “Standardization Rules” table with these values. This process can also be further automated by automatically inserting the “Near Match” unique values into the NameLookup or “Standardization Rules” table and have the reviewer enter the proper “first_name_normalized”. You can also log the number of matches and statistics on the scores to help manage you matching process.

    Enrichment – You may want to include gender identification for the name. You can add this to you tables or use the (Toolkit SSIS Name Component)

    Monitoring – As you perform each step of the process you can collect and log the various matched or exceptions and table entries and “Score” your progress in automating the match or grouping process. I will have a video for this on the soon.

    I also have a brief video on JumpstartTV using the SSIS Fuzzy Lookup. http://www.jumpstarttv.com/applied-fuzzy-lookup_73.aspx demonstrating this approach.

    [font="Comic Sans MS"]Ira Warren Whiteside[/font]

  • Another approach would be to use SSIS Data Mining capabilities of Analysis Services http://www.microsoft.com/events/podcasts/default.aspx?topic=Topic-202f8d31-bbff-4d7e-a3f5-c4f19f935a09&audience=Audience-b046181f-3333-4c19-977e-c230ed48d9c0&seriesID=&pageId=x492, this is an excellent video by Donald Farmer. This would still require most of the Profiling, Cleansing, Parsing/Standardization ,Matching, Enrichment, Monitoring steps. You would replace the Matching(Fuzzy) step with the Data Mining Predicative capabilities of Analysis Services.

    [font="Comic Sans MS"]Ira Warren Whiteside[/font]

  • I've started to give this a go this afternoon, however, I've ran into problems creating the data flow diagram.

    When I click on the data type box on for one of the output columns the script compoment just closes and will not let me change anything! - I'm trying to change it to DT_STR length of 10, as specified in the article.

    I'll admit I'm a newbie to SSIS, so I did some reading before attempting this, but essentially this is my first SSIS project.

    Can anyone give me some guidance, or point me in the direction of the correct forum area if this is not it...

    My use of it is to standardise jobtitles for the record...

  • Please send me an email to iwhiteside@iwhiteside.com. We can resolve it. I'll post an answer once we have the specifics

    [font="Comic Sans MS"]Ira Warren Whiteside[/font]

  • Can you please post source here or please respond to my email early asking for sources...

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply