Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase

Roll Your Own SSIS Fuzzy Matching / Grouping (Jaro - Winkler) Expand / Collapse
Author
Message
Posted Friday, February 20, 2009 11:31 PM


SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, December 1, 2014 10:40 PM
Points: 42, Visits: 294
Comments posted to this topic are about the item Roll Your Own SSIS Fuzzy Matching / Grouping (Jaro - Winkler)

Ira Warren Whiteside
Post #661970
Posted Wednesday, March 4, 2009 8:35 PM


SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, December 1, 2014 10:40 PM
Points: 42, Visits: 294
If anyone wants the sample tables, insert scripts and SSIS Package send me an email ira.whiteside@actualitybusinessintelligence.com

Ira Warren Whiteside
Post #668860
Posted Thursday, March 5, 2009 9:40 AM
UDP Broadcaster

UDP BroadcasterUDP BroadcasterUDP BroadcasterUDP BroadcasterUDP BroadcasterUDP BroadcasterUDP BroadcasterUDP Broadcaster

Group: General Forum Members
Last Login: 2 days ago @ 2:55 AM
Points: 1,466, Visits: 3,036
Thanks for the article....

Have you done any comparison of your results against the SSIS Fuzzy Lookup Data Flow Task?

Its just that I am working with this problem at the moment as I have records to add into a database as part of a ETL suite and I have about 12% of the records having a fuzzy lookup match using the SSIS Fuzzy Lookup task.


Facts are stubborn things, but statistics are more pliable - Mark Twain
Carolyn
SQLServerSpecialists
Post #669356
Posted Thursday, March 5, 2009 9:56 AM


SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, December 1, 2014 10:40 PM
Points: 42, Visits: 294
The performance should be the same for the first time you run. However the Microsoft SSIS Fuzzy Lookup can create an index to use for the subsequent runs.

I also have a SQL Server Stored Procedure version(Upcoming Article), and if you like I can send you the source code for SSIS.

Also all of our original ABI SSIS Components will soon be available through Melissa Data. http://www.melissadata.com/dqt/total-data-quality-integration.htm


Melissa Data SSIS Total Data Quality Toolkit(TDQ-IT) offers a wide range of SSIS data transformation and cleansing functionality including profiling, parsing, cleansing, fuzzy matching and monitoring functionality built right in to SSIS.

We have created several short videos.


Ira Warren Whiteside
Post #669373
Posted Thursday, March 5, 2009 12:41 PM
UDP Broadcaster

UDP BroadcasterUDP BroadcasterUDP BroadcasterUDP BroadcasterUDP BroadcasterUDP BroadcasterUDP BroadcasterUDP Broadcaster

Group: General Forum Members
Last Login: 2 days ago @ 2:55 AM
Points: 1,466, Visits: 3,036
Will this toolkit be able to find the master record in a fuzzy match say you have a Thomas a Tom and a Terrance will it identify or make a call on which is the correct name. If it does make the decision how does it arrive at it?

I currently see this as a manual intervention type action, where I pull the records out and put a user interface in place for a user to make this decision, but I would prefer a automated solution if it is possible.



Facts are stubborn things, but statistics are more pliable - Mark Twain
Carolyn
SQLServerSpecialists
Post #669547
Posted Friday, March 6, 2009 9:18 PM


SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, December 1, 2014 10:40 PM
Points: 42, Visits: 294
Our Toolkit provides the means to automate this process significantly, however there would always be some manual intervention. The objective is to limit the number of records sent thru the Fuzzy matching , and provide a list on near matches for use in a review process.

Actually there is a method to the Fuzzy madness. I suggest you follow a methodology that encompasses the following: Profiling, Cleansing, Parsing/Standardization ,Matching, Enrichment, Monitoring:

Profiling – Detect any anomalies, via pattern analysis, in the name field, such as To#m, or spaces in name. I have an earlier post on this using the script component http://www.sqlservercentral.com/articles/ETL/63792/ as well as components in the Melissa Data SSIS Total Data Quality Toolkit(TDQ-IT).

Cleansing – Remove theses anomalies and spaces. Same post as above.

Parsing / Standardization – Use a SSIS (Equality)Lookup or Melissa SSIS Normalize to get the “correct name” and eliminate simple equality matches. Basically you will create a “lookup” table that will serve as you set of Standardization Rules. You can take the “DDL for NameLookup”, (Send me and email) and populate it with three ‘first_name” entries of Thomas a Tom and a Terrance all pointing to a first_name_normalized value of “Thomas” all with a unique ‘name_group_id”. Tom and Thomas are already in the sample. This will handle simple translations of names. The next step will handle similar names. There are many source for creating a comprehensive table of names. In the US we have a Census list available of the most common names in each state. Another technique is to use a data profiling process known column domain profiling or value distribution. This process will create a set of unique values or first names and automate the population of your name NameLookup table.(Available in DQT Toolkit or use a SSIS Sort). Obvioulsy this approach is aqnostic and can be used for any domain , lookup or cross reference situation.

Matching – Now that we have the simple transformations or equality matches you will need to match “similar” names , such as “Tommy” or “Tomas” We take the remaining records and run them thru a Fuzzy Match process (Microsoft Fuzzy Lookup or Melissa Jaro-Winkler for Names or n-Gram for general string matching) and split the records in three groups (Match, Near Match and No Match). The only group needing “Manual Review” would be the “Near Match”.

The methodology for “manual intervention” would be to update the “Standardization Rules” table with these values. This process can also be further automated by automatically inserting the “Near Match” unique values into the NameLookup or “Standardization Rules” table and have the reviewer enter the proper “first_name_normalized”. You can also log the number of matches and statistics on the scores to help manage you matching process.

Enrichment – You may want to include gender identification for the name. You can add this to you tables or use the (Toolkit SSIS Name Component)

Monitoring – As you perform each step of the process you can collect and log the various matched or exceptions and table entries and “Score” your progress in automating the match or grouping process. I will have a video for this on the soon.

I also have a brief video on JumpstartTV using the SSIS Fuzzy Lookup. http://www.jumpstarttv.com/applied-fuzzy-lookup_73.aspx demonstrating this approach.


Ira Warren Whiteside
Post #670801
Posted Friday, March 6, 2009 9:38 PM


SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, December 1, 2014 10:40 PM
Points: 42, Visits: 294
Another approach would be to use SSIS Data Mining capabilities of Analysis Services http://www.microsoft.com/events/podcasts/default.aspx?topic=Topic-202f8d31-bbff-4d7e-a3f5-c4f19f935a09&audience=Audience-b046181f-3333-4c19-977e-c230ed48d9c0&seriesID=&pageId=x492, this is an excellent video by Donald Farmer. This would still require most of the Profiling, Cleansing, Parsing/Standardization ,Matching, Enrichment, Monitoring steps. You would replace the Matching(Fuzzy) step with the Data Mining Predicative capabilities of Analysis Services.

Ira Warren Whiteside
Post #670806
Posted Monday, March 30, 2009 9:58 AM
Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Monday, March 31, 2014 9:19 AM
Points: 314, Visits: 677
I've started to give this a go this afternoon, however, I've ran into problems creating the data flow diagram.

When I click on the data type box on for one of the output columns the script compoment just closes and will not let me change anything! - I'm trying to change it to DT_STR length of 10, as specified in the article.

I'll admit I'm a newbie to SSIS, so I did some reading before attempting this, but essentially this is my first SSIS project.

Can anyone give me some guidance, or point me in the direction of the correct forum area if this is not it...

My use of it is to standardise jobtitles for the record...
Post #686265
Posted Monday, March 30, 2009 11:05 AM


SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, December 1, 2014 10:40 PM
Points: 42, Visits: 294
Please send me an email to iwhiteside@iwhiteside.com. We can resolve it. I'll post an answer once we have the specifics

Ira Warren Whiteside
Post #686322
Posted Tuesday, August 20, 2013 11:13 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, August 20, 2013 11:12 AM
Points: 1, Visits: 4
Can you please post source here or please respond to my email early asking for sources...
Post #1486375
« Prev Topic | Next Topic »

Add to briefcase

Permissions Expand / Collapse