Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase

Advanced SSIS Fuzzy Matching via Record Linkage Methodology Expand / Collapse
Author
Message
Posted Thursday, July 5, 2012 12:06 AM


SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, August 18, 2014 1:48 PM
Points: 42, Visits: 290
Comments posted to this topic are about the item Advanced SSIS Fuzzy Matching via Record Linkage Methodology

Ira Warren Whiteside
Post #1325187
Posted Thursday, July 5, 2012 2:27 AM
SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Monday, August 18, 2014 6:23 AM
Points: 422, Visits: 740
A very interesting article, I'm looking forward to the follow ups as we're heading in to a fairly big matching job (merging 1/4 million education records to 100k social care records for our county).
A couple of questions (which you might be covering in the future). Does fuzzy matching work in the same way as fuzzy grouping? I had assumed that there was some blocking going on in the background on any exact matches defined but the tests in the article imlies not.
Also is there a way of getting the blocking to work in SSIS with a much larger number of blocks e.g. one for every date of birth or post code? We could probably run the fuzzy matching in a loop but that would loose the advantage of the blocks running in parallel.

Thanks
Barney
Post #1325251
Posted Thursday, July 5, 2012 12:37 PM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: 2 days ago @ 12:26 PM
Points: 371, Visits: 963
Very interesting.

Cheers
Post #1325639
Posted Thursday, July 5, 2012 3:30 PM


SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Tuesday, August 12, 2014 2:26 AM
Points: 107, Visits: 703
Excellent article....

Regards,

Basit A. Farooq (MSC Computing, MCITP SQL Server 2005 & 2008, MCDBA SQL Server 2000)

http://basitaalishan.com
Post #1325746
Posted Friday, July 6, 2012 2:07 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, June 9, 2014 2:48 PM
Points: 3, Visits: 25
[Thank you for an article.
It is indeed interesting but, in practice a bit simplistic.
Matching zip codes, phones are fine as long as it is entered accurately or it is a part of current address of the person. (Assuming that person is being matched).
If for example I moved to another locations (address, zip code) then grouping would be negative effort.
Sometimes, it is hard to match person by address. You don’t know it is address change or totally different person.
Post #1326312
Posted Tuesday, July 10, 2012 7:59 PM


Mr or Mrs. 500

Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500

Group: General Forum Members
Last Login: Wednesday, August 13, 2014 12:47 PM
Points: 587, Visits: 147
Nice article Ira, kinda familiar too.


JOEL-145858, don't take the content of the article out of context. This is one of MANY methods that you can use in matching, but it doesn't represent a complete matching solution. You would obviously find as many exact matches as possible first, as any kind of fuzzy matching is cost-prohibitive comparatively. Then, using the model Ira outlined, you can perform fuzzy matching against what remains unmatched. And using geographic elements for blocking and fuzzy matching only serves as one example; The same model would fit other elements. Example:

Demographic - Block on First 2 letters of last name, year of birth / fuzzy match on FirstName, LastName, DOB
Demo / Geo - Block on FirstName, DOB, State / Fuzzy Match on FirstName, LastName, Address, City, Zip

The whole point of the parallel blocks is to minimize your comparison set, and thereby the number of potential combinations. In matching solutions I have done using this exact method, multiple iterations of this model with different criteria served our matching needs very well.

P.S - the method is also highly scalable if you have the processing power and memory on your SSIS Box.


Joshua T. Lewis
Post #1328008
Posted Tuesday, July 10, 2012 8:19 PM


SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, August 18, 2014 1:48 PM
Points: 42, Visits: 290
Mr DeusExDatum is exactly "spot on" and I highly recommend that you follow him, he has very deep and extensive experience and I have had the pleasure in working with him.

The reason I focus on methodology first in my article and a specific SSIS technique was to emphasis the entire process of record linkage, obviously you would get all the exact matches out of the mix first and cycle thru the remaing with Fuzzy and utilitize various blocking criteria and even develop multiple scoring.

Mr DeusExDatum has implemented this method in real world solution.

If I am able to do a Part Duex of this topic I will seek Mr DeusExDatum contribution.


Ira Warren Whiteside
Post #1328011
« Prev Topic | Next Topic »

Add to briefcase

Permissions Expand / Collapse