Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Advanced SSIS Fuzzy Matching via Record Linkage Methodology


Advanced SSIS Fuzzy Matching via Record Linkage Methodology

Author
Message
Ira Warren Whiteside
Ira Warren Whiteside
SSC Journeyman
SSC Journeyman (76 reputation)SSC Journeyman (76 reputation)SSC Journeyman (76 reputation)SSC Journeyman (76 reputation)SSC Journeyman (76 reputation)SSC Journeyman (76 reputation)SSC Journeyman (76 reputation)SSC Journeyman (76 reputation)

Group: General Forum Members
Points: 76 Visits: 296
Comments posted to this topic are about the item Advanced SSIS Fuzzy Matching via Record Linkage Methodology

Ira Warren Whiteside
BarneyL
BarneyL
Mr or Mrs. 500
Mr or Mrs. 500 (515 reputation)Mr or Mrs. 500 (515 reputation)Mr or Mrs. 500 (515 reputation)Mr or Mrs. 500 (515 reputation)Mr or Mrs. 500 (515 reputation)Mr or Mrs. 500 (515 reputation)Mr or Mrs. 500 (515 reputation)Mr or Mrs. 500 (515 reputation)

Group: General Forum Members
Points: 515 Visits: 971
A very interesting article, I'm looking forward to the follow ups as we're heading in to a fairly big matching job (merging 1/4 million education records to 100k social care records for our county).
A couple of questions (which you might be covering in the future). Does fuzzy matching work in the same way as fuzzy grouping? I had assumed that there was some blocking going on in the background on any exact matches defined but the tests in the article imlies not.
Also is there a way of getting the blocking to work in SSIS with a much larger number of blocks e.g. one for every date of birth or post code? We could probably run the fuzzy matching in a loop but that would loose the advantage of the blocks running in parallel.

Thanks
Barney
jfogel
jfogel
SSC-Addicted
SSC-Addicted (493 reputation)SSC-Addicted (493 reputation)SSC-Addicted (493 reputation)SSC-Addicted (493 reputation)SSC-Addicted (493 reputation)SSC-Addicted (493 reputation)SSC-Addicted (493 reputation)SSC-Addicted (493 reputation)

Group: General Forum Members
Points: 493 Visits: 1157
Very interesting.

Cheers
Basit Farooq
Basit Farooq
SSC-Enthusiastic
SSC-Enthusiastic (157 reputation)SSC-Enthusiastic (157 reputation)SSC-Enthusiastic (157 reputation)SSC-Enthusiastic (157 reputation)SSC-Enthusiastic (157 reputation)SSC-Enthusiastic (157 reputation)SSC-Enthusiastic (157 reputation)SSC-Enthusiastic (157 reputation)

Group: General Forum Members
Points: 157 Visits: 805
Excellent article....

Regards,

Basit A. Farooq (MSC Computing, MCITP SQL Server 2005 & 2008, MCDBA SQL Server 2000)

http://basitaalishan.com
JOEL-145858
JOEL-145858
Forum Newbie
Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)

Group: General Forum Members
Points: 5 Visits: 25
[Thank you for an article.
It is indeed interesting but, in practice a bit simplistic.
Matching zip codes, phones are fine as long as it is entered accurately or it is a part of current address of the person. (Assuming that person is being matched).
If for example I moved to another locations (address, zip code) then grouping would be negative effort.
Sometimes, it is hard to match person by address. You don’t know it is address change or totally different person.
DeusExDatum
DeusExDatum
Mr or Mrs. 500
Mr or Mrs. 500 (598 reputation)Mr or Mrs. 500 (598 reputation)Mr or Mrs. 500 (598 reputation)Mr or Mrs. 500 (598 reputation)Mr or Mrs. 500 (598 reputation)Mr or Mrs. 500 (598 reputation)Mr or Mrs. 500 (598 reputation)Mr or Mrs. 500 (598 reputation)

Group: General Forum Members
Points: 598 Visits: 187
Nice article Ira, kinda familiar too. :-)


JOEL-145858, don't take the content of the article out of context. This is one of MANY methods that you can use in matching, but it doesn't represent a complete matching solution. You would obviously find as many exact matches as possible first, as any kind of fuzzy matching is cost-prohibitive comparatively. Then, using the model Ira outlined, you can perform fuzzy matching against what remains unmatched. And using geographic elements for blocking and fuzzy matching only serves as one example; The same model would fit other elements. Example:

Demographic - Block on First 2 letters of last name, year of birth / fuzzy match on FirstName, LastName, DOB
Demo / Geo - Block on FirstName, DOB, State / Fuzzy Match on FirstName, LastName, Address, City, Zip

The whole point of the parallel blocks is to minimize your comparison set, and thereby the number of potential combinations. In matching solutions I have done using this exact method, multiple iterations of this model with different criteria served our matching needs very well.

P.S - the method is also highly scalable if you have the processing power and memory on your SSIS Box.

Joshua T. Lewis
Ira Warren Whiteside
Ira Warren Whiteside
SSC Journeyman
SSC Journeyman (76 reputation)SSC Journeyman (76 reputation)SSC Journeyman (76 reputation)SSC Journeyman (76 reputation)SSC Journeyman (76 reputation)SSC Journeyman (76 reputation)SSC Journeyman (76 reputation)SSC Journeyman (76 reputation)

Group: General Forum Members
Points: 76 Visits: 296
Mr DeusExDatum is exactly "spot on" and I highly recommend that you follow him, he has very deep and extensive experience and I have had the pleasure in working with him.

The reason I focus on methodology first in my article and a specific SSIS technique was to emphasis the entire process of record linkage, obviously you would get all the exact matches out of the mix first and cycle thru the remaing with Fuzzy and utilitize various blocking criteria and even develop multiple scoring.

Mr DeusExDatum has implemented this method in real world solution.

If I am able to do a Part Duex of this topic I will seek Mr DeusExDatum contribution.

Ira Warren Whiteside
ralphpawne
ralphpawne
Forum Newbie
Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)

Group: General Forum Members
Points: 1 Visits: 0
For fuzzy matching tools, I've come across DataMatch by Data Ladder, which is an excellent fuzzy matching and record linkage tool used across business and would work really well for this situation. They offer a complimentary trial for new users.

In fact, an independent verified evaluation was done of the software comparing it to major software tools by IBM and SAS. There was a study done at Curtin University Centre for Data Linkage in Australia that simulated the matching of 4.4 Million records. It identified what providers had in terms of accuracy (Number of matches found vs available. Number of false matches)

1. DataMatch Enterprise, Highest Accuracy (>95%), Very Fast, Low Cost
2. IBM Quality Stage , high accuracy (>90%), Very Fast, High Cost (>$100K)
3. SAS Data Flux, Medium Accuracy (>85%), Fast, High Cost (>100K)

Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search