Advanced SSIS Fuzzy Matching via Record Linkage Methodology

Question

Advanced SSIS Fuzzy Matching via Record Linkage Methodology

Ira Warren Whiteside

SSCrazy

Points: 2146
More actions
July 5, 2012 at 12:06 am

#258839

Comments posted to this topic are about the item Advanced SSIS Fuzzy Matching via Record Linkage Methodology
[font="Comic Sans MS"]Ira Warren Whiteside[/font]

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

BarneyL Ten Centuries Points: 1044 More actions · Answer 1

A very interesting article, I'm looking forward to the follow ups as we're heading in to a fairly big matching job (merging 1/4 million education records to 100k social care records for our county).

A couple of questions (which you might be covering in the future). Does fuzzy matching work in the same way as fuzzy grouping? I had assumed that there was some blocking going on in the background on any exact matches defined but the tests in the article imlies not.

Also is there a way of getting the blocking to work in SSIS with a much larger number of blocks e.g. one for every date of birth or post code? We could probably run the fuzzy matching in a loop but that would loose the advantage of the blocks running in parallel.

Thanks

Barney

jfogel SSCarpal Tunnel Points: 4155 More actions · Answer 2

jfogel

SSCarpal Tunnel

Points: 4155

July 5, 2012 at 12:37 pm

#1509300

Very interesting.

Cheers

Basit Farooq Ten Centuries Points: 1151 More actions · Answer 3

Excellent article....

Regards,

Basit A. Farooq (MSC Computing, MCITP SQL Server 2005 & 2008, MCDBA SQL Server 2000)

http://basitaalishan.com

JOEL-145858 SSC Enthusiast Points: 154 More actions · Answer 4

[Thank you for an article.

It is indeed interesting but, in practice a bit simplistic.

Matching zip codes, phones are fine as long as it is entered accurately or it is a part of current address of the person. (Assuming that person is being matched).

If for example I moved to another locations (address, zip code) then grouping would be negative effort.

Sometimes, it is hard to match person by address. You don’t know it is address change or totally different person.

DeusExDatum Ten Centuries Points: 1144 More actions · Answer 5

Nice article Ira, kinda familiar too. 🙂

JOEL-145858, don't take the content of the article out of context. This is one of MANY methods that you can use in matching, but it doesn't represent a complete matching solution. You would obviously find as many exact matches as possible first, as any kind of fuzzy matching is cost-prohibitive comparatively. Then, using the model Ira outlined, you can perform fuzzy matching against what remains unmatched. And using geographic elements for blocking and fuzzy matching only serves as one example; The same model would fit other elements. Example:

Demographic - Block on First 2 letters of last name, year of birth / fuzzy match on FirstName, LastName, DOB

Demo / Geo - Block on FirstName, DOB, State / Fuzzy Match on FirstName, LastName, Address, City, Zip

The whole point of the parallel blocks is to minimize your comparison set, and thereby the number of potential combinations. In matching solutions I have done using this exact method, multiple iterations of this model with different criteria served our matching needs very well.

P.S - the method is also highly scalable if you have the processing power and memory on your SSIS Box.

Josh Lewis

Ira Warren Whiteside SSCrazy Points: 2146 More actions · Answer 6

Mr DeusExDatum is exactly "spot on" and I highly recommend that you follow him, he has very deep and extensive experience and I have had the pleasure in working with him.

The reason I focus on methodology first in my article and a specific SSIS technique was to emphasis the entire process of record linkage, obviously you would get all the exact matches out of the mix first and cycle thru the remaing with Fuzzy and utilitize various blocking criteria and even develop multiple scoring.

Mr DeusExDatum has implemented this method in real world solution.

If I am able to do a Part Duex of this topic I will seek Mr DeusExDatum contribution.

[font="Comic Sans MS"]Ira Warren Whiteside[/font]

ralphpawne Valued Member Points: 63 More actions · Answer 7

For fuzzy matching tools, I've come across DataMatch by Data Ladder, which is an excellent fuzzy matching and record linkage tool used across business and would work really well for this situation. They offer a complimentary trial[/url] for new users.

In fact, an independent verified evaluation was done of the software comparing it to major software tools by IBM and SAS. There was a study done at Curtin University Centre for Data Linkage in Australia that simulated the matching of 4.4 Million records. It identified what providers had in terms of accuracy (Number of matches found vs available. Number of false matches)

1.DataMatch Enterprise, Highest Accuracy (>95%), Very Fast, Low Cost

2.IBM Quality Stage , high accuracy (>90%), Very Fast, High Cost (>$100K)

3.SAS Data Flux, Medium Accuracy (>85%), Fast, High Cost (>100K)