Advanced SSIS Fuzzy Matching via Record Linkage Methodology

  • Comments posted to this topic are about the item Advanced SSIS Fuzzy Matching via Record Linkage Methodology

    [font="Comic Sans MS"]Ira Warren Whiteside[/font]

  • A very interesting article, I'm looking forward to the follow ups as we're heading in to a fairly big matching job (merging 1/4 million education records to 100k social care records for our county).

    A couple of questions (which you might be covering in the future). Does fuzzy matching work in the same way as fuzzy grouping? I had assumed that there was some blocking going on in the background on any exact matches defined but the tests in the article imlies not.

    Also is there a way of getting the blocking to work in SSIS with a much larger number of blocks e.g. one for every date of birth or post code? We could probably run the fuzzy matching in a loop but that would loose the advantage of the blocks running in parallel.

    Thanks

    Barney

  • Very interesting.

    Cheers

  • Excellent article....

    Regards,

    Basit A. Farooq (MSC Computing, MCITP SQL Server 2005 & 2008, MCDBA SQL Server 2000)

    http://basitaalishan.com
  • [Thank you for an article.

    It is indeed interesting but, in practice a bit simplistic.

    Matching zip codes, phones are fine as long as it is entered accurately or it is a part of current address of the person. (Assuming that person is being matched).

    If for example I moved to another locations (address, zip code) then grouping would be negative effort.

    Sometimes, it is hard to match person by address. You donโ€™t know it is address change or totally different person.

  • Nice article Ira, kinda familiar too. ๐Ÿ™‚

    JOEL-145858, don't take the content of the article out of context. This is one of MANY methods that you can use in matching, but it doesn't represent a complete matching solution. You would obviously find as many exact matches as possible first, as any kind of fuzzy matching is cost-prohibitive comparatively. Then, using the model Ira outlined, you can perform fuzzy matching against what remains unmatched. And using geographic elements for blocking and fuzzy matching only serves as one example; The same model would fit other elements. Example:

    Demographic - Block on First 2 letters of last name, year of birth / fuzzy match on FirstName, LastName, DOB

    Demo / Geo - Block on FirstName, DOB, State / Fuzzy Match on FirstName, LastName, Address, City, Zip

    The whole point of the parallel blocks is to minimize your comparison set, and thereby the number of potential combinations. In matching solutions I have done using this exact method, multiple iterations of this model with different criteria served our matching needs very well.

    P.S - the method is also highly scalable if you have the processing power and memory on your SSIS Box.

    Josh Lewis

  • Mr DeusExDatum is exactly "spot on" and I highly recommend that you follow him, he has very deep and extensive experience and I have had the pleasure in working with him.

    The reason I focus on methodology first in my article and a specific SSIS technique was to emphasis the entire process of record linkage, obviously you would get all the exact matches out of the mix first and cycle thru the remaing with Fuzzy and utilitize various blocking criteria and even develop multiple scoring.

    Mr DeusExDatum has implemented this method in real world solution.

    If I am able to do a Part Duex of this topic I will seek Mr DeusExDatum contribution.

    [font="Comic Sans MS"]Ira Warren Whiteside[/font]

  • For fuzzy matching tools, I've come across DataMatch by Data Ladder, which is an excellent fuzzy matching and record linkage tool used across business and would work really well for this situation. They offer a complimentary trial[/url] for new users.

    In fact, an independent verified evaluation was done of the software comparing it to major software tools by IBM and SAS. There was a study done at Curtin University Centre for Data Linkage in Australia that simulated the matching of 4.4 Million records. It identified what providers had in terms of accuracy (Number of matches found vs available. Number of false matches)

    1.DataMatch Enterprise, Highest Accuracy (>95%), Very Fast, Low Cost

    2.IBM Quality Stage , high accuracy (>90%), Very Fast, High Cost (>$100K)

    3.SAS Data Flux, Medium Accuracy (>85%), Fast, High Cost (>100K)

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply