Identify possible duplicates in a table (finding approximate matches)

  • Hi,

    I'm having some difficulties figuring out how to do this and would greatly appreciate any assistance anyone can offer.

    We're in the process of trying to clean up some of our old data. We have a Customer table that contains some obvious duplicates and some not-so-obvious duplicates. For example, there might be a record with a CustomerName of "Bobs Bar and Grill" and another record with a CustomerName of "Bob'sBar and Gril". The Address might also be identical, and perhaps one digit was mistyped for the Phone field, but they are likely duplicates that we would want to identify.

    CREATE TABLE #TempCustomers (CustomerPk INT, CustomerName VARCHAR(100), Address VARCHAR(100), Phone VARCHAR(100))

    INSERT INTO #TempCustomers (CustomerPk, CustomerName, Address, Phone) VALUES (1, 'Bobs Bar and Grill', '123 East Street','183-384-3847')

    INSERT INTO #TempCustomers (CustomerPk, CustomerName, Address, Phone) VALUES (2, 'Bob''sBar and Gril', '123 East Street','183-334-3847')

    INSERT INTO #TempCustomers (CustomerPk, CustomerName, Address, Phone) VALUES (4, 'Bill''s Bikes', '485 West Avenue','776-183-3811')

    --What I want to see

    SELECT

    CustomerPk1 = 1,

    CustomerName1 = 'Bobs Bar and Grill',

    Address1 = '123 East Street',

    Phone1 = '183-384-3847',

    CustomerPk2 = 2,

    CustomerName2 = 'Bob''sBar and Gril',

    Address2 = '123 East Street',

    Phone2 = '183-334-3847',

    PercentageLikelihoodOfDuplicate = .93 --fudging this number

    UNION

    SELECT

    CustomerPk2 = 2,

    CustomerName2 = 'Bob''sBar and Gril',

    Address2 = '123 East Street',

    Phone2 = '183-334-3847',

    CustomerPk1 = 3,

    CustomerName1 = 'Bill''s Bikes',

    Address1 = '485 West Avenue',

    Phone1 = '776-183-3811',

    PercentageLikelihoodOfDuplicate = .08 --fudging this number

    DROP TABLE #TempCustomers

    What I'm hoping to do is find a SQL function that will compare several column values for each record in the table and identify the likelihood of duplicates. Then we can filter out those with a low probability and send those with the higher probability to our users to correct.

    Any ideas on how this might be accomplished?

  • In my opinion, pure T-SQL is not a good tool for this. T-SQL expects exact and precise specification, and how do you specify "sort of matches". And even if you do come up with something, the result is probably going to be a very nasty and complex (and slow!) query.

    One thing you might want to look into if you want to do it in T-SQL is the SOUNDEX function. It is designed to mostly catch phonetic near-matches (intended to be used when people say their name by telephone and you don't know the exact spelling and perhaps even misheard), and is allegedly not even very good at that, but it might be one step in the right direction for you.

    However, I also recommend looking into SSIS. In an SSIS package, you can use components that do what's known as "fuzzy matching". I must admit that I don't know much about these components except that they exist, but I would definitely investigate this if I were you.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Hugo,

    Thanks for the response! I marked it as the correct solution because I know I've seen SSIS used for this purpose, but I actually ended up dumping the data into Excel and using the free Fuzzy Lookup add-in because it seemed like a quicker and easier solution for a time-sensitive issue.

  • tarr94 (3/18/2016)


    Hugo,

    Thanks for the response! I marked it as the correct solution because I know I've seen SSIS used for this purpose, but I actually ended up dumping the data into Excel and using the free Fuzzy Lookup add-in because it seemed like a quicker and easier solution for a time-sensitive issue.

    I'm curious... Where did you get the add-on from?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply