Match all duplicate IDs in a single row for review

  • Hello. I hope someone can help me with a problem matching duplicate records in SQL Server 2008. I want to identify potential duplicate records so they can be reviewed. To make this easier for the review process, I would like to list out all the possible duplicates in a single row.

    Sample Data: Customer Table

    ID, FirstName, LastName, State

    1, Amy, Adams, CA

    2, Bob, Bonds, NY

    3, Charlie, Conway, OR

    4, Derek, Davis,

    5, Amy, Adams, MI

    6, Emily, Evans, NY

    7, Eric, Evans, NJ

    8, Amy, Adams, CA

    9, C, Conway, OR

    I would like to flag potential duplicates if they have the same FirstName and LastName OR the same LastName and State

    In this example, records 1,5,8 are possible duplicates and records 3 and 9 are possible duplicates.

    I would like the output to look something like this:

    ID, MatchID1, MatchID2, MatchIDN

    1, 5, 8

    3, 9

    This is my sql to try and get the desired result:

    SELECT Customer.ID, Customer_1.ID INTO Results

    FROM Customer INNER JOIN Customer AS Customer_1 ON Customer.LastName = Customer_1.LastName

    WHERE (((Customer_1.ID)<>[Customer].[ID]) AND ((Customer_1.FirstName)=[Customer].[FirstName]) AND ((Customer_1.LastName)=[Customer].[LastName])) OR (((Customer_1.ID)<>[Customer].[ID]) AND ((Customer_1.LastName)=[Customer].[LastName]) AND ((Customer_1.State)=[Customer].[State]));

    This is the result

    Customer_ID, Customer_1_ID

    8, 1

    5, 1

    9, 3

    8, 5

    1, 5

    5, 8

    1, 8

    3, 9

    I have 2 problems:

    1. I think I need to use a cursor to get all possible duplicates for a given customer record in case there is more than 1 duplicate as is the case with Amy Adams in my example

    2. With this approach, my result set has duplicate values in different fields. I only want one record that shows all possible duplicates, e.g., I want one result row that shows that customer records 3 and 9 are possible duplicates instead of 1 row that says record 9 is a possible duplicate of record 3 and another row that says record 3 is a duplicate of record 9.

    I think the way to do this is to get all the possible combinations and then go through each row to re-order the MatchID in ascending order so the lowest Customer ID is in MatchID1 and the next lowest is in MatchID2, etc. Then I could look for duplicate records based on the re-ordered MatchID values.

    Does this approach make sense? If so, how would I write the queries to address problems 1 & 2 above?

    Any help is greatly appreciated.

  • The following code doesn't get you to your final result the way you want it pivoted, but it gets close:

    WITH Customer

    AS (

    SELECT

    *

    FROM

    ( VALUES ( 1, 'Amy', 'Adams', 'CA'),

    ( 2, 'Bob', 'Bonds', 'NY'), ( 3, 'Charlie', 'Conway', 'OR'),

    ( 4, 'Derek', 'Bonds', 'NY'), ( 5, 'Amy', 'Adams', 'MI'),

    ( 6, 'Emily', 'Evans', 'NY'), ( 7, 'Eric', 'Evans', 'NJ'),

    ( 8, 'Amy', 'Adams', 'CA'), ( 9, 'C', 'Conway', 'OR') ) AS C (ID, FirstName, LastName, State)

    ),

    AddHashes

    AS (

    SELECT

    *,

    HASHBYTES('SHA2_256', Customer.FirstName + Customer.LastName) AS NameHash,

    HASHBYTES('SHA2_256', Customer.LastName + Customer.State) AS LastNameStateHash

    FROM

    Customer

    ),

    Dupes

    AS (

    SELECT

    C1.ID AS OriginalID,

    C2.ID AS MatchID,

    ROW_NUMBER() OVER (PARTITION BY C1.ID + C2.ID,

    CASE WHEN C1.NameHash = C2.NameHash AND

    C1.LastNameStateHash = C2.LastNameStateHash

    THEN C1.NameHash +

    C1.LastNameStateHash

    WHEN C1.NameHash = C2.NameHash

    THEN C1.NameHash

    ELSE C1.LastNameStateHash

    END ORDER BY C1.ID) AS RowNo,

    C1.NameHash,

    C1.LastNameStateHash

    FROM

    AddHashes AS C1

    JOIN AddHashes AS C2

    ON C1.ID <> C2.ID AND

    (

    (C1.NameHash = C2.NameHash) OR

    (C1.LastNameStateHash = C2.LastNameStateHash)

    )

    )

    SELECT

    *

    FROM

    Dupes

    WHERE

    Dupes.RowNo = 1;

    It returns:

    OriginalIDMatchID

    1 5

    1 8

    2 4

    3 9

    5 8

    I actually think this is a better result because it may be that 5 & 8 is the actual duplicate and 1 & 5 and 1 & 8 aren't really duplicates so I'd want to see it this way instead of the way you want it, but you know your requirements better than I do.

    I made 1 change to your example data to make 2 & 4 a match so I could figure out to make the RowNo work when the sum of OriginalID and MatchID is the same across different combinations and the hashing made that possible. I didn't have to hash, but hashing is simpler than typing all the column names a bunch of times.

    There is likely a better way to solve this problem, I just don't have it.

  • staugust1 (2/27/2016)


    ...OR the same LastName and State...

    To be honest, if you have very much data at all, that could return a huge number of potential duplicates. If someone told me that were a requirement, I'd tell them that they really need to clean their pipe before the smoke from it. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 3 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply