Match all duplicate IDs in a single row for review

Question

Post reply

Match all duplicate IDs in a single row for review

staugust1

Newbie

Points: 5
More actions
February 27, 2016 at 11:32 am

#399780

Hello. I hope someone can help me with a problem matching duplicate records in SQL Server 2008. I want to identify potential duplicate records so they can be reviewed. To make this easier for the review process, I would like to list out all the possible duplicates in a single row.
Sample Data: Customer Table
ID, FirstName, LastName, State
1, Amy, Adams, CA
2, Bob, Bonds, NY
3, Charlie, Conway, OR
4, Derek, Davis,
5, Amy, Adams, MI
6, Emily, Evans, NY
7, Eric, Evans, NJ
8, Amy, Adams, CA
9, C, Conway, OR
I would like to flag potential duplicates if they have the same FirstName and LastName OR the same LastName and State
In this example, records 1,5,8 are possible duplicates and records 3 and 9 are possible duplicates.
I would like the output to look something like this:
ID, MatchID1, MatchID2, MatchIDN
1, 5, 8
3, 9
This is my sql to try and get the desired result:
SELECT Customer.ID, Customer_1.ID INTO Results
FROM Customer INNER JOIN Customer AS Customer_1 ON Customer.LastName = Customer_1.LastName
WHERE (((Customer_1.ID)<>[Customer].[ID]) AND ((Customer_1.FirstName)=[Customer].[FirstName]) AND ((Customer_1.LastName)=[Customer].[LastName])) OR (((Customer_1.ID)<>[Customer].[ID]) AND ((Customer_1.LastName)=[Customer].[LastName]) AND ((Customer_1.State)=[Customer].[State]));
This is the result
Customer_ID, Customer_1_ID
8, 1
5, 1
9, 3
8, 5
1, 5
5, 8
1, 8
3, 9
I have 2 problems:
1. I think I need to use a cursor to get all possible duplicates for a given customer record in case there is more than 1 duplicate as is the case with Amy Adams in my example
2. With this approach, my result set has duplicate values in different fields. I only want one record that shows all possible duplicates, e.g., I want one result row that shows that customer records 3 and 9 are possible duplicates instead of 1 row that says record 9 is a possible duplicate of record 3 and another row that says record 3 is a duplicate of record 9.
I think the way to do this is to get all the possible combinations and then go through each row to re-order the MatchID in ascending order so the lowest Customer ID is in MatchID1 and the next lowest is in MatchID2, etc. Then I could look for duplicate records based on the re-ordered MatchID values.
Does this approach make sense? If so, how would I write the queries to address problems 1 & 2 above?
Any help is greatly appreciated.

Viewing 3 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

Jack Corbett SSC Guru Points: 184397 More actions · Answer 1

The following code doesn't get you to your final result the way you want it pivoted, but it gets close:

WITH Customer

AS (

SELECT

*

FROM

( VALUES ( 1, 'Amy', 'Adams', 'CA'),

( 2, 'Bob', 'Bonds', 'NY'), ( 3, 'Charlie', 'Conway', 'OR'),

( 4, 'Derek', 'Bonds', 'NY'), ( 5, 'Amy', 'Adams', 'MI'),

( 6, 'Emily', 'Evans', 'NY'), ( 7, 'Eric', 'Evans', 'NJ'),

( 8, 'Amy', 'Adams', 'CA'), ( 9, 'C', 'Conway', 'OR') ) AS C (ID, FirstName, LastName, State)

),

AddHashes

AS (

SELECT

*,

HASHBYTES('SHA2_256', Customer.FirstName + Customer.LastName) AS NameHash,

HASHBYTES('SHA2_256', Customer.LastName + Customer.State) AS LastNameStateHash

FROM

Customer

),

Dupes

AS (

SELECT

C1.ID AS OriginalID,

C2.ID AS MatchID,

ROW_NUMBER() OVER (PARTITION BY C1.ID + C2.ID,

CASE WHEN C1.NameHash = C2.NameHash AND

C1.LastNameStateHash = C2.LastNameStateHash

THEN C1.NameHash +

C1.LastNameStateHash

WHEN C1.NameHash = C2.NameHash

THEN C1.NameHash

ELSE C1.LastNameStateHash

END ORDER BY C1.ID) AS RowNo,

C1.NameHash,

C1.LastNameStateHash

FROM

AddHashes AS C1

JOIN AddHashes AS C2

ON C1.ID <> C2.ID AND

(

(C1.NameHash = C2.NameHash) OR

(C1.LastNameStateHash = C2.LastNameStateHash)

)

SELECT

*

FROM

Dupes

WHERE

Dupes.RowNo = 1;

It returns:

OriginalIDMatchID

1 5

1 8

2 4

3 9

5 8

I actually think this is a better result because it may be that 5 & 8 is the actual duplicate and 1 & 5 and 1 & 8 aren't really duplicates so I'd want to see it this way instead of the way you want it, but you know your requirements better than I do.

I made 1 change to your example data to make 2 & 4 a match so I could figure out to make the RowNo work when the sum of OriginalID and MatchID is the same across different combinations and the hashing made that possible. I didn't have to hash, but hashing is simpler than typing all the column names a bunch of times.

There is likely a better way to solve this problem, I just don't have it.

Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question

Jeff Moden SSC Guru Points: 1004683 More actions · Answer 2

staugust1 (2/27/2016)
...OR the same LastName and State...

To be honest, if you have very much data at all, that could return a huge number of potential duplicates. If someone told me that were a requirement, I'd tell them that they really need to clean their pipe before the smoke from it. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)