February 27, 2016 at 11:32 am
Hello. I hope someone can help me with a problem matching duplicate records in SQL Server 2008. I want to identify potential duplicate records so they can be reviewed. To make this easier for the review process, I would like to list out all the possible duplicates in a single row.
Sample Data: Customer Table
ID, FirstName, LastName, State
1, Amy, Adams, CA
2, Bob, Bonds, NY
3, Charlie, Conway, OR
4, Derek, Davis,
5, Amy, Adams, MI
6, Emily, Evans, NY
7, Eric, Evans, NJ
8, Amy, Adams, CA
9, C, Conway, OR
I would like to flag potential duplicates if they have the same FirstName and LastName OR the same LastName and State
In this example, records 1,5,8 are possible duplicates and records 3 and 9 are possible duplicates.
I would like the output to look something like this:
ID, MatchID1, MatchID2, MatchIDN
1, 5, 8
3, 9
This is my sql to try and get the desired result:
SELECT Customer.ID, Customer_1.ID INTO Results
FROM Customer INNER JOIN Customer AS Customer_1 ON Customer.LastName = Customer_1.LastName
WHERE (((Customer_1.ID)<>[Customer].[ID]) AND ((Customer_1.FirstName)=[Customer].[FirstName]) AND ((Customer_1.LastName)=[Customer].[LastName])) OR (((Customer_1.ID)<>[Customer].[ID]) AND ((Customer_1.LastName)=[Customer].[LastName]) AND ((Customer_1.State)=[Customer].[State]));
This is the result
Customer_ID, Customer_1_ID
8, 1
5, 1
9, 3
8, 5
1, 5
5, 8
1, 8
3, 9
I have 2 problems:
1. I think I need to use a cursor to get all possible duplicates for a given customer record in case there is more than 1 duplicate as is the case with Amy Adams in my example
2. With this approach, my result set has duplicate values in different fields. I only want one record that shows all possible duplicates, e.g., I want one result row that shows that customer records 3 and 9 are possible duplicates instead of 1 row that says record 9 is a possible duplicate of record 3 and another row that says record 3 is a duplicate of record 9.
I think the way to do this is to get all the possible combinations and then go through each row to re-order the MatchID in ascending order so the lowest Customer ID is in MatchID1 and the next lowest is in MatchID2, etc. Then I could look for duplicate records based on the re-ordered MatchID values.
Does this approach make sense? If so, how would I write the queries to address problems 1 & 2 above?
Any help is greatly appreciated.
February 27, 2016 at 3:29 pm
The following code doesn't get you to your final result the way you want it pivoted, but it gets close:
WITH Customer
AS (
SELECT
*
FROM
( VALUES ( 1, 'Amy', 'Adams', 'CA'),
( 2, 'Bob', 'Bonds', 'NY'), ( 3, 'Charlie', 'Conway', 'OR'),
( 4, 'Derek', 'Bonds', 'NY'), ( 5, 'Amy', 'Adams', 'MI'),
( 6, 'Emily', 'Evans', 'NY'), ( 7, 'Eric', 'Evans', 'NJ'),
( 8, 'Amy', 'Adams', 'CA'), ( 9, 'C', 'Conway', 'OR') ) AS C (ID, FirstName, LastName, State)
),
AddHashes
AS (
SELECT
*,
HASHBYTES('SHA2_256', Customer.FirstName + Customer.LastName) AS NameHash,
HASHBYTES('SHA2_256', Customer.LastName + Customer.State) AS LastNameStateHash
FROM
Customer
),
Dupes
AS (
SELECT
C1.ID AS OriginalID,
C2.ID AS MatchID,
ROW_NUMBER() OVER (PARTITION BY C1.ID + C2.ID,
CASE WHEN C1.NameHash = C2.NameHash AND
C1.LastNameStateHash = C2.LastNameStateHash
THEN C1.NameHash +
C1.LastNameStateHash
WHEN C1.NameHash = C2.NameHash
THEN C1.NameHash
ELSE C1.LastNameStateHash
END ORDER BY C1.ID) AS RowNo,
C1.NameHash,
C1.LastNameStateHash
FROM
AddHashes AS C1
JOIN AddHashes AS C2
ON C1.ID <> C2.ID AND
(
(C1.NameHash = C2.NameHash) OR
(C1.LastNameStateHash = C2.LastNameStateHash)
)
)
SELECT
*
FROM
Dupes
WHERE
Dupes.RowNo = 1;
It returns:
OriginalIDMatchID
1 5
1 8
2 4
3 9
5 8
I actually think this is a better result because it may be that 5 & 8 is the actual duplicate and 1 & 5 and 1 & 8 aren't really duplicates so I'd want to see it this way instead of the way you want it, but you know your requirements better than I do.
I made 1 change to your example data to make 2 & 4 a match so I could figure out to make the RowNo work when the sum of OriginalID and MatchID is the same across different combinations and the hashing made that possible. I didn't have to hash, but hashing is simpler than typing all the column names a bunch of times.
There is likely a better way to solve this problem, I just don't have it.
Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question
February 27, 2016 at 4:14 pm
staugust1 (2/27/2016)
...OR the same LastName and State...
To be honest, if you have very much data at all, that could return a huge number of potential duplicates. If someone told me that were a requirement, I'd tell them that they really need to clean their pipe before the smoke from it. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 3 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply