SQLServerCentral Article

Duplicate Detection and Management

,

Intro

Inevitably, every DBA is eventually asked to delete, merge, or otherwise report on duplicate data.  There are many tools at our disposal to make this task both simple and efficient to complete.  Read on for simple steps and examples to effectively deal with duplicate data.

Problem

You’re hard at work one day and the HR director runs over to your desk in a hurry.  Somehow, an employee has copied thousands of resource records by accident.  Most of sales, HR, and marketing are unable to work effectively until this is resolved.  How do you identify and remove the newly-created duplicates without harming existing data?

Solution

In order to simulate this, we will create a simple table in AdventureWorks, insert some data that represents the original data set, and then insert copies of some of it, which represents the duplication that we are tasked with cleaning up:

USE AdventureWorks
GO
-- Create table to use for our example
CREATE TABLE Resources
(
       id INT IDENTITY(1,1),
       first_name NVARCHAR(50),
       middle_namwe NVARCHAR(50),
       last_name NVARCHAR(50),
       email_address NVARCHAR(50),
       email_promotion INT
)
-- Populate our table with an initial data set
INSERT INTO dbo.Resources
        ( first_name ,
          middle_namwe ,
          last_name ,
          email_address ,
         email_promotion
        )
SELECT
       FirstName,
       MiddleName,
       LastName,
       EmailAddress,
       EmailPromotion
FROM Person.Person
INNER JOIN Person.EmailAddress
ON Person.EmailAddress.BusinessEntityID = Person.Person.BusinessEntityID
-- Now create 7500 duplicate records from our data set
INSERT INTO dbo.Resources
        ( first_name ,
          middle_namwe ,
          last_name ,
          email_address ,
          email_promotion
        )
SELECT TOP 7500
    first_name ,
    middle_namwe ,
    last_name ,
    email_address ,
    email_promotion
FROM dbo.Resources;

The steps we need to take are:

  1. Determine which columns can be used when determining duplicate data.  This is critical to avoiding mistakes later. The HR director tells us that we can identify duplicates based on first name, last name, and email address.
  2. Write a query to view the duplicate data and verify that everything looks correct.  We will do this using a common table expression (CTE):
WITH CTE_DUPLICATE_RESOURCES AS
(
       SELECT
              id,
              ROW_NUMBER() OVER (PARTITION BY first_name, last_name, email_address ORDER BY id) AS row_num,
              first_name ,
              middle_namwe ,
              last_name ,
              email_address ,
              email_promotion
       FROM dbo.Resources
)
SELECT
       *
FROM CTE_DUPLICATE_RESOURCES;

The ROW_NUMBER() will number all rows based on their occurrence.  The PARTITION BY columns are the duplicate criteria that we determined in our first step.  The ORDER BY column determines how to order the resulting data.  We use id here as this will order our results by their creation, but a date or other time-sensitive column could be used in similar situations.

  1. Locate the duplicates to delete, verify the data, and delete them:
WITH CTE_DUPLICATE_RESOURCES AS
(
       SELECT
              id,
              ROW_NUMBER() OVER (PARTITION BY first_name, last_name, email_address ORDER BY id) AS row_num,
              first_name ,
              middle_namwe ,
              last_name ,
              email_address ,
              email_promotion
       FROM dbo.Resources
)
SELECT
       *
FROM CTE_DUPLICATE_RESOURCES
WHERE row_num > 1;

By adding the filter of row_num > 1, we are now only seeing the duplicate rows.  Since this looks good, we can just change the SELECT into a DELETE and complete the cleanup:

WITH CTE_DUPLICATE_RESOURCES AS
(
       SELECT
              id,
              ROW_NUMBER() OVER (PARTITION BY first_name, last_name, email_address ORDER BY id) AS row_num,
              first_name ,
              middle_namwe ,
              last_name ,
              email_address ,
              email_promotion
       FROM dbo.Resources
)
DELETE
FROM CTE_DUPLICATE_RESOURCES
WHERE row_num > 1;

Merging Data

Sometimes we will have to do more than just delete duplicates.  Instead of removing them, we will be asked to merge them, thereby maintaining certain columns of data from the records we are about to delete.  This can be accomplished in a manner similar to above by taking our existing duplicate detection and working it into an update prior to our deletion.

Let’s say that the HR Director tells us that the new records have some valuable data in them, and before being deleted, we need to ensure that the email promotion and middle name are copied back to the original records:

WITH CTE_DUPLICATE_RESOURCES AS
(
       SELECT
              id,
              ROW_NUMBER() OVER (PARTITION BY first_name, last_name, email_address ORDER BY id) AS row_num,
              first_name ,
              middle_name ,
              last_name ,
              email_address ,
              email_promotion
       FROM dbo.Resources
)
UPDATE dbo.Resources
       SET email_promotion = CTE_DUPLICATE_RESOURCES.email_promotion,
           middle_name = CTE_DUPLICATE_RESOURCES.middle_name
FROM CTE_DUPLICATE_RESOURCES
INNER JOIN dbo.Resources
ON CTE_DUPLICATE_RESOURCES.first_name = dbo.Resources.first_name
AND CTE_DUPLICATE_RESOURCES.last_name = dbo.Resources.last_name
AND CTE_DUPLICATE_RESOURCES.email_address = dbo.Resources.email_address
WHERE CTE_DUPLICATE_RESOURCES.row_num > 1
AND CTE_DUPLICATE_RESOURCES.id > dbo.Resources.id

This revised version of our CTE from above will join back to Resources to match the original record with the duplicate record and then update it to reflect the newer email promotion and middle name.  The inequality on id ensures that the rows returned in the CTE are newer than the original rows.  As with before, a date or other ordered column could be used to separate duplicate from original and join each accurately.

Once this is complete, we can run the deletion from above and our work here is finished!

Summary

The key to duplicate detection is knowing with certainty how to detect duplicates and mapping that knowledge into the columns of data that we will be working with.  Based on that information, the ROW_NUMBER column that we create in our CTE will define how to organize our data prior to performing our cleanup.  Once this is complete, we can manipulate that data using the TSQL that we have written, including deletion of duplicates, updating existing data, and creating new records.

Rate

4.56 (32)

You rated this post out of 5. Change rating

Share

Share

Rate

4.56 (32)

You rated this post out of 5. Change rating