Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Duplicate Detection and Management

By Edward Pollack,

Intro

Inevitably, every DBA is eventually asked to delete, merge, or otherwise report on duplicate data.  There are many tools at our disposal to make this task both simple and efficient to complete.  Read on for simple steps and examples to effectively deal with duplicate data.

Problem

You’re hard at work one day and the HR director runs over to your desk in a hurry.  Somehow, an employee has copied thousands of resource records by accident.  Most of sales, HR, and marketing are unable to work effectively until this is resolved.  How do you identify and remove the newly-created duplicates without harming existing data?

Solution

In order to simulate this, we will create a simple table in AdventureWorks, insert some data that represents the original data set, and then insert copies of some of it, which represents the duplication that we are tasked with cleaning up:

USE AdventureWorks

GO
-- Create table to use for our example
CREATE TABLE Resources
(
       id INT IDENTITY(1,1),
       first_name NVARCHAR(50),
       middle_namwe NVARCHAR(50),
       last_name NVARCHAR(50),
       email_address NVARCHAR(50),
       email_promotion INT
)

-- Populate our table with an initial data set
INSERT INTO dbo.Resources
        ( first_name ,
          middle_namwe ,
          last_name ,
          email_address ,
         email_promotion
        )
SELECT
       FirstName,
       MiddleName,
       LastName,
       EmailAddress,
       EmailPromotion
FROM Person.Person
INNER JOIN Person.EmailAddress
ON Person.EmailAddress.BusinessEntityID = Person.Person.BusinessEntityID

-- Now create 7500 duplicate records from our data set
INSERT INTO dbo.Resources
        ( first_name ,
          middle_namwe ,
          last_name ,
          email_address ,
          email_promotion
        )
SELECT TOP 7500
    first_name ,
    middle_namwe ,
    last_name ,
    email_address ,
    email_promotion
FROM dbo.Resources;

The steps we need to take are:

  1. Determine which columns can be used when determining duplicate data.  This is critical to avoiding mistakes later. The HR director tells us that we can identify duplicates based on first name, last name, and email address.
  2. Write a query to view the duplicate data and verify that everything looks correct.  We will do this using a common table expression (CTE):
WITH CTE_DUPLICATE_RESOURCES AS
(
       SELECT
              id,
              ROW_NUMBER() OVER (PARTITION BY first_name, last_name, email_address ORDER BY id) AS row_num,
              first_name ,
              middle_namwe ,
              last_name ,
              email_address ,
              email_promotion
       FROM dbo.Resources
)
SELECT
       *
FROM CTE_DUPLICATE_RESOURCES;

The ROW_NUMBER() will number all rows based on their occurrence.  The PARTITION BY columns are the duplicate criteria that we determined in our first step.  The ORDER BY column determines how to order the resulting data.  We use id here as this will order our results by their creation, but a date or other time-sensitive column could be used in similar situations.

  1. Locate the duplicates to delete, verify the data, and delete them:
WITH CTE_DUPLICATE_RESOURCES AS
(
       SELECT
              id,
              ROW_NUMBER() OVER (PARTITION BY first_name, last_name, email_address ORDER BY id) AS row_num,
              first_name ,
              middle_namwe ,
              last_name ,
              email_address ,
              email_promotion
       FROM dbo.Resources
)
SELECT
       *
FROM CTE_DUPLICATE_RESOURCES
WHERE row_num > 1;

By adding the filter of row_num > 1, we are now only seeing the duplicate rows.  Since this looks good, we can just change the SELECT into a DELETE and complete the cleanup:

WITH CTE_DUPLICATE_RESOURCES AS
(
       SELECT
              id,
              ROW_NUMBER() OVER (PARTITION BY first_name, last_name, email_address ORDER BY id) AS row_num,
              first_name ,
              middle_namwe ,
              last_name ,
              email_address ,
              email_promotion
       FROM dbo.Resources
)
DELETE
FROM CTE_DUPLICATE_RESOURCES
WHERE row_num > 1;

Merging Data

Sometimes we will have to do more than just delete duplicates.  Instead of removing them, we will be asked to merge them, thereby maintaining certain columns of data from the records we are about to delete.  This can be accomplished in a manner similar to above by taking our existing duplicate detection and working it into an update prior to our deletion.

Let’s say that the HR Director tells us that the new records have some valuable data in them, and before being deleted, we need to ensure that the email promotion and middle name are copied back to the original records:

WITH CTE_DUPLICATE_RESOURCES AS
(
       SELECT
              id,
              ROW_NUMBER() OVER (PARTITION BY first_name, last_name, email_address ORDER BY id) AS row_num,
              first_name ,
              middle_name ,
              last_name ,
              email_address ,
              email_promotion
       FROM dbo.Resources
)
UPDATE dbo.Resources
       SET email_promotion = CTE_DUPLICATE_RESOURCES.email_promotion,
           middle_name = CTE_DUPLICATE_RESOURCES.middle_name
FROM CTE_DUPLICATE_RESOURCES
INNER JOIN dbo.Resources
ON CTE_DUPLICATE_RESOURCES.first_name = dbo.Resources.first_name
AND CTE_DUPLICATE_RESOURCES.last_name = dbo.Resources.last_name
AND CTE_DUPLICATE_RESOURCES.email_address = dbo.Resources.email_address
WHERE CTE_DUPLICATE_RESOURCES.row_num > 1
AND CTE_DUPLICATE_RESOURCES.id > dbo.Resources.id

This revised version of our CTE from above will join back to Resources to match the original record with the duplicate record and then update it to reflect the newer email promotion and middle name.  The inequality on id ensures that the rows returned in the CTE are newer than the original rows.  As with before, a date or other ordered column could be used to separate duplicate from original and join each accurately.

Once this is complete, we can run the deletion from above and our work here is finished!

Summary

The key to duplicate detection is knowing with certainty how to detect duplicates and mapping that knowledge into the columns of data that we will be working with.  Based on that information, the ROW_NUMBER column that we create in our CTE will define how to organize our data prior to performing our cleanup.  Once this is complete, we can manipulate that data using the TSQL that we have written, including deletion of duplicates, updating existing data, and creating new records.

Total article views: 6305 | Views in the last 30 days: 19
 
Related Articles
ARTICLE

Email Changes

We've changed the way we send email for the newsetters. Read about our new address and whitelisting....

ARTICLE

TSQL Challenge 41 - Extract email addresses from text data

This challenge involves extracting email addresses from a table that contains text data downloaded f...

FORUM

Changing the SendFrom Email address for data-driven subs

For a nightly job I need the sentfrom email address to be dynamic

FORUM

Subscription email rights

Change the email address

FORUM

Should i increase the maximum lentgh limit of email address under operator in SQL 2005 ?

Should i increase the maximum lentgh limit of email address under operator in SQL 2005 ?

Tags
common table expressions (cte)    
cte    
duplicate    
merge    
t-sql    
 
Contribute

Join the most active online SQL Server Community

SQL knowledge, delivered daily, free:

Email address:  

You make SSC a better place

As a member of SQLServerCentral, you get free access to loads of fresh content: thousands of articles and SQL scripts, a library of free eBooks, a weekly database news roundup, a great Q & A platform… And it’s our huge, buzzing community of SQL Server Professionals that makes it such a success.

Join us!

Steve Jones
Editor, SQLServerCentral.com

Already a member? Jump in:

Email address:   Password:   Remember me: Forgotten your password?
Steve Jones