Removing Duplicates

Sioban Krzywicki, 2015-01-23 (first published: 2013-10-08)

A confounding yet common problem with data is duplicated information. It could be in a poorly designed table, in imported data or other situations, but the problem remains the same, how do you tell your system to delete all but one copy of the row? Fortunately, SQL Server has a tool that makes it easy: ROW_NUMBER().

In duplicate data, each row is the same. While you can look at it and decide which row you’d want to delete, the problem is how you can tell SQL Server which row that is. ROW_NUMBER() lets you add a column that differentiates between the rows and you can delete all but the desired ones.

Lets say you have a table

CREATE TABLE DuplicateRow(
FName varchar(30),
LName varchar(30),
JobTitle varchar(30),
Age tinyint
)

We’ll keep it simple for this example. Now let’s say you query the table and find that when you were doing the inserts you were a little too enthusiastic in hitting “execute”

FName	LName	JobTitle	Age
Mike	Klarm	Manager	37
Mike	Klarm	Manager	37
Mike	Klarm	Manager	37

You don’t want him in there three times, but if you tell SQL Server to DELETE based on any of the data you have available, it’ll delete all three records and you’ll have to insert again. To keep one row and delete the others, you’d use ROW_NUMBER(). We’ll start with a SELECT so you can see what’s happening.

SELECT
  ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY LName ) R
, FName
, LName
, JobTitle
, Age
FROM DuplicateRow

This gives you

R	FName	LName	JobTitle	Age
1	Mike	Klarm	Manager	37
2	Mike	Klarm	Manager	37
3	Mike	Klarm	Manager	37

And now you can just delete any rows where R > 1

DELETE q
    FROM ( SELECT ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY LName ) R
              , FName
              , LName
              , JobTitle
              , Age
            FROM DuplicateRow
         ) q
    WHERE R > 1

you don’t need to do the SELECT first, but it is nice to see what you’ll be deleting.

If there are other rows in the database, the first instance of every set of data will be numbered 1 and will not be deleted, but every record that is a duplicate will be deleted.

Let’s look at ROW_NUMBER() just a little more closely. All of the magic happens in the OVER() clause that follows ROW_NUMBER(), specifically the “partition by” section. This is where you list the columns that you expect to be duplicated. Since we want to delete any rows where every single column is duplicated, we list all of them here. “Order by” doesn’t make much difference here since all the rows are the same, but you do need a value here for it to work.

If your data is a little different, you can use ROW_NUMBER to get rid of different combinations of data. Let’s say Mike got a promotion last year and someone added a row instead of updating it.

FName	LName	JobTitle	Age
Mike	Klarm	Manager	37
Mike	Klarm	Manager	37
Mike	Klarm	Manager	37
Mike	Klarm	Operator	32
Mike	Klarm	Operator	32
Mike	Klarm	Intern	22

Depending on what you put in your “partition by” statement, you can decide what to delete. If you want to keep one of each set, use the same query as above and you’ll get

R	FName	LName	JobTitle	Age
1	Mike	Klarm	Manager	37
2	Mike	Klarm	Manager	37
3	Mike	Klarm	Manager	37
1	Mike	Klarm	Operator	32
2	Mike	Klarm	Operator	32
1	Mike	Klarm	Intern	22

It will number each row, starting at one for each unique combination of data listed in the “partition by” statement. If you then run the above DELETE statement, you’ll be left with

FName	LName	JobTitle	Age
Mike	Klarm	Manager	37
Mike	Klarm	Operator	32
Mike	Klarm	Intern	22

But let’s say you only want to keep the most recent record. You could run the above query and also delete any record where the Age <> 37, but that doesn’t work as well if you have a lot of data in the table, you’d have to specify just Mike Klarm to keep from deleting anyone who isn’t 37.

An easier way is to modify your “partition by” clause.

SELECT ROW_NUMBER() OVER ( PARTITION BY FName, LName ORDER BY Age DESC ) R
      , FName
      , LName
      , JobTitle
      , Age
    FROM DuplicateRow

You’ll end up with

R	FName	LName	JobTitle	Age
1	Mike	Klarm	Manager	37
2	Mike	Klarm	Manager	37
3	Mike	Klarm	Manager	37
4	Mike	Klarm	Operator	32
5	Mike	Klarm	Operator	32
6	Mike	Klarm	Intern	22

And when you delete where R > 1 you’ll be left with the most recent record. You can use the “partition by” clause to get many different sortings on data that’s only partially duplicated.

Listing Duplicate Values by Group

by Additional Articles

Database Journal

duplicate data

Without question, one of the most common tasks performed by Database Administrators (DBAs) is identifying and weeding out duplicate values in tables. Despite the inordinate number of queries written by other DBAs to locate duplicate values in their database tables, the real challenge is in locating a useable SQL statement to go by. Rob Gravelle presents a few solutions that will save you some time down the road.

2016-12-15

6,301 reads

Convert Rows into Columns

by M Patrick Dillon

SQLServerCentral.com

Rows to Columns

Convert the rows of a SELECT statement into a predetermined number of columns.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

4.16 (58)

You rated this post out of 5. Change rating

2015-12-11 (first published: 2014-07-17)

27,130 reads

Discuss

Troubleshooting ASP and ADO Errors

by Additional Articles

Microsoft MSDN

Stored Procedures

My ASP file doesn’t access my database." "I can’t connect to my database from my code." "I’m having problems calling and debugging stored procedures." These are some of the problems I hear every day as a Microsoft® developer support engineer.

2001-07-20

2,008 reads

Using Stored Procedures in SQL Server

by Additional Articles

Microsoft MSDN

Stored Procedures

If you’re a programmer, you know that SQL is becoming more and more prevalent. Here’s a guide to one of its basic building blocks the stored procedure.

2001-07-09

4,417 reads

Using XP_FILEEXIST

by Brian Knight

SQLServerCentral.com

Stored Procedures

XP_FILEEXIST gives you the ability to find files and directories. Find out how else you can use it in this article.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

4.67 (6)

You rated this post out of 5. Change rating

2001-06-08

20,117 reads

Removing Duplicates

Rate

Share

Categories

Share

Rate

Removing Duplicates

Rate

Share

Categories

Share

Rate

Related content

Listing Duplicate Values by Group

Convert Rows into Columns

Troubleshooting ASP and ADO Errors

Using Stored Procedures in SQL Server

Using XP_FILEEXIST