SQLServerCentral Article

Removing Duplicates

,

A confounding yet common problem with data is duplicated information. It could be in a poorly designed table, in imported data or other situations, but the problem remains the same, how do you tell your system to delete all but one copy of the row? Fortunately, SQL Server has a tool that makes it easy: ROW_NUMBER().

In duplicate data, each row is the same. While you can look at it and decide which row you’d want to delete, the problem is how you can tell SQL Server which row that is. ROW_NUMBER() lets you add a column that differentiates between the rows and you can delete all but the desired ones.

Lets say you have a table

CREATE TABLE DuplicateRow(
FName varchar(30),
LName varchar(30),
JobTitle varchar(30),
Age tinyint
)

We’ll keep it simple for this example. Now let’s say you query the table and find that when you were doing the inserts you were a little too enthusiastic in hitting “execute”

FName LName JobTitle Age
Mike Klarm Manager 37
Mike Klarm Manager 37
Mike Klarm Manager 37

You don’t want him in there three times, but if you tell SQL Server to DELETE based on any of the data you have available, it’ll delete all three records and you’ll have to insert again. To keep one row and delete the others, you’d use ROW_NUMBER(). We’ll start with a SELECT so you can see what’s happening.

SELECT
  ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY LName ) R
, FName
, LName
, JobTitle
, Age
FROM DuplicateRow

This gives you

R FName LName JobTitle Age
1 Mike Klarm Manager 37
2 Mike Klarm Manager 37
3 Mike Klarm Manager 37

And now you can just delete any rows where R > 1

DELETE q
    FROM ( SELECT ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY LName ) R
              , FName
              , LName
              , JobTitle
              , Age
            FROM DuplicateRow
         ) q
    WHERE R > 1

you don’t need to do the SELECT first, but it is nice to see what you’ll be deleting.

If there are other rows in the database, the first instance of every set of data will be numbered 1 and will not be deleted, but every record that is a duplicate will be deleted.

Let’s look at ROW_NUMBER() just a little more closely. All of the magic happens in the OVER() clause that follows ROW_NUMBER(), specifically the “partition by” section. This is where you list the columns that you expect to be duplicated. Since we want to delete any rows where every single column is duplicated, we list all of them here. “Order by” doesn’t make much difference here since all the rows are the same, but you do need a value here for it to work.

If your data is a little different, you can use ROW_NUMBER to get rid of different combinations of data. Let’s say Mike got a promotion last year and someone added a row instead of updating it.

FName LName JobTitle Age
Mike Klarm Manager 37
Mike Klarm Manager 37
Mike Klarm Manager 37
Mike Klarm Operator 32
Mike Klarm Operator 32
Mike Klarm Intern 22

Depending on what you put in your “partition by” statement, you can decide what to delete. If you want to keep one of each set, use the same query as above and you’ll get

R FName LName JobTitle Age
1 Mike Klarm Manager 37
2 Mike Klarm Manager 37
3 Mike Klarm Manager 37
1 Mike Klarm Operator 32
2 Mike Klarm Operator 32
1 Mike Klarm Intern 22

It will number each row, starting at one for each unique combination of data listed in the “partition by” statement. If you then run the above DELETE statement, you’ll be left with

FName LName JobTitle Age
Mike Klarm Manager 37
Mike Klarm Operator 32
Mike Klarm Intern 22

But let’s say you only want to keep the most recent record. You could run the above query and also delete any record where the Age <> 37, but that doesn’t work as well if you have a lot of data in the table, you’d have to specify just Mike Klarm to keep from deleting anyone who isn’t 37.

An easier way is to modify your “partition by” clause.

SELECT ROW_NUMBER() OVER ( PARTITION BY FName, LName ORDER BY Age DESC ) R
      , FName
      , LName
      , JobTitle
      , Age
    FROM DuplicateRow

You’ll end up with

R FName LName JobTitle Age
1 Mike Klarm Manager 37
2 Mike Klarm Manager 37
3 Mike Klarm Manager 37
4 Mike Klarm Operator 32
5 Mike Klarm Operator 32
6 Mike Klarm Intern 22

And when you delete where R > 1 you’ll be left with the most recent record. You can use the “partition by” clause to get many different sortings on data that’s only partially duplicated.

Rate

4.61 (183)

You rated this post out of 5. Change rating

Share

Share

Rate

4.61 (183)

You rated this post out of 5. Change rating