Blog Post

Removing Identical Duplicate Rows

,

Deleting duplicate rows out of a table can be tricky.  A brute force way to do this is with a TOP 1 and a cursor, but clearly there are set-based ways to accomplish this.  Using the ROW_NUMBER function with an appropriate OVER is fast and simple to understand.


This is a simple lab you can run to test it out.  


use sandboxdb

go 

drop table wdupholder

go 

create table wdupholder

( text1 varchar(10) not null

, text2 varchar(10) not null

)  

go 

insert into wdupholder (text1, text2) values ('test1','test1')

,('test1','test1')

,('test2','test2')

,('test2','test2')

,('test3','test3')

,('test3','test3')

,('test3','test3')

,('test3','test3') 

go 

select * from wdupholder

Note that we have a number of rows that are completely identical to eachother - no way to tell them apart.

delete a from 

(select DupRank = row_number() OVER (Partition By text1, text2 order by text1 asc) 

from wdupholder) a

where duprank > 1

go 

select * from wdupholder

go 

drop table wdupholder

go

In that case, we had no way to tell the duplicate records apart.  In this next scenario, our job is easier.  We have duplicate records but we have an integer column that may have been some sort of autoincrement or serialized value.  In your database, this may be a [createddate] datetime field that could be used to differentiate duplicate fields. 


We want to delete all but the lowest value for the column [int1] of any set of duplicate rows.  This is easy to do by using the integer column in the ORDER BY of the OVER.  We could delete all but the most recent values, based on [int1], by changing the ORDER BY to desc.


You should be able to apply this simple lab to your environment.  

create table wdupholder

( text1 varchar(10) not null

, text2 varchar(10) not null

, int1 int null

) 

go 

insert into wdupholder (text1, text2, int1) values ('test1','test1','1')

,('test1','test1','2')

,('test2','test2','3')

,('test2','test2','4')

,('test3','test3','5')

,('test3','test3','6')

,('test3','test3','7')

,('test3','test3',null) 

go 

select * from wdupholder

go 

delete a from 

(select DupRank = row_number() OVER (Partition By text1, text2 order by int1 asc) 

from wdupholder ) a 

where duprank > 1

go 

select * from wdupholder

Note the behavior of the NULL value in the int1 column, and be aware of NULL values in your own dataset.


Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating