Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Eliminating Duplicate Rows using The PARTITION BY clause


Eliminating Duplicate Rows using The PARTITION BY clause

Author
Message
Suresh Kumar Maganti
Suresh Kumar Maganti
SSC-Enthusiastic
SSC-Enthusiastic (159 reputation)SSC-Enthusiastic (159 reputation)SSC-Enthusiastic (159 reputation)SSC-Enthusiastic (159 reputation)SSC-Enthusiastic (159 reputation)SSC-Enthusiastic (159 reputation)SSC-Enthusiastic (159 reputation)SSC-Enthusiastic (159 reputation)

Group: General Forum Members
Points: 159 Visits: 125
Comments posted to this topic are about the item Eliminating Duplicate Rows using The PARTITION BY clause


Kindest Regards,

M Suresh Kumar

David Lean
David Lean
SSC Rookie
SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)

Group: General Forum Members
Points: 42 Visits: 129
Nice article & Nice solution. But I would caution anyone about to blindly using this in production. It is only the 1st part of the process.

Problem 1: need to Fix Declarative Referential Integrity (DRI) for the rows deleted.
Invariably the duplicate rows you are about to remove will be referenced by the foreign keys in another table. When you delete these rows you must keep a copy of the Keys of the rows you deleted & map them to the key of the “duplicate” row you kept. That way you have some lookup table you can use to correct all the foreign key references.
Even if you fix up the foreign key references first, it is good to keep them “just in case” you overlooked fixing up a table. (which is really easy to do on a complex schema or where DRI isn’t perfect)
The optimum way to get a list of rows you deleted is to use the OUTPUT clause in the Deleted statement. (see Example B of OUTPUT Clause (Transact-SQL) article in Books Online http://msdn.microsoft.com/en-us/library/ms177564.aspx )

Problem 2: Fixing DRI Creates more duplicates.
Often the child table has multiple duplicate rows that point to duplicated parent rows. As you remove duplicate parent rows & change the foreign keys in the Child table. You may discover that what you thought was “3 groups of duplicate child rows each with 2 rows” becomes “1 group of 6 duplicate rows all pointing to the same parent row”.
Thus you get a chicken & egg situation. Generally I’ve found it best to Clean the parent. Keep a list, use it to fix the child. Then dedupe the child, keep a list etc. This means you only have to touch each table once.
But beware. Duplicate Rows & dirty data are commonly found in the same databases. Often when you tidy up the data, especially if you set bad values to NULL, you may produce even more duplicate rows. So the can sometimes become an iterative process.

The most common causes for duplicate rows.
1. Many to Many reference tables that don’t make the 2 foreign keys the primary key but use an Identity col instead.
2. Using Identity columns for your keys & having no other alternate key with a unique constraint on the table.
3. Poor error handling &/or No transactions to protect you if your batch jobs fail.
Have fun, Dave
nageshp
nageshp
SSC Veteran
SSC Veteran (253 reputation)SSC Veteran (253 reputation)SSC Veteran (253 reputation)SSC Veteran (253 reputation)SSC Veteran (253 reputation)SSC Veteran (253 reputation)SSC Veteran (253 reputation)SSC Veteran (253 reputation)

Group: General Forum Members
Points: 253 Visits: 454
Good article .Helps us a lot.

Thanks,
Nagesh.
Md. Marufuzzaman
Md. Marufuzzaman
Grasshopper
Grasshopper (20 reputation)Grasshopper (20 reputation)Grasshopper (20 reputation)Grasshopper (20 reputation)Grasshopper (20 reputation)Grasshopper (20 reputation)Grasshopper (20 reputation)Grasshopper (20 reputation)

Group: General Forum Members
Points: 20 Visits: 2
Hi:
This is simply outstanding....

Regards,
Md. Marufuzzaman
saravanan vediyappan
saravanan vediyappan
Forum Newbie
Forum Newbie (7 reputation)Forum Newbie (7 reputation)Forum Newbie (7 reputation)Forum Newbie (7 reputation)Forum Newbie (7 reputation)Forum Newbie (7 reputation)Forum Newbie (7 reputation)Forum Newbie (7 reputation)

Group: General Forum Members
Points: 7 Visits: 53
It's really good one that avoids creating identity column and while ..loop,etc. Cool it works for me:-)
tommyh
tommyh
SSCommitted
SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)

Group: General Forum Members
Points: 1560 Visits: 2000
Okay this might come on a bit strong but anyway.

Are you kidding me?

Having a table without anything to uniquely identify a row... okay thats bad.

Okay you say lets create a primary key on all the columns. I wouldnt. All values in those column could change (and be NULL), adding FK to other tables to this (salery maybe) would require ALOT of redundant data... that could change. So no.

Now since this is a Emp_Details. There should be something like an Employee table with hopefully something like EmpID. Now lacking that table a

alter table Emp_Details add i integer identity
go
alter table Emp_Details add constraint PK_Emp_Details primary key clustered (i)


Atleast to have something unique to identify a row.

Now to fix the duplicates all you need would be.

delete from Emp_Details
where i not in (select Min(i)
from Emp_Details e2
group by e2.Emp_name, e2.Company, e2.Join_Date, e2.Resigned_Date)


Now i admit that this IS SLOWER. It is however shorter then

delete from a
from (select Emp_Name, Company, Join_Date, Resigned_Date
,ROW_NUMBER() over (partition by Emp_Name, Company, Join_Date
,Resigned_Date
order by Emp_Name, Company, Join_Date
,Resigned_Date) RowNumber
from Emp_Details) a
where a.RowNumber > 1



And personally i think its easier to read. Now add a unique index to this table as well and this is a one time thing because the problem wont happen again. And then performance might not be that important (depending on the number of rows in the table off course)
jun.merencilla
jun.merencilla
SSC Rookie
SSC Rookie (28 reputation)SSC Rookie (28 reputation)SSC Rookie (28 reputation)SSC Rookie (28 reputation)SSC Rookie (28 reputation)SSC Rookie (28 reputation)SSC Rookie (28 reputation)SSC Rookie (28 reputation)

Group: General Forum Members
Points: 28 Visits: 100
good article. but i would not do this simply because there is a more straightforward way.


select distinct *
into #Emp_Details
from Emp_Details

truncate table Emp_Details

insert into Emp_Details
select * from #Emp_Details



though some of you guys may raise your eyebrows the moment you see temp tables, this code gets the job done without resorting to complex code.
David Lean
David Lean
SSC Rookie
SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)

Group: General Forum Members
Points: 42 Visits: 129
SSC Eights

Interesting. I had assumed that his table had a PK but he was ignoring it as I'm often forced to. Why? Because it is an Identity Col, we know it is unique, but it is useless. Why is it useless in detecting duplicate rows? Because if you insert the same row row multiple times it will blindly add a new PK value.

Sometimes it is what you want & othertimes things are broken.
eg: For a Grocery Orders Database, 6 consecutive rows each saying "Bottle of milk" is very likely. But for a Tool Booth solution, if the database shows the same car going thru a toll booth 3 times in the same minute. They are likely to be duplicate rows.

Thus it is difficult to have a "one size fits all" solution. Clearly you could have a unique composite key on nearly every column. But that would effectively duplicate the entire table in the index. So unless it is a clustered index, it would be impractical. For the car example above, perhaps a unique index on an alternate key of Number plate & time (to the minute precision) would be sufficient.
David Lean
David Lean
SSC Rookie
SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)

Group: General Forum Members
Points: 42 Visits: 129
jun.merencilla (9/22/2010)
good article. but i would not do this simply because there is a more straightforward way.


select distinct *
into #Emp_Details
from Emp_Details

truncate table Emp_Details

insert into Emp_Details
select * from #Emp_Details



though some of you guys may raise your eyebrows the moment you see temp tables, this code gets the job done without resorting to complex code.


Not really practical for production.
1. Need to take whole database offline when the table vanishes.
2. If it is a large table & you are using Log Reader based technology (ie: Replication, Mirroring, Change Data Capture, Log Shipping, Incremental loads into your SSAS cubes etc) this will fill your log & create a load on the downstream destinations.
3. Need to drop any Declarative Referential Integrity constraints & other things that schema binding will prevent. So also need to do all that schema work before a truncate table.

So perhaps instead of truncate. Consider using the MERGE command to remove those duplicate rows.
Condorman
Condorman
SSC Rookie
SSC Rookie (45 reputation)SSC Rookie (45 reputation)SSC Rookie (45 reputation)SSC Rookie (45 reputation)SSC Rookie (45 reputation)SSC Rookie (45 reputation)SSC Rookie (45 reputation)SSC Rookie (45 reputation)

Group: General Forum Members
Points: 45 Visits: 51
For readability's sake if nothing else, I would probably wrap the RowNumber in a CTE and delete from there. It might even be more efficient, especially in the second example where the PARTITION BY clause is not used and we are forced to execute the select statement twice.

Unfortunately, I don't have a SQL server in front of my right now, so I'm not able to test that theory right now. Likely that may only be the case with smaller recordsets.


;
WITH cteDuplicateRows as (
select
Emp_Name
, Company
, Join_Date
, Resigned_Date
, RowNumber = ROW_NUMBER() over (
partition by Emp_Name, Company, Join_Date, Resigned_Date
order by Emp_Name, Company, Join_Date, Resigned_Date
)
from Emp_Details
)

Delete cteDuplicateRows
where a.RowNumber > 1


Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search