Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 12345»»»

Eliminating Duplicate Rows using The PARTITION BY clause Expand / Collapse
Author
Message
Posted Tuesday, September 21, 2010 9:49 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Thursday, June 27, 2013 9:51 AM
Points: 141, Visits: 113
Comments posted to this topic are about the item Eliminating Duplicate Rows using The PARTITION BY clause


Kindest Regards,

M Suresh Kumar

Post #990866
Posted Tuesday, September 21, 2010 11:20 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Tuesday, August 27, 2013 4:57 AM
Points: 28, Visits: 124
Nice article & Nice solution. But I would caution anyone about to blindly using this in production. It is only the 1st part of the process.

Problem 1: need to Fix Declarative Referential Integrity (DRI) for the rows deleted.
Invariably the duplicate rows you are about to remove will be referenced by the foreign keys in another table. When you delete these rows you must keep a copy of the Keys of the rows you deleted & map them to the key of the “duplicate” row you kept. That way you have some lookup table you can use to correct all the foreign key references.
Even if you fix up the foreign key references first, it is good to keep them “just in case” you overlooked fixing up a table. (which is really easy to do on a complex schema or where DRI isn’t perfect)
The optimum way to get a list of rows you deleted is to use the OUTPUT clause in the Deleted statement. (see Example B of OUTPUT Clause (Transact-SQL) article in Books Online http://msdn.microsoft.com/en-us/library/ms177564.aspx )

Problem 2: Fixing DRI Creates more duplicates.
Often the child table has multiple duplicate rows that point to duplicated parent rows. As you remove duplicate parent rows & change the foreign keys in the Child table. You may discover that what you thought was “3 groups of duplicate child rows each with 2 rows” becomes “1 group of 6 duplicate rows all pointing to the same parent row”.
Thus you get a chicken & egg situation. Generally I’ve found it best to Clean the parent. Keep a list, use it to fix the child. Then dedupe the child, keep a list etc. This means you only have to touch each table once.
But beware. Duplicate Rows & dirty data are commonly found in the same databases. Often when you tidy up the data, especially if you set bad values to NULL, you may produce even more duplicate rows. So the can sometimes become an iterative process.

The most common causes for duplicate rows.
1. Many to Many reference tables that don’t make the 2 foreign keys the primary key but use an Identity col instead.
2. Using Identity columns for your keys & having no other alternate key with a unique constraint on the table.
3. Poor error handling &/or No transactions to protect you if your batch jobs fail.
Have fun, Dave
Post #990883
Posted Wednesday, September 22, 2010 12:09 AM
SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Wednesday, August 27, 2014 10:15 AM
Points: 251, Visits: 449
Good article .Helps us a lot.

Thanks,
Nagesh.
Post #990890
Posted Wednesday, September 22, 2010 12:33 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Wednesday, September 22, 2010 10:25 AM
Points: 18, Visits: 2
Hi:
This is simply outstanding....

Regards,
Md. Marufuzzaman
Post #990891
Posted Wednesday, September 22, 2010 12:50 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Sunday, February 9, 2014 7:39 PM
Points: 5, Visits: 53
It's really good one that avoids creating identity column and while ..loop,etc. Cool it works for me
Post #990894
Posted Wednesday, September 22, 2010 12:50 AM
UDP Broadcaster

UDP BroadcasterUDP BroadcasterUDP BroadcasterUDP BroadcasterUDP BroadcasterUDP BroadcasterUDP BroadcasterUDP Broadcaster

Group: General Forum Members
Last Login: Sunday, June 29, 2014 11:26 PM
Points: 1,481, Visits: 1,960
Okay this might come on a bit strong but anyway.

Are you kidding me?

Having a table without anything to uniquely identify a row... okay thats bad.

Okay you say lets create a primary key on all the columns. I wouldnt. All values in those column could change (and be NULL), adding FK to other tables to this (salery maybe) would require ALOT of redundant data... that could change. So no.

Now since this is a Emp_Details. There should be something like an Employee table with hopefully something like EmpID. Now lacking that table a
alter table Emp_Details add i integer identity 
go
alter table Emp_Details add constraint PK_Emp_Details primary key clustered (i)

Atleast to have something unique to identify a row.

Now to fix the duplicates all you need would be.
delete from Emp_Details 
where i not in (select Min(i)
from Emp_Details e2
group by e2.Emp_name, e2.Company, e2.Join_Date, e2.Resigned_Date)

Now i admit that this IS SLOWER. It is however shorter then
delete from a
from (select Emp_Name, Company, Join_Date, Resigned_Date
,ROW_NUMBER() over (partition by Emp_Name, Company, Join_Date
,Resigned_Date
order by Emp_Name, Company, Join_Date
,Resigned_Date) RowNumber
from Emp_Details) a
where a.RowNumber > 1

And personally i think its easier to read. Now add a unique index to this table as well and this is a one time thing because the problem wont happen again. And then performance might not be that important (depending on the number of rows in the table off course)

Post #990895
Posted Wednesday, September 22, 2010 1:44 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, March 3, 2014 9:03 PM
Points: 26, Visits: 100
good article. but i would not do this simply because there is a more straightforward way.

select distinct * 
into #Emp_Details
from Emp_Details

truncate table Emp_Details

insert into Emp_Details
select * from #Emp_Details

though some of you guys may raise your eyebrows the moment you see temp tables, this code gets the job done without resorting to complex code.
Post #990915
Posted Wednesday, September 22, 2010 2:36 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Tuesday, August 27, 2013 4:57 AM
Points: 28, Visits: 124
SSC Eights

Interesting. I had assumed that his table had a PK but he was ignoring it as I'm often forced to. Why? Because it is an Identity Col, we know it is unique, but it is useless. Why is it useless in detecting duplicate rows? Because if you insert the same row row multiple times it will blindly add a new PK value.

Sometimes it is what you want & othertimes things are broken.
eg: For a Grocery Orders Database, 6 consecutive rows each saying "Bottle of milk" is very likely. But for a Tool Booth solution, if the database shows the same car going thru a toll booth 3 times in the same minute. They are likely to be duplicate rows.

Thus it is difficult to have a "one size fits all" solution. Clearly you could have a unique composite key on nearly every column. But that would effectively duplicate the entire table in the index. So unless it is a clustered index, it would be impractical. For the car example above, perhaps a unique index on an alternate key of Number plate & time (to the minute precision) would be sufficient.
Post #990934
Posted Wednesday, September 22, 2010 2:57 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Tuesday, August 27, 2013 4:57 AM
Points: 28, Visits: 124
jun.merencilla (9/22/2010)
good article. but i would not do this simply because there is a more straightforward way.

select distinct * 
into #Emp_Details
from Emp_Details

truncate table Emp_Details

insert into Emp_Details
select * from #Emp_Details

though some of you guys may raise your eyebrows the moment you see temp tables, this code gets the job done without resorting to complex code.


Not really practical for production.
1. Need to take whole database offline when the table vanishes.
2. If it is a large table & you are using Log Reader based technology (ie: Replication, Mirroring, Change Data Capture, Log Shipping, Incremental loads into your SSAS cubes etc) this will fill your log & create a load on the downstream destinations.
3. Need to drop any Declarative Referential Integrity constraints & other things that schema binding will prevent. So also need to do all that schema work before a truncate table.

So perhaps instead of truncate. Consider using the MERGE command to remove those duplicate rows.
Post #990941
Posted Wednesday, September 22, 2010 3:38 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Friday, February 21, 2014 11:42 AM
Points: 43, Visits: 46
For readability's sake if nothing else, I would probably wrap the RowNumber in a CTE and delete from there. It might even be more efficient, especially in the second example where the PARTITION BY clause is not used and we are forced to execute the select statement twice.

Unfortunately, I don't have a SQL server in front of my right now, so I'm not able to test that theory right now. Likely that may only be the case with smaller recordsets.

;
WITH cteDuplicateRows as (
select
Emp_Name
, Company
, Join_Date
, Resigned_Date
, RowNumber = ROW_NUMBER() over (
partition by Emp_Name, Company, Join_Date, Resigned_Date
order by Emp_Name, Company, Join_Date, Resigned_Date
)
from Emp_Details
)

Delete cteDuplicateRows
where a.RowNumber > 1

Post #990969
« Prev Topic | Next Topic »

Add to briefcase 12345»»»

Permissions Expand / Collapse