Eliminating Duplicate Rows using The PARTITION BY clause

Question

Eliminating Duplicate Rows using The PARTITION BY clause

Suresh Kumar Maganti

Default port

Points: 1429
More actions
September 21, 2010 at 9:49 pm

#238576

Comments posted to this topic are about the item Eliminating Duplicate Rows using The PARTITION BY clause
Kindest Regards,
M Suresh Kumar

Viewing 15 posts - 1 through 15 (of 52 total)

You must be logged in to reply to this topic. Login to reply

David Lean SSC Eights! Points: 842 More actions · Answer 1

Nice article & Nice solution. But I would caution anyone about to blindly using this in production. It is only the 1st part of the process.

Problem 1: need to Fix Declarative Referential Integrity (DRI) for the rows deleted.

Invariably the duplicate rows you are about to remove will be referenced by the foreign keys in another table. When you delete these rows you must keep a copy of the Keys of the rows you deleted & map them to the key of the “duplicate” row you kept. That way you have some lookup table you can use to correct all the foreign key references.

Even if you fix up the foreign key references first, it is good to keep them “just in case” you overlooked fixing up a table. (which is really easy to do on a complex schema or where DRI isn’t perfect)

The optimum way to get a list of rows you deleted is to use the OUTPUT clause in the Deleted statement. (see Example B of OUTPUT Clause (Transact-SQL) article in Books Online http://msdn.microsoft.com/en-us/library/ms177564.aspx )

Problem 2: Fixing DRI Creates more duplicates.

Often the child table has multiple duplicate rows that point to duplicated parent rows. As you remove duplicate parent rows & change the foreign keys in the Child table. You may discover that what you thought was “3 groups of duplicate child rows each with 2 rows” becomes “1 group of 6 duplicate rows all pointing to the same parent row”.

Thus you get a chicken & egg situation. Generally I’ve found it best to Clean the parent. Keep a list, use it to fix the child. Then dedupe the child, keep a list etc. This means you only have to touch each table once.

But beware. Duplicate Rows & dirty data are commonly found in the same databases. Often when you tidy up the data, especially if you set bad values to NULL, you may produce even more duplicate rows. So the can sometimes become an iterative process.

The most common causes for duplicate rows.

1.Many to Many reference tables that don’t make the 2 foreign keys the primary key but use an Identity col instead.

2.Using Identity columns for your keys & having no other alternate key with a unique constraint on the table.

3.Poor error handling &/or No transactions to protect you if your batch jobs fail.

Have fun, Dave

nageshp Right there with Babe Points: 719 More actions · Answer 2

nageshp

Right there with Babe

Points: 719

September 22, 2010 at 12:09 am

#1225343

Good article .Helps us a lot.

Thanks,

Nagesh.

Md. Marufuzzaman SSC Journeyman Points: 78 More actions · Answer 3

Hi:

This is simply outstanding....

Regards,

Md. Marufuzzaman

saravanan vediyappan SSC Enthusiast Points: 107 More actions · Answer 4

It's really good one that avoids creating identity column and while ..loop,etc. Cool it works for me:-)

tommyh SSCertifiable Points: 6302 More actions · Answer 5

Okay this might come on a bit strong but anyway.

Are you kidding me?

Having a table without anything to uniquely identify a row... okay thats bad.

Okay you say lets create a primary key on all the columns. I wouldnt. All values in those column could change (and be NULL), adding FK to other tables to this (salery maybe) would require ALOT of redundant data... that could change. So no.

Now since this is a Emp_Details. There should be something like an Employee table with hopefully something like EmpID. Now lacking that table a

alter table Emp_Details add i integer identity

go

alter table Emp_Details add constraint PK_Emp_Details primary key clustered (i)

Atleast to have something unique to identify a row.

Now to fix the duplicates all you need would be.

delete from Emp_Details

where i not in (select Min(i)

from Emp_Details e2

group by e2.Emp_name, e2.Company, e2.Join_Date, e2.Resigned_Date)

Now i admit that this IS SLOWER. It is however shorter then

delete from a

from (select Emp_Name, Company, Join_Date, Resigned_Date

,ROW_NUMBER() over (partition by Emp_Name, Company, Join_Date

,Resigned_Date

order by Emp_Name, Company, Join_Date

,Resigned_Date) RowNumber

from Emp_Details) a

where a.RowNumber > 1

And personally i think its easier to read. Now add a unique index to this table as well and this is a one time thing because the problem wont happen again. And then performance might not be that important (depending on the number of rows in the table off course)

jun.merencilla SSC Veteran Points: 212 More actions · Answer 6

good article. but i would not do this simply because there is a more straightforward way.

select distinct *

into #Emp_Details

from Emp_Details

truncate table Emp_Details

insert into Emp_Details

select * from #Emp_Details

though some of you guys may raise your eyebrows the moment you see temp tables, this code gets the job done without resorting to complex code.

David Lean SSC Eights! Points: 842 More actions · Answer 7

SSC Eights

Interesting. I had assumed that his table had a PK but he was ignoring it as I'm often forced to. Why? Because it is an Identity Col, we know it is unique, but it is useless. Why is it useless in detecting duplicate rows? Because if you insert the same row row multiple times it will blindly add a new PK value.

Sometimes it is what you want & othertimes things are broken.

eg: For a Grocery Orders Database, 6 consecutive rows each saying "Bottle of milk" is very likely. But for a Tool Booth solution, if the database shows the same car going thru a toll booth 3 times in the same minute. They are likely to be duplicate rows.

Thus it is difficult to have a "one size fits all" solution. Clearly you could have a unique composite key on nearly every column. But that would effectively duplicate the entire table in the index. So unless it is a clustered index, it would be impractical. For the car example above, perhaps a unique index on an alternate key of Number plate & time (to the minute precision) would be sufficient.

David Lean SSC Eights! Points: 842 More actions · Answer 8

jun.merencilla (9/22/2010)
good article. but i would not do this simply because there is a more straightforward way.
select distinct *
into #Emp_Details
from Emp_Details
truncate table Emp_Details
insert into Emp_Details
select * from #Emp_Details
though some of you guys may raise your eyebrows the moment you see temp tables, this code gets the job done without resorting to complex code.

Not really practical for production.

1. Need to take whole database offline when the table vanishes.

2. If it is a large table & you are using Log Reader based technology (ie: Replication, Mirroring, Change Data Capture, Log Shipping, Incremental loads into your SSAS cubes etc) this will fill your log & create a load on the downstream destinations.

3. Need to drop any Declarative Referential Integrity constraints & other things that schema binding will prevent. So also need to do all that schema work before a truncate table.

So perhaps instead of truncate. Consider using the MERGE command to remove those duplicate rows.

Condorman SSC Enthusiast Points: 173 More actions · Answer 9

For readability's sake if nothing else, I would probably wrap the RowNumber in a CTE and delete from there. It might even be more efficient, especially in the second example where the PARTITION BY clause is not used and we are forced to execute the select statement twice.

Unfortunately, I don't have a SQL server in front of my right now, so I'm not able to test that theory right now. Likely that may only be the case with smaller recordsets.

;

WITH cteDuplicateRows as (

select

Emp_Name

, Company

, Join_Date

, Resigned_Date

, RowNumber = ROW_NUMBER() over (

partition by Emp_Name, Company, Join_Date, Resigned_Date

order by Emp_Name, Company, Join_Date, Resigned_Date

)

from Emp_Details

)

Delete cteDuplicateRows

where a.RowNumber > 1

feakesj SSC Journeyman Points: 77 More actions · Answer 10

A minor quibble on the ROW_NUMBER() examples: the ORDER BY clause is pointless since it repeats the PARTITION BY expression list. The ORDER BY list controls the order within each partition, not the whole result set, so none of the expressions in the PARTITION BY list should appear in the ORDER BY list since all rows in a partition share the same values for those expressions.

If you don't want to just remove the ORDER BY clause you could adapt the examples to ORDER BY something that is guaranteed to be unique within a partition, e.g. if Emp_Details has a surrogate key column Emp_ID which is unique you could change your first example to

select Emp_Name

,Company

,Join_Date

,Resigned_Date

,ROW_NUMBER() over (partition by Emp_Name, Company, Join_Date

,Resigned_Date

order by Emp_ID) RowNumber

from Emp_Details

vinod.andani-874416 Mr or Mrs. 500 Points: 537 More actions · Answer 11

can be re-written like this :

WITH CTE AS

(

SELECT Emp_Name, Company, Join_Date, Resigned_Date, ROW_NUMBER() OVER (order BY Emp_Name, Company, Join_Date, Resigned_Date) AS ROWNUM

from #Emp_Details

)

DELETE FROM CTE

WHERE ROWNUM NOT IN (SELECT MAX(ROWNUM) FROM CTE

GROUP BY Emp_Name, Company, Join_Date, Resigned_Date)

Pieter-423357 SSCommitted Points: 1619 More actions · Answer 12

Good article.

Not sure that on larger datasets and/or busier server one would want to DELETE/TRUNCATE records. The risk with DELETE/TRUNCATE is that you change your original content, thus you are unable to recover easily should you discover a problem with the query (that never happens right 😉 Also DELETE will create logs which can be quite substantial on larger datasets.

Why not role forward with a SELECT INTO statement, it preserves the Original dataset in the process. Once the result are satisfying the old table can be swapped out and while the users never experienced an outage.

my 2 cents

trubolotta SSC Veteran Points: 219 More actions · Answer 13

This appears to be one of those articles that creates a problem based on poor design and purports to correct it using some bloated functionality of SQL Server. If the table were properly designed with uniqueness constraints, the problem would not exist. Allowing duplicate data into the table in the first place is the problem, not fixing it after the fact. The more likely scenario and the one I have seen most often comes from importing data from poorly designed databases or poorly trained users.

One problem I recall in particular involved importing about 7,000 addresses from a client where their data entry people had cleverly but clumsily forced data into the table to violate uniqueness contraints. Apparently, the data entry people, upon receiving some message that was totally cryptic to them about "duplicate keys" simply changed a small piece of information to force the data into the table. For example, one address appeared multiple times by simply changing "Street" to "St.", "Str.", "St", "Str", "ST" and even "S t r e e t". That is the kind of duplication that is a problem, not the oversimplified example of this article.

It seems to me we have entered a phase with SQL Server fuctions in search of problems to justify their cost and footprint, no matter how contrived those problems may be.

thisisfutile Hall of Fame Points: 3668 More actions · Answer 14

@trubolotta Not trying to start an argument here, but hindsight is 20/20 for everyone. Sometimes we find ourselves in a situation like the OP is describing (or something similar) and we need a solution.