Eliminating Duplicate Rows using The PARTITION BY clause

  • Comments posted to this topic are about the item Eliminating Duplicate Rows using The PARTITION BY clause


    Kindest Regards,

    M Suresh Kumar

  • Nice article & Nice solution. But I would caution anyone about to blindly using this in production. It is only the 1st part of the process.

    Problem 1: need to Fix Declarative Referential Integrity (DRI) for the rows deleted.

    Invariably the duplicate rows you are about to remove will be referenced by the foreign keys in another table. When you delete these rows you must keep a copy of the Keys of the rows you deleted & map them to the key of the “duplicate” row you kept. That way you have some lookup table you can use to correct all the foreign key references.

    Even if you fix up the foreign key references first, it is good to keep them “just in case” you overlooked fixing up a table. (which is really easy to do on a complex schema or where DRI isn’t perfect)

    The optimum way to get a list of rows you deleted is to use the OUTPUT clause in the Deleted statement. (see Example B of OUTPUT Clause (Transact-SQL) article in Books Online http://msdn.microsoft.com/en-us/library/ms177564.aspx )

    Problem 2: Fixing DRI Creates more duplicates.

    Often the child table has multiple duplicate rows that point to duplicated parent rows. As you remove duplicate parent rows & change the foreign keys in the Child table. You may discover that what you thought was “3 groups of duplicate child rows each with 2 rows” becomes “1 group of 6 duplicate rows all pointing to the same parent row”.

    Thus you get a chicken & egg situation. Generally I’ve found it best to Clean the parent. Keep a list, use it to fix the child. Then dedupe the child, keep a list etc. This means you only have to touch each table once.

    But beware. Duplicate Rows & dirty data are commonly found in the same databases. Often when you tidy up the data, especially if you set bad values to NULL, you may produce even more duplicate rows. So the can sometimes become an iterative process.

    The most common causes for duplicate rows.

    1.Many to Many reference tables that don’t make the 2 foreign keys the primary key but use an Identity col instead.

    2.Using Identity columns for your keys & having no other alternate key with a unique constraint on the table.

    3.Poor error handling &/or No transactions to protect you if your batch jobs fail.

    Have fun, Dave

  • Good article .Helps us a lot.

    Thanks,

    Nagesh.

  • Hi:

    This is simply outstanding....

    Regards,

    Md. Marufuzzaman

  • It's really good one that avoids creating identity column and while ..loop,etc. Cool it works for me:-)

  • Okay this might come on a bit strong but anyway.

    Are you kidding me?

    Having a table without anything to uniquely identify a row... okay thats bad.

    Okay you say lets create a primary key on all the columns. I wouldnt. All values in those column could change (and be NULL), adding FK to other tables to this (salery maybe) would require ALOT of redundant data... that could change. So no.

    Now since this is a Emp_Details. There should be something like an Employee table with hopefully something like EmpID. Now lacking that table a

    alter table Emp_Details add i integer identity

    go

    alter table Emp_Details add constraint PK_Emp_Details primary key clustered (i)

    Atleast to have something unique to identify a row.

    Now to fix the duplicates all you need would be.

    delete from Emp_Details

    where i not in (select Min(i)

    from Emp_Details e2

    group by e2.Emp_name, e2.Company, e2.Join_Date, e2.Resigned_Date)

    Now i admit that this IS SLOWER. It is however shorter then

    delete from a

    from (select Emp_Name, Company, Join_Date, Resigned_Date

    ,ROW_NUMBER() over (partition by Emp_Name, Company, Join_Date

    ,Resigned_Date

    order by Emp_Name, Company, Join_Date

    ,Resigned_Date) RowNumber

    from Emp_Details) a

    where a.RowNumber > 1

    And personally i think its easier to read. Now add a unique index to this table as well and this is a one time thing because the problem wont happen again. And then performance might not be that important (depending on the number of rows in the table off course)

  • good article. but i would not do this simply because there is a more straightforward way.

    select distinct *

    into #Emp_Details

    from Emp_Details

    truncate table Emp_Details

    insert into Emp_Details

    select * from #Emp_Details

    though some of you guys may raise your eyebrows the moment you see temp tables, this code gets the job done without resorting to complex code.

  • SSC Eights

    Interesting. I had assumed that his table had a PK but he was ignoring it as I'm often forced to. Why? Because it is an Identity Col, we know it is unique, but it is useless. Why is it useless in detecting duplicate rows? Because if you insert the same row row multiple times it will blindly add a new PK value.

    Sometimes it is what you want & othertimes things are broken.

    eg: For a Grocery Orders Database, 6 consecutive rows each saying "Bottle of milk" is very likely. But for a Tool Booth solution, if the database shows the same car going thru a toll booth 3 times in the same minute. They are likely to be duplicate rows.

    Thus it is difficult to have a "one size fits all" solution. Clearly you could have a unique composite key on nearly every column. But that would effectively duplicate the entire table in the index. So unless it is a clustered index, it would be impractical. For the car example above, perhaps a unique index on an alternate key of Number plate & time (to the minute precision) would be sufficient.

  • jun.merencilla (9/22/2010)


    good article. but i would not do this simply because there is a more straightforward way.

    select distinct *

    into #Emp_Details

    from Emp_Details

    truncate table Emp_Details

    insert into Emp_Details

    select * from #Emp_Details

    though some of you guys may raise your eyebrows the moment you see temp tables, this code gets the job done without resorting to complex code.

    Not really practical for production.

    1. Need to take whole database offline when the table vanishes.

    2. If it is a large table & you are using Log Reader based technology (ie: Replication, Mirroring, Change Data Capture, Log Shipping, Incremental loads into your SSAS cubes etc) this will fill your log & create a load on the downstream destinations.

    3. Need to drop any Declarative Referential Integrity constraints & other things that schema binding will prevent. So also need to do all that schema work before a truncate table.

    So perhaps instead of truncate. Consider using the MERGE command to remove those duplicate rows.

  • For readability's sake if nothing else, I would probably wrap the RowNumber in a CTE and delete from there. It might even be more efficient, especially in the second example where the PARTITION BY clause is not used and we are forced to execute the select statement twice.

    Unfortunately, I don't have a SQL server in front of my right now, so I'm not able to test that theory right now. Likely that may only be the case with smaller recordsets.

    ;

    WITH cteDuplicateRows as (

    select

    Emp_Name

    , Company

    , Join_Date

    , Resigned_Date

    , RowNumber = ROW_NUMBER() over (

    partition by Emp_Name, Company, Join_Date, Resigned_Date

    order by Emp_Name, Company, Join_Date, Resigned_Date

    )

    from Emp_Details

    )

    Delete cteDuplicateRows

    where a.RowNumber > 1

  • A minor quibble on the ROW_NUMBER() examples: the ORDER BY clause is pointless since it repeats the PARTITION BY expression list. The ORDER BY list controls the order within each partition, not the whole result set, so none of the expressions in the PARTITION BY list should appear in the ORDER BY list since all rows in a partition share the same values for those expressions.

    If you don't want to just remove the ORDER BY clause you could adapt the examples to ORDER BY something that is guaranteed to be unique within a partition, e.g. if Emp_Details has a surrogate key column Emp_ID which is unique you could change your first example to

    select Emp_Name

    ,Company

    ,Join_Date

    ,Resigned_Date

    ,ROW_NUMBER() over (partition by Emp_Name, Company, Join_Date

    ,Resigned_Date

    order by Emp_ID) RowNumber

    from Emp_Details

  • can be re-written like this :

    WITH CTE AS

    (

    SELECT Emp_Name, Company, Join_Date, Resigned_Date, ROW_NUMBER() OVER (order BY Emp_Name, Company, Join_Date, Resigned_Date) AS ROWNUM

    from #Emp_Details

    )

    DELETE FROM CTE

    WHERE ROWNUM NOT IN (SELECT MAX(ROWNUM) FROM CTE

    GROUP BY Emp_Name, Company, Join_Date, Resigned_Date)

  • Good article.

    Not sure that on larger datasets and/or busier server one would want to DELETE/TRUNCATE records. The risk with DELETE/TRUNCATE is that you change your original content, thus you are unable to recover easily should you discover a problem with the query (that never happens right 😉 Also DELETE will create logs which can be quite substantial on larger datasets.

    Why not role forward with a SELECT INTO statement, it preserves the Original dataset in the process. Once the result are satisfying the old table can be swapped out and while the users never experienced an outage.

    my 2 cents

  • This appears to be one of those articles that creates a problem based on poor design and purports to correct it using some bloated functionality of SQL Server. If the table were properly designed with uniqueness constraints, the problem would not exist. Allowing duplicate data into the table in the first place is the problem, not fixing it after the fact. The more likely scenario and the one I have seen most often comes from importing data from poorly designed databases or poorly trained users.

    One problem I recall in particular involved importing about 7,000 addresses from a client where their data entry people had cleverly but clumsily forced data into the table to violate uniqueness contraints. Apparently, the data entry people, upon receiving some message that was totally cryptic to them about "duplicate keys" simply changed a small piece of information to force the data into the table. For example, one address appeared multiple times by simply changing "Street" to "St.", "Str.", "St", "Str", "ST" and even "S t r e e t". That is the kind of duplication that is a problem, not the oversimplified example of this article.

    It seems to me we have entered a phase with SQL Server fuctions in search of problems to justify their cost and footprint, no matter how contrived those problems may be.

  • @trubolotta Not trying to start an argument here, but hindsight is 20/20 for everyone. Sometimes we find ourselves in a situation like the OP is describing (or something similar) and we need a solution.

Viewing 15 posts - 1 through 15 (of 52 total)

You must be logged in to reply to this topic. Login to reply