Deduping Data in SQL Server 2005

  • Comments posted to this topic are about the content posted at http://www.sqlservercentral.com/columnists/chawkins/dedupingdatainsqlserver2005.asp

  • When you are populating the record set, take out the GO between the next-to-the-last and the last insert statements. Having this penultimate GO in the set of queries will remove the scope of the @NOW variable and cause the last INSERT to fail.

  • nice example

    Here is another :

    WITH

    cteEmployeeOrderedByMyRowNumber AS

    (SELECT ROW_NUMBER() OVER (ORDER BY EMPID ASC, REFDATE ASC) AS MyRowNumber

    , Row_NUMBER() Over (Partition By EMPID,FNAME,LNAME Order By REFDATE ASC) as PartitionRank

    , *

    FROM EMPLOYEE

    -- WHERE 1 = 1

    )

    DELETE FROM cteEmployeeOrderedByMyRowNumber

    where PartitionRank > 1 ;

     

    just to get it in the tips of the fingers

     

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • Hi...

     

    I usually clean up my "mess" with "something" like this

     

    delete from myTable where myID NOT in

    (select min(MyID) from myTable group by myUniqueField)

     

    But I think the best application for CTE are recursive querries...

  • Is there a performance gain to using this function and technique?  Or is it just one of those "hey cool function --- let's use it"? 

    aries

  • great examples!

     

    If only duplicates need to be removed the ROW_NUMBER() may not be needed.

    WITH cteEmployeeOrderedByMyRank AS

    (SELECT RANK() OVER (PARTITION BY EMPID,FNAME,LNAME ORDER BY REFDATE ASC) AS PartitionRank

    , *

    FROM EMPLOYEE

    -- WHERE 1 = 1

    )

    DELETE FROM cteEmployeeOrderedByMyRank

    WHERE PartitionRank > 1 ;

     

    It surely seems to be much faster than the cursor based apporach.


    bm21

  • Interesting demonstration of the ROW_NUMBER() function. Please say it ain’t so, Joe! - that you are not using cursors to remove duplicate rows. Even the technique of a SELECT DISTINCT into a temporary table would be a better option. As other readers have commented, there are a number of ways to remove duplicate rows. This would be my approach:

     

    DELETE Employee

    FROM Employee a INNER JOIN (SELECT Empid,

                                                                              FName,

                                                                              LName,

                                                                             MIN(RefDate) AS 'MinDate'

                                                              FROM Employee   

                                                             GROUP BY Empid, FName, LName) b

                                      ON a.Empid = b.Empid

                                     AND a.FName = b.FName

                                     AND a.LName = b.LName

                                     AND a.RefDate > b.MinDate

     

    This would still leave the issue of James verses Jim that would need to be resolved separately. If you didn’t care about spelling variations and wanted to assume that the first entry was the correct one then this would work:

     

    DELETE Employee

    FROM Employee a INNER JOIN (SELECT Empid,

                                                                              MIN(RefDate) AS 'MinDate'

                                                              FROM Employee

                                                             GROUP BY Empid) b

                                      ON a.Empid = b.Empid

                                     AND a.RefDate > b.MinDate

     

    I would be interested in the question of performance between the two techniques but I’d put my money on mine which I suspect has a whole lot less overhead even as a cross join than having the engine generate a row position.

     

     

  • Another option (that works with SQL 2000) and even in cases of all the columns having the same value (no column to differentiate the rows), is inserting the result set into a new table with an identity column (or adding an identity column to the original table). After that, it's just a matter of keeping the distinct rows as shown in the comments.

    I can't remember where I read this solution, but it was either here in SQL Server Central or SQL Team foruns.

    André Cardoso

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply