Deduping Data in SQL Server 2005

Question

Deduping Data in SQL Server 2005

drnetwork

SSCommitted

Points: 1815
More actions
January 30, 2006 at 5:35 pm

#167924

Comments posted to this topic are about the content posted at http://www.sqlservercentral.com/columnists/chawkins/dedupingdatainsqlserver2005.asp

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

drnetwork SSCommitted Points: 1815 More actions · Answer 1

When you are populating the record set, take out the GO between the next-to-the-last and the last insert statements. Having this penultimate GO in the set of queries will remove the scope of the @NOW variable and cause the last INSERT to fail.

Johan Bijnens SSC Guru Points: 135255 More actions · Answer 2

nice example

Here is another :

WITH

cteEmployeeOrderedByMyRowNumber AS

(SELECT ROW_NUMBER() OVER (ORDER BY EMPID ASC, REFDATE ASC) AS MyRowNumber

, Row_NUMBER() Over (Partition By EMPID,FNAME,LNAME Order By REFDATE ASC) as PartitionRank

, *

FROM EMPLOYEE

-- WHERE 1 = 1

)

DELETE FROM cteEmployeeOrderedByMyRowNumber

where PartitionRank > 1 ;

just to get it in the tips of the fingers

Johan

Learn to play, play to learn !

Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:

- How to post Performance Problems
- How to post data/code to get the best help[/url]

- How to prevent a sore throat after hours of presenting ppt

press F1 for solution, press shift+F1 for urgent solution 😀

Need a bit of Powershell? How about this

Who am I ? Sometimes this is me but most of the time this is me

Heiko Hatzfeld SSC Veteran Points: 249 More actions · Answer 3

Hi...

I usually clean up my "mess" with "something" like this

delete from myTable where myID NOT in

(select min(MyID) from myTable group by myUniqueField)

But I think the best application for CTE are recursive querries...

Aries Manlig Valued Member Points: 58 More actions · Answer 4

Is there a performance gain to using this function and technique? Or is it just one of those "hey cool function --- let's use it"?

aries

sscbm21 Old Hand Points: 323 More actions · Answer 5

great examples!

If only duplicates need to be removed the ROW_NUMBER() may not be needed.

WITH cteEmployeeOrderedByMyRank AS

(SELECT RANK() OVER (PARTITION BY EMPID,FNAME,LNAME ORDER BY REFDATE ASC) AS PartitionRank

, *

FROM EMPLOYEE

-- WHERE 1 = 1

)

DELETE FROM cteEmployeeOrderedByMyRank

WHERE PartitionRank > 1 ;

It surely seems to be much faster than the cursor based apporach.

bm21

USKiwi SSChasing Mays Points: 604 More actions · Answer 6

Interesting demonstration of the ROW_NUMBER() function. Please say it ain’t so, Joe! - that you are not using cursors to remove duplicate rows. Even the technique of a SELECT DISTINCT into a temporary table would be a better option. As other readers have commented, there are a number of ways to remove duplicate rows. This would be my approach:

DELETE Employee

FROM Employee a INNER JOIN (SELECT Empid,

FName,

LName,

MIN(RefDate) AS 'MinDate'

FROM Employee

GROUP BY Empid, FName, LName) b

ON a.Empid = b.Empid

AND a.FName = b.FName

AND a.LName = b.LName

AND a.RefDate > b.MinDate

This would still leave the issue of James verses Jim that would need to be resolved separately. If you didn’t care about spelling variations and wanted to assume that the first entry was the correct one then this would work:

DELETE Employee

FROM Employee a INNER JOIN (SELECT Empid,

MIN(RefDate) AS 'MinDate'

FROM Employee

GROUP BY Empid) b

ON a.Empid = b.Empid

AND a.RefDate > b.MinDate

I would be interested in the question of performance between the two techniques but I’d put my money on mine which I suspect has a whole lot less overhead even as a cross join than having the engine generate a row position.

André Cardoso SSCommitted Points: 1745 More actions · Answer 7

Another option (that works with SQL 2000) and even in cases of all the columns having the same value (no column to differentiate the rows), is inserting the result set into a new table with an identity column (or adding an identity column to the original table). After that, it's just a matter of keeping the distinct rows as shown in the comments.

I can't remember where I read this solution, but it was either here in SQL Server Central or SQL Team foruns.

André Cardoso