SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Deduping Data in SQL Server 2005


Deduping Data in SQL Server 2005

Author
Message
drnetwork
drnetwork
Say Hey Kid
Say Hey Kid (689 reputation)Say Hey Kid (689 reputation)Say Hey Kid (689 reputation)Say Hey Kid (689 reputation)Say Hey Kid (689 reputation)Say Hey Kid (689 reputation)Say Hey Kid (689 reputation)Say Hey Kid (689 reputation)

Group: General Forum Members
Points: 689 Visits: 82
Comments posted to this topic are about the content posted at http://www.sqlservercentral.com/columnists/chawkins/dedupingdatainsqlserver2005.asp



drnetwork
drnetwork
Say Hey Kid
Say Hey Kid (689 reputation)Say Hey Kid (689 reputation)Say Hey Kid (689 reputation)Say Hey Kid (689 reputation)Say Hey Kid (689 reputation)Say Hey Kid (689 reputation)Say Hey Kid (689 reputation)Say Hey Kid (689 reputation)

Group: General Forum Members
Points: 689 Visits: 82
When you are populating the record set, take out the GO between the next-to-the-last and the last insert statements. Having this penultimate GO in the set of queries will remove the scope of the @NOW variable and cause the last INSERT to fail.



ALZDBA
ALZDBA
SSC-Dedicated
SSC-Dedicated (31K reputation)SSC-Dedicated (31K reputation)SSC-Dedicated (31K reputation)SSC-Dedicated (31K reputation)SSC-Dedicated (31K reputation)SSC-Dedicated (31K reputation)SSC-Dedicated (31K reputation)SSC-Dedicated (31K reputation)

Group: General Forum Members
Points: 31417 Visits: 8986

nice example

Here is another :

WITH cteEmployeeOrderedByMyRowNumber AS
(SELECT ROW_NUMBER() OVER (ORDER BY EMPID ASC, REFDATE ASC) AS MyRowNumber
, Row_NUMBER() Over (Partition By EMPID,FNAME,LNAME Order By REFDATE ASC) as PartitionRank
, *
FROM EMPLOYEE
-- WHERE 1 = 1
)
DELETE FROM cteEmployeeOrderedByMyRowNumber
where PartitionRank > 1 ;

just to get it in the tips of the fingers



Johan


Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere w00t

- How to post Performance Problems
- How to post data/code to get the best help


- How to prevent a sore throat after hours of presenting ppt ?


press F1 for solution, press shift+F1 for urgent solution :-D


Need a bit of Powershell? How about this

Who am I ? Sometimes this is me Alien but most of the time this is me Hehe
Heiko Hatzfeld
Heiko Hatzfeld
SSC Journeyman
SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)

Group: General Forum Members
Points: 99 Visits: 8

Hi...

I usually clean up my "mess" with "something" like this

delete from myTable where myID NOT in

(select min(MyID) from myTable group by myUniqueField[s])

But I think the best application for CTE are recursive querries...


Aries Manlig
Aries Manlig
Grasshopper
Grasshopper (18 reputation)Grasshopper (18 reputation)Grasshopper (18 reputation)Grasshopper (18 reputation)Grasshopper (18 reputation)Grasshopper (18 reputation)Grasshopper (18 reputation)Grasshopper (18 reputation)

Group: General Forum Members
Points: 18 Visits: 3
Is there a performance gain to using this function and technique? Or is it just one of those "hey cool function --- let's use it"?

aries
sscbm21
sscbm21
SSC Veteran
SSC Veteran (211 reputation)SSC Veteran (211 reputation)SSC Veteran (211 reputation)SSC Veteran (211 reputation)SSC Veteran (211 reputation)SSC Veteran (211 reputation)SSC Veteran (211 reputation)SSC Veteran (211 reputation)

Group: General Forum Members
Points: 211 Visits: 9

great examples!

If only duplicates need to be removed the ROW_NUMBER() may not be needed.

WITH cteEmployeeOrderedByMyRank AS
(SELECT RANK() OVER (PARTITION BY EMPID,FNAME,LNAME ORDER BY REFDATE ASC) AS PartitionRank
, *
FROM EMPLOYEE
-- WHERE 1 = 1
)
DELETE FROM cteEmployeeOrderedByMyRank
WHERE PartitionRank > 1 ;

It surely seems to be much faster than the cursor based apporach.




bm21
USKiwi
USKiwi
SSC-Enthusiastic
SSC-Enthusiastic (172 reputation)SSC-Enthusiastic (172 reputation)SSC-Enthusiastic (172 reputation)SSC-Enthusiastic (172 reputation)SSC-Enthusiastic (172 reputation)SSC-Enthusiastic (172 reputation)SSC-Enthusiastic (172 reputation)SSC-Enthusiastic (172 reputation)

Group: General Forum Members
Points: 172 Visits: 59

Interesting demonstration of the ROW_NUMBER() function. Please say it ain’t so, Joe! - that you are not using cursors to remove duplicate rows. Even the technique of a SELECT DISTINCT into a temporary table would be a better option. As other readers have commented, there are a number of ways to remove duplicate rows. This would be my approach:

DELETE Employee

FROM Employee a INNER JOIN (SELECT Empid,

FName,

LName,

MIN(RefDate) AS 'MinDate'

FROM Employee

GROUP BY Empid, FName, LName) b

ON a.Empid = b.Empid

AND a.FName = b.FName

AND a.LName = b.LName

AND a.RefDate > b.MinDate

This would still leave the issue of James verses Jim that would need to be resolved separately. If you didn’t care about spelling variations and wanted to assume that the first entry was the correct one then this would work:

DELETE Employee

FROM Employee a INNER JOIN (SELECT Empid,

MIN(RefDate) AS 'MinDate'

FROM Employee

GROUP BY Empid) b

ON a.Empid = b.Empid

AND a.RefDate > b.MinDate

I would be interested in the question of performance between the two techniques but I’d put my money on mine which I suspect has a whole lot less overhead even as a cross join than having the engine generate a row position.





André Cardoso
André Cardoso
UDP Broadcaster
UDP Broadcaster (1.4K reputation)UDP Broadcaster (1.4K reputation)UDP Broadcaster (1.4K reputation)UDP Broadcaster (1.4K reputation)UDP Broadcaster (1.4K reputation)UDP Broadcaster (1.4K reputation)UDP Broadcaster (1.4K reputation)UDP Broadcaster (1.4K reputation)

Group: General Forum Members
Points: 1441 Visits: 374

Another option (that works with SQL 2000) and even in cases of all the columns having the same value (no column to differentiate the rows), is inserting the result set into a new table with an identity column (or adding an identity column to the original table). After that, it's just a matter of keeping the distinct rows as shown in the comments.

I can't remember where I read this solution, but it was either here in SQL Server Central or SQL Team foruns.

André Cardoso





Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search