SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Duplicate Detection and Management


Duplicate Detection and Management

Author
Message
Ed Pollack
Ed Pollack
Ten Centuries
Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)

Group: General Forum Members
Points: 1432 Visits: 513
Comments posted to this topic are about the item Duplicate Detection and Management
peterlim05
peterlim05
SSC Journeyman
SSC Journeyman (84 reputation)SSC Journeyman (84 reputation)SSC Journeyman (84 reputation)SSC Journeyman (84 reputation)SSC Journeyman (84 reputation)SSC Journeyman (84 reputation)SSC Journeyman (84 reputation)SSC Journeyman (84 reputation)

Group: General Forum Members
Points: 84 Visits: 207
see also...

http://www.sqlservercentral.com/Forums/Topic1434840-23-1.aspx
TheComedian
TheComedian
Ten Centuries
Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)

Group: General Forum Members
Points: 1063 Visits: 517
I would also suggest that you back up the table before removing the duplicates. Although the solution is nice and simple, the recovery is definitely not if you make a mistake.
john 5241
john 5241
Forum Newbie
Forum Newbie (7 reputation)Forum Newbie (7 reputation)Forum Newbie (7 reputation)Forum Newbie (7 reputation)Forum Newbie (7 reputation)Forum Newbie (7 reputation)Forum Newbie (7 reputation)Forum Newbie (7 reputation)

Group: General Forum Members
Points: 7 Visits: 7
Why have you not chosen to use

SELECT * FROM <dbName> WHERE <tableID> IN

(
SELECT <tableID> FROM <tableA>

EXCEPT

SELECT <tableID> FROM <tableB>
)
GO
Ed Pollack
Ed Pollack
Ten Centuries
Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)

Group: General Forum Members
Points: 1432 Visits: 513
TheComedian: As always, backing up data in any production environment before making a big change is critical! I left it out as this is a demo, but it's certainly worth noting (thank you!) for anyone that is faced with similar situations on real data.
Jonathan AC Roberts
Jonathan AC Roberts
SSCrazy
SSCrazy (2.5K reputation)SSCrazy (2.5K reputation)SSCrazy (2.5K reputation)SSCrazy (2.5K reputation)SSCrazy (2.5K reputation)SSCrazy (2.5K reputation)SSCrazy (2.5K reputation)SSCrazy (2.5K reputation)

Group: General Forum Members
Points: 2510 Visits: 1999
Good article. Nice way of selecting all columns when detecting duplicates for a subset of the columns.
skeleton567
skeleton567
Ten Centuries
Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)

Group: General Forum Members
Points: 1219 Visits: 443
Ed Pollack (4/29/2014)
Comments posted to this topic are about the item <A HREF="/articles/Duplicate/109019/">Duplicate Detection and Management</A>


First of all, check for the existence of a created/changed timestamp you might use to define a subset of the records. Depends on how the duplicates came to be created by the code

Maybe the easiest and simplest way is still the good old query:

SELECT A, B, C, COUNT(A)
FROM tablename
GROUP BY A, B, C
ORDER BY A, B, C
HAVING COUNT(A) > 1

That may not be exact code 'cause I'm an old retired guy and haven't done this for a few years, but the example is close and will tell you how many of each set exist. Just group by all fields or as many at a time as you need and that the GROUP BY syntax allows, using temp tables or subqueries and SELECT DISTINCT as needed. Can't remember if you can just do a SELECT DISCTINCT * to get one of each identical set or not. Also can't recall if there is a limit on the GROUP BY and/or ORDER BY, and I'm too lazy to look it up.

Play with it. It's fun, even when you get too old to remember. It's 5:00 somewhere.
JMcCann 28785
JMcCann 28785
Grasshopper
Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)

Group: General Forum Members
Points: 12 Visits: 56
Does this delete 'All' duplicate rows? Or just the ones that were duplicated by accident?
Does it leave the original rows alone?
Ed Pollack
Ed Pollack
Ten Centuries
Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)

Group: General Forum Members
Points: 1432 Visits: 513
JMcCann 28785 (5/1/2014)
Does this delete 'All' duplicate rows? Or just the ones that were duplicated by accident?
Does it leave the original rows alone?


This TSQL will delete all duplicates except for the oldest/original one. The ordering in the CTE determines how you want to classify the original (that we want to keep) and the accidental duplicates (that we want to delete). Ie should we order by id and keep the smallest (original) id, or should we order by date and keep the oldest row. You could order by any other criteria as well, depending on your specific situation.

The same goes for the update statement, which will update specific information in the original row from the first duplicate in each set, thereby allowing you to remove the duplicates when complete.

In all of these operations, the original row is purposely preserved.
JMcCann 28785
JMcCann 28785
Grasshopper
Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)

Group: General Forum Members
Points: 12 Visits: 56
Thanks for the explanation!
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search