Duplicate Detection and Management

Question

Post reply

Duplicate Detection and Management

Ed Pollack

Hall of Fame

Points: 3199
More actions
April 29, 2014 at 12:11 am

#396673

Comments posted to this topic are about the item Duplicate Detection and Management

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply

peterlim05 Old Hand Points: 342 More actions · Answer 1

see also...

http://www.sqlservercentral.com/Forums/Topic1434840-23-1.aspx

TheComedian SSCrazy Points: 2932 More actions · Answer 2

I would also suggest that you back up the table before removing the duplicates. Although the solution is nice and simple, the recovery is definitely not if you make a mistake.

john 5241 Grasshopper Points: 17 More actions · Answer 3

Why have you not chosen to use

SELECT * FROM <dbName> WHERE <tableID> IN

(

SELECT <tableID> FROM <tableA>

EXCEPT

SELECT <tableID> FROM <tableB>

)

GO

Ed Pollack Hall of Fame Points: 3199 More actions · Answer 4

TheComedian: As always, backing up data in any production environment before making a big change is critical! I left it out as this is a demo, but it's certainly worth noting (thank you!) for anyone that is faced with similar situations on real data.

Jonathan AC Roberts SSCoach Points: 18405 More actions · Answer 5

Good article. Nice way of selecting all columns when detecting duplicates for a subset of the columns.

skeleton567 SSCertifiable Points: 5935 More actions · Answer 6

Ed Pollack (4/29/2014)
Comments posted to this topic are about the item <A HREF="/articles/Duplicate/109019/">Duplicate Detection and Management</A>

First of all, check for the existence of a created/changed timestamp you might use to define a subset of the records. Depends on how the duplicates came to be created by the code

Maybe the easiest and simplest way is still the good old query:

SELECT A, B, C, COUNT(A)

FROM tablename

GROUP BY A, B, C

ORDER BY A, B, C

HAVING COUNT(A) > 1

That may not be exact code 'cause I'm an old retired guy and haven't done this for a few years, but the example is close and will tell you how many of each set exist. Just group by all fields or as many at a time as you need and that the GROUP BY syntax allows, using temp tables or subqueries and SELECT DISTINCT as needed. Can't remember if you can just do a SELECT DISCTINCT * to get one of each identical set or not. Also can't recall if there is a limit on the GROUP BY and/or ORDER BY, and I'm too lazy to look it up.

Play with it. It's fun, even when you get too old to remember. It's 5:00 somewhere.

Rick
If you do a half-assed job of things, folks will ask 'why did this ass only do half the job?'

JMcCann 28785 SSC Rookie Points: 44 More actions · Answer 7

Does this delete 'All' duplicate rows? Or just the ones that were duplicated by accident?

Does it leave the original rows alone?

Ed Pollack Hall of Fame Points: 3199 More actions · Answer 8

JMcCann 28785 (5/1/2014)
Does this delete 'All' duplicate rows? Or just the ones that were duplicated by accident?
Does it leave the original rows alone?

This TSQL will delete all duplicates except for the oldest/original one. The ordering in the CTE determines how you want to classify the original (that we want to keep) and the accidental duplicates (that we want to delete). Ie should we order by id and keep the smallest (original) id, or should we order by date and keep the oldest row. You could order by any other criteria as well, depending on your specific situation.

The same goes for the update statement, which will update specific information in the original row from the first duplicate in each set, thereby allowing you to remove the duplicates when complete.

In all of these operations, the original row is purposely preserved.

JMcCann 28785 SSC Rookie Points: 44 More actions · Answer 9

JMcCann 28785

SSC Rookie

Points: 44

May 1, 2014 at 6:36 pm

#1710231

Thanks for the explanation!

mrcgray19 Valued Member Points: 72 More actions · Answer 10

How can I include duplicate data only if certain criteria are met?

Employee table. If employee moves from one company to the other, the employee is terminated in the old and rehired in the new company. I have 10 employees that have 2 records.

company employee termdate

abc 111111 04/04/2014

xyz 111111 01/01/1900

abc 222222 05/01/2013

xyz 222222 04/29/2014

Goal: Want to find all users that are terminated (termdate not = '01/01/1900') but do not include the user if they have a second active record. I want every terminated user with one record as well as employee 222222.

The CTE example only finds the second record.

Jonathan AC Roberts SSCoach Points: 18405 More actions · Answer 11

I think simply testing for existence should do what you need.

SELECT E.Company,

E.Employee,

E.TermDate

FROM tblEmployee E

WHERE E.TermDate <> '19000101' -- Terminated

AND NOT EXISTS(SELECT *

FROM tblEmployee E1

WHERE E1.Employee = E.Employee

AND E1.TermDate = '19000101') -- And not an active row for same employee

mrcgray19 Valued Member Points: 72 More actions · Answer 12

Of course it would be that simple!

Almost perfect.

When the employee does have 2 term records, it of course now returns both records. How would I further filter those to only show 1 record with the most recent term date?

Ed Pollack Hall of Fame Points: 3199 More actions · Answer 13

mrcgray19 (5/6/2014)
Of course it would be that simple!
Almost perfect.
When the employee does have 2 term records, it of course now returns both records. How would I further filter those to only show 1 record with the most recent term date?

mrcgray19 (5/6/2014)
How can I include duplicate data only if certain criteria are met?
Employee table. If employee moves from one company to the other, the employee is terminated in the old and rehired in the new company. I have 10 employees that have 2 records.
company employee termdate
abc 111111 04/04/2014
xyz 111111 01/01/1900
abc 222222 05/01/2013
xyz 222222 04/29/2014
Goal: Want to find all users that are terminated (termdate not = '01/01/1900') but do not include the user if they have a second active record. I want every terminated user with one record as well as employee 222222.
The CTE example only finds the second record.

The WHERE clause on the final select from the CTE will control the data that you get back. If you'd like, you can remove the WHERE row_num > 1 from it. Now the SELECT returns everything, including the row_num to identify the number of dupes involved. You can also filter further, using an IF EXISTS to further refine your data set returned. I'll use the original example from the post to illustrate this:

WITH CTE_DUPLICATE_RESOURCES AS

(

SELECT

id,

ROW_NUMBER() OVER (PARTITION BY first_name, last_name, email_address ORDER BY id) AS row_num,

first_name ,

middle_name ,

last_name ,

email_address ,

email_promotion

FROM dbo.Resources

)

SELECT

*

FROM CTE_DUPLICATE_RESOURCES DUPE1

WHERE NOT EXISTS

(SELECT * FROM CTE_DUPLICATE_RESOURCES DUPE2

WHERE DUPE2.first_name = DUPE1.first_name AND DUPE2.last_name = DUPE1.last_name AND DUPE2.email_address = DUPE1.email_address

AND DUPE2.row_num > 1

)

To get only the most recent of a set of potential dupes, you can do something like this:

WITH CTE_DUPLICATE_RESOURCES AS

(

SELECT

id,

ROW_NUMBER() OVER (PARTITION BY first_name, last_name, email_address ORDER BY id) AS row_num,

first_name ,

middle_name ,

last_name ,

email_address ,

email_promotion

FROM dbo.Resources

)

SELECT

*

FROM CTE_DUPLICATE_RESOURCES DUPE1

WHERE NOT EXISTS

(SELECT * FROM CTE_DUPLICATE_RESOURCES DUPE2

WHERE DUPE2.first_name = DUPE1.first_name AND DUPE2.last_name = DUPE1.last_name AND DUPE2.email_address = DUPE1.email_address

AND DUPE2.id > DUPE1.id

)

Jonathan AC Roberts SSCoach Points: 18405 More actions · Answer 14

Or this:

;WITH CTE AS

(

SELECT E.Company,

E.Employee,

ROW_NUMBER() OVER (PARTITION BY E.Employee ORDER BY TermDate DESC) AS row_num,

E.TermDate

FROM tblEmployee E

WHERE E.TermDate <> '19000101' -- Terminated

AND NOT EXISTS(SELECT *

FROM tblEmployee E1

WHERE E1.Employee = E.Employee

AND E1.TermDate = '19000101') -- And not an active row for same employee

)

SELECT *

FROM CTE

WHERE row_num = 1