Duplicate Detection and Management

  • Comments posted to this topic are about the item Duplicate Detection and Management

  • I would also suggest that you back up the table before removing the duplicates. Although the solution is nice and simple, the recovery is definitely not if you make a mistake.

  • Why have you not chosen to use

    SELECT * FROM <dbName> WHERE <tableID> IN

    (

    SELECT <tableID> FROM <tableA>

    EXCEPT

    SELECT <tableID> FROM <tableB>

    )

    GO

  • TheComedian: As always, backing up data in any production environment before making a big change is critical! I left it out as this is a demo, but it's certainly worth noting (thank you!) for anyone that is faced with similar situations on real data.

  • Good article. Nice way of selecting all columns when detecting duplicates for a subset of the columns.

  • Ed Pollack (4/29/2014)


    Comments posted to this topic are about the item <A HREF="/articles/Duplicate/109019/">Duplicate Detection and Management</A>

    First of all, check for the existence of a created/changed timestamp you might use to define a subset of the records. Depends on how the duplicates came to be created by the code

    Maybe the easiest and simplest way is still the good old query:

    SELECT A, B, C, COUNT(A)

    FROM tablename

    GROUP BY A, B, C

    ORDER BY A, B, C

    HAVING COUNT(A) > 1

    That may not be exact code 'cause I'm an old retired guy and haven't done this for a few years, but the example is close and will tell you how many of each set exist. Just group by all fields or as many at a time as you need and that the GROUP BY syntax allows, using temp tables or subqueries and SELECT DISTINCT as needed. Can't remember if you can just do a SELECT DISCTINCT * to get one of each identical set or not. Also can't recall if there is a limit on the GROUP BY and/or ORDER BY, and I'm too lazy to look it up.

    Play with it. It's fun, even when you get too old to remember. It's 5:00 somewhere.

    Rick
    Disaster Recovery = Backup ( Backup ( Your Backup ) )

  • Does this delete 'All' duplicate rows? Or just the ones that were duplicated by accident?

    Does it leave the original rows alone?

  • JMcCann 28785 (5/1/2014)


    Does this delete 'All' duplicate rows? Or just the ones that were duplicated by accident?

    Does it leave the original rows alone?

    This TSQL will delete all duplicates except for the oldest/original one. The ordering in the CTE determines how you want to classify the original (that we want to keep) and the accidental duplicates (that we want to delete). Ie should we order by id and keep the smallest (original) id, or should we order by date and keep the oldest row. You could order by any other criteria as well, depending on your specific situation.

    The same goes for the update statement, which will update specific information in the original row from the first duplicate in each set, thereby allowing you to remove the duplicates when complete.

    In all of these operations, the original row is purposely preserved.

  • Thanks for the explanation!

  • How can I include duplicate data only if certain criteria are met?

    Employee table. If employee moves from one company to the other, the employee is terminated in the old and rehired in the new company. I have 10 employees that have 2 records.

    company employee termdate

    abc 111111 04/04/2014

    xyz 111111 01/01/1900

    abc 222222 05/01/2013

    xyz 222222 04/29/2014

    Goal: Want to find all users that are terminated (termdate not = '01/01/1900') but do not include the user if they have a second active record. I want every terminated user with one record as well as employee 222222.

    The CTE example only finds the second record.

  • I think simply testing for existence should do what you need.

    SELECT E.Company,

    E.Employee,

    E.TermDate

    FROM tblEmployee E

    WHERE E.TermDate <> '19000101' -- Terminated

    AND NOT EXISTS(SELECT *

    FROM tblEmployee E1

    WHERE E1.Employee = E.Employee

    AND E1.TermDate = '19000101') -- And not an active row for same employee

  • Of course it would be that simple!

    Almost perfect.

    When the employee does have 2 term records, it of course now returns both records. How would I further filter those to only show 1 record with the most recent term date?

  • mrcgray19 (5/6/2014)


    Of course it would be that simple!

    Almost perfect.

    When the employee does have 2 term records, it of course now returns both records. How would I further filter those to only show 1 record with the most recent term date?

    mrcgray19 (5/6/2014)


    How can I include duplicate data only if certain criteria are met?

    Employee table. If employee moves from one company to the other, the employee is terminated in the old and rehired in the new company. I have 10 employees that have 2 records.

    company employee termdate

    abc 111111 04/04/2014

    xyz 111111 01/01/1900

    abc 222222 05/01/2013

    xyz 222222 04/29/2014

    Goal: Want to find all users that are terminated (termdate not = '01/01/1900') but do not include the user if they have a second active record. I want every terminated user with one record as well as employee 222222.

    The CTE example only finds the second record.

    The WHERE clause on the final select from the CTE will control the data that you get back. If you'd like, you can remove the WHERE row_num > 1 from it. Now the SELECT returns everything, including the row_num to identify the number of dupes involved. You can also filter further, using an IF EXISTS to further refine your data set returned. I'll use the original example from the post to illustrate this:

    WITH CTE_DUPLICATE_RESOURCES AS

    (

    SELECT

    id,

    ROW_NUMBER() OVER (PARTITION BY first_name, last_name, email_address ORDER BY id) AS row_num,

    first_name ,

    middle_name ,

    last_name ,

    email_address ,

    email_promotion

    FROM dbo.Resources

    )

    SELECT

    *

    FROM CTE_DUPLICATE_RESOURCES DUPE1

    WHERE NOT EXISTS

    (SELECT * FROM CTE_DUPLICATE_RESOURCES DUPE2

    WHERE DUPE2.first_name = DUPE1.first_name AND DUPE2.last_name = DUPE1.last_name AND DUPE2.email_address = DUPE1.email_address

    AND DUPE2.row_num > 1

    )

    To get only the most recent of a set of potential dupes, you can do something like this:

    WITH CTE_DUPLICATE_RESOURCES AS

    (

    SELECT

    id,

    ROW_NUMBER() OVER (PARTITION BY first_name, last_name, email_address ORDER BY id) AS row_num,

    first_name ,

    middle_name ,

    last_name ,

    email_address ,

    email_promotion

    FROM dbo.Resources

    )

    SELECT

    *

    FROM CTE_DUPLICATE_RESOURCES DUPE1

    WHERE NOT EXISTS

    (SELECT * FROM CTE_DUPLICATE_RESOURCES DUPE2

    WHERE DUPE2.first_name = DUPE1.first_name AND DUPE2.last_name = DUPE1.last_name AND DUPE2.email_address = DUPE1.email_address

    AND DUPE2.id > DUPE1.id

    )

  • Or this:

    ;WITH CTE AS

    (

    SELECT E.Company,

    E.Employee,

    ROW_NUMBER() OVER (PARTITION BY E.Employee ORDER BY TermDate DESC) AS row_num,

    E.TermDate

    FROM tblEmployee E

    WHERE E.TermDate <> '19000101' -- Terminated

    AND NOT EXISTS(SELECT *

    FROM tblEmployee E1

    WHERE E1.Employee = E.Employee

    AND E1.TermDate = '19000101') -- And not an active row for same employee

    )

    SELECT *

    FROM CTE

    WHERE row_num = 1

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply