Get Rid of Duplicates!

  • seth delconte

    SSCertifiable

    Points: 6388

    Comments posted to this topic are about the item Get Rid of Duplicates!

    _________________________________
    seth delconte
    http://sqlkeys.com

  • int.blue

    Grasshopper

    Points: 21

    I have the same issue on some tables that have an auto increment column which isn't an identity column, though lucky it sounds like it doesn't happen as often as it does for you! This is my technique which doesn't use a temp table, I use set rowcount set to 1 less than the number of duplicates and then call delete:

    CREATE TABLE dbo.tblDupTest

    (

    id int not null,

    )

    INSERT INTO dbo.tblDupTest VALUES(1)

    INSERT INTO dbo.tblDupTest VALUES(1)

    INSERT INTO dbo.tblDupTest VALUES(2)

    INSERT INTO dbo.tblDupTest VALUES(3)

    INSERT INTO dbo.tblDupTest VALUES(4)

    INSERT INTO dbo.tblDupTest VALUES(4)

    INSERT INTO dbo.tblDupTest VALUES(4)

    -- At this point we should have two 1s and three 4s

    SELECT *

    FROM dbo.tblDupTest

    -- This will give us the counts

    SELECT id,

    COUNT(id) AS 'Count'

    FROM dbo.tblDupTest

    GROUP BY id

    HAVING COUNT(id) > 1

    -- Then set the rowcount to one less than the duplicate and call delete

    set rowcount 1

    DELETE FROM dbo.tblDupTest WHERE id = 1

    set rowcount 2

    DELETE FROM dbo.tblDupTest WHERE id = 4

    set rowcount 0

    SELECT *

    FROM dbo.tblDupTest

  • JP de Jong-202059

    SSCommitted

    Points: 1674

    Hi I prefer this syntax:

    WITH ItemsToBeDeleted

    AS

    (

    SELECT *

    , row_number() over (partition by item_no ORDER BY id) as RowNumber

    FROM item_store

    )

    DELETE FROM ItemsToBeDeleted Where RowNumber = 2

    Much more efficient.

    Regards,

    JP

  • Trevor.weehuizen

    Valued Member

    Points: 59

    dealing with duplicates on a daily basis, i find this approach works well

    alter table withdupes add delid int identity(1,1)

    delete x

    from withdupes x

    inner join (

    select itemno, min(delid) as keepid

    from withdupes

    group by itemno

    ) y on x.itemo = y.itemno

    where x.delid <> y.keepid

    alter table withdupes drop column delid

  • Peter Pirker

    Ten Centuries

    Points: 1079

    It's very neat and all, and maybe I misunderstand, but if the tables are identical, couldn't you avoid the whole duplicates issue by inserting into one table only, and let replication take care of the rest?

    Regards

    Peter

  • Jim C-203340

    SSCommitted

    Points: 1605

    The row_number() method is by far the quickest and cleanest method. If you've never used row_number() before, do yourself a favor and learn it.

    one modification to JP's code.. "... where RowNumber > 1" will delete all duplicates not just in cases where you only have 1 dup.

    JP de Jong-202059 (11/30/2009)


    Hi I prefer this syntax:

    WITH ItemsToBeDeleted

    AS

    (

    SELECT *

    , row_number() over (partition by item_no ORDER BY id) as RowNumber

    FROM item_store

    )

    DELETE FROM ItemsToBeDeleted Where RowNumber = 2

    Much more efficient.

    Regards,

    JP


    Regards,

    Jim C

  • RyanRandall

    SSChampion

    Points: 13623

    There's an old article on it on this site...

    http://www.sqlservercentral.com/articles/SQL+Server+2005+-+TSQL/dedupingdatainsqlserver2005/2260/

    This option became available with SQL Server 2005.

    Jim C-203340 (11/30/2009)


    The row_number() method is by far the quickest and cleanest method. If you've never used row_number() before, do yourself a favor and learn it.

    one modification to JP's code.. "... where RowNumber > 1" will delete all duplicates not just in cases where you only have 1 dup.

    JP de Jong-202059 (11/30/2009)


    Hi I prefer this syntax:

    WITH ItemsToBeDeleted

    AS

    (

    SELECT *

    , row_number() over (partition by item_no ORDER BY id) as RowNumber

    FROM item_store

    )

    DELETE FROM ItemsToBeDeleted Where RowNumber = 2

    Much more efficient.

    Regards,

    JP

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  • Logicalman1998

    SSC Journeyman

    Points: 75

    I have had a similar problem in the past, but rather than looking to clean up after the fact, I test for duplicates beforehand and eliminate the insert at that time.

    I feel I have missed something in the original article as to why this might not have been identified as a design issue.

  • Dean Cochrane

    SSC Eights!

    Points: 962

    I think I'm with Tony Scott on this one: why not prevent the issue at insert time, rather than go through all the pain of removing duplicates after the fact?

  • tpoulsen

    SSC Rookie

    Points: 37

    yes agree with this - if you are replicating two databases then just update one - other than that use unique constraints (and if need be triggers) to make sure that you always have a natural uniqueness on each row in your table.

  • seth delconte

    SSCertifiable

    Points: 6388

    To answer the 'Why don't you just use replication/triggers to keep the tables in sync' questions:

    Our app is being phased out, and was developed by 2 teams of developers that wrote the app to access 2 different databases that were very similar, but not exactly the same. As we are developing new software to replace the old app, I have to keep it functional for now. Thus, replication and/or triggers are not a viable solution in this case. 🙂

    _________________________________
    seth delconte
    http://sqlkeys.com

  • Steve Jones - SSC Editor

    SSC Guru

    Points: 715810

    Good article, Seth, and a nice explanation of the issue. It's not always easy to do things up front, especially when you have business reasons for not putting resources into those solutions. We've all had apps that we would like to re-architect, but could not for some reason.

  • Logicalman1998

    SSC Journeyman

    Points: 75

    Good explanation, I can understand now why the issue cannot be resolved up front.

    Thanks,

  • dbajunior

    SSCrazy

    Points: 2307

    I use this one a lot because it removes multiples (3's, 4's, etc) - not just duplicates...

    WITH dups AS

    ( SELECT *, ROW_NUMBER() OVER (partition BY USER_NAME, start_date ORDER BY USER_NAME, start_date) AS RowNum

    FROM tbl_users)

    Delete from dups where rownum > 1

  • dasapito

    Valued Member

    Points: 51

    I use this:

    DELETE FROM tblUser tu1

    WHERE tu1.intUserID > ANY (SELECT intUserID

    FROM tblUser tu2

    WHERE tu2.strUserName = tu1.strUserName

    AND tu2.strFamilyName = tu1.strFamilyName)

Viewing 15 posts - 1 through 15 (of 30 total)

You must be logged in to reply to this topic. Login to reply