Finding and Deleting Duplicate Data

  • Comments posted to this topic are about the content posted at http://www.sqlservercentral.com/columnists/ccubley/findinganddeletingduplicatedata.asp


    Chris Cubley
    www.queryplan.com

  • Neatly encapsulates several techniques in one place. I will save it in my bits of useful code.


    Graham Farrow
    DBA
    Mastercare Central Division

  • Useful topic to cover but isn't the final solution over-complex?

    Couldn't you just use an inequality of PaymentID's on a self-join as follows..?

    DELETE Payment

    WHERE PaymentID IN

    (SELECT P1.PaymentID

    FROM Payment P1

    INNER JOIN Payment P2

    ON P1.CustomerNumber = p2.CustomerNumber

    AND

    p1.PostedDatetime = p2.PostedDatetime

    AND

    p1.PaymentAmt = p2.PaymentAmt

    AND

    p1.PaymentID > p2.PaymentID)

    Mark

  • This is not as complex as using temporary tables and scrubbing processes and then moving to live tables.

    That was my first atempt.

  • Excellent article ! This is something I plan to use on one of my tables later today.

  • Good Article! Thanks.

  • I think you can use this approach too

    --table structure

    CREATE TABLE [a4] (

    [id] [int] IDENTITY (1, 1) NOT NULL ,

    [title] [varchar] (50) ,

    [name] [varchar] (50) NULL ,

    CONSTRAINT [PK_a4] PRIMARY KEY CLUSTERED

    (

    [id]

    ) ON [PRIMARY]

    ) ON [PRIMARY]

    GO

    --to have distinct id for the doubles

    --the first id's

    select distinct a.id, a.title, a.name from a4 a, a4 b where a.title = b.title and a.name = b.name and a.id < b.id

    --the last id's

    select distinct a.id, a.title, a.name from a4 a, a4 b where a.title = b.title and a.name = b.name and a.id > b.id

    --to delete the first id's doubles and get unique rows in the original table

    delete from a4 where id in (

    select distinct a.id from a4 a, a4 b where a.title = b.title and a.name = b.name and a.id < b.id )

    --to delete the last id's doubles and get unique rows in the original table

    delete from a4 where id in (

    select distinct a.id from a4 a, a4 b where a.title = b.title and a.name = b.name and a.id > b.id )

  • Just to follow up, I followed all of the sample code and everything worked great the first time. The article was well written and the logic was clear in describing why the steps were necessary. Keep up the good work !!!

  • This is a pretty good technique, but there are problems with it.  Using derived tables in a query can be very powerful, but they can also cause excessive use of tempdb (similar to using # tables).  Often, there are other ways of doing the same thing for cheaper.

    So, if you don't have to use them, don't.  The query below is functionally equivalent to the delete statement in the article, but it costs almost half.

    delete p1

    From Payment p1

    JOIN Payment p2

       on  p1.CustomerNumber = p2.CustomerNumber and

        p1.PostedDateTime = p2.PostedDateTime and

        p1.PaymentAmt   = p2.PaymentAmt

    where p1.PaymentID < p2.PaymentID

     

    Signature is NULL

  • Here's the full script:

    if object_ID('tempdb..#Payment') is not null drop table #Payment

    CREATE TABLE #Payment(

     PaymentID int identity Not nUll Primary Key,

     CustomerNumber int  NOT NULL,

     PostedDatetime datetime NOT NULL,

     PaymentAmt money  NOT NULL,

     PaymentMemo varchar(200) NOT NULL,

     CCRefNum char(10) NOT NULL

    )

     

    Insert #Payment Values (1, '01/01/2004', 1, '', '')

    Insert #Payment Values (1, '01/01/2004', 1, '', '')

    Insert #Payment Values (2, '01/02/2004', 2, '', '')

    Insert #Payment Values (2, '01/02/2004', 2, '', '')

    Insert #Payment Values (2, '01/02/2004', 2, '', '')

    Insert #Payment Values (3, '01/03/2004', 3, '', '')

    Insert #Payment Values (3, '01/03/2004', 3, '', '')

    Insert #Payment Values (3, '01/03/2004', 3, '', '')

    Insert #Payment Values (3, '01/03/2004', 3, '', '')

     

     

    delete p1

    From #Payment p1

    JOIN #Payment p2

       on  p1.CustomerNumber = p2.CustomerNumber and

        p1.PostedDateTime = p2.PostedDateTime and

        p1.PaymentAmt   = p2.PaymentAmt

    where p1.PaymentID < p2.PaymentID

     

    DELETE FROM

     p1

    FROM

     #Payment p1

    INNER JOIN

     (

      SELECT

       MAX(PaymentID) AS PaymentID,

       CustomerNumber,

       PostedDatetime,

       PaymentAmt

      FROM

       #Payment

      GROUP BY

       CustomerNumber,

       PostedDatetime,

       PaymentAmt

      HAVING

       COUNT(*) > 1

    &nbsp p2

     ON(

      p1.CustomerNumber = p2.CustomerNumber

      AND

      p1.PostedDatetime = p2.PostedDatetime

      AND

      p1.PaymentAmt = p2.PaymentAmt

      AND

      p1.PaymentID <> p2.PaymentID

    &nbsp

     

    Signature is NULL

  • Thank you to Calvin Lawsen, Posted 7/26/2004 6:55 PM. I tried and it worked first pass.

    However, I had to modify a bit to make the process delete the records with the earliest date and time stamp, leaving only the most recent record. However, upon closer scrutiny, I noticed that the second part of your code segment, i.e. beginning with "DELETE FROM

    p1......" didn't seem necessary?! The first code segment worked just fine.

    Am I missing something or was this an oversight?

  • To Calvin Lawson - I should hav added the modified code as follows:

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    ALTER PROCEDURE SP_TEST01

    AS

    BEGIN

    SET NOCOUNT ON;

    if object_ID('dbo.TBLTEST01') is not null drop table TBLTEST01

    CREATE TABLE TBLTEST01(

    ID int identity Not NUll Primary Key,

    CustNum int NOT NULL,

    PostedDate datetime NOT NULL,

    PaymentAmt money NOT NULL,

    PaymentMemo varchar(200) NOT NULL,

    CCRefNum char(10) NOT NULL

    )

    Insert TBLTEST01 Values (1, '01/10/2004', 1, '', '')

    Insert TBLTEST01 Values (1, '01/11/2004', 1, 'Keep', '')

    Insert TBLTEST01 Values (2, '01/10/2004', 2, '', '')

    Insert TBLTEST01 Values (2, '01/11/2004', 2, '', '')

    Insert TBLTEST01 Values (2, '01/12/2004', 2, 'Keep', '')

    Insert TBLTEST01 Values (3, '01/10/2004', 3, '', '')

    Insert TBLTEST01 Values (3, '01/11/2004', 3, '', '')

    Insert TBLTEST01 Values (3, '01/12/2004', 3, '', '')

    Insert TBLTEST01 Values (3, '01/13/2004', 3, 'Keep', '')

    delete p1

    From TBLTEST01 p1

    JOIN TBLTEST01 p2

    on p1.CustNum = p2.CustNum and

    p1.PostedDate < p2.PostedDate and

    p1.PaymentAmt = p2.PaymentAmt

    where p1.ID < p2.ID

    END

    Thanks again

  • Hi,

    Great Post. This will definately help me. I tried manipulating your code to work for me and i get an "Msg 207, Level 16, State 1, Line 27 Invalid column name 'iID'." when i want o delete the duplicates and retail one copy of the duplicates. The error is with p2.iid. The column is there and i am not sure why i am getting this error.

    Can you please help?

    --deleteing duplicated rows with keeping 1 copy of the duplicated row.

    DELETE FROM

    p1

    FROM

    VehicleHistory p1

    INNER JOIN

    (

    SELECT

    MAX(iID) AS ColumnID,

    ivehicleid,

    dtdatetime,

    iOdometer

    FROM

    VehicleHistory

    GROUP BY

    ivehicleid,

    dtdatetime,

    iOdometer

    HAVING

    COUNT(*) > 1) p2

    ON

    (p1.ivehicleid = p2.ivehicleid

    AND

    p1.dtdatetime = p2.dtdatetime

    AND

    p1.iOdometer = p2.iOdometer

    and

    p1.iID <> p2.iID)

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply