Finding and Deleting Duplicate Data

Question

Post reply

Finding and Deleting Duplicate Data

ccubley@queryplan.com

SSC-Addicted

Points: 411
More actions
July 13, 2003 at 12:00 am

#106109

Comments posted to this topic are about the content posted at http://www.sqlservercentral.com/columnists/ccubley/findinganddeletingduplicatedata.asp

Chris Cubley
www.queryplan.com

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

Graham Farrow SSC Enthusiast Points: 113 More actions · Answer 1

Neatly encapsulates several techniques in one place. I will save it in my bits of useful code.

Graham Farrow
DBA
Mastercare Central Division

shepton Old Hand Points: 374 More actions · Answer 2

Useful topic to cover but isn't the final solution over-complex?

Couldn't you just use an inequality of PaymentID's on a self-join as follows..?

DELETE Payment

WHERE PaymentID IN

(SELECT P1.PaymentID

FROM Payment P1

INNER JOIN Payment P2

ON P1.CustomerNumber = p2.CustomerNumber

AND

p1.PostedDatetime = p2.PostedDatetime

AND

p1.PaymentAmt = p2.PaymentAmt

AND

p1.PaymentID > p2.PaymentID)

Mark

whunt Grasshopper Points: 19 More actions · Answer 3

This is not as complex as using temporary tables and scrubbing processes and then moving to live tables.

That was my first atempt.

cppwiz SSCarpal Tunnel Points: 4036 More actions · Answer 4

Excellent article ! This is something I plan to use on one of my tables later today.

currym SSC Eights! Points: 949 More actions · Answer 5

currym

SSC Eights!

Points: 949

July 26, 2003 at 5:53 pm

#467183

Good Article! Thanks.

icata Hall of Fame Points: 3176 More actions · Answer 6

I think you can use this approach too

--table structure

CREATE TABLE [a4] (

[id] [int] IDENTITY (1, 1) NOT NULL ,

[title] [varchar] (50) ,

[name] [varchar] (50) NULL ,

CONSTRAINT [PK_a4] PRIMARY KEY CLUSTERED

(

[id]

) ON [PRIMARY]

GO

--to have distinct id for the doubles

--the first id's

select distinct a.id, a.title, a.name from a4 a, a4 b where a.title = b.title and a.name = b.name and a.id < b.id

--the last id's

select distinct a.id, a.title, a.name from a4 a, a4 b where a.title = b.title and a.name = b.name and a.id > b.id

--to delete the first id's doubles and get unique rows in the original table

delete from a4 where id in (

select distinct a.id from a4 a, a4 b where a.title = b.title and a.name = b.name and a.id < b.id )

--to delete the last id's doubles and get unique rows in the original table

delete from a4 where id in (

select distinct a.id from a4 a, a4 b where a.title = b.title and a.name = b.name and a.id > b.id )

cppwiz SSCarpal Tunnel Points: 4036 More actions · Answer 7

Just to follow up, I followed all of the sample code and everything worked great the first time. The article was well written and the logic was clear in describing why the steps were necessary. Keep up the good work !!!

Calvin Lawson SSChampion Points: 11030 More actions · Answer 8

This is a pretty good technique, but there are problems with it. Using derived tables in a query can be very powerful, but they can also cause excessive use of tempdb (similar to using # tables). Often, there are other ways of doing the same thing for cheaper.

So, if you don't have to use them, don't. The query below is functionally equivalent to the delete statement in the article, but it costs almost half.

delete p1

From Payment p1

JOIN Payment p2

on p1.CustomerNumber = p2.CustomerNumber and

p1.PostedDateTime = p2.PostedDateTime and

p1.PaymentAmt = p2.PaymentAmt

where p1.PaymentID < p2.PaymentID

Signature is NULL

Calvin Lawson SSChampion Points: 11030 More actions · Answer 9

Here's the full script:

if object_ID('tempdb..#Payment') is not null drop table #Payment

CREATE TABLE #Payment(

PaymentID int identity Not nUll Primary Key,

CustomerNumber int NOT NULL,

PostedDatetime datetime NOT NULL,

PaymentAmt money NOT NULL,

PaymentMemo varchar(200) NOT NULL,

CCRefNum char(10) NOT NULL

)

Insert #Payment Values (1, '01/01/2004', 1, '', '')

Insert #Payment Values (2, '01/02/2004', 2, '', '')

Insert #Payment Values (3, '01/03/2004', 3, '', '')

delete p1

From #Payment p1

JOIN #Payment p2

on p1.CustomerNumber = p2.CustomerNumber and

p1.PostedDateTime = p2.PostedDateTime and

p1.PaymentAmt = p2.PaymentAmt

where p1.PaymentID < p2.PaymentID

DELETE FROM

p1

FROM

#Payment p1

INNER JOIN

(

SELECT

MAX(PaymentID) AS PaymentID,

CustomerNumber,

PostedDatetime,

PaymentAmt

FROM

#Payment

GROUP BY

CustomerNumber,

PostedDatetime,

PaymentAmt

HAVING

COUNT(*) > 1

&nbsp p2

ON(

p1.CustomerNumber = p2.CustomerNumber

AND

p1.PostedDatetime = p2.PostedDatetime

AND

p1.PaymentAmt = p2.PaymentAmt

AND

p1.PaymentID <> p2.PaymentID

&nbsp

Signature is NULL

Peter Meagher Grasshopper Points: 18 More actions · Answer 10

Thank you to Calvin Lawsen, Posted 7/26/2004 6:55 PM. I tried and it worked first pass.

However, I had to modify a bit to make the process delete the records with the earliest date and time stamp, leaving only the most recent record. However, upon closer scrutiny, I noticed that the second part of your code segment, i.e. beginning with "DELETE FROM

p1......" didn't seem necessary?! The first code segment worked just fine.

Am I missing something or was this an oversight?

Peter Meagher Grasshopper Points: 18 More actions · Answer 11

To Calvin Lawson - I should hav added the modified code as follows:

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

ALTER PROCEDURE SP_TEST01

AS

BEGIN

SET NOCOUNT ON;

if object_ID('dbo.TBLTEST01') is not null drop table TBLTEST01

CREATE TABLE TBLTEST01(

ID int identity Not NUll Primary Key,

CustNum int NOT NULL,

PostedDate datetime NOT NULL,

PaymentAmt money NOT NULL,

PaymentMemo varchar(200) NOT NULL,

CCRefNum char(10) NOT NULL

)

Insert TBLTEST01 Values (1, '01/10/2004', 1, '', '')

Insert TBLTEST01 Values (1, '01/11/2004', 1, 'Keep', '')

Insert TBLTEST01 Values (2, '01/10/2004', 2, '', '')

Insert TBLTEST01 Values (2, '01/11/2004', 2, '', '')

Insert TBLTEST01 Values (2, '01/12/2004', 2, 'Keep', '')

Insert TBLTEST01 Values (3, '01/10/2004', 3, '', '')

Insert TBLTEST01 Values (3, '01/11/2004', 3, '', '')

Insert TBLTEST01 Values (3, '01/12/2004', 3, '', '')

Insert TBLTEST01 Values (3, '01/13/2004', 3, 'Keep', '')

delete p1

From TBLTEST01 p1

JOIN TBLTEST01 p2

on p1.CustNum = p2.CustNum and

p1.PostedDate < p2.PostedDate and

p1.PaymentAmt = p2.PaymentAmt

where p1.ID < p2.ID

END

Thanks again

vishnup Valued Member Points: 69 More actions · Answer 12

Hi,

Great Post. This will definately help me. I tried manipulating your code to work for me and i get an "Msg 207, Level 16, State 1, Line 27 Invalid column name 'iID'." when i want o delete the duplicates and retail one copy of the duplicates. The error is with p2.iid. The column is there and i am not sure why i am getting this error.

Can you please help?

--deleteing duplicated rows with keeping 1 copy of the duplicated row.

DELETE FROM

p1

FROM

VehicleHistory p1

INNER JOIN

(

SELECT

MAX(iID) AS ColumnID,

ivehicleid,

dtdatetime,

iOdometer

FROM

VehicleHistory

GROUP BY

ivehicleid,

dtdatetime,

iOdometer

HAVING

COUNT(*) > 1) p2

ON

(p1.ivehicleid = p2.ivehicleid

AND

p1.dtdatetime = p2.dtdatetime

AND

p1.iOdometer = p2.iOdometer

and

p1.iID <> p2.iID)