Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Finding and Deleting Duplicate Data

By Chris Cubley,

I have often found that poor database design creates the need for inventive SQL solutions. In fact, the whole concept of "normalizing" databases sprang from eliminating certain "anomalies" when querying the database. When faced with a sub-optimal database design, the ideal solution from a technical standpoint is to redesign the database and correct its flaws. However, redesigning the database often is not an option due to time or budget constraints, pre-existing software, or simple lack of authority to change the design. We must often play the hand we are dealt. In this article, I'll show you how to leverage advanced join techniques to get the most from design-challenged databases.

Suppose the person that designed your database fell asleep in database theory class when the professor was discussing entity integrity. As a result, the schema lacks primary keys and is rife with duplicate data. Since its difficult to write a quality piece of software against a database with duplicate data, you decide to create primary keys on the tables. Your only problem is that you must first identify and eliminate the duplicate data.

There are several ways of finding and eliminating duplicate data; some are better suited to specific situations than others. The technique that I'll show you in this article is a relatively straightforward technique that should work in a wide variety of situations. To demonstrate the technique, I'm going to use the following table:

CREATE TABLE Payment(
	CustomerNumber	int		NOT NULL,
	PostedDatetime	datetime	NOT NULL,
	PaymentAmt	money		NOT NULL,
	PaymentMemo	varchar(200)	NOT NULL,
	CCRefNum	char(10)	NOT NULL
)

The first step to a duplicate-free table is coming up with a way of uniquely identifying each row. There's no way you can delete a particular row without being able to distinguish it from the others. The easiest way to accomplish this is to add an identity column to this table and make it the primary key.

ALTER TABLE Payment ADD 
PaymentID int NOT NULL IDENTITY(1, 1),
	CONSTRAINT PK_Payment PRIMARY KEY(PaymentID)

Now that you have the ability to specify a particular row to delete, you must decide how to define "duplicate". Which columns should uniquely identify a row? In real-world scenarios, the answer will be based upon how the table has been used and the actual data contained in it; however, for this example, a combination of CustomerNumber, PostedDatetime, and PaymentAmt will serve as the logical key. It makes sense that the same customer would not post two payments of exactly the same amount at exactly the same time.

Having decided what a duplicate is, you need some way of identifying them. To identify the duplicate rows, you need to construct a query which returns each unique CustomerNumber / PostedDatetime / PaymentAmt combination where there is more than one row with the same key values.

SELECT
	CustomerNumber,
	PostedDatetime,
	PaymentAmt
FROM
	Payment
GROUP BY
	CustomerNumber,
	PostedDatetime,
	PaymentAmt
HAVING
	COUNT(*) > 1

This query groups by the key values to return one row per unique set of values. The HAVING clause then filters out all groups where there is only one row for that set of values. This leaves a rowset containing only those logical key values where there are duplicates.

The next problem to solve is how to eliminate all but one row per set of key values. The PaymentID column added earlier provides an answer. If we could somehow get exactly one PaymentID for each set of duplicates, we could delete all rows where the CustomerNumber / PostedDatetime / PaymentAmt is returned in the above query, but the PaymentID does not match. This would delete all but one row per set of duplicates. How can we accomplish this?

When grouping data, aggregate functions can be used to return data about the group as a whole. While some of the aggregate functions such as SUM() and AVG() return values which don't exist on any particular row in the group, the extrema functions MIN() and MAX() return values which exist in an actual row in the group. Hence, the extrema functions can be used to return an arbitrary value from some set of values.

Using this knowledge about MIN() and MAX(), we can modify the above query to return the CustomerNumber / PostedDatetime / PaymentAmt for each set of duplicates along with exactly one unique PaymentID.

SELECT
	MAX(PaymentID) AS PaymentID,
	CustomerNumber,
	PostedDatetime,
	PaymentAmt
FROM
	Payment
GROUP BY
	CustomerNumber,
	PostedDatetime,
	PaymentAmt
HAVING
	COUNT(*) > 1

This query returns all the data we need to perform the delete operation, but how do we tell the database engine to actually delete the records? We can't simply change the SELECT to DELETE because DELETE statements don't support grouping and aggregates. However, the DELETE statement does support joining to other tables. We can use a technique called a derived table to treat the above SELECT statement as if it were a table. We can then write a DELETE statement which joins to this derived table and performs the DELETE.

DELETE FROM
	p1
FROM
	Payment p1
INNER JOIN
	(
		SELECT
			MAX(PaymentID) AS PaymentID,
			CustomerNumber,
			PostedDatetime,
			PaymentAmt
		FROM
			Payment
		GROUP BY
			CustomerNumber,
			PostedDatetime,
			PaymentAmt
		HAVING
			COUNT(*) > 1
	) p2
	ON(
		p1.CustomerNumber = p2.CustomerNumber
		AND
		p1.PostedDatetime = p2.PostedDatetime
		AND
		p1.PaymentAmt = p2.PaymentAmt
		AND
		p1.PaymentID <> p2.PaymentID
	)

The most interesting part of this query is the join criteria. We join on CustomerNumber, PostedDatetime, and PaymentAmt to delete rows that match logically to the set of duplicates that we've found. If we left the criteria at this point, we'd end up deleting the duplicates, but we wouldn't have any copies of the duplicate rows remaining. Our requirements stated that we delete all the duplicates, but leave one copy of each duplicated row. To do this, we add the fourth criterion stating the PaymentID not match one of the PaymentID values in the set of duplicates. Thus, one row per set of duplicates is retained while deleting the others.

Hopefully, you'll never need to use these techniques to delete duplicate data. However, if you do, you won't have to struggle to find a technique that precisely and efficiently performs the task. Even if you don't use this technique to delete duplicates, many of the elements of the solution - derived tables, joined deletes, grouping and aggregation - can be applied to other situations that require inventive query techniques.

Total article views: 19826 | Views in the last 30 days: 15
 
Related Articles
FORUM

delete duplicates

delete duplicates

SCRIPT

Deleting Duplicate Records

In Datawarehousing, a normal scenario is eleminating duplicate records/rows or deleting duplicate re...

FORUM

Deleting Duplicate Record in Production

Deleting Duplicate Record in Production

FORUM

delete duplicate

delete duplicate based on date

SCRIPT

Deleting Duplicate Records

This article explains a simple solution for the common problem in MSSQL 7.0/2000, finding & deleting...

Tags
advanced querying    
miscellaneous    
programming    
t-sql    
 
Contribute

Join the most active online SQL Server Community

SQL knowledge, delivered daily, free:

Email address:  

You make SSC a better place

As a member of SQLServerCentral, you get free access to loads of fresh content: thousands of articles and SQL scripts, a library of free eBooks, a weekly database news roundup, a great Q & A platform… And it’s our huge, buzzing community of SQL Server Professionals that makes it such a success.

Join us!

Steve Jones
Editor, SQLServerCentral.com

Already a member? Jump in:

Email address:   Password:   Remember me: Forgotten your password?
Steve Jones