SQLServerCentral Article

Finding and Deleting Duplicate Data

,

I have often found that poor database design creates the need for inventive SQL solutions. In fact, the whole concept of "normalizing" databases sprang from eliminating certain "anomalies" when querying the database. When faced with a sub-optimal database design, the ideal solution from a technical standpoint is to redesign the database and correct its flaws. However, redesigning the database often is not an option due to time or budget constraints, pre-existing software, or simple lack of authority to change the design. We must often play the hand we are dealt. In this article, I'll show you how to leverage advanced join techniques to get the most from design-challenged databases.

Suppose the person that designed your database fell asleep in database theory class when the professor was discussing entity integrity. As a result, the schema lacks primary keys and is rife with duplicate data. Since its difficult to write a quality piece of software against a database with duplicate data, you decide to create primary keys on the tables. Your only problem is that you must first identify and eliminate the duplicate data.

There are several ways of finding and eliminating duplicate data; some are better suited to specific situations than others. The technique that I'll show you in this article is a relatively straightforward technique that should work in a wide variety of situations. To demonstrate the technique, I'm going to use the following table:

CREATE TABLE Payment(

CustomerNumberintNOT NULL,

PostedDatetimedatetimeNOT NULL,

PaymentAmtmoneyNOT NULL,

PaymentMemovarchar(200)NOT NULL,

CCRefNumchar(10)NOT NULL

)

The first step to a duplicate-free table is coming up with a way of uniquely identifying each row. There's no way you can delete a particular row without being able to distinguish it from the others. The easiest way to accomplish this is to add an identity column to this table and make it the primary key.

ALTER TABLE Payment ADD

PaymentID int NOT NULL IDENTITY(1, 1),

CONSTRAINT PK_Payment PRIMARY KEY(PaymentID)

Now that you have the ability to specify a particular row to delete, you must decide how to define "duplicate". Which columns should uniquely identify a row? In real-world scenarios, the answer will be based upon how the table has been used and the actual data contained in it; however, for this example, a combination of CustomerNumber, PostedDatetime, and PaymentAmt will serve as the logical key. It makes sense that the same customer would not post two payments of exactly the same amount at exactly the same time.

Having decided what a duplicate is, you need some way of identifying them. To identify the duplicate rows, you need to construct a query which returns each unique CustomerNumber / PostedDatetime / PaymentAmt combination where there is more than one row with the same key values.

SELECT

CustomerNumber,

PostedDatetime,

PaymentAmt

FROM

Payment

GROUP BY

CustomerNumber,

PostedDatetime,

PaymentAmt

HAVING

COUNT(*) > 1

This query groups by the key values to return one row per unique set of values. The HAVING clause then filters out all groups where there is only one row for that set of values. This leaves a rowset containing only those logical key values where there are duplicates.

The next problem to solve is how to eliminate all but one row per set of key values. The PaymentID column added earlier provides an answer. If we could somehow get exactly one PaymentID for each set of duplicates, we could delete all rows where the CustomerNumber / PostedDatetime / PaymentAmt is returned in the above query, but the PaymentID does not match. This would delete all but one row per set of duplicates. How can we accomplish this?

When grouping data, aggregate functions can be used to return data about the group as a whole. While some of the aggregate functions such as SUM() and AVG() return values which don't exist on any particular row in the group, the extrema functions MIN() and MAX() return values which exist in an actual row in the group. Hence, the extrema functions can be used to return an arbitrary value from some set of values.

Using this knowledge about MIN() and MAX(), we can modify the above query to return the CustomerNumber / PostedDatetime / PaymentAmt for each set of duplicates along with exactly one unique PaymentID.

SELECT

MAX(PaymentID) AS PaymentID,

CustomerNumber,

PostedDatetime,

PaymentAmt

FROM

Payment

GROUP BY

CustomerNumber,

PostedDatetime,

PaymentAmt

HAVING

COUNT(*) > 1

This query returns all the data we need to perform the delete operation, but how do we tell the database engine to actually delete the records? We can't simply change the SELECT to DELETE because DELETE statements don't support grouping and aggregates. However, the DELETE statement does support joining to other tables. We can use a technique called a derived table to treat the above SELECT statement as if it were a table. We can then write a DELETE statement which joins to this derived table and performs the DELETE.

DELETE FROM

p1

FROM

Payment p1

INNER JOIN

(

SELECT

MAX(PaymentID) AS PaymentID,

CustomerNumber,

PostedDatetime,

PaymentAmt

FROM

Payment

GROUP BY

CustomerNumber,

PostedDatetime,

PaymentAmt

HAVING

COUNT(*) > 1

) p2

ON(

p1.CustomerNumber = p2.CustomerNumber

AND

p1.PostedDatetime = p2.PostedDatetime

AND

p1.PaymentAmt = p2.PaymentAmt

AND

p1.PaymentID <> p2.PaymentID

)

The most interesting part of this query is the join criteria. We join on CustomerNumber, PostedDatetime, and PaymentAmt to delete rows that match logically to the set of duplicates that we've found. If we left the criteria at this point, we'd end up deleting the duplicates, but we wouldn't have any copies of the duplicate rows remaining. Our requirements stated that we delete all the duplicates, but leave one copy of each duplicated row. To do this, we add the fourth criterion stating the PaymentID not match one of the PaymentID values in the set of duplicates. Thus, one row per set of duplicates is retained while deleting the others.

Hopefully, you'll never need to use these techniques to delete duplicate data. However, if you do, you won't have to struggle to find a technique that precisely and efficiently performs the task. Even if you don't use this technique to delete duplicates, many of the elements of the solution - derived tables, joined deletes, grouping and aggregation - can be applied to other situations that require inventive query techniques.

Rate

4.9 (10)

You rated this post out of 5. Change rating

Share

Share

Rate

4.9 (10)

You rated this post out of 5. Change rating