Deduping Data in SQL Server 2005

Introduction

Recently I read Muthusamy Anantha Kumar's article "Row_NUMBER() function in SQL Server 2005" (http://www.sqlservercentral.com/articles/articlelink.asp?articleid=2239) article. In it, he explored not only the new ROW_NUMBER() function, but, also gave a good example of SQL Server 2005's common table expression.

When I saw his sample, I immediately realized that I could use this to solve one of my more perplexing problems in SQL Server - how to reduce duplicates in a record set to a single row. I will demonstrate how to "de-dupify" a record set in SQL Server 2005 in a single query - a feat I would find myself using a cursor for in earlier versions of SQL Server. Particularly vexing were issues where you didn't just have two copies of a record but might have three or more copies or ragged numbers - two here and three there. My solution below accounts for this hardest case of the ragged numbers of duplicates.

I slightly modified Muthusamy's original example.

First build an EMPLOYEE table in TEMPDB (my scratchpad of choice). I added a reference date to each record so that later when I de-dupify, we'll show you how to keep the earliest or latest duplicate record.

USE [TEMPDB]
GO
IF  EXISTS
  (SELECT * FROM SYS.OBJECTS
  WHERE OBJECT_ID =
    OBJECT_ID(N'[DBO].[EMPLOYEE]') AND TYPE IN (N'U'))
DROP TABLE [DBO].[EMPLOYEE]
GO
CREATE TABLE EMPLOYEE (EMPID INT, FNAME VARCHAR(50),
LNAME VARCHAR(50), REFDATE DATETIME)
GO

Next, populate the EMPLOYEE table. Notice how I use the semi-colon to separate my queries while still keeping the context of the @NOW variable relevant. When you use common table expressions (as we'll later do), semi-colons are important.

DECLARE @NOW DATETIME
SELECT @NOW = GETDATE()
INSERT INTO EMPLOYEE  (EMPID, FNAME, LNAME, REFDATE)
VALUES (2021110, 'MICHAEL', 'POLAND', DATEADD(MI,1,@NOW));
INSERT INTO EMPLOYEE  (EMPID, FNAME, LNAME, REFDATE)
VALUES (2021110, 'MICHAEL', 'POLAND', DATEADD(MI,2,@NOW));
INSERT INTO EMPLOYEE  (EMPID, FNAME, LNAME, REFDATE)
VALUES (2021115, 'JIM', 'KENNEDY', DATEADD(MI,3,@NOW));
INSERT INTO EMPLOYEE  (EMPID, FNAME, LNAME, REFDATE)
VALUES (2121000, 'JAMES', 'SMITH', DATEADD(MI,4,@NOW));
INSERT INTO EMPLOYEE  (EMPID, FNAME, LNAME, REFDATE)
VALUES (2011111, 'ADAM', 'ACKERMAN', DATEADD(MI,5,@NOW));
INSERT INTO EMPLOYEE  (EMPID, FNAME, LNAME, REFDATE)
VALUES (3015670, 'MARTHA', 'LEDERER', DATEADD(MI,6,@NOW));
INSERT INTO EMPLOYEE  (EMPID, FNAME, LNAME, REFDATE)
VALUES (1021710, 'MARTHA', 'MANDEZ', DATEADD(MI,7,@NOW));
INSERT INTO EMPLOYEE  (EMPID, FNAME, LNAME, REFDATE)
VALUES (2021110, 'MICHAEL', 'POLAND', DATEADD(MI,8,@NOW));
INSERT INTO EMPLOYEE  (EMPID, FNAME, LNAME, REFDATE)
VALUES (2021115, 'JIM', 'KENNEDY', DATEADD(MI,9,@NOW));
INSERT INTO EMPLOYEE  (EMPID, FNAME, LNAME, REFDATE)
VALUES (2021115, 'JAMES', 'KENNEDY', DATEADD(MI,10,@NOW));
GO

Illustration 1.0 shows our EMPLOYEE table. Your reference dates will be different but will maintain the one-minute offsets between rows.

2021110

MICHAEL

POLAND

2006-01-18 09:41:08.297

2021110

MICHAEL

POLAND

2006-01-18 09:42:08.297

2021115

JIM

KENNEDY

2006-01-18 09:43:08.297

2121000

JAMES

SMITH

2006-01-18 09:44:08.297

2011111

ADAM

ACKERMAN

2006-01-18 09:45:08.297

3015670

MARTHA

LEDERER

2006-01-18 09:46:08.297

1021710

MARTHA

MANDEZ

2006-01-18 09:47:08.297

2021110

MICHAEL

POLAND

2006-01-18 09:48:08.297

2021115

JIM

KENNEDY

2006-01-18 09:49:08.297

2021115

JAMES

KENNEDY

2006-01-18 09:50:08.297

Illustration 1.0

You see that EMPID/FNAME 2021110/Michael has three repeats and 2021115/Jim has two, our ragged duplicates.

SELECT EMPID, FNAME, LNAME, REFDATE FROM EMPLOYEE

Here is Muthusamy's query to show you how to use the ROW_NUMBER() function. I've also added our REFDATE to the ORDER BY clause so that the earliest records come first.

SELECT ROW_NUMBER() OVER (ORDER BY EMPID ASC, REFDATE ASC) AS ROWID, * FROM EMPLOYEE

The results are seen in illustration 2.0:

1021710

MARTHA

MANDEZ

2006-01-18 09:47:08.297

2011111

ADAM

ACKERMAN

2006-01-18 09:45:08.297

2021110

MICHAEL

POLAND

2006-01-18 09:41:08.297

2021110

MICHAEL

POLAND

2006-01-18 09:42:08.297

2021110

MICHAEL

POLAND

2006-01-18 09:48:08.297

2021115

JIM

KENNEDY

2006-01-18 09:43:08.297

2021115

JIM

KENNEDY

2006-01-18 09:49:08.297

2021115

JAMES

KENNEDY

2006-01-18 09:50:08.297

2121000

JAMES

SMITH

2006-01-18 09:44:08.297

3015670

MARTHA

LEDERER

2006-01-18 09:46:08.297

Illustration 2.0

Now here is our common table expression,EMPLOYEE ORDERED BY ROWID, query to de-dupify the record set. I'm wrapping it in a BEGIN TRAN and ROLLBACK TRAN for testing purposes. Later when you get the record set that you want, you can change the ROLLBACK to a COMMIT. I've given you a before and after SELECT of the record set so you can see the change. In this case, we are keeping the earliest record.

BEGIN TRAN;
SELECT ROW_NUMBER() OVER (ORDER BY EMPID ASC, REFDATE ASC) AS ROWID, * FROM EMPLOYEE;
WITH [EMPLOYEE ORDERED BY ROWID] AS
(SELECT ROW_NUMBER() OVER (ORDER BY EMPID ASC, REFDATE ASC) AS ROWID, * FROM EMPLOYEE WHERE 1 = 1)
DELETE FROM [EMPLOYEE ORDERED BY ROWID] WHERE ROWID NOT IN
(SELECT MIN(ROWID) FROM [EMPLOYEE ORDERED BY ROWID] GROUP BY EMPID,FNAME,LNAME);
SELECT ROW_NUMBER() OVER (ORDER BY EMPID ASC, REFDATE ASC) AS ROWID, * FROM EMPLOYEE;
ROLLBACK TRAN;
-- COMMIT TRAN;
GO

Notice how I terminate the each of the queries above with a semi-colon. In SQL Server 2005, certain queries absolutely must be separated from other queries by a semi-colon. Common table expressions are one such query type. If I had not used the semi-colon in the first SELECT statement, I would have received the following error:


Msg 336, Level 15, State 1, Line 3
Incorrect syntax near 'EMPLOYEE ORDERED BY ROWID'. If this is intended to be a common table expression,
you need to explicitly terminate the previous statement with a semi-colon.

The results of the SELECT before our DELETE are those of illustration 2 above. Illustration 3 shows the results of the post-DELETE SELECT query:

1021710

MARTHA

MANDEZ

2006-01-18 09:47:08.297

2011111

ADAM

ACKERMAN

2006-01-18 09:45:08.297

2021110

MICHAEL

POLAND

2006-01-18 09:41:08.297

2021115

JIM

KENNEDY

2006-01-18 09:43:08.297

2021115

JAMES

KENNEDY

2006-01-18 09:50:08.297

2121000

JAMES

SMITH

2006-01-18 09:44:08.297

3015670

MARTHA

LEDERER

2006-01-18 09:46:08.297

Illustration 3.0

Thus, we've de-dupified the EMPLOYEE table in one DELETE query above.

The DELETE query above using the common table expression is very powerful. You can manipulate the final record set in several different ways.

Notice that I added a "WHERE 1 = 1" clause to my definition of the common table expression. This is unnecessary. I added it just to demonstrate that you can limit the scope of the common table expression [EMPLOYEE ORDERED BY ROWID] by limiting the employee record set returned by the WHERE clause.

If I wanted to retain the later date for the employee record, I could do this in one of two ways. I could sort the population of the common table expression [EMPLOYEE ORDERED BY ROWID] by using REFDATE DESC. Alternatively, I could use the MAX function for ROWID in the subquery of the DELETE statement.

Finally, I specify the uniqueness of the duplication by modifying the GROUP BY clause of the sub-query. Notice that we have two records for EMPID 2021115 remaining in our final record set. We de-dupified the records for Jim Kennedy, but James Kennedy still remains. Removing FNAME from our GROUP BY clause would cause the earlier record for Jim Kennedy to remain. This may or may not be desirable. You would probably want to resolve the disparity between Jim and James with the application owners before proceeding, displaying the usefulness of the earlier discussed ROLLBACK/COMMIT option for testing what you really want your final record set to look like.

I'll leave it to you to try deleting the EMPID and LNAME columns from the GROUP BY clause and see what happens to your record set.

Conclusion

In this article, we have shown how SQL Server 2005 ROW_NUMBER() function and common table expressions allow you to de-dupify record sets in one quick query.

Deduping Data in SQL Server 2005

Introduction

Conclusion

Rate

Share

Categories

Share

Rate

Deduping Data in SQL Server 2005

Introduction

Conclusion

Rate

Share

Categories

Share

Rate

Related content

Invoke UDFs that accept tables with SQL Server 2005's APPLY operator

SQL 2005 for the SQL2K Developer Part 2

SQL Server 2005 for SQL2k Developer (Part 1)

Boost Performance and Reduce Code Use With SQL Server Aggregate Functi

Sequential Ordering on SQL 2005