RE: Eliminating Duplicate Rows using The PARTITION BY clause

SSC Eights!

Points: 842

September 21, 2010 at 11:20 pm

Nice article & Nice solution. But I would caution anyone about to blindly using this in production. It is only the 1st part of the process.

Problem 1: need to Fix Declarative Referential Integrity (DRI) for the rows deleted.

Invariably the duplicate rows you are about to remove will be referenced by the foreign keys in another table. When you delete these rows you must keep a copy of the Keys of the rows you deleted & map them to the key of the “duplicate” row you kept. That way you have some lookup table you can use to correct all the foreign key references.

Even if you fix up the foreign key references first, it is good to keep them “just in case” you overlooked fixing up a table. (which is really easy to do on a complex schema or where DRI isn’t perfect)

The optimum way to get a list of rows you deleted is to use the OUTPUT clause in the Deleted statement. (see Example B of OUTPUT Clause (Transact-SQL) article in Books Online http://msdn.microsoft.com/en-us/library/ms177564.aspx )

Problem 2: Fixing DRI Creates more duplicates.

Often the child table has multiple duplicate rows that point to duplicated parent rows. As you remove duplicate parent rows & change the foreign keys in the Child table. You may discover that what you thought was “3 groups of duplicate child rows each with 2 rows” becomes “1 group of 6 duplicate rows all pointing to the same parent row”.

Thus you get a chicken & egg situation. Generally I’ve found it best to Clean the parent. Keep a list, use it to fix the child. Then dedupe the child, keep a list etc. This means you only have to touch each table once.

But beware. Duplicate Rows & dirty data are commonly found in the same databases. Often when you tidy up the data, especially if you set bad values to NULL, you may produce even more duplicate rows. So the can sometimes become an iterative process.

The most common causes for duplicate rows.

1.Many to Many reference tables that don’t make the 2 foreign keys the primary key but use an Identity col instead.

2.Using Identity columns for your keys & having no other alternate key with a unique constraint on the table.

3.Poor error handling &/or No transactions to protect you if your batch jobs fail.

Have fun, Dave