RE: Deleting Duplicate Records

Valued Member

Points: 57

March 10, 2006 at 7:40 am

The script is interesting, but I agree with the other people who have posted here that a non-cursor method is the best. In fact, I believe that cursors should be hidden in BOL in an advanced section with huge warnings around it that say things like "USE AT YOUR OWN RISK" or something like that. We definitely don't want newbies being shown solutions that use cursors, which appears to advocate their use. I have done work like this for years with Billions of records and found that the following is the fastest method to remove the dupes, if there is a substantial percentage of dupes.

1. Move the unique rows into another table (either created ahead of time, or using select...into); however the new table should have no indexes yet.

2. Drop the original table.

3. Rename the new table to the same name as the old and create any needed indexes, etc.

Moving the unique rows into a new table is substantially faster than doing seek operations on the original table (what the engine does internally) to delete rows, unless there is a very low percentage of duplicates (say a few percent, which can vary slightly). Also, having no indexes causes the inserts to go much faster and eliminates the index fragmentation that would occur and have to be cleaned up later. Next, dropping of the original table and the metadata action of renaming the table is much faster than copying the records from a temp table or variable back into another table. Finally, creating any needed indexes, etc. (make sure you do the clustered index first), makes them start at a non-fragmented state.

A couple of things about the replies concerning how the data got this way to begin with. Usually this type of scenario is one that is done during data cleansing for things like combining data from different sources, for example a data warehouse load from multiple source systems. Many times operations like this are not only necessary, but can constitute a substantial amount of the data cleansing work. Personally, I call operations like this "homogenizing the data" or making the data homogeneous. If you have multiple source systems from say, multiple sub companies that are run under a single holding company and each sub company has autonomous operations, then it is very difficult (or impossible) to get the data clean to begin with.

However, the 2 most difficult aspects of removing duplicates are not discussed in his article. One aspect is deciding which duplicate is the right one. In the real world duplicates are not really just dupes based on all columns, but rather they are 2 rows with the same key, but the rest of the columns differ. Using the author's example "non-dupe" table, record id=2, name='joe' has 2 rows, one with a salary of 1000 and the other 2000. Since name is the business key then this is a dupe, but which do we get rid of. A wrong decision one way or the other would either short-change joe or double his salary. This is one of the real difficulties in remove dupes.

The second difficulty in removing duplicates is in removing records that are duplicates, but have differing business keys. For example, we are bringing together data from 2 sub-companies that both buy from 3COM, but the companies spell it differently. For one company it is spelled '3COM' and the other has multiple entries '3 COM', '3.COM', and 'Three Com'. So how do we make it so that all of these records are linked to a single entity? I have methods that can do most of the matching of this type of data in the older DTS and using the newer SSIS fuzzy matching capabilities. It is a good start, but be careful as you can miss matches and make wrong matches, inadvertently. If people are interested, I could probably write up something about this. However, I just wanted to make the point that simple constraints will not always work. You have the old principle of GIGO (Garbage In, Garbage Out). However, using certain techniques, you can at least clean up the trash a little and make it a little more presentable.

I hope this was helpful to people.