Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Deleting Duplicate Records


Deleting Duplicate Records

Author
Message
Daren Bieniek
Daren Bieniek
Forum Newbie
Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)

Group: General Forum Members
Points: 5 Visits: 1

The script is interesting, but I agree with the other people who have posted here that a non-cursor method is the best.  In fact, I believe that cursors should be hidden in BOL in an advanced section with huge warnings around it that say things like "USE AT YOUR OWN RISK" or something like that.  We definitely don't want newbies being shown solutions that use cursors, which appears to advocate their use.  I have done work like this for years with Billions of records and found that the following is the fastest method to remove the dupes, if there is a substantial percentage of dupes.

1. Move the unique rows into another table (either created ahead of time, or using select...into); however the new table should have no indexes yet.

2. Drop the original table.

3. Rename the new table to the same name as the old and create any needed indexes, etc.

Moving the unique rows into a new table is substantially faster than doing seek operations on the original table (what the engine does internally) to delete rows, unless there is a very low percentage of duplicates (say a few percent, which can vary slightly).  Also, having no indexes causes the inserts to go much faster and eliminates the index fragmentation that would occur and have to be cleaned up later.  Next, dropping of the original table and the metadata action of renaming the table is much faster than copying the records from a temp table or variable back into another table.  Finally, creating any needed indexes, etc. (make sure you do the clustered index first), makes them start at a non-fragmented state.

A couple of things about the replies concerning how the data got this way to begin with.  Usually this type of scenario is one that is done during data cleansing for things like combining data from different sources, for example a data warehouse load from multiple source systems.  Many times operations like this are not only necessary, but can constitute a substantial amount of the data cleansing work.  Personally, I call operations like this "homogenizing the data" or making the data homogeneous.  If you have multiple source systems from say, multiple sub companies that are run under a single holding company and each sub company has autonomous operations, then it is very difficult (or impossible) to get the data clean to begin with.

However, the 2 most difficult aspects of removing duplicates are not discussed in his article.  One aspect is deciding which duplicate is the right one.  In the real world duplicates are not really just dupes based on all columns, but rather they are 2 rows with the same key, but the rest of the columns differ.  Using the author's example "non-dupe" table, record id=2, name='joe' has 2 rows, one with a salary of 1000 and the other 2000.  Since name is the business key then this is a dupe, but which do we get rid of.  A wrong decision one way or the other would either short-change joe or double his salary.  This is one of the real difficulties in remove dupes.

The second difficulty in removing duplicates is in removing records that are duplicates, but have differing business keys.  For example, we are bringing together data from 2 sub-companies that both buy from 3COM, but the companies spell it differently.  For one company it is spelled '3COM' and the other has multiple entries '3 COM', '3.COM', and 'Three Com'.  So how do we make it so that all of these records are linked to a single entity?  I have methods that can do most of the matching of this type of data in the older DTS and using the newer SSIS fuzzy matching capabilities.  It is a good start, but be careful as you can miss matches and make wrong matches, inadvertently.  If people are interested, I could probably write up something about this.  However, I just wanted to make the point that simple constraints will not always work.  You have the old principle of GIGO (Garbage In, Garbage Out).  However, using certain techniques, you can at least clean up the trash a little and make it a little more presentable.

I hope this was helpful to people.

 


Brian Y. Anderson
Brian Y. Anderson
Forum Newbie
Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)

Group: General Forum Members
Points: 1 Visits: 1

I was totally looking for something like this. The fact that it ended up being so elgant and simple made me feel ignorant for a moment. This is a very helpful script. It's funny that after you think about trying to do something like this for so long you end up over thinking it. Then when you provide such a simple solution, it's like 'wow, how could I not have seen this before!'

 

Thanks again,

 

Brian

 

 


kevin mann
kevin mann
SSC Veteran
SSC Veteran (226 reputation)SSC Veteran (226 reputation)SSC Veteran (226 reputation)SSC Veteran (226 reputation)SSC Veteran (226 reputation)SSC Veteran (226 reputation)SSC Veteran (226 reputation)SSC Veteran (226 reputation)

Group: General Forum Members
Points: 226 Visits: 119
--OR!!

declare @affected tinyint

SET @affected = 1


SET ROWCOUNT 1

WHILE @affected > 0
BEGIN
   
   DELETE employee
   FROM   employee
   INNER JOIN
   (
      SELECT id, name, salary
      FROM employee
      GROUP BY id, name,salary
      HAVING COUNT(1)>1
   Wink dupes
      ON   dupes.id = employee.id
      AND   dupes.name = employee.name
      AND   dupes.salary = employee.salary
   
   
   SET @affected = @@ROWCOUNT
END

SET ROWCOUNT 0

SELECT * FROM employee
ORDER BY id

DROP TABLE employee
William Alber
William Alber
Forum Newbie
Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)

Group: General Forum Members
Points: 1 Visits: 1

I'll typically use a DELETE FROM / GROUP BY combination in conjunction with a temporary table:

DELETE FROM employee
WHERE id NOT IN (
    SELECT MIN(id)
    FROM employee
    GROUP BY name, salary
)

...assuming that id is an identity column (if your table doesn't have an identity column then you're on your own!).

I'll often adapt this approach if there are e.g. timestamp fields, such that they're not included in the GROUP BY (if you'd still want to consider the record dupe'd regardless).

If you're after all those records with the highest timestamp etc. then a temporary table with an identity column works wonders.

To kill off all records with dupe'd emp name, leaving the record with the highest salary...

CREATE TABLE #emps (
    myID INTEGER IDENTITY(1, 1),
    id INTEGER,
    name NCHAR(100), -- whatever
    salary MONEY
)

INSERT INTO #emps
SELECT id, name, salary
FROM employee
ORDER BY salary DESC

DELETE FROM employee
WHERE id NOT IN (
    SELECT id FROM #emps
    WHERE myID NOT IN (
        SELECT MIN(myID)
        FROM #emps
        GROUP BY name
    )
)

...I realise that NOT IN is not a particularly expedient operation, but have had great success with this in the past.


sheepoo
sheepoo
SSC-Enthusiastic
SSC-Enthusiastic (121 reputation)SSC-Enthusiastic (121 reputation)SSC-Enthusiastic (121 reputation)SSC-Enthusiastic (121 reputation)SSC-Enthusiastic (121 reputation)SSC-Enthusiastic (121 reputation)SSC-Enthusiastic (121 reputation)SSC-Enthusiastic (121 reputation)

Group: General Forum Members
Points: 121 Visits: 34
My personal experience is that duplicate records in the database means poor validation on both client and DB side as well as poor database design. I agree with those people who have gone the way of 'nip the evil in the bud' and not allow duplicates to get in the database.

The author mentioned that this article is for novice SQL Server users: I would rather train novice users how to avoid problems in the first place rather than react to them and realize that some things were left wanting in the DB design


Just my 2 cents
kevin mann
kevin mann
SSC Veteran
SSC Veteran (226 reputation)SSC Veteran (226 reputation)SSC Veteran (226 reputation)SSC Veteran (226 reputation)SSC Veteran (226 reputation)SSC Veteran (226 reputation)SSC Veteran (226 reputation)SSC Veteran (226 reputation)

Group: General Forum Members
Points: 226 Visits: 119
how about a new admin taking over for a careless admin
Andrew Clarke-161005
Andrew Clarke-161005
SSC Rookie
SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)

Group: Administrators
Points: 31 Visits: 52
I'm surprised that anyone working with SQL Server needs lessons in removing duplicates, and puzzled at the complexity of this cursor-based solution originally published. A better solution has already been pointed out in the forum. As to the opinion that databases should never contain duplicate entries, yes, but somehow they sometimes creep in. Like the Kernighan and Richie exception error message, 'this is impossible', which one should never see and often does. Commercial databases are often of frightening complexity and one should be always on guard for duplicates. Yes, constraints are one way of doing it, but there are times when the overhead is too great.

Editor: Simple-Talk
brian mcmillin
brian mcmillin
Forum Newbie
Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)

Group: General Forum Members
Points: 1 Visits: 1
Why not just do a select distinct into a new table, Delete the old and then rename the new to the old.
kevin mann
kevin mann
SSC Veteran
SSC Veteran (226 reputation)SSC Veteran (226 reputation)SSC Veteran (226 reputation)SSC Veteran (226 reputation)SSC Veteran (226 reputation)SSC Veteran (226 reputation)SSC Veteran (226 reputation)SSC Veteran (226 reputation)

Group: General Forum Members
Points: 226 Visits: 119
That was mentioned above
johncyriac
johncyriac
SSC Journeyman
SSC Journeyman (90 reputation)SSC Journeyman (90 reputation)SSC Journeyman (90 reputation)SSC Journeyman (90 reputation)SSC Journeyman (90 reputation)SSC Journeyman (90 reputation)SSC Journeyman (90 reputation)SSC Journeyman (90 reputation)

Group: General Forum Members
Points: 90 Visits: 58

Theoretically speaking select distinct is the best way but, which are the fields you want to be distinct may change in each case.

Virtually these are the problems of people who deals with , dataporting /migration not of DBAs





Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search