Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««12345»»»

Deleting Duplicate Records Expand / Collapse
Author
Message
Posted Friday, March 10, 2006 7:40 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Sunday, July 9, 2006 7:38 AM
Points: 5, Visits: 1

The script is interesting, but I agree with the other people who have posted here that a non-cursor method is the best.  In fact, I believe that cursors should be hidden in BOL in an advanced section with huge warnings around it that say things like "USE AT YOUR OWN RISK" or something like that.  We definitely don't want newbies being shown solutions that use cursors, which appears to advocate their use.  I have done work like this for years with Billions of records and found that the following is the fastest method to remove the dupes, if there is a substantial percentage of dupes.

1. Move the unique rows into another table (either created ahead of time, or using select...into); however the new table should have no indexes yet.

2. Drop the original table.

3. Rename the new table to the same name as the old and create any needed indexes, etc.

Moving the unique rows into a new table is substantially faster than doing seek operations on the original table (what the engine does internally) to delete rows, unless there is a very low percentage of duplicates (say a few percent, which can vary slightly).  Also, having no indexes causes the inserts to go much faster and eliminates the index fragmentation that would occur and have to be cleaned up later.  Next, dropping of the original table and the metadata action of renaming the table is much faster than copying the records from a temp table or variable back into another table.  Finally, creating any needed indexes, etc. (make sure you do the clustered index first), makes them start at a non-fragmented state.

A couple of things about the replies concerning how the data got this way to begin with.  Usually this type of scenario is one that is done during data cleansing for things like combining data from different sources, for example a data warehouse load from multiple source systems.  Many times operations like this are not only necessary, but can constitute a substantial amount of the data cleansing work.  Personally, I call operations like this "homogenizing the data" or making the data homogeneous.  If you have multiple source systems from say, multiple sub companies that are run under a single holding company and each sub company has autonomous operations, then it is very difficult (or impossible) to get the data clean to begin with.

However, the 2 most difficult aspects of removing duplicates are not discussed in his article.  One aspect is deciding which duplicate is the right one.  In the real world duplicates are not really just dupes based on all columns, but rather they are 2 rows with the same key, but the rest of the columns differ.  Using the author's example "non-dupe" table, record id=2, name='joe' has 2 rows, one with a salary of 1000 and the other 2000.  Since name is the business key then this is a dupe, but which do we get rid of.  A wrong decision one way or the other would either short-change joe or double his salary.  This is one of the real difficulties in remove dupes.

The second difficulty in removing duplicates is in removing records that are duplicates, but have differing business keys.  For example, we are bringing together data from 2 sub-companies that both buy from 3COM, but the companies spell it differently.  For one company it is spelled '3COM' and the other has multiple entries '3 COM', '3.COM', and 'Three Com'.  So how do we make it so that all of these records are linked to a single entity?  I have methods that can do most of the matching of this type of data in the older DTS and using the newer SSIS fuzzy matching capabilities.  It is a good start, but be careful as you can miss matches and make wrong matches, inadvertently.  If people are interested, I could probably write up something about this.  However, I just wanted to make the point that simple constraints will not always work.  You have the old principle of GIGO (Garbage In, Garbage Out).  However, using certain techniques, you can at least clean up the trash a little and make it a little more presentable.

I hope this was helpful to people.

 

Post #264847
Posted Friday, March 10, 2006 7:42 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, October 2, 2006 3:24 PM
Points: 1, Visits: 1

I was totally looking for something like this. The fact that it ended up being so elgant and simple made me feel ignorant for a moment. This is a very helpful script. It's funny that after you think about trying to do something like this for so long you end up over thinking it. Then when you provide such a simple solution, it's like 'wow, how could I not have seen this before!'

 

Thanks again,

 

Brian

 

 

Post #264848
Posted Friday, March 10, 2006 7:56 AM
SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Sunday, August 17, 2014 11:51 AM
Points: 226, Visits: 119
--OR!!

declare @affected tinyint

SET @affected = 1


SET ROWCOUNT 1

WHILE @affected > 0
BEGIN

DELETE employee
FROM employee
INNER JOIN
(
SELECT id, name, salary
FROM employee
GROUP BY id, name,salary
HAVING COUNT(1)>1
) dupes
ON dupes.id = employee.id
AND dupes.name = employee.name
AND dupes.salary = employee.salary


SET @affected = @@ROWCOUNT
END

SET ROWCOUNT 0

SELECT * FROM employee
ORDER BY id

DROP TABLE employee
Post #264861
Posted Friday, March 10, 2006 11:34 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Friday, September 1, 2006 1:58 PM
Points: 1, Visits: 1

I'll typically use a DELETE FROM / GROUP BY combination in conjunction with a temporary table:

DELETE FROM employee
WHERE id NOT IN (
    SELECT MIN(id)
    FROM employee
    GROUP BY name, salary
)

...assuming that id is an identity column (if your table doesn't have an identity column then you're on your own!).

I'll often adapt this approach if there are e.g. timestamp fields, such that they're not included in the GROUP BY (if you'd still want to consider the record dupe'd regardless).

If you're after all those records with the highest timestamp etc. then a temporary table with an identity column works wonders.

To kill off all records with dupe'd emp name, leaving the record with the highest salary...

CREATE TABLE #emps (
    myID INTEGER IDENTITY(1, 1),
    id INTEGER,
    name NCHAR(100), -- whatever
    salary MONEY
)

INSERT INTO #emps
SELECT id, name, salary
FROM employee
ORDER BY salary DESC

DELETE FROM employee
WHERE id NOT IN (
    SELECT id FROM #emps
    WHERE myID NOT IN (
        SELECT MIN(myID)
        FROM #emps
        GROUP BY name
    )
)

...I realise that NOT IN is not a particularly expedient operation, but have had great success with this in the past.

Post #264957
Posted Friday, March 10, 2006 12:29 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Saturday, August 31, 2013 11:41 PM
Points: 121, Visits: 34
My personal experience is that duplicate records in the database means poor validation on both client and DB side as well as poor database design. I agree with those people who have gone the way of 'nip the evil in the bud' and not allow duplicates to get in the database.

The author mentioned that this article is for novice SQL Server users: I would rather train novice users how to avoid problems in the first place rather than react to them and realize that some things were left wanting in the DB design


Just my 2 cents
Post #264979
Posted Friday, March 10, 2006 12:34 PM
SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Sunday, August 17, 2014 11:51 AM
Points: 226, Visits: 119
how about a new admin taking over for a careless admin
Post #264982
Posted Saturday, March 11, 2006 6:13 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: Administrators
Last Login: Saturday, November 29, 2014 9:26 AM
Points: 31, Visits: 48
I'm surprised that anyone working with SQL Server needs lessons in removing duplicates, and puzzled at the complexity of this cursor-based solution originally published. A better solution has already been pointed out in the forum. As to the opinion that databases should never contain duplicate entries, yes, but somehow they sometimes creep in. Like the Kernighan and Richie exception error message, 'this is impossible', which one should never see and often does. Commercial databases are often of frightening complexity and one should be always on guard for duplicates. Yes, constraints are one way of doing it, but there are times when the overhead is too great.

Editor: Simple-Talk
Post #265064
Posted Sunday, March 12, 2006 12:29 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Thursday, August 2, 2007 9:21 AM
Points: 1, Visits: 1
Why not just do a select distinct into a new table, Delete the old and then rename the new to the old.
Post #265097
Posted Monday, March 13, 2006 7:58 AM
SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Sunday, August 17, 2014 11:51 AM
Points: 226, Visits: 119
That was mentioned above
Post #265188
Posted Tuesday, March 14, 2006 9:30 PM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Saturday, October 23, 2010 4:51 AM
Points: 80, Visits: 58

Theoretically speaking select distinct is the best way but, which are the fields you want to be distinct may change in each case.

Virtually these are the problems of people who deals with , dataporting /migration not of DBAs




Post #265776
« Prev Topic | Next Topic »

Add to briefcase «««12345»»»

Permissions Expand / Collapse