Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««123»»

Get Rid of Duplicates! Expand / Collapse
Author
Message
Posted Monday, November 30, 2009 8:00 AM


SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: Friday, February 21, 2014 7:54 AM
Points: 1,619, Visits: 1,233
To answer the 'Why don't you just use replication/triggers to keep the tables in sync' questions:
Our app is being phased out, and was developed by 2 teams of developers that wrote the app to access 2 different databases that were very similar, but not exactly the same. As we are developing new software to replace the old app, I have to keep it functional for now. Thus, replication and/or triggers are not a viable solution in this case. :)


_________________________________
seth delconte
http://sqlkeys.com
Post #826222
Posted Monday, November 30, 2009 8:15 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: Administrators
Last Login: Today @ 4:39 PM
Points: 33,155, Visits: 15,291
Good article, Seth, and a nice explanation of the issue. It's not always easy to do things up front, especially when you have business reasons for not putting resources into those solutions. We've all had apps that we would like to re-architect, but could not for some reason.






Follow me on Twitter: @way0utwest

Forum Etiquette: How to post data/code on a forum to get the best help
Post #826232
Posted Monday, November 30, 2009 8:21 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, September 28, 2010 10:12 AM
Points: 5, Visits: 17


Good explanation, I can understand now why the issue cannot be resolved up front.

Thanks,
Post #826236
Posted Monday, November 30, 2009 8:31 AM


SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Yesterday @ 8:31 PM
Points: 123, Visits: 1,443
I use this one a lot because it removes multiples (3's, 4's, etc) - not just duplicates...

WITH dups AS
( SELECT *, ROW_NUMBER() OVER (partition BY USER_NAME, start_date ORDER BY USER_NAME, start_date) AS RowNum
FROM tbl_users)

Delete from dups where rownum > 1
Post #826246
Posted Monday, November 30, 2009 10:46 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Friday, March 9, 2012 7:42 AM
Points: 51, Visits: 28
I use this:

DELETE FROM tblUser tu1
WHERE tu1.intUserID > ANY (SELECT intUserID
FROM tblUser tu2
WHERE tu2.strUserName = tu1.strUserName
AND tu2.strFamilyName = tu1.strFamilyName)
Post #826382
Posted Monday, November 30, 2009 10:47 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, May 13, 2014 6:00 PM
Points: 8, Visits: 59
If you found two duplicated item_no's why did four rows get deleted? Wouldn't you want to delete just one of the duplicates so that one unique row would remain?

I must be missing something. Thanks for your explanation in advance.

Richard
Post #826384
Posted Monday, November 30, 2009 5:41 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 6:46 PM
Points: 36,944, Visits: 31,446
I have to admit that when I read the following, I thought that Seth had simply lost his mind...
My quick resolution in this situation is to:

1. Remove the unique index temporarily;
2. Run the application, allowing it to insert duplicate item(s);
3. then find the duplicate(s) and remove them.

Of course, these steps are preceded by performing a good backup of the database and possibly putting the database in single user mode to prevent unexpected query results during my work. As simple as the task of removing a record with a duplicate value sounds, it can get confusing, and I need to proceed with care. To be safe, I follow this rule of thumb: first I perform a SELECT of the record(s) that will be removed, then I convert it to a DELETE statement after I'm sure it will affect only record(s) that I want it to.
.. because it just wasn't clear that it was a legacy app that shouldn't be changed because of the impending rewrite. I thought that was an awful lot of work to do a simple conditional insert.

Now that Seth has clarified the problem a bit, I can mostly agree with the pain he goes through including that of duplicate elimination. On that subject and for all of those that made the very good suggestion of using ROW_NUMBER() to isolate duplicates, keep in mind that this is a legacy app on a legacy DB and it might be pre-2k5 where ROW_NUMBER() simply doesn't exist. Still, the title of the article is "Get Rid of Duplicates" and not "Get Rid of Duplicates for a Special Case" and I can certainly understand why people may have jumped to the wrong conclusion on this article especially when the wrap-up line in the Conclusion is "Now you can confidently remove duplicate records from your tables!" and there was no mention of version nor ROW_NUMBER().

That notwithstanding, for what the article was actually about, it was a good, well written article. Thanks, Seth.

As a side bar... I don't know what the app would do with the "Duplicate key was ignored." message that would show up you tried to insert any dupes (some apps interpret such messages as an error... same goes with returned row counts), but have you tried changing the unique index to a unique index with the "IGNORE_DUP_KEY = ON" setting? If the app forgives the warning message(s) about dupes being ignored (1 for each INSERT statment that has dupes no matter how many dupes exist in that INSERT), it could save you a wad of trouble that you're currently going through.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #826432
Posted Tuesday, December 1, 2009 2:42 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, April 1, 2014 5:49 AM
Points: 9, Visits: 141
I prefer to use GROUP BY when I want to see the number of duplicates for each group of attributes used to check uniqueness, while ROW_NUMBER is a nicer solution and I use it especially when I want to use the latest or earliest entered version of the same record.

The check for duplicates might be required when merging data from two different sources or when breaking a not normalized table into multiple tables, for example a headers/lines set of tables. Another situation in which I had to check for duplicates is when importing data from non-relational sources (e.g. text files, Excel sheets, etc.) in which the chances of having duplicates are quite high.

As already stressed, it's preferable to reduce upfront the possibility of entering duplicates, unfortunately that's not always possible.

It's not always required to add unique indexes/constraints, though that was a good tip.
Post #826532
Posted Tuesday, December 1, 2009 9:16 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Thursday, October 31, 2013 8:04 PM
Points: 68, Visits: 278
Seth,

this is an age old dilemma and was put to rest with many variations like JP de Jong-202059 pointed out. In fact this site has articles with script to perform the same task. Stop wasting our time with your eureka moments ....
Post #826783
Posted Tuesday, December 1, 2009 9:45 AM


SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: Friday, February 21, 2014 7:54 AM
Points: 1,619, Visits: 1,233
If you found two duplicated item_no's why did four rows get deleted? Wouldn't you want to delete just one of the duplicates so that one unique row would remain?

I must be missing something. Thanks for your explanation in advance.

Richard


Richard,

In this scenario, I have 2 duplicated records, where every field is identical. I copied one instance of each record into a temp table, then deleted ALL the records from the original table that had item_nos that were duplicated (each of the 2 records had 1 duplicate record, so the total was 4 records). I chose to group by item_no, but could just as easily have used id. Then I copied everything from the temp table back to the original table (2 non-duplicate records). This method just seemed to make sense to me, I'm sure there are other, possibly more efficient ways to do this.


_________________________________
seth delconte
http://sqlkeys.com
Post #826821
« Prev Topic | Next Topic »

Add to briefcase ««123»»

Permissions Expand / Collapse