Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Get Rid of Duplicates!


Get Rid of Duplicates!

Author
Message
seth delconte
seth delconte
SSCommitted
SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)

Group: General Forum Members
Points: 1627 Visits: 1360
To answer the 'Why don't you just use replication/triggers to keep the tables in sync' questions:
Our app is being phased out, and was developed by 2 teams of developers that wrote the app to access 2 different databases that were very similar, but not exactly the same. As we are developing new software to replace the old app, I have to keep it functional for now. Thus, replication and/or triggers are not a viable solution in this case. Smile

_________________________________
seth delconte
http://sqlkeys.com
Steve Jones
Steve Jones
SSC-Dedicated
SSC-Dedicated (36K reputation)SSC-Dedicated (36K reputation)SSC-Dedicated (36K reputation)SSC-Dedicated (36K reputation)SSC-Dedicated (36K reputation)SSC-Dedicated (36K reputation)SSC-Dedicated (36K reputation)SSC-Dedicated (36K reputation)

Group: Administrators
Points: 36471 Visits: 18764
Good article, Seth, and a nice explanation of the issue. It's not always easy to do things up front, especially when you have business reasons for not putting resources into those solutions. We've all had apps that we would like to re-architect, but could not for some reason.

Follow me on Twitter: @way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com
Logicalman1998
Logicalman1998
Forum Newbie
Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)

Group: General Forum Members
Points: 5 Visits: 17
Good explanation, I can understand now why the issue cannot be resolved up front.

Thanks,
dbajunior
dbajunior
SSC-Enthusiastic
SSC-Enthusiastic (123 reputation)SSC-Enthusiastic (123 reputation)SSC-Enthusiastic (123 reputation)SSC-Enthusiastic (123 reputation)SSC-Enthusiastic (123 reputation)SSC-Enthusiastic (123 reputation)SSC-Enthusiastic (123 reputation)SSC-Enthusiastic (123 reputation)

Group: General Forum Members
Points: 123 Visits: 1603
I use this one a lot because it removes multiples (3's, 4's, etc) - not just duplicates...

WITH dups AS
( SELECT *, ROW_NUMBER() OVER (partition BY USER_NAME, start_date ORDER BY USER_NAME, start_date) AS RowNum
FROM tbl_users)

Delete from dups where rownum > 1
dasapito
dasapito
Valued Member
Valued Member (51 reputation)Valued Member (51 reputation)Valued Member (51 reputation)Valued Member (51 reputation)Valued Member (51 reputation)Valued Member (51 reputation)Valued Member (51 reputation)Valued Member (51 reputation)

Group: General Forum Members
Points: 51 Visits: 28
I use this:

DELETE FROM tblUser tu1
WHERE tu1.intUserID > ANY (SELECT intUserID
FROM tblUser tu2
WHERE tu2.strUserName = tu1.strUserName
AND tu2.strFamilyName = tu1.strFamilyName)
rstelma
rstelma
Grasshopper
Grasshopper (15 reputation)Grasshopper (15 reputation)Grasshopper (15 reputation)Grasshopper (15 reputation)Grasshopper (15 reputation)Grasshopper (15 reputation)Grasshopper (15 reputation)Grasshopper (15 reputation)

Group: General Forum Members
Points: 15 Visits: 67
If you found two duplicated item_no's why did four rows get deleted? Wouldn't you want to delete just one of the duplicates so that one unique row would remain?

I must be missing something. Thanks for your explanation in advance.

Richard
Jeff Moden
Jeff Moden
SSC-Forever
SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)

Group: General Forum Members
Points: 45520 Visits: 39948
I have to admit that when I read the following, I thought that Seth had simply lost his mind...
My quick resolution in this situation is to:

1. Remove the unique index temporarily;
2. Run the application, allowing it to insert duplicate item(s);
3. then find the duplicate(s) and remove them.

Of course, these steps are preceded by performing a good backup of the database and possibly putting the database in single user mode to prevent unexpected query results during my work. As simple as the task of removing a record with a duplicate value sounds, it can get confusing, and I need to proceed with care. To be safe, I follow this rule of thumb: first I perform a SELECT of the record(s) that will be removed, then I convert it to a DELETE statement after I'm sure it will affect only record(s) that I want it to.
.. because it just wasn't clear that it was a legacy app that shouldn't be changed because of the impending rewrite. I thought that was an awful lot of work to do a simple conditional insert.

Now that Seth has clarified the problem a bit, I can mostly agree with the pain he goes through including that of duplicate elimination. On that subject and for all of those that made the very good suggestion of using ROW_NUMBER() to isolate duplicates, keep in mind that this is a legacy app on a legacy DB and it might be pre-2k5 where ROW_NUMBER() simply doesn't exist. Still, the title of the article is "Get Rid of Duplicates" and not "Get Rid of Duplicates for a Special Case" and I can certainly understand why people may have jumped to the wrong conclusion on this article especially when the wrap-up line in the Conclusion is "Now you can confidently remove duplicate records from your tables!" and there was no mention of version nor ROW_NUMBER(). ;-)

That notwithstanding, for what the article was actually about, it was a good, well written article. Thanks, Seth.

As a side bar... I don't know what the app would do with the "Duplicate key was ignored." message that would show up you tried to insert any dupes (some apps interpret such messages as an error... same goes with returned row counts), but have you tried changing the unique index to a unique index with the "IGNORE_DUP_KEY = ON" setting? If the app forgives the warning message(s) about dupes being ignored (1 for each INSERT statment that has dupes no matter how many dupes exist in that INSERT), it could save you a wad of trouble that you're currently going through.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
Although they tell us that they want it real bad, our primary goal is to ensure that we dont actually give it to them that way.
Although change is inevitable, change for the better is not.
Just because you can do something in PowerShell, doesnt mean you should. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
sql-troubles
sql-troubles
Grasshopper
Grasshopper (11 reputation)Grasshopper (11 reputation)Grasshopper (11 reputation)Grasshopper (11 reputation)Grasshopper (11 reputation)Grasshopper (11 reputation)Grasshopper (11 reputation)Grasshopper (11 reputation)

Group: General Forum Members
Points: 11 Visits: 145
I prefer to use GROUP BY when I want to see the number of duplicates for each group of attributes used to check uniqueness, while ROW_NUMBER is a nicer solution and I use it especially when I want to use the latest or earliest entered version of the same record.

The check for duplicates might be required when merging data from two different sources or when breaking a not normalized table into multiple tables, for example a headers/lines set of tables. Another situation in which I had to check for duplicates is when importing data from non-relational sources (e.g. text files, Excel sheets, etc.) in which the chances of having duplicates are quite high.

As already stressed, it's preferable to reduce upfront the possibility of entering duplicates, unfortunately that's not always possible.

It's not always required to add unique indexes/constraints, though that was a good tip.
G33kKahuna
G33kKahuna
Valued Member
Valued Member (72 reputation)Valued Member (72 reputation)Valued Member (72 reputation)Valued Member (72 reputation)Valued Member (72 reputation)Valued Member (72 reputation)Valued Member (72 reputation)Valued Member (72 reputation)

Group: General Forum Members
Points: 72 Visits: 313
Seth,

this is an age old dilemma and was put to rest with many variations like JP de Jong-202059 pointed out. In fact this site has articles with script to perform the same task. Stop wasting our time with your eureka moments ....
seth delconte
seth delconte
SSCommitted
SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)

Group: General Forum Members
Points: 1627 Visits: 1360
If you found two duplicated item_no's why did four rows get deleted? Wouldn't you want to delete just one of the duplicates so that one unique row would remain?

I must be missing something. Thanks for your explanation in advance.

Richard


Richard,

In this scenario, I have 2 duplicated records, where every field is identical. I copied one instance of each record into a temp table, then deleted ALL the records from the original table that had item_nos that were duplicated (each of the 2 records had 1 duplicate record, so the total was 4 records). I chose to group by item_no, but could just as easily have used id. Then I copied everything from the temp table back to the original table (2 non-duplicate records). This method just seemed to make sense to me, I'm sure there are other, possibly more efficient ways to do this.

_________________________________
seth delconte
http://sqlkeys.com
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search