Removing Duplicates

Question

Removing Duplicates

Viewing 15 posts - 16 through 30 (of 53 total)

You must be logged in to reply to this topic. Login to reply

PHYData DBA SSCertifiable Points: 7541 More actions · Answer 1

Jozef Moravcik (10/8/2013)
Hi Guys,
I see there is a wrong DB structure (table structure). Ussualy each master table should have at least a primary key. Using proper primary key you avoid the problem...

Jozef,

I agree with your point that proper Database table structure will reduce the need for this code. However that does not apply to the real world. Everyday we find table structures that can not be changed in third party products databases that need to be cleaned of duplicate records.

There are MANY databases out there that have tables that are built this way for performance and other reasons. If you do not use one of the many methods to remove the duplicates and instead update the table schema, the product could stop working, the support staff could stop working, and you as a DBA might even stop working.

In short, knowing how to remove rows is something you do as a professional DBA. Updating the SCHEMA for databases used by live applications is something done by that applications development and support staff. 😎

PHYData DBA SSCertifiable Points: 7541 More actions · Answer 2

Stefan Krzywicki (10/8/2013)
It is funny, I agonized a bit over the example table. As I was putting in the columns I kept thinking "Well, if this were a real database I'd put this in another table so it could have multiple values or historical data or maybe this should be calculated"

Stefan, your example table is perfect for your article. So perfect that if you Google "Removing Duplicate RowNumber()" similar table structure is in most of the other articles on how to do this. 😎

lptech Hall of Fame Points: 3208 More actions · Answer 3

My current shop gets a lot of data from the mainframe, using a variety of methods to load the data into SQL Server. As such, we have many table without primary keys. Since there are occasionally situations where duplicates slip past the existing processing, with article is extremely useful.

kit-1143032 SSC Enthusiast Points: 115 More actions · Answer 4

I am wondering if anyone has used this in a CTE? I can't delete the duplicates from my actual tables but I would like to eliminate them within my stored procedure.

Thanks,

Kate

peterzeke SSCrazy Eights Points: 9003 More actions · Answer 5

Very good article, Stefan.

Found an odd thing -- Your article displays the date of "2013/10/08" next to your name, but when selecting the "printable" version the date shows "2013/09/23"?

-- Pete

Sioban Krzywicki One Orange Chip Points: 27770 More actions · Answer 6

kit-1143032 (10/8/2013)
I am wondering if anyone has used this in a CTE? I can't delete the duplicates from my actual tables but I would like to eliminate them within my stored procedure.
Thanks,
Kate

You should be able to use this query in your stored procedure as it is. If you use a temp table to stage the data, you can remove the duplicates from the temp table.

Give it a try in a CTE, it should work just fine.

--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams

Greg Edwards-268690 SSC-Insane Points: 20680 More actions · Answer 7

Stefan Krzywicki (10/8/2013)
Greg Edwards-268690 (10/8/2013)
I can see the need to understand how to do this, although I tend to go back to the basics.
If you find yourself needing to do this, question if it be designed into the the table in the first place to prevent this?
After all, I think this leads into the initial issue - without deleting all, the engine needs a way to discern which one(s) to delete.
I realize sometimes you have no input / control into this, but a few words about this concept might be a worthwhile addition.
If you wanted to expand this, age (at least to me) should be calculated, not stored in most cases.
And the changing job title also drives me towards separating out to different tables and having effectivity dates.
But that is way beyond your intended scope.
Just trying to spark a thought or two, not to make a big deal about any of this.
It is funny, I agonized a bit over the example table. As I was putting in the columns I kept thinking "Well, if this were a real database I'd put this in another table so it could have multiple values or historical data or maybe this should be calculated" then I reminded myself that was outside the scope of this article and I just needed something to use as an example. : -)
Same thing with table design, there are situations where you have no way to prevent this ahead of time, whether it is because the duplicates are in the data coming in or because the table is already in production and the powers that be won't approve a structural change. I wanted to keep the focus on this one task as dealing with every possibility would make the article far longer.

I thought you would have those mixed thoughts and just kept it simple. 🙂

Part of my comment was driven by how often we can see this poor design, something we shouldn't have to run into, but in reality do many times. So we have to deal with it.

Kind of hit a nerve also based on my toughts of Select DISTINCT, which seemed to be common for user queries where I used to work.

Especially when I knew there was no need for this, they were just missing some file joins, and inviting performance and data issues unknowingly.

But good article for the limited scope.

LinksUp SSCertifiable Points: 6527 More actions · Answer 8

kit-1143032 (10/8/2013)
I am wondering if anyone has used this in a CTE?

Absolutely! Using Row_Number with a CTE is a great combination.

;with cte as

(

Select *,

Row_Number() over (Partition By Last_Name, First_Name, Age Order by Last_name) RowNum

From TableName

)

Delete From cte Where RowNum > 1

This is a VERY simplified example, but you should see how it could be applicable to a number of scenarios you may come up against.

__________________________________________________________________________________________________________
How to Post to get the most: http://www.sqlservercentral.com/articles/Best+Practices/61537/

deshpandenikhils SSC Rookie Points: 36 More actions · Answer 9

Simple and concise.......Thanks.

I usually use CTEs to delete duplicate rows.

kit-1143032 SSC Enthusiast Points: 115 More actions · Answer 10

kit-1143032

SSC Enthusiast

Points: 115

October 8, 2013 at 1:56 pm

#1656899

Thank you. That will be very helpful.

gregoryheath14 Grasshopper Points: 17 More actions · Answer 11

I actually had a practical situation where i needed to do this today and..it worked well, so thanks!

Mr. Kapsicum SSCertifiable Points: 6128 More actions · Answer 12

have been using CTE along with Row_number function to delete duplicate records.

nice article. 🙂

louigopal Mr or Mrs. 500 Points: 575 More actions · Answer 13

louigopal

Mr or Mrs. 500

Points: 575

October 9, 2013 at 2:51 am

#1657030

Thanks Stefan for the nice article.

okbangas SSChampion Points: 11773 More actions · Answer 14

I agree with the others suggesting CTE, as I think the code is way more readable.

Ole Kristian Velstadbråten Bangås - Virinco - Facebook - Twitter

Concatenating Row Values in Transact-SQL[/url]

curious_sqldba SSC-Dedicated Points: 36502 More actions · Answer 15

Very nice article and good explanation. Can you please add your thoughts related to performance and is this the recommended way to do it?