Removing Duplicates

  • PHYData DBA

    SSCertifiable

    Points: 7541

    Jozef Moravcik (10/8/2013)


    Hi Guys,

    I see there is a wrong DB structure (table structure). Ussualy each master table should have at least a primary key. Using proper primary key you avoid the problem...

    Jozef,

    I agree with your point that proper Database table structure will reduce the need for this code. However that does not apply to the real world. Everyday we find table structures that can not be changed in third party products databases that need to be cleaned of duplicate records.

    There are MANY databases out there that have tables that are built this way for performance and other reasons. If you do not use one of the many methods to remove the duplicates and instead update the table schema, the product could stop working, the support staff could stop working, and you as a DBA might even stop working.

    In short, knowing how to remove rows is something you do as a professional DBA. Updating the SCHEMA for databases used by live applications is something done by that applications development and support staff. 😎

  • PHYData DBA

    SSCertifiable

    Points: 7541

    Stefan Krzywicki (10/8/2013)

    It is funny, I agonized a bit over the example table. As I was putting in the columns I kept thinking "Well, if this were a real database I'd put this in another table so it could have multiple values or historical data or maybe this should be calculated"

    Stefan, your example table is perfect for your article. So perfect that if you Google "Removing Duplicate RowNumber()" similar table structure is in most of the other articles on how to do this. 😎

  • lptech

    Hall of Fame

    Points: 3188

    My current shop gets a lot of data from the mainframe, using a variety of methods to load the data into SQL Server. As such, we have many table without primary keys. Since there are occasionally situations where duplicates slip past the existing processing, with article is extremely useful.

  • kit-1143032

    SSC Enthusiast

    Points: 115

    I am wondering if anyone has used this in a CTE? I can't delete the duplicates from my actual tables but I would like to eliminate them within my stored procedure.

    Thanks,

    Kate

  • peterzeke

    SSCrazy Eights

    Points: 8998

    Very good article, Stefan.

    Found an odd thing -- Your article displays the date of "2013/10/08" next to your name, but when selecting the "printable" version the date shows "2013/09/23"?

    -- Pete

  • Sioban Krzywicki

    One Orange Chip

    Points: 27770

    kit-1143032 (10/8/2013)


    I am wondering if anyone has used this in a CTE? I can't delete the duplicates from my actual tables but I would like to eliminate them within my stored procedure.

    Thanks,

    Kate

    You should be able to use this query in your stored procedure as it is. If you use a temp table to stage the data, you can remove the duplicates from the temp table.

    Give it a try in a CTE, it should work just fine.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • Greg Edwards-268690

    SSC-Insane

    Points: 20564

    Stefan Krzywicki (10/8/2013)


    Greg Edwards-268690 (10/8/2013)


    I can see the need to understand how to do this, although I tend to go back to the basics.

    If you find yourself needing to do this, question if it be designed into the the table in the first place to prevent this?

    After all, I think this leads into the initial issue - without deleting all, the engine needs a way to discern which one(s) to delete.

    I realize sometimes you have no input / control into this, but a few words about this concept might be a worthwhile addition.

    If you wanted to expand this, age (at least to me) should be calculated, not stored in most cases.

    And the changing job title also drives me towards separating out to different tables and having effectivity dates.

    But that is way beyond your intended scope.

    Just trying to spark a thought or two, not to make a big deal about any of this.

    It is funny, I agonized a bit over the example table. As I was putting in the columns I kept thinking "Well, if this were a real database I'd put this in another table so it could have multiple values or historical data or maybe this should be calculated" then I reminded myself that was outside the scope of this article and I just needed something to use as an example. : -)

    Same thing with table design, there are situations where you have no way to prevent this ahead of time, whether it is because the duplicates are in the data coming in or because the table is already in production and the powers that be won't approve a structural change. I wanted to keep the focus on this one task as dealing with every possibility would make the article far longer.

    I thought you would have those mixed thoughts and just kept it simple. 🙂

    Part of my comment was driven by how often we can see this poor design, something we shouldn't have to run into, but in reality do many times. So we have to deal with it.

    Kind of hit a nerve also based on my toughts of Select DISTINCT, which seemed to be common for user queries where I used to work.

    Especially when I knew there was no need for this, they were just missing some file joins, and inviting performance and data issues unknowingly.

    But good article for the limited scope.

  • LinksUp

    SSCertifiable

    Points: 6370

    kit-1143032 (10/8/2013)


    I am wondering if anyone has used this in a CTE?

    Absolutely! Using Row_Number with a CTE is a great combination.

    ;with cte as

    (

    Select *,

    Row_Number() over (Partition By Last_Name, First_Name, Age Order by Last_name) RowNum

    From TableName

    )

    Delete From cte Where RowNum > 1

    This is a VERY simplified example, but you should see how it could be applicable to a number of scenarios you may come up against.

    __________________________________________________________________________________________________________
    How to Post to get the most: http://www.sqlservercentral.com/articles/Best+Practices/61537/

  • deshpandenikhils

    SSC Rookie

    Points: 36

    Simple and concise.......Thanks.

    I usually use CTEs to delete duplicate rows.

  • kit-1143032

    SSC Enthusiast

    Points: 115

    Thank you. That will be very helpful.

  • gregoryheath14

    Grasshopper

    Points: 17

    I actually had a practical situation where i needed to do this today and..it worked well, so thanks!

  • Mr. Kapsicum

    SSCertifiable

    Points: 6128

    have been using CTE along with Row_number function to delete duplicate records.

    nice article. 🙂

  • louigopal

    Mr or Mrs. 500

    Points: 575

    Thanks Stefan for the nice article.

  • okbangas

    SSChampion

    Points: 11773

    I agree with the others suggesting CTE, as I think the code is way more readable.



    Ole Kristian Velstadbråten Bangås - Virinco - Facebook - Twitter

    Concatenating Row Values in Transact-SQL[/url]

  • curious_sqldba

    SSC-Dedicated

    Points: 36295

    Very nice article and good explanation. Can you please add your thoughts related to performance and is this the recommended way to do it?

Viewing 15 posts - 16 through 30 (of 53 total)

You must be logged in to reply to this topic. Login to reply