Removing Duplicates

  • Sioban Krzywicki (1/23/2015)


    Sure! That's a way you can do it. Want to post some code to share with people? I find multiple ways of doing things is both nice to know and can help if one way runs into trouble.

    For example, sometimes CTEs can run long on large datasets.

    There is an example on this thread!

    http://www.sqlservercentral.com/Forums/FindPost1502706.aspx

    __________________________________________________________________________________________________________
    How to Post to get the most: http://www.sqlservercentral.com/articles/Best+Practices/61537/

  • How do you delete a duplicate from a Heap?

    If the table is a true Heap with no PK and the records in every column match you would have to delete all the rows and then insert the values back in only once.

  • PHYData DBA (1/23/2015)


    How do you delete a duplicate from a Heap?

    If the table is a true Heap with no PK and the records in every column match you would have to delete all the rows and then insert the values back in only once.

    This method works just fine on a heap. You may notice I don't bother declaring any index on the table in the example because this method doesn't rely on a key. You Partition By all the columns that might have duplicates. The example actually shows this very thing.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • Nice clear example Sioban. Thanks. I frequently come across clients who have unintended duplicates in their tables. Data cleansing is a major part of the ETL process in some organisations.

    Rgds, Dave.

    --------------------------------------

    You must come with me, or you'll be late.

    Late for what?

    Late as in the late Arthur Dent. -- Slartibartfast

    --------------------------------------

  • My way of working:

    [font="Courier New"]

    BEGIN TRAN

    SELECT DISTINCT *

    INTO #tempTable

    FROM myTable

    TRUNCATE TABLE myTable

    INSERT INTO myTable

    SELECT *

    FROM #tempTable

    COMMIT TRAN

    DROP TABLE #tempTable

    [/font]

  • hans.reidt (1/28/2015)


    My way of working:

    [font="Courier New"]

    BEGIN TRAN

    SELECT DISTINCT *

    INTO #tempTable

    FROM myTable

    TRUNCATE TABLE myTable

    INSERT INTO myTable

    SELECT *

    FROM #tempTable

    COMMIT TRAN

    DROP TABLE #tempTable

    [/font]

    That certainly is a way you can do it, though it might take longer or cause memory problems on a large table and it depends on how big your server is. And it gets a little trickier than that if, say you only want to check 4 of 20 columns for duplicates.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • I do this on tables with > 1.000.000 records

    Something to keep in mind of course, is the number / complexity of indexes.

    A possibility is to disable all indexes, remove the duplicates and rebuild the indexes.

    Of course, when you have duplicates in your data, the problem is usually the import.

    That I try to solve with an LEFT OUTER JOIN on the unique columns.

    This way I get idempotence, which is nice

  • hans.reidt (1/28/2015)


    I do this on tables with > 1.000.000 records

    Something to keep in mind of course, is the number / complexity of indexes.

    A possibility is to disable all indexes, remove the duplicates and rebuild the indexes.

    Of course, when you have duplicates in your data, the problem is usually the import.

    That I try to solve with an LEFT OUTER JOIN on the unique columns.

    This way I get idempotence, which is nice

    Sure, I'm not saying it can't work. I'm saying that depending on your server's power & resources, that might not be a great option. Especially when you get to really big table, 10 million, 100 million or more. Or if you're doing this on a data import. And I don't think I'd want that code in any automated deduping procedures.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

Viewing 8 posts - 46 through 52 (of 52 total)

You must be logged in to reply to this topic. Login to reply