Dedup Large Table

  • I have a very wide table about 46 columns, there are approx 328 million rows in the table.  We are trying to dedup the table but run in to out of memory errors.  
    I have tried a couple of techniques and really want to avoid using a cursor.  What would the best way to dedup these tables. (this one is the largest in size and presenting the largest problem.

    I would like to do this a little dynamic.  We are wanting to use some parameters to loop through the table and dedup that way.  There are approx 928 unique DataSet (unique identifiers for batching) that are in the middle as a FK)  We can use these to loop through the datasets.  
    Anyone know of any examples I could look at or offer any suggestions.

    CTE?, SSIS?  Thoughts?

  • SELECT DISTINCT *
    INTO #TempTable
    FROM OriginalTable

    TRUNCATE OriginalTable

    INSERT INTO OriginalTable
    SELECT * FROM #TempTable

    Not knowing the structure of the table, but you may need to break this into pieces, or use a third staging table.

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • Michael L John - Wednesday, June 28, 2017 8:34 AM

    SELECT DISTINCT *
    INTO #TempTable
    FROM OriginalTable

    TRUNCATE OriginalTable

    INSERT INTO OriginalTable
    SELECT * FROM #TempTable

    Not knowing the structure of the table, but you may need to break this into pieces, or use a third staging table.

    Thanks, I'll give that a try.

  • Stephen crocker - Wednesday, June 28, 2017 8:38 AM

    Michael L John - Wednesday, June 28, 2017 8:34 AM

    SELECT DISTINCT *
    INTO #TempTable
    FROM OriginalTable

    TRUNCATE OriginalTable

    INSERT INTO OriginalTable
    SELECT * FROM #TempTable

    Not knowing the structure of the table, but you may need to break this into pieces, or use a third staging table.

    Thanks, I'll give that a try.

    Just be sure you have enough disk space for the tempdb system database, as this method will use as much disk space in tempdb as the existing table occupies, and at 328 million rows, that could be an awful lot.  A row size in the original table of just 1,000 bytes would require 328 GIGABYTES !!!

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • wouldn't it be better to calculate a checksum or binary checksum for the list of columns to test instead?
    the space used would be substantially smaller, right?, then you could use a temp table with those values, and a CTE with a rownumber to delete dupes?

    SELECT CHECKSUM(col1,col2,col46)
    FROM MyTable
    GROUP BY  CHECKSUM(col1,col2,col46)
    HAVING COUNT(*) > 1

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Lowell - Wednesday, June 28, 2017 1:35 PM

    wouldn't it be better to calculate a checksum or binary checksum for the list of columns to test instead?
    the space used would be substantially smaller, right?, then you could use a temp table with those values, and a CTE with a rownumber to delete dupes?

    SELECT CHECKSUM(col1,col2,col46)
    FROM MyTable
    GROUP BY  CHECKSUM(col1,col2,col46)
    HAVING COUNT(*) > 1

    Lowell makes a good point.  I like the ROW_NUMBER approach myself.

    Another thing to keep in mind is the size of the log file.  The INSERTs in Mike's approach and the DELETEs in Lowell's approach are all fully logged operations.  If it were me, I'd figure out how many rows you're going to keep and delete and let that influence my decision.

  • Just remember that checksum is NOT reliable when it comes to the determination of what is a duplicate.  It should only be used for determining what COULD be duplicates and what are definitely not.  For anything that checksum returns as a duplicate, it's only a possibility of being a duplicate and further testing is required to determine if the items actually ARE duplicates. 

    A short but pointed example follows... are they duplicates or not?
    SELECT  CS1 = CHECKSUM('A352KD')
            ,CS2 = CHECKSUM('A352NT')
    ;

    Results... CHECKSUM thinks they're duplicates...
    CS1         CS2
    ----------- -----------
    141500177   141500177

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Friday, June 30, 2017 6:48 PM

    Just remember that checksum is NOT reliable when it comes to the determination of what is a duplicate.  It should only be used for determining what COULD be duplicates and what are definitely not.  For anything that checksum returns as a duplicate, it's only a possibility of being a duplicate and further testing is required to determine if the items actually ARE duplicates. 

    A short but pointed example follows... are they duplicates or not?
    SELECT  CS1 = CHECKSUM('A352KD')
            ,CS2 = CHECKSUM('A352NT')
    ;

    Results... CHECKSUM thinks they're duplicates...
    CS1         CS2
    ----------- -----------
    141500177   141500177

    hmmm...what version of SQL you running Jeff?
    here's my results

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day


  • WITH Rows AS
    (
         SELECT *, ROW_NUMBER() OVER (PARTITION BY BINARY_CHECKSUM(*) ORDER BY BINARY_CHECKSUM(*)) AS [Count]
         FROM Table
    )
         DELETE FROM Rows WHERE [Count] > 1 ;

    As it was already mentioned, CHECKSUM and BINARY_CHECKSUM can have collision and thus, not have 100% uniqueness. You can alternatively use HASHBYTES(), which will be more accurate and has been advertised as collision free, but you still risk the issue of trust and there is a 8,000 byte limitation. I've used both in the past and it's worked fine on both smaller and bigger. However, the trust issue is the big one that turns away most DBA's and I don't think it's still 100% unique.

    Alternatively, you can use similar but partition by your actual fields because it's essentially grouping it by every field and counting it. then you just DELETE in the same fashion with the CTE. This will be more accurate than hashing.

  • J Livingston SQL - Saturday, July 1, 2017 1:51 AM

    Jeff Moden - Friday, June 30, 2017 6:48 PM

    Just remember that checksum is NOT reliable when it comes to the determination of what is a duplicate.  It should only be used for determining what COULD be duplicates and what are definitely not.  For anything that checksum returns as a duplicate, it's only a possibility of being a duplicate and further testing is required to determine if the items actually ARE duplicates. 

    A short but pointed example follows... are they duplicates or not?
    SELECT  CS1 = CHECKSUM('A352KD')
            ,CS2 = CHECKSUM('A352NT')
    ;

    Results... CHECKSUM thinks they're duplicates...
    CS1         CS2
    ----------- -----------
    141500177   141500177

    hmmm...what version of SQL you running Jeff?
    here's my results

    2008 Dev and 2012 Enterprise.  Glad to see they finally did something to move away from a simple Byte Level XOR in 2016 but I hope they didn't make it perform worse.  Checksum has become a staple for me in improving the performance of constrained random data generation.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Saturday, July 1, 2017 5:55 AM

    J Livingston SQL - Saturday, July 1, 2017 1:51 AM

    Jeff Moden - Friday, June 30, 2017 6:48 PM

    Just remember that checksum is NOT reliable when it comes to the determination of what is a duplicate.  It should only be used for determining what COULD be duplicates and what are definitely not.  For anything that checksum returns as a duplicate, it's only a possibility of being a duplicate and further testing is required to determine if the items actually ARE duplicates. 

    A short but pointed example follows... are they duplicates or not?
    SELECT  CS1 = CHECKSUM('A352KD')
            ,CS2 = CHECKSUM('A352NT')
    ;

    Results... CHECKSUM thinks they're duplicates...
    CS1         CS2
    ----------- -----------
    141500177   141500177

    hmmm...what version of SQL you running Jeff?
    here's my results

    2008 Dev and 2012 Enterprise.  Glad to see they finally did something to move away from a simple Byte Level XOR in 2016 but I hope they didn't make it perform worse.  Checksum has become a staple for me in improving the performance of constrained random data generation.

    would be happy to compare performance of random data generation on 2016 if you care to provide a script.

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • xsevensinzx - Saturday, July 1, 2017 3:46 AM

    However, the trust issue is the big one that turns away most DBA's and I don't think it's still 100% unique.

    Here's the proof that virtually any form of HASHBYTES can have a collision (explore the links at the article in the following link and do other Yabingooglehoo searches on the subject)..
    https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html

    While new algorithms, such as SHA2-512, certainly and seriously reduce the risk of a collision, there are still no guarantees that there won't be a collision and it could happen on your very next check for dupes.  It all depends on what on what level of risk you want to take and what the level of damage will be if you do happen to "hit the lottery".  If you cannot afford the risk or cannot afford for any damage to occur, then you must do a secondary "brute force" check between the compared items that are returned as being duplicated even if you're using SHA-512.

    Shifting gears a bit, the same holds true for the Type 4 GUIDs that MS has been using for quite some time (since mid '90's, IIRC).  Although the number of possible values is quite large, they are NOT guaranteed to be "Globally Unique" despite their name.  They're still "just": random numbers and collisions between different machines are possible especially since they came out with that bloody damned NEWSEQUENTIALID.  That's one of the reasons why I store GUIDs that are used as a KEY in a column with a UNIQUE constraint (usually a PK, of course, just to add the NOT NULL part).

    Further on the subject of GUIDs, I did the calculation a long time ago to give people the idea of magnitude of how many values are available.  If 1 GUID represented 1 mile, there are enough unique values to traverse 14 Quadrillion Milky Way Galaxies.  Even with that, there is still no guarantee of uniqueness. Heh... to coin a phrase, "Even galaxies collide".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Getting back to the original problem, there are 46 columns in the table, but how many of them are required to ensure uniqueness?  You only need to partition by the columns you want to use for the dedupe.  If you have an option of which set of columns to use, I'd try to make the set as narrow as possible.  Consider any NCIs you have on the table.  If you had a unique NCI already on the table, that would be ideal.

    I know you already know the approach, but I'm suggesting that you focus on getting the query that identifies the duplicates as small as possible.

    To see what would be deleted:

    WITH cte AS (
      SELECT RN = ROW_NUMBER() OVER(PARTITION BY col1, col2, col3 ORDER BY ID)
        FROM dbo.YourLargeTable
    )
    SELECT *
      FROM cte
      WHERE RN > 1;

    To do the actual delete, simply change the SELECT * to a DELETE.

    I hope this helps.

  • Jeff Moden - Saturday, July 1, 2017 7:09 AM

    xsevensinzx - Saturday, July 1, 2017 3:46 AM

    However, the trust issue is the big one that turns away most DBA's and I don't think it's still 100% unique.

    Here's the proof that virtually any form of HASHBYTES can have a collision (explore the links at the article in the following link and do other Yabingooglehoo searches on the subject)..
    https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html

    While new algorithms, such as SHA2-512, certainly and seriously reduce the risk of a collision, there are still no guarantees that there won't be a collision and it could happen on your very next check for dupes.  It all depends on what on what level of risk you want to take and what the level of damage will be if you do happen to "hit the lottery".  If you cannot afford the risk or cannot afford for any damage to occur, then you must do a secondary "brute force" check between the compared items that are returned as being duplicated even if you're using SHA-512.

    Shifting gears a bit, the same holds true for the Type 4 GUIDs that MS has been using for quite some time (since mid '90's, IIRC).  Although the number of possible values is quite large, they are NOT guaranteed to be "Globally Unique" despite their name.  They're still "just": random numbers and collisions between different machines are possible especially since they came out with that bloody damned NEWSEQUENTIALID.  That's one of the reasons why I store GUIDs that are used as a KEY in a column with a UNIQUE constraint (usually a PK, of course, just to add the NOT NULL part).

    Further on the subject of GUIDs, I did the calculation a long time ago to give people the idea of magnitude of how many values are available.  If 1 GUID represented 1 mile, there are enough unique values to traverse 14 Quadrillion Milky Way Galaxies.  Even with that, there is still no guarantee of uniqueness. Heh... to coin a phrase, "Even galaxies collide".

    Yep, pretty much. Not saying it was unique, but others have sold it as collision free. This is why it's best to just row_number() and partition over every column versus hashing it.

    The reason why I mentioned the "trust" piece as such was because a lot of people don't understand how the hashing works and even then, if new algorithms come out, there will still be trust issues I'm sure among the community.

  • xsevensinzx - Saturday, July 1, 2017 1:14 PM

    Yep, pretty much. Not saying it was unique, but others have sold it as collision free. This is why it's best to just row_number() and partition over every column versus hashing it.

    The reason why I mentioned the "trust" piece as such was because a lot of people don't understand how the hashing works and even then, if new algorithms come out, there will still be trust issues I'm sure among the community.

    Ah... just to clarify.  I wasn't saying anything on the contrary to what you said and hope you didn't take it that way.  I absolutely agree with what you said and just used your good post to springboard into a rant about how many people continue to shoot themselves in the head by making assumptions. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply