I need to deduplicate a 800 million row table (100 million rows have duplicates)

  • What best way would you suggest that I use /T-SQL to DEDUPLICATE this 800 million-row table? (it has almost 100 million FULL ROW duplicates (no PK (yet) ), with a count >1 after grouping by all fields..

    ANY SUGGESTIONS?

    thanks!

    Likes to play Chess

  • Whatever way you do, it's not going to be super fast.

    You may want to try

    SELECT DISTINCT * INTO NewTable FROM ExistingTable

    DROP TABLE ExistingTable

    EXEC sp_rename 'NewTable' , 'ExistingTable'

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Ouch. An 800 million row heap?

    Given that amount of data and that performance is going to be painful here is one idea.

    Do a select distinct into a new table. Then drop the original table. Finally rename the new table to the name of the original table. Make sure you do this in a test environment first.

    Let's assume your base table is named MyTable and has 3 columns (Col1, Col2, Col3).

    select distinct Col1, Col2, Col3

    into MyNewTable

    from MyTable;

    drop table MyTable;

    EXEC sp_rename 'dbo.MyNewTable', 'MyTable';

    --EDIT--

    Seems Eugene was faster at the keyboard than I was.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • With 800M rows, I would use a bulk export/import for this. I see 2 options:

    Use bcp to export query with a SELECT DISTINCT (or GROUP BY all columns), truncate the table and bulk insert the data.

    OR

    USE bcp to export all the table, truncate the table, create a PK and bulk insert all data allowing to continue after 800M errors.

    It might be slower than the other options (or not depending on how the commit behaves) but you won't to lose your permissions, or other metadata associated with the table.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (2/6/2014)


    With 800M rows, I would use a bulk export/import for this. I see 2 options:

    Use bcp to export query with a SELECT DISTINCT (or GROUP BY all columns), truncate the table and bulk insert the data.

    OR

    USE bcp to export all the table, truncate the table, create a PK and bulk insert all data allowing to continue after 800M errors.

    It might be slower than the other options (or not depending on how the commit behaves) but you won't to lose your permissions, or other metadata associated with the table.

    I guess if the table is a heap, it is unlikely to have a lot of metadata.

    Scripting out and restoring identical permissions (and, other metadata if needed), should not be a big problem.

    It will take quite a time to get 800 mils of rows cleaned using any method...

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Sean Lange (2/6/2014)


    Ouch. An 800 million row heap?

    Given that amount of data and that performance is going to be painful here is one idea.

    Do a select distinct into a new table. Then drop the original table. Finally rename the new table to the name of the original table. Make sure you do this in a test environment first.

    Let's assume your base table is named MyTable and has 3 columns (Col1, Col2, Col3).

    select distinct Col1, Col2, Col3

    into MyNewTable

    from MyTable;

    drop table MyTable;

    EXEC sp_rename 'dbo.MyNewTable', 'MyTable';

    --EDIT--

    Seems Eugene was faster at the keyboard than I was.

    I have had to do this on a number of occasions for clients. Copying the table as described here is often the fastest way of doing it.

    Another option is to just select the dupes into a new table, dedupe from the new table etc etc etc.

    But before going too far, I wonder if there is anyway whatsoever to distinguish these dupes from each other. Having any single field that is different can be a huge time saver when deduping this data.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • PROBLEMS with your advises so far:

    1 - However if I use all these methods then I will have to face RECREATING INDEXES on a new table. There are several non-clustered indexes, each of them will probably easily take 4 or 5 hours to get built.

    2 - I will also have to deal with disk space (copying from the table into another table and then dropping one later requires twice as much space as the orig table already occupies).

    Are there any other ways that you would suggest that do not involve having to drop an orig. table?

    Thanks.

    Likes to play Chess

  • Well, if you could answer the last part of my previous post, then there may be other possibilities that would be acceptable and not take excessive resources.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • << I wonder if there is anyway whatsoever to distinguish these dupes from each other.>> well it is NOT KNOWN.

    IT will take a while till we complete the study of how that can be addressed. waiting for a lot of input and meetings with the business users, etc. etc.

    it will be a while before such question can be answered. weeks, if not months. so that is why I want to deduplicate the table, create another non-clustered IDx, -- the least I can do before any designe/re-design is possible.

    Likes to play Chess

  • VoldemarG (2/6/2014)


    << I wonder if there is anyway whatsoever to distinguish these dupes from each other.>> well it is NOT KNOWN.

    IT will take a while till we complete the study of how that can be addressed. waiting for a lot of input and meetings with the business users, etc. etc.

    it will be a while before such question can be answered. weeks, if not months. so that is why I want to deduplicate the table, create another non-clustered IDx, -- the least I can do before any designe/re-design is possible.

    If the assessment is not complete and it is unknown whether there is a way to distinguish the dupes from one another, I would not even venture anywhere close to deduping the data.

    If there was a single attribute in the table that could help distinguish one dupe record from it's counterpart then that attribute could easily be used to help create a faster process.

    The faster process could involve indexes.

    It could involve creating a new table to dump just the dupes in, then delete from the master table based off this attribute that has been identified.

    It could also be used to help more quickly delete the records in place in the table.

    But for now, complete the data analysis and find a way to help distinguish these dupes.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • I will go ahead and give you one possible solution that I have been holding back on because I feel it would be a huge resource hog with the number of records we are looking at here.

    I would rather get more information and have you use a far more efficient means - once the data is proven to be sound and understood.

    That said, you have this option for deduping (which may or may not severely hamper resources due to the number of dupes).

    http://jasonbrimhall.info/2011/03/21/dedupe-data-cte/

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply