Deleting data in large table

  • I have a large table with 17M+ rows for our package record audits.  I've decided to clear out anything older than 3 months which would clear all but 1.2M records.  The table doesn't currently have any indexes.  What's the recommended way to clear the records from the table?  Should i consider adding an index to make clearing it faster?

  • If you're going to remove 25% or more of the data from a HEAP or a Table (Clustered Index Present), the best thing to do all the way around is create a empty copy of the table with only the Clustered Index (or nothing if you want it to be a HEAP), and do a minimally logged insert of only what you want to keep into the new table.  If that works, the rename the tables so the original table is renamed with _Old as the object name and the new table as the original name.

    Then, add the Non-Clustered indexes and any keys back to the new table.

    If you have scratch and tmp schemas, you can even change which schema the Primary Key constraint is on so that you don't have to violate any naming conventions there (Constraints must be uniquely named in a given schema just like other objects must be.  Non-clustered indexes are an exception because they're slaves to the table or heap).

    A lot of people have different names for this type of thing... I just refer to it as a "Swap'n'Drop".

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I also read about using 'SWITCH' to map the data to the backup table, then insert the records back into the original table.   Once the data dump is done, I'll still have over 1M records in the original table.  Would you suggest an index to help with performance of future daily table cleanup?   We rarely ever query the data.

  • Jackie Lowery wrote:

    I also read about using 'SWITCH' to map the data to the backup table, then insert the records back into the original table. Once the data dump is done, I'll still have over 1M records in the original table. Would you suggest an index to help with performance of future daily table cleanup? We rarely ever query the data.

    SWITCH only works on either a "whole table" or a partition of a Partitioned Table. And, you'd have to make an empty copy of the table to do the SWITCH as well.  I'm thinking that's not an option here (and so didn't originally recommend it) because adding the Clustered Index to Partition the table would require the space to hold an additional copy of the table until the Clustered Index is done and then you'd have the huge amount of empty space that the original table (heap) would leave behind.

    So take a log file backup just before you switch to BULK LOGGED recovery and, when you're done with the "Swap'n'Drop", switch back to the FULL Recovery Model and take another log file backup to limit the "no point in time restores allowed when a bulk operation like "Minimal Logging" has occurred" period of time.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • As for the old table, you were going to delete all but a million rows.  When you're sure the copy worked correctly, just drop the old table.

    Also, decide the maximum amount of time you want to keep data in this table.  Add an index on the "Date_Created" or other temporal column and use the same "keys" where you decide what the cutoff for the million rows was.  That should be a Clustered Index for this history table... try to make it "unique" as a "Primary Key", as well.  It'll help you do the rare lookups , as well.

    Then do nightly "rolloffs" from the logical and, hopefully, physical beginning of the table to delete any rows that have exceeded the desired "keep it until" age.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks a lot.  I can make a plan from that.

  • Why not creating a NONCLUSTERED Index on the datetime column and "walk in small batches" through the table.

    Small chunks of data prevent LOCK ESCALATION and you can run it smoothly in the background or you create a job which clean your data at out of business times

    DECLARE @row_count INT = 1;
    DECLARE @batch_size INT = 3000;

    WHILE @row_count > 0
    BEGIN
    DELETE TOP (@batch_size)
    FROM dbo.mytable
    WHERE datetime_column < DATEADD(MONTH, -3, GETDATE());

    SET @row_count = @@ROWCOUNT;
    END
    GO

    Microsoft Certified Master: SQL Server 2008
    MVP - Data Platform (2013 - ...)
    my blog: http://www.sqlmaster.de (german only!)

  • You’ve already got two good patterns from Jeff and Uwe so I’d approach it as a mix of both, depending on how much downtime and extra space you can afford right now.

    Since you’re removing apprx 93% of the table and this is essentially an audit/history table that’s rarely queried, Jeff’s Swap’n’Drop is usually the cleanest one-time fix: create a new table with the same structure but with a clustered index on your datetime / created date column (ideally a unique PK), insert only the last 3 months into it (using TABLOCK and BULK_LOGGED/SIMPLE to keep logging down), then rename the tables and drop the old heap once you’ve verified the row counts. That avoids doing a massive logged delete on 16M+ rows and gives you a properly clustered “append-only” history table that will perform better for any rare lookups you do. For ongoing clean-up, it’s absolutely worth having an index just to support deletes, even if users rarely query this table. I’d still keep the clustered index on the datetime column, then use Uwe’s pattern as a nightly/weekly job so you never get into this situation again. Something like:

    DECLARE @row_count int = 1;

    DECLARE @batch_size int = 3000;

    WHILE @row_count > 0

    BEGIN

    DELETE TOP (@batch_size)

    FROM dbo.PackageAudit

    WHERE AuditDate < DATEADD(MONTH, -3, SYSDATETIME());

    SET @row_count = @@ROWCOUNT;

    END;

    With the datetime column indexed, those batch deletes will use an efficient seek, avoid lock escalation, and can happily run in the background or during off-hours. If, in the future, the table keeps growing and you need even faster roll-offs, you can look at partitioning by date and using SWITCH to drop whole partitions at once – but for 17M rows and a 3-month retention window, a one-time Swap’n’Drop plus indexed, batched deletes is usually more than enough.

  • This was removed by the editor as SPAM

  • Uwe Ricken wrote:

    Why not creating a NONCLUSTERED Index on the datetime column and "walk in small batches" through the table.

    Small chunks of data prevent LOCK ESCALATION and you can run it smoothly in the background or you create a job which clean your data at out of business times

    DECLARE @row_count INT = 1;
    DECLARE @batch_size INT = 3000;

    WHILE @row_count > 0
    BEGIN
    DELETE TOP (@batch_size)
    FROM dbo.mytable
    WHERE datetime_column < DATEADD(MONTH, -3, GETDATE());

    SET @row_count = @@ROWCOUNT;
    END
    GO

    Thanks for the batch code.   I'll use that after the table is cleaned up.  Going to add index on the datetime column as well.

  • I lean towards Uwe's solution, especially when I need the table to stay online and accept insert/update/delete while I'm deleting data.

  • Steve Jones - SSC Editor wrote:

    I lean towards Uwe's solution, especially when I need the table to stay online and accept insert/update/delete while I'm deleting data.

    I was thinking about this too.   One question though.   Would you recommend creating the index on the datetime column first, then running the batch?

  • Yes, indexing makes a huge difference and the performance savings can overwhelm the index creation time/CPU

Viewing 13 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply