Efficient way to delete historical data

  • Could it all be as simple this: a single delete operation that leverages an output clause to insert deleted rows into a history table? If this were to be scheduled to execute every hour (instead of daily or every five days) then it will roll over a few hundred thousand rows at a time into the history table, and it's all included in a single implicit transaction.

    create table PrintJobs

    (

    JobID int not null identity(1,1) primary key

    , PrintDate datetime default getdate()

    );

    create index ix_PrintDate on PrintJobs ( PrintDate );

    create table PrintJobsHistory

    (

    JobID int not null primary key

    , PrintDate datetime

    );

    delete from PrintJobs

    output deleted.JobID, deleted.PrintDate

    into PrintJobsHistory ( JobID, PrintDate )

    where datediff( day, PrintDate, getdate() ) > 5;

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Or when rolling off to history hourly:

    where datediff( hour, PrintDate, getdate() ) > 120;

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Just to give an update on this and to ask for some further pointers:

    Background - I'm trying to delete historic data from a number of tables which contain huge numbers of rows in the quickest time, without massive log growth. Intially the problem was performed by a C# program which deleted the data from 12 or so tables in one transaction and this increased log file growth to 500 GB and blocked any other transactions whilst running.

    Work done so far:

    1: Replicate the C# deletion process in a procedure but batching the deletes of each table to 10,000 rows at a time. - Result: deleting from just the first table took well over 24 hours and so I stopped the process. T log stayed below 1GB.

    2. Replicate the C# deletion process in a procedure but batching the deletes of each table to 1,000,000 rows at a time. - Result: Process completes in 9 hours! (getting better) T Log grew to 4GB.

    3. Tune the process. Looking at the C# program, it first got the latest id (@oldestDocumentId) and used this to delete from the other tables where the documentid was less than this value. It did the delete using pretty much the same query on each table as below: Table_N could be any 1 of the 10 or so tables.

    DELETE TOP ( 100000 )

    Table_n

    FROM Table_n

    INNER JOIN DocumentResultDataItem ON Table_n.ResultItemRecordId = DocumentResultDataItem.ResultItemRecordId

    WHERE DocumentResultDataItem.DocumentRecordId < @oldestDocumentId;

    The above code was repeated pretty much for the first 10 tables each containing hundres of millions of rows. I found that this query was taken on average around 40 seconds to run on the smallest table and up to 2 minutes on the largest. As I was now batching this process, everytime it deleted a million rows, it repeated the query again and again, which is why foing batches of 10,000 at a time took so long!

    I replaced this statement by doing the following. Getting the min and max id of the table I was deleting from by using the query above, and then doing the delete where the table id was between the start and end ids. This meant I was only doing the resource intensive query once and the deletion could use the clustered index of the table it was deleting from.

    SELECT @StartId = MIN(Table_N.ResultItemRecordId) ,

    @EndId = MAX(Table_N.ResultItemRecordId)

    FROM Table_N

    INNER JOIN DocumentResultDataItem ON Table_N.ResultItemRecordId = DocumentResultDataItem.ResultItemRecordId

    WHERE DocumentResultDataItem.DocumentRecordId < @oldestDocumentId;

    WHILE 1 = 1

    BEGIN

    DELETE TOP ( 100000 )

    Table_n

    WHERE Table_n.ResultItemRecordId BETWEEN @StartId

    AND @EndId;

    IF @@ROWCOUNT = 0

    BEGIN

    BREAK;

    END;

    END;

    After the initial hit (40-120 seconds) of getting the min and max id's, the delete statment now ran in under a second for most tables!

    The end result was that the whole deletion procedure completed in just 3 hours! Much better than the initial 9 or 24!:-) TLog grew to 4GB too! Bonus!

    The biggest problem I have, is when I come to delete from the DocumentResultDataItem table itself, as this table is used as a foreign key by lots of the other tables, the delete batch takes around 30 seconds to execute for 1 million rows at a time. There are well over 560 million rows in the production table which need to be deleted, so it's taking 5 hours plus to remove them all when running this in live! Looking at the explain plan for the query, it is going off and checking the foreign key tables and the actual and estimated rows are way out! I've attached the explain plan showing the delete query being executed. I have updated stats before executing this delete statement, but it doesn't fix anything. The explain plan I have attached has the start and end id's explicitly set, but the procedure I am using to run the deletion, sets the variables at run time in the procedure. However, the explain plan when running the procedure looks very similar.

    Any help on how to improve the final stage of this process would be great!

  • Since you have the start and end values, instead of just say TOP 10000, how about just saying, delete 10000 rows of information. Pass the start value, then 100000 past it. This way your range is always no more than 10000 values. By saying TOP, and without an ORDER BY, the optimizer is still having to scan the whole data set over & over, and on all the child tables. Now, that means some of the batches will be less than 10000 due to gaps, but that's OK because you're going to be telling it much more specific values to delete.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Thanks for the help Grant.

    Do you mean as follows: (Obviously changing the values dynamically for each pass in the real code).

    DELETE DocumentResultDataItem

    WHERE DocumentResultDataItem.ResultItemRecordId BETWEEN 225392327

    AND ( 225392327

    + 1000000 );

    This removes the clustered index seek from the DocumentResultDataItem table and the sort, and just does an index delete as the first item. However, it still then performs a load of index seeks or scans on the other tables which reference this one. The actual and estimated rows are still way out on the child tables. There should be zero rows since the data has already been deleted earlier in the procedure.

    The delete statement takes over a minute to complete now rather than 30 - 40 seconds.

  • Maddave (4/7/2015)


    Thanks for the help Grant.

    Do you mean as follows: (Obviously changing the values dynamically for each pass in the real code).

    DELETE DocumentResultDataItem

    WHERE DocumentResultDataItem.ResultItemRecordId BETWEEN 225392327

    AND ( 225392327

    + 1000000 );

    This removes the clustered index seek from the DocumentResultDataItem table and the sort, and just does an index delete as the first item. However, it still then performs a load of index seeks or scans on the other tables which reference this one. The actual and estimated rows are still way out on the child tables. There should be zero rows since the data has already been deleted earlier in the procedure.

    The delete statement takes over a minute to complete now rather than 30 - 40 seconds.

    OK. The disparity in the stats can be explained, partially. You're deleting data. When it started, the values were higher. The plan gets compiled, then as values drop, the plan doesn't get recompiled and the stats don't get updated because stats only update after you've done modifications to 20% of the rows (+500 rows). However, after an update in the statistics, you shouldn't see radical disparity any more unless something else is affecting the stats such as a table variable or multi-statement table valued user defined function. From what I can see, we're not dealing with that, right? Indexes on the foreign keys of those other tables might eliminate some of the scans... maybe.

    If you know, and I mean KNOW, that you don't have data in all the related tables, you could disable the FK checks for the delete and then renable them, using a WITH CHECK option to ensure that everything is in place. That will eliminate all of that processing from the delete.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Thanks Grant.

    I'm still not sure why the stats are being skewed. I've forced an update of the stats but the estimated is still massively more than actual. I agree, the problem is the data has been removed from the child tables, but I would have thought it would correct itself with an update stats. There are no inline functions or other issues with the statement so not sure whats wrong.

    I've also tried (in test) deleting the FK constraints. After doing this the explain plan of the deletion on the large table showed a simple custered index delete, however, the duration was still around the same time as when the keys were in place, so no real improvement. I think the problem is that this table has so much data ( 500 million plus rows), that it's simply an IO issue now.

  • I hate delete operations. I always feel deletes, especially bulk deletes, indicate something is wrong with the database model or ETL design. What I mean by that is: why insert something (or hundreds of millions of things) only to turn around and routinely delete them back out? I'd rather have the application insert audit records to a delimited file and then report off of that, rather than load it into a database in the first place.

    Consider doing something like the following, which can be scheduled to run every hour, so you're gradually and steadily moving rows from the reporting table to a historical archive table.

    delete from PrintJobs

    output deleted.JobID, deleted.PrintDate, ...

    into PrintJobsHistory ( JobID, PrintDate, ... )

    where datediff( hour, PrintDate, getdate() ) > 120;

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Thanks. I agree there are definitely some improvements that could be made with this design.

    Unfortunately, I cannot implement the delete by hour, well not yet anyway. The machines run for 12 hour shifts and insert performance must be kept to a maximum to not slow the machines down.

    Once I get over this initial problem of clearing out the last few days worth of data, I think a better solution is similar to what you propose, but at the database level. Let the shift run for the 12 hours, then once complete, rename the database to "archive_shiftName" and recreate a new database with an empty schema. This would then allow the next shift to run whilst reports are run on the other database. Then the archive database can simply be dropped once the data has been reported on, or backed up and restored to another instance if historic storage is required.

    It would need a change to the application software and to the the reports etc, so until then I will just have to let my new process run until the work is done!

  • Maddave (4/7/2015)


    Thanks. I agree there are definitely some improvements that could be made with this design.

    Unfortunately, I cannot implement the delete by hour, well not yet anyway. The machines run for 12 hour shifts and insert performance must be kept to a maximum to not slow the machines down.

    ...

    By now we've all heard of RBAR (Row By Agonizing Row) when it comes to looping and selects. However, there is also 'Insert By Agonizing Insert'; that is inserting for each loop in a process. This is another reason to batch up inserts in memory or a text file rather than inserting directly to a table.

    In MSSQL 2014, there is a new feature called Delayed Durability introduced to mitigate just this type of scenario. Basically what it does is allow asynchronous "fire and forget" writes.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Hate to backtrack if it's been asked, but have you looked at the wait statistics on this? I'd be curious as to what it is waiting on. Maybe it is just a hardware issue.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • By now we've all heard of RBAR (Row By Agonizing Row) when it comes to looping and selects. However, there is also 'Insert By Agonizing Insert'; that is inserting for each loop in a process. This is another reason to batch up inserts in memory or a text file rather than inserting directly to a table.

    In MSSQL 2014, there is a new feature called Delayed Durability introduced to mitigate just this type of scenario. Basically what it does is allow asynchronous "fire and forget" writes.

    To be fair I dont know the exact mechanism the app uses to insert into the table, so it could well do it in batches, but I understand what you're saying. It's something to check definitely.

  • Grant Fritchey (4/7/2015)


    Hate to backtrack if it's been asked, but have you looked at the wait statistics on this? I'd be curious as to what it is waiting on. Maybe it is just a hardware issue.

    I'm not at work at the minute so cannot check, but the waits are always around page latching if I remember.

    Remember it is running SQL Server Workgroup so there are limitations on the memory and cpu available to the server! Makes things even more interesting.

    Ideally this system would be running Enterprise so the server can be fully loaded and the tables partitioned, but as this is a whole package the company want to sell to customers, licensing costs are an issue.

  • Thanks to what I learned from Itzik Ben Gan's Advanced TSQL course, I've recently reduced our audit trail by a hundred million records by running the following script as a sql agent job in the dead hours daily (after backups complete and before users start) to leave last financial year data only. It took out about 2 million records per night.

    /* delete in chunks */

    while (datepart(hour,getdate()) >= 22) or (datepart(hour,getdate()) < 6)

    begin

    delete top (10000) FROM [dbo].[AUDTRAIL]

    where [MADEDATE] < convert(datetime,'01/04/2014',103);

    If @@ROWCOUNT < 5000 BREAK;

    End;

    GO

    I expect I could have raised the top value but this worked OK.

Viewing 14 posts - 16 through 28 (of 28 total)

You must be logged in to reply to this topic. Login to reply