How to batch delete large table?

  • Hi,

    I have a table with about 466 Million rows. In this table there is a int column called WeeksToRetain as well as a EventDate column containing the date the row was inserted. I am trying to delete all the rows that that should be deleted according to the WeeksToRetain. For example, if the EventDate is 5/07/15 with a 1 in the WeeksToRetain column the row should be removed by 5/14/15. I am not sure what days SQL considers the beginning and end of the week. However the core issue I am having is the sheer mass of deletions I must do and log growth. So I am trying to do the delete in batches. More specifically I want to load a temporary table with a million rows, then use the temporary table to load a sub temporary table with 100,000 rows and join this temporary table to the table I want to delete from looping through 10 times to get 1 million. I am having trouble writing this because well it's pretty much my first time doing so getting confused in the logic. Any help could include another way of doing this all together, however here is what I wrote so far and I'm stumped. The Logging.EvenLog table which is the table I'm trying to purge has a clustered index on EventDate (ASC). I would like to run this in a schedule job with enough time between executions for log backups to run. Any help is much appreciated!!

    DECLARE @i int

    DECLARE @RowCount int

    DECLARE @NextBatchDate datetime

    CREATE TABLE #BatchProcess

    (

    EventDate datetime,

    ApplicationID int,

    EventTypeID int

    )

    CREATE TABLE #SubBatchProcess

    (

    EventDate datetime,

    ApplicationID int,

    EventTypeID int

    )

    -- Check to see if any records in Logging.EventLog are past their weeks to retain and insert them into #BatchProcess table

    INSERT INTO #BatchProcess

    SELECT TOP(1000000)

    EventDate,

    ApplicationID,

    EventTypeID

    --DATEDIFF (ww, EventDate, GETDATE()) AS WeeksOld

    FROM Logging.EventLog a

    WHERE (WeeksToRetain > 0

    AND DATEADD(ww, WeeksToRetain, eventDate) < GETDATE())

    OR WeeksToRetain IS NULL

    ORDER BY EventDate DESC

    set @Rowcount = @@RowCount

    set @i = 0

    -- Enter loop if rows where inserted from previous step.

    while (@Rowcount > 0 and @i < 11)

    Begin

    --Select a smaller set into #SubBatchProcess table.

    INSERT INTO #SubBatchProcess

    SELECT TOP(100000) EventDate,ApplicationID,EventTypeID

    From #BatchProcess b

    IF @@ROWCOUNT > 0 -- If set count is greater then 0, join set with Logging.EventLog and delete that set.

    BEGIN

    DELETE Logging.EventLog

    From Logging.EventLog le

    INNER JOIN #SubBatchProcess b on le.EventDate = b.EventDate and le.ApplicationID = b.ApplicationID and le.EventTypeID = b.EventTypeID

    -- Walk the table

    SET @NextBatchDate = (select top 1 EventDate

    FROM #SubBatchProcess

    Order by EventDate asc);

    DELETE FROM #SubBatchProcess

    -- Insert the next 100,000 from the #BatchProcess table

    INSERT INTO #SubBatchProcess

    Select top 100000 EventDate, ApplicationID, EventTypeID

    FROM #BatchProcess

    WHERE EventDate < @NextBatchDate

    ORDER by EventDate desc

    SET @RowCount = @@ROWCOUNT

    SET @i+=

    END

    DROP TABLE #BatchProcess

  • First, how many rows are you deleting versus how many are you keeping?

    Second, have you read this article: http://www.sqlservercentral.com/articles/T-SQL/67898/%5B/b%5D">http://www.sqlservercentral.com/articles/T-SQL/67898/%5B/b%5D

    • This reply was modified 5 years, 1 month ago by  Lynn Pettis.
    • This reply was modified 3 years, 11 months ago by  Steve Jones - SSC Editor.
  • I haven't, I will take a look.

    I am querying for a count right now, but it will take a long time. I estimate half, so 250 Million...

  • lmacdonald (6/16/2015)


    I haven't, I will take a look.

    I am querying for a count right now, but it will take a long time. I estimate half, so 250 Million...

    In addition to the counts, can there be down time during the purge, I assume you have a PK on the table to identify each row uniquely, and then how many indexes are there defined on the table.

  • There can be minimal down time. This table is used to log application events, so it doesn't really effect end users but there are sql jobs that load this table that run about every hour. It would be fine if they got a few hours behind. 43 Minutes going and I still have not gotten a count back. There is one clustered index on the table on the EventDate, it's not unique. There is no PK, which is why I am joining on several columns, and also why the it takes forever to query the table, probably a massive scan going on. I may consider building an index that will aid my delete, however it being so large it will surely lock the table.

  • Is there a set of columns taken together that are unique?

  • Since this is a log table does that mean the event date is always increasing?

  • I think you're adding enormous amounts of overhead to this process. How about something like below? I'm not 100% sure on the logic for EventDate being NULL, so please verify that as well.

    For efficiency, you want to process the table in cluster key order. But, if finding 50,000 rows to DELETE takes too long, naturally you can adjust that number down.

    IF OBJECT_ID('tempdb.dbo.#EventDates') IS NOT NULL

    DROP TABLE #EventDates

    CREATE TABLE #EventDates(

    EventDate datetime NULL

    )

    DECLARE @i int

    DECLARE @RowCount int

    DECLARE @Total_Rows_Deleted int

    DECLARE @Max_Rows_To_Delete int

    DECLARE @EventDate datetime

    SET @RowCount = 0

    SET @Total_Rows_Deleted = 0

    SET @Max_Rows_To_Delete = 100000

    SET @EventDate = '19000101'

    WHILE @RowCount < @Max_Rows_To_Delete

    BEGIN

    DELETE TOP (50000) /* change # rows as needed */

    FROM el

    OUTPUT DELETED.EventDate INTO #EventDates ( EventDate )

    FROM Logging.EventLog el

    WHERE

    el.EventDate >= @EventDate AND

    (el.WeeksToRetain IS NULL) OR

    (el.WeeksToRetain > 0 AND DATEADD(DAY, el.WeeksToRetain * 7, el.EventDate) < GETDATE())

    SET @RowCount = @@ROWCOUNT

    IF @RowCount = 0

    BREAK;

    SET @Total_Rows_Deleted = @Total_Rows_Deleted + @RowCount

    SELECT @EventDate = MAX(EventDate)

    FROM #EventDates

    TRUNCATE TABLE #EventDates

    END --WHILE

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Yes the EventDate keeps increasing, as in new entries are added with the most recent dates. As I said there is no unique index, there is only that index I already mentioned. However we have a maintenance night coming up in two days and I think I will try to add a good index by then. We have the database in a staging environment and that version of the table has different indexes and I was able to select/delete data much faster. Thank you for that query! I am done for the day but I'll take a look tomorrow.

  • Don't delete rows in batches of 50,000. Anything >= 5,000 may exceed the default lock escalation threshold and take a table lock. And you don't need two temporary tables when one will do. Some advice:

    - Create an index on the temp table and use a batch separator (Go) between populating it and using it. The reason for this is so that the query optimiser can see there are 250 million rows in there instead of assuming there is only 1 row.

    - Then execute your Delete Top (2000) batches. I've seen others use a variable like Top (@Batch_Size) which also makes the optimizer assume 1 row, so don't do that, not that you have.

    Assuming you are working on a clustered index key (and if you're doing it by date, were able to extract everything into the temp table without causing any blocking, however it worked out what was required), I've seen similar code run on similarly large scales on very busy databases without any application impact (e.g. an extra few seconds to run queries but not locking, blocking, and query timeouts). If you were doing anything out of the ordinary though like have CDC or replication, or ran out of log space in FULL mode, it could become more troublesome.

    Luckily we were able to test first in lower environments and find these kinks before doing the real thing.

  • It could be another approach. If you are deleting 50% of the records as mentioned in the earlier post, it could be quicker to do something like this:

    - stop the jobs writing to your table

    - create a new table, identical structure

    - copy in the temp table records you wanna keep

    - delete your table and rename the temporary one

    Advantage is that the table will not be fragmented and log is not affected by the large delete operation.

  • Thanks everyone. All good tips. I was incorrect about the indexes. I'm not sure but I must have been looking at the wrong table or server because there is actually a non clustered index as well, and the clustered index contains 3 columns. I'm not sure why it's so hard to pull data from this database, even smaller select statements on the table take forever, and on the staging environment they are very fast. Anyway you can consider this thread closed, although not resolved I have enough to keep me going for a while.

  • codykonior (6/16/2015)


    And you don't need two temporary tables when one will do.

    You don't need any temporary tables at all. The temporary tables just add unnecessary overhead.

    DECLARE @Row_Batch INT = 50000; -- change # rows as needed

    WHILE @@ROWCOUNT > 0 -- you can also add criteria here to prevent it from running at peak times.

    BEGIN

    DELETE TOP (@Row_Batch)

    FROM el

    FROM Logging.EventLog el

    WHERE

    ISNULL(el.WeeksToRetain, 0) > 0

    AND DATEADD(WEEK, el.WeeksToRetain, el.EventDate) < SYSDATETIME()

    END --WHILE

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Andrei Hetel (6/16/2015)


    It could be another approach. If you are deleting 50% of the records as mentioned in the earlier post, it could be quicker to do something like this:

    - stop the jobs writing to your table

    - create a new table, identical structure

    - copy in the temp table records you wanna keep

    - delete your table and rename the temporary one

    Advantage is that the table will not be fragmented and log is not affected by the large delete operation.

    You still have the log entries for the insert into the temp table.

  • Lynn Pettis (6/17/2015)


    Andrei Hetel (6/16/2015)


    It could be another approach. If you are deleting 50% of the records as mentioned in the earlier post, it could be quicker to do something like this:

    - stop the jobs writing to your table

    - create a new table, identical structure

    - copy in the temp table records you wanna keep

    - delete your table and rename the temporary one

    Advantage is that the table will not be fragmented and log is not affected by the large delete operation.

    You still have the log entries for the insert into the temp table.

    Sorry, forgot to mention about TABLOCK hint on the target table.That will reduce logging to minimum.

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply