Deleting Large Number of Records

  • Comments posted to this topic are about the item Deleting Large Number of Records

  • Nice one Lynn 🙂

    I've found that for deleting a sub-section of data from a much larger table (for example - can we delete history from transactions table that is over x years old) it can help to create a filtered index under SS2008 for that... You can create the index on-line, and it significantly reduces the seek time if you're deleting a sub-set.

    Of course, if the column is already indexed - no probs! 😀

    Atlantis Interactive - SQL Server Tools
    My blog[/url]
    Why I wrote a sql query analyzer clone

  • Hi Lynn,

    Good article concept, this is a problem that I've had to deal with many times, and I suspect most DBAs do at one point or another.

    A couple concerns though:

    1) Would those transaction log backup files really be manageable? If you chose to make the batch size smaller (say 1,000 instead of 10,000) to minimize duration of locks (esp. if the deletion operations are expensive, for example containing substantial BLOB columns), then you rapidly end up with 2,000 transaction log files for your 2,000,000 records... Imagine having to recover those! :ermm: I guess it's a minor point, but I suspect in many (most?) cases the transaction log backup logic should be disconnected from the deletion loop itself.

    2) Another point is the deletion mechanism - in SQL Server 2000 I have found the SET ROWCOUNT method to be quite painful on large tables. I've not looked deeply into it, but in the environments I've encountered it's MUCH more efficient to identify an indexed numeric key (not necessarily unique, as long as it's smoothly distributed), ideally with a clustered index, and loop through that instead. I imagine that there's a constant cost to evaluating the WHERE clause, even when the ROWCOUNT is only going to act on a subset of the matching records.

    Am I alone in this? Has anyone else had to come up with different ways due to performance issues with the SET ROWCOUNT method?

    Tao

    ---

    Edited: fixed mistake: PK -> Clustered index

    http://poorsql.com for T-SQL formatting: free as in speech, free as in beer, free to run in SSMS or on your version control server - free however you want it.

  • Matt Whitfield (9/15/2009)


    Nice one Lynn 🙂

    I've found that for deleting a sub-section of data from a much larger table (for example - can we delete history from transactions table that is over x years old) it can help to create a filtered index under SS2008 for that... You can create the index on-line, and it significantly reduces the seek time if you're deleting a sub-set.

    Of course, if the column is already indexed - no probs! 😀

    Thank you for the info on SQL Server 2008. At this time I unfortunately haven't had the oportunity to work with 2008 so I'll have to look into that when I do and see what changes that may result in with this method of deleting records.

  • Tao Klerks (9/15/2009)


    Hi Lynn,

    Good article concept, this is a problem that I've had to deal with many times, and I suspect most DBAs do at one point or another.

    A couple concerns though:

    Okay, let's take a look.

    1) Would those transaction log backup files really be manageable? If you chose to make the batch size smaller (say 1,000 instead of 10,000) to minimize duration of locks (esp. if the deletion operations are expensive, for example containing substantial BLOB columns), then you rapidly end up with 2,000 transaction log files for your 2,000,000 records... Imagine having to recover those! :ermm: I guess it's a minor point, but I suspect in many (most?) cases the transaction log backup logic should be disconnected from the deletion loop itself.

    There will always be a pain point some where. Part of the issue I was trying to deal with here is controlling the size of the transaction log. This is one of the areas we see frequently on the forums where the OP is concerned with the transaction log growing excessively during the deletion process, taking much of the available disk space regardless of recovery model in use.

    When the database is using the SIMPLE recovery model, then all that is needed is to batch the deletes. By doing this, as the transaction log is checkpointed the space will be reused in the transaction log keeping the file under control.

    Unfortunately, if the database is using either BULK-LOGGED or FULL recovery model, you actually need to run periodic transaction log backups during the delete process. This is what my code allows one to do. You have to decide how big the batch should be for each delete, how large the transaction log should be allowed to grow (if necessary), how many transaction log files are you going to create. It is possible to further modify this code fairly easily to also incorprate periodic differential backups as well. If you want a differential backup after every 100 transaction log backups, it wouldn't be hard to do that with this code.

    2) Another point is the deletion mechanism - in SQL Server 2000 I have found the SET ROWCOUNT method to be quite painful on large tables. I've not looked deeply into it, but in the environments I've encountered it's MUCH more efficient to identify an indexed numeric key (not necessarily unique, as long as it's smoothly distributed), ideally with a clustered index, and loop through that instead. I imagine that there's a constant cost to evaluating the WHERE clause, even when the ROWCOUNT is only going to act on a subset of the matching records.

    I can't really address this particular issue regarding SET ROWCOUNT in SQL Server 2000 has I haven't really had to do this myself. I'm just trying to provide people with a method of accomplishing deleting a large number of records while trying to maintain the size the transaction log and keep the table relatively open for normal processing by hopefully keeping SQL Server 2000 from putting and holding a table lock. As I have always said on these forums, when you are using anything suggested on any forum be sure to test, test, and test some more.

    This is just one option I am offering, it may not always be the best. It may require some tweaks and modifications depending on the environment it is used in. If there is a suitable index as you suggest, it would be easy to modify this code to force it to use that index.

  • I'm not sure I agree that SET ROWCOUNT is the issue in SQL 2000. I've used a similar method years ago in 2000 to delete large numbers of records and it seemed to work fine for me. Granted, large is relative and what is large today might be substantially larger than five years ago, but I'd be interested to know specifically if it's changed the execution plan or where the issue was.

  • Thanks, Lynn! Here's a possible minor tweak that I think makes one less trip through the loop (unless the total number of rows we're deleting is an exact multiple of the batchsize):

    Instead of:

    while @batchsize 0

    If we say:

    declare @orgBatchSize bigint

    set @orgBatchSize = @batchsize

    while @batchsize = @orgBatchSize

    This exits the loop as soon as the number of deleted rows is less than the batchsize (which will only happen when there's nothing left to delete). This could be a time-saver if the Where clause on the Delete is slow.

  • Lynn ... great article!

    Any thoughts on how (or even if) this approach should be modified for bulk deletions on a SQL 2005 mirror configuration with log-shipping in place?

    JC

  • Steve

    Sorry - would have quoted but the forum was insistent on quoting another post when I clicked quote 😀

    I would put money on the fact that when you used SET ROWCOUNT one of the two following possibilities was true:

    1) There was no where clause on the statement

    2) The where clause was easily satisfied by an index

    ...

    Atlantis Interactive - SQL Server Tools
    My blog[/url]
    Why I wrote a sql query analyzer clone

  • Just... keep simple!

    If you need to delete a large number of rows and the scenario don't need to keep the database log.

    In this case I want to delete all the rows between 2009/01/01 and 2009/10/23, about 100000000 rows.

    delete top (10000) from myTable where colDate between '20090101' and '20091023'

    backup log myDatabase with no_log

    go 10000

    What you think?

    P.S. The recovery model is SIMPLE.

  • I use this ( this is home cooked)

    create proc temp_slowDelete

    (

    @table varchar(255),

    @where varchar(1024)

    )

    as

    /* Slow transaction friendly delete */

    declare @ParmDefinition nvarchar(1024)

    declare @deleteStatment nvarchar(2048)

    declare @statment nvarchar(2048)

    declare @total int

    declare @rowcount int

    Select @deleteStatment = ''delete from '' + @table + '' where '' + @where

    set rowcount 250000

    set nocount on

    RAISERROR (''Counting rows to delete...'',0,1) WITH NOWAIT

    SET @ParmDefinition = N''@CountOUT int OUTPUT'';

    select @statment=''select @CountOUT = count(*) from '' + @table + '' where '' + @where

    exec sp_executesql @statment,@ParmDefinition,@CountOUT=@total OUTPUT;

    if (@total = 0)

    begin

    print ''Nothing todo :)''

    return

    end

    RAISERROR (''%d to delete'',0,1,@total) WITH NOWAIT

    exec sp_executesql @deleteStatment

    set @rowcount = @@ROWCOUNT

    while (@ROWCOUNT > 0)

    begin

    select @total = @total - @ROWCOUNT

    RAISERROR (''Deleted %d, %d left'',0,1,@rowcount,@total) WITH NOWAIT

    exec sp_executesql @deleteStatment

    set @rowcount = @@ROWCOUNT

    end'

  • Ric Sierra (9/15/2009)


    Just... keep simple!

    If you need to delete a large number of rows and the scenario don't need to keep the database log.

    In this case I want to delete all the rows between 2009/01/01 and 2009/10/23, about 100000000 rows.

    delete top (10000) from myTable where colDate between '20090101' and '20091023'

    backup log myDatabase with no_log

    go 10000

    What you think?

    P.S. The recovery model is SIMPLE.

    If your database is using the SIMPLE recovery model you can't do BACKUP LOG, nor is it needed. The only other issue I'd have with your simple process is why run it more than needed. Using the GO 10000, your delete statement is going to run 10000 times regardless of how many times it really needs to run.

    If your database is using BULK-LOGGED or FULL recovery model, then you do need the transaction log backups in case a recovery is needed. The code I have provided can be incorporated into a stored procedure where periodic deletes of large nmber of records need to be accomplished.

  • Keep in mind that even with simple mode, the log can grow. If you don't break it into batches then each delete is logged, which can be a lot of data. It's only when a transaction is committed can the log records be marked as "deleted" during a checkpoint.

  • Thanks, Lynn. Most timely, since I'm in the middle of deleting something like 15 million rows from a production database that is designated high availability in our organization. Very helpful!

  • Lynn Pettis (9/15/2009)


    Ric Sierra (9/15/2009)


    Just... keep simple!

    If you need to delete a large number of rows and the scenario don't need to keep the database log.

    In this case I want to delete all the rows between 2009/01/01 and 2009/10/23, about 100000000 rows.

    delete top (10000) from myTable where colDate between '20090101' and '20091023'

    backup log myDatabase with no_log

    go 10000

    What you think?

    P.S. The recovery model is SIMPLE.

    If your database is using the SIMPLE recovery model you can't do BACKUP LOG, nor is it needed. The only other issue I'd have with your simple process is why run it more than needed. Using the GO 10000, your delete statement is going to run 10000 times regardless of how many times it really needs to run.

    If your database is using BULK-LOGGED or FULL recovery model, then you do need the transaction log backups in case a recovery is needed. The code I have provided can be incorporated into a stored procedure where periodic deletes of large nmber of records need to be accomplished.

    In this scenario I need to delete 100,000,000 rows, by each batch I delete 10,000 rows but to ensure to delete all the rows I need to run 10,000 times the batch.

    10,000 x 10,000 = 100,000,000

Viewing 15 posts - 1 through 15 (of 72 total)

You must be logged in to reply to this topic. Login to reply