Large table cleanup with minimal locks

  • Although it's perhaps outside the scope of the article, I've found that when performing mass deletions of obsolete data, in addition to contention, a significant issue is managing the size of the transaction log generated by the deletions.

    Partition switching can help, but only if the table is already partitioned (splitting an existing table into partitions is just as expensive, transaction-log-wise, as deleting rows), and only if foreign key constraints don't preclude switching of partitions.

    Forgive me if this is an obvious point.

    Anyone have any hints for reducing the transaction log growth created by deleting rows?

  • graecl (3/6/2012)


    ... Anyone have any hints for reducing the transaction log growth created by deleting rows?

    When I've had to do bulk deletions on a 120GB table, I started with a full DB backup, then upped the transaction log backup job frequency to promote VLF reusability as best I could. There was some log growth, but running the t-log backup every 15 minutes kept it well within available space.

    As it is, I left the t-log backup running every 15 minutes anyway. No harm done, and my t-log files are typically within 2x to 3x the initial size; starting the log files at 512MB and secondaries at the same size. Yes, that a lot to initialize to zeros, but we've had no performance issues to date and we very, very rarely add secondaries.

    (Minor plug: using level-4 file compression in Red Gate's SQL backup saves us 10s of GB per backup involving mass data reduction - couldn't live without it on SQL 2005.)

  • I use CTE's to accomplish these type of delete and or archive tasks.

    For example:

    WITH AuditRecordsToArchiveTop10000 AS (

    SELECT TOP 10000 *

    FROM SOURCETABLE

    WHERE CREATED < getdate()-45)

    DELETE AuditRecordsToArchiveTop10000

    OUTPUT DELETED.*

    INTO ARCHIVEDTABLE ;

  • You can also check out this article, it may help with some ideas:

    Deleting Large Number of Records

    [/url]

  • jay.dunk (3/6/2012)


    Using nolock is quite risky and is getting used more and more these days to avoid locks, but it DOES come with issues which i only found out a few months agotake a look at this post which puts it a lot better than i do

    OKBangas stated it earlier in this thread. If the NOLOCKs are on Temp Tables, then they just don't matter because no one else will be writing to them. Personally, I'll use WITH(TABLOCKX) on Temp Tables to set a single lock because no one will be using them but me.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • SAinCA (3/6/2012)


    (Not my architecture, so please avoid commenting on the presence of RBAR :-D)

    Heh... you posted it so if there were comments to be made about RBAR, you'd just have to deal with it. 😉

    Fortunately, what you posted isn't RBAR. The While Loop is a control loop to delete sets of rows, not individual rows.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • TheSQLGuru (3/6/2012)


    Note that the locks taken for the DELETE in this case could be PAGE locks, in which case there could be potentially hundreds of thousands of rows available for the DELETE before the engine decides it needs to escalate to a TABLE lock. Also note if the TABLE lock isn't available due to other concurrent activity the engine will keep going and retry the lock escalation every 1250-lock increment (with some constraints, provisos, limitations, etc).

    You've brought up a very good point, Kevin, that's missing from the article and all of the suggestions so far. There's no guarantee in which order the deletes will occur unless there's an ORDER BY involved. Ordering by the same columns as the clustered index would keep most of the deletes on the "earliest" pages and mostly together so that you could probably get away with a PAGLOCK hint. The fewer locks you have to take, the faster things can run. It would also isolate the deletes to areas that weren't being used as much, assuming a proper clustered index of "unique and always increasing were followed.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Eric1972 (3/6/2012)


    I use CTE's to accomplish these type of delete and or archive tasks.

    For example:

    WITH AuditRecordsToArchiveTop10000 AS (

    SELECT TOP 10000 *

    FROM SOURCETABLE

    WHERE CREATED < getdate()-45)

    DELETE AuditRecordsToArchiveTop10000

    OUTPUT DELETED.*

    INTO ARCHIVEDTABLE ;

    That's perfect for single table deletes especially if you add and ORDER BY either on the "ever increasing" clustered index or on the CREATE column itself to keep all of the deletes as close together and as "early" as possible.

    You might still need a Temp Table and an explicit transaction if you were deleting from 3 related tables like in the article.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • cfradenburg (3/6/2012)


    jberg-604007 (3/6/2012)


    can't index a table variable

    But you can declare a primary key which does have an effect on how the query is processed.

    Just to add to that, you can also put indexes on table variables through the use of unique constraints during table variable definition. It's a bloody rare thing, though.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • ChrisTaylor (3/6/2012)


    jay.dunk (3/6/2012)


    I do something similar on a 24/7 db, 365 days a year with all files on a single raid 5 partition!?! I use the same approach but with a table variable and I process over 1.5 million deletes per day on a rolling basis.

    Be careful with using a table variable with a large amount of rows because a table variable is optimized for one row, by SQL Server i.e. it assumes 1 row will be returned therefore possibly producing a sub-optimal execution plan.

    I agree with the comments regarding using the inner join as opposed to WHERE...IN (SELECT...)

    Oh, be careful now. First, a table variable can be optimized for more than one row using a statement level recompile. It can be VERY effective. Of course, that's one of the advantages of using a Temp Table... if it needs to, it'll do the recompile on it's own.

    On the WHERE...IN thing... it's a myth to think that WHERE IN has a performance problem. It may have a problem with NULLs, but it's not generally a performance problem. In fact, just the opposite is frequently true where a WHERE...IN will outperform an inner join.

    It's been proven many times on these fine forums but I guess one more time won't hurt. Here's my standard test table along with a "pony" table for the testing.

    -----------------------------------------------------------------------------------------------------------------------

    --DROP TABLE dbo.JBMTest,#MyInTable

    GO

    --===== Create and populate a 1,000,000 row test table.

    -- Column "RowNum" has a range of 1 to 1,000,000 unique numbers

    -- Column "SomeInt" has a range of 1 to 50,000 non-unique numbers

    -- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings

    -- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers

    -- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times

    -- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'

    -- for all rows.

    -- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)

    -- Jeff Moden

    SELECT TOP 1000000

    RowNum = IDENTITY(INT,1,1),

    SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,

    SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)

    + CHAR(ABS(CHECKSUM(NEWID()))%26+65),

    SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),

    SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

    SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),

    SomeHex12 = RIGHT(NEWID(),12)

    INTO dbo.JBMTest

    FROM Master.dbo.SysColumns t1

    CROSS JOIN Master.dbo.SysColumns t2

    --===== A table is not properly formed unless a unique clustered index has been assigned

    ALTER TABLE dbo.JBMTest

    ADD PRIMARY KEY CLUSTERED (RowNum)

    --===== Create a table with some "WHERE IN" values

    SELECT TOP 1000

    IDENTITY(INT,1,1) AS RowNum,

    ABS(CHECKSUM(NEWID()))%50000+1 AS SomeInt

    INTO #MyInTable

    FROM Master.dbo.SysColumns t1

    ----===== Add an index to the larger table

    CREATE INDEX IX_JBMTest_SomeInt

    ON dbo.JBMTest (SomeInt)

    Here's the test code. Feel free to add indexes as you see fit... the results come out pretty much the same.

    --===== THE TEST CODE

    PRINT '========== INNER JOIN =================================================='

    SET STATISTICS TIME ON

    SELECT DISTINCT t.*

    FROM dbo.JBMTest t

    INNER JOIN #MyInTable i

    ON t.SomeInt = i.SomeInt

    SET STATISTICS TIME OFF

    GO

    PRINT '========== WHERE IN =================================================='

    SET STATISTICS TIME ON

    SELECT t.*

    FROM dbo.JBMTest t

    WHERE t.SomeInt IN (SELECT SomeInt FROM #MyInTable)

    SET STATISTICS TIME OFF

    GO

    PRINT '========== WHERE EXISTS =================================================='

    SET STATISTICS TIME ON

    SELECT t.*

    FROM dbo.JBMTest t

    WHERE EXISTS (SELECT 1 FROM #MyInTable i WHERE t.SomeInt = i.SomeInt)

    SET STATISTICS TIME OFF

    GO

    Here are the results from my ol' machine with a single CPU.

    ========== INNER JOIN ==================================================

    (19607 row(s) affected)

    SQL Server Execution Times:

    CPU time = 343 ms, elapsed time = 1834 ms.

    ========== WHERE IN ==================================================

    (19607 row(s) affected)

    SQL Server Execution Times:

    CPU time = 125 ms, elapsed time = 1641 ms.

    ========== WHERE EXISTS ==================================================

    (19607 row(s) affected)

    SQL Server Execution Times:

    CPU time = 156 ms, elapsed time = 1475 ms.

    ========== INNER JOIN ==================================================

    (19607 row(s) affected)

    SQL Server Execution Times:

    CPU time = 297 ms, elapsed time = 1576 ms.

    The INNER JOIN lost whether it went first or last. Like I said, the supposed performance problems about WHERE IN are mostly myths.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • jay.dunk (3/6/2012)


    hitrickk (3/6/2012)


    jay.dunk (3/6/2012)


    @kalman --- that wont get round the contention, if you use the method shown in the article you can throttle the deletes to avoid locking for too long a time, whilst only perform one select per large batch to get the ids you want.

    I do something similar on a 24/7 db, 365 days a year with all files on a single raid 5 partition!?! I use the same approach but with a table variable and I process over 1.5 million deletes per day on a rolling basis.

    The cte will keep repeating teh select for the delete and therefore likely cause locking for the duration of the select/delete.

    Hi , I have too 24/7 db - but with good filer machine (raid 1+0 ) . I deleted with such CTE more then 4000K rows yesterday on a production machine (statistics table ) . in the CTE I use the hint WITH(NOLOCK) after the table name in the select . so I have no lock wait (dirty read ) . My DB is in FULL recovery model stat . try it . I will try to write an example tomorrow .

    Regards Kalman

    Using nolock is quite risky and is getting used more and more these days to avoid locks, but it DOES come with issues which i only found out a few months agotake a look at this post which puts it a lot better than i do

    http://blogs.msdn.com/b/sqlcat/archive/2007/02/01/previously-committed-rows-might-be-missed-if-nolock-hint-is-used.aspx%5B/quote%5D

    Thank You for the link , great article .

    I guess the answer is depends on the type of data You have . If You have for the example a bank or stock exchange live DB - I guess that using dirty read (nolock) will be bad idea - but if we are talking about statistics DB that count user activity (most of the time inserts and no updates and the data isn't vital in every raw like in a bank record) You can use nolock .

    the answer as I wrote is depends on your system and the info that the table holds . I use nolock only when I sure this will be safe . and the difference in preformance is amazing

  • Isn't setting ROWCOUNT also a valid option to limit the number of rows affected by a DELETE statement? Or am I missing something here?

  • vliet (3/7/2012)


    Isn't setting ROWCOUNT also a valid option to limit the number of rows affected by a DELETE statement? Or am I missing something here?

    See the first "Important" note in this article:

    SET ROWCOUNT (Transact-SQL)

    Since that won't be supported in the next version I wouldn't recommend doing it now but it was a valid option in the past.

  • Thanks, Jeff. I meant the transmit process, which takes one row, sends it to the central server, then goes and gets another... Wot no batching??? Yep, primitive RBAR, but it does "guarantee" every record is transmitted. One day they'll let me change it to a multi-row handling SP on the server end, with individual record failure rollback, you know, that "new technology" thing. Sarcasm flag waving high and wildly :w00t:

  • If your tables have ids in order you may get better performance by selecting ids in order rather than using TOP. If your tables have date columns you can probably just process a month at a time or so without the id table. Deletes may perform better when you use a range of ids because that way the data is likely to all be in a page at a time, or a small set of pages. TOP can give you a random spread of data. By 'perform better' I mean cause less locking for your system. If you are only locking one month of old data at a time you're not likely to have any problems. The one other person trying to grab that data at the same time will see its locked, your process will finish with it, and give them access. If you're locking data spread across a whole bunch of pages you increase the chance that other processes will try to grab one of those pages at the same time as you grab it and cause waits to build up.

Viewing 15 posts - 31 through 45 (of 54 total)

You must be logged in to reply to this topic. Login to reply