Deleting millions rows from some tables

  • Hi everybody

    I read some posts about removing lots of rows from a table.

    I am facing a similar problem. I have a star modeling.

    I have a table with 5 million rows but in order to keep good performance, we decided to clean our DB, by removing rows older than a date parameter. But deleting a row in the main table involves to delete a row in 40 linked tables.

    We tried to delete about 1,5 million of rows in the main table, then 40 * 1,5 = 60 millions rows. :w00t:

    We tried on our emergency server, that is the same configuration than the live DB. It takes too much time.

    It deletes about 15 000 rows in the main table by hour.

    At the beginning, we wrote a proc to select the IDs from the main table to delete by a cursor.

    For each ID, the proc calls another proc with the ID as parameter. In this second proc, we delete the rows in the 40 child tables.

    Our DBA said that SQL Server rebuild its index on every table and this is the reason for the long duration of this statement. I asked him if it is not possible to disable this 'automatic' rebuild and for only answer, I got that the deleting is made on a clustered primary key and it is not possible. :blink:

    I tried to find another solution :

    1 - I create a table with all Id to delete and loop on this one instead of on the statement (joins of 3 tables (main and child tables).

    2 - I noticed that all the constraint are defined with a cascade delete, so deleting in the main table will automatically delete in the child tables. No need to execute delete statement on each child table. But it did not speed up the massive deleting.

    3 - I tried to disable all the constraint in order to avoid to check foreign key constraint. It was much more faster but not enough.

    Now I am looking for a way to speed up this batch.

    Does anyone have a 'real DBA' solution or a way to investigate ?

    We are ready to delete by smaller packages (100 000 rows) but even this number involves a 6 hours working time (and during the night, there are others batches that are running).

    Thanks for your answers !

  • SQL does not automatically rebuild indexes are large deletes it just leaves the spaces marked as deleted (the data is actually still in the database until it is overwritten). Check the database to make sure the auto shrink option is not on (this might cause data movement and the indexes then might reorganize if it is). As well how long does it take to delete 1 record?

  • Have you tried deleting from each of the individual child tables first? How fast did that run?

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • If there are indexes other than the primary key on any of the tables, you might try disabling those indexes before the delete and then rebuilding them afterwards. Books Online has an example of how to do this.

    Even deleting a very large number of records shouldn't take as long as you are describing, unless there are other locks taking place on the table(s) at the same time. (Or if the hardware is simply underpowered for a database that size.)

    Try running a smaller delete, maybe 100 rows, and run a trace to see what locks you might be running into. Books Online has data on how to do this.

    I would expect it to be locking issues if I ran into something like this. If that's the case, possibly deleting in very small batches will help, but you'll have to resolve any locks first.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Antares686 (3/27/2008)


    SQL does not automatically rebuild indexes are large deletes it just leaves the spaces marked as deleted (the data is actually still in the database until it is overwritten). Check the database to make sure the auto shrink option is not on (this might cause data movement and the indexes then might reorganize if it is). As well how long does it take to delete 1 record?

    Hi,

    I thought so about automatic indexes rebuild.

    I checked that this option (AutoShrink) is disabled.

    I change my SQL statement to get the duration of the deletion of one group of record (one ID from main table and one row in each of the 40 child tables)

  • Here's a couple of facts for you...

    Creating a temp table that holds 750000 rows of primary keys of rows to be deleted takes 2.1 seconds using SELECT/INTO.

    Adding a primary key to that temp table takes 1.4 seconds

    A set based delete on the source table joined to that temp table that holds all the primary keys operates at a rate of 35,000 rows per second. This number includes the time it takes to update the primary key and, of course, will get smaller in the presence of additional indexes, triggers, and other things.

    Cascade delete slows it down because that basically turns the setbased delete into RBAR.

    Cursors cripple it. Most delete triggers will make it run twice as slow no matter which method is used.

    Deletes over a second long (25-35,000 rows) should be broken down into sets of 25,000 rows to keep the transaction log and blocking from going absolutely nuts... I normally do this with a "delete crawler". To keep the process from being horribly blocked by other processes, use TABLOCKX to do the deletes... 25,000 rows will take something less than a second and, with a delay in between deletes, most apps can handle that with no problem.

    Here's an example of the delete crawler that I use in such situations... the example includes a test table you can play with. As usual, the details are in the comments in the code...

    --===== If the test table exists, drop it

    IF OBJECT_ID('dbo.JBMTestDetail','U') IS NOT NULL

    DROP TABLE dbo.JBMTestDetail

    GO

    --===== Create and populate a 1,000,000 row test table.

    -- Column "ConnID" has a range of 1 to 100,000 non-unique numbers stored as VARCHAR(30)

    -- Column "Key1" has a range of 1 to 100,000 non-unique numbers stored as VARCHAR(30)

    -- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers

    -- Column "Time_Stamp" has a range of >=01/01/2005 and <01/01/2015 non-unique date/times

    -- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'

    -- for all rows.

    -- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)

    SELECT TOP 1000000

    ConnID = ISNULL(CAST(ABS(CHECKSUM(NEWID()))%100000+1 AS VARCHAR(30)),''), --(10 rows per connection)

    Key1 = ISNULL(CAST(ABS(CHECKSUM(NEWID()))%100000+1 AS VARCHAR(20)),''), --just to test index with

    SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)

    + CHAR(ABS(CHECKSUM(NEWID()))%26+65),

    SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),

    SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

    Time_Stamp = ISNULL(CAST(RAND(CHECKSUM(NEWID()))*3652.0+38351.0 AS DATETIME),0),

    SomeHex12 = RIGHT(NEWID(),12)

    INTO dbo.JBMTestDetail

    FROM Master.dbo.SysColumns t1,

    Master.dbo.SysColumns t2 --Lack of join criteria makes this a CROSS-JOIN

    --===== Create indexes similar to Troy's

    CREATE CLUSTERED INDEX IXC_JBMTestDetail_Time_Stamp_ConnID ON dbo.JBMTestDetail (Time_Stamp,ConnID)

    CREATE NONCLUSTERED INDEX IX_JBMTestDetail_ConnID_Key1 ON dbo.JBMTestDetail (ConnID,Key1)

    GO

    --===== Setup to measure performance...

    SET STATISTICS TIME ON

    --===== Define the cutoff date with a time of "midnight" or, if you will,

    -- define the cutoff date with no time so we only delete whole days.

    DECLARE @CutoffDate DATETIME

    SELECT @CutoffDate = DATEADD(dd,DATEDIFF(dd,0,GETDATE()),-94)

    --===== Limit all further queries, including deletes, to 25,000 rows

    -- (about 1 second worth of deletes, like I said before)

    SET ROWCOUNT 25000

    --===== See if any rows qualify for deletion. If even just one exists,

    -- then there's work to do and @@ROWCOUNT will be > 0.

    -- DO NOT PUT ANY CODE BETWEEN THIS SELECT AND THE WHILE LOOP OR

    -- YOU'LL MESS UP THE ROWCOUNT FOR THE WHILE LOOP

    SELECT TOP 1 1 FROM dbo.JBMTestDetail WHERE Time_Stamp < @CutoffDate

    --===== If the rowcount from the above is greater than 0,

    -- then delete 25,000 rows at a time until there's nothing

    -- left to delete

    WHILE @@ROWCOUNT > 0

    BEGIN

    --===== Just a "marker" to separate the loop in the output

    PRINT REPLICATE('=',78)

    --===== This delay gives other processes breathing room

    WAITFOR DELAY '00:00:10'

    --===== Do the delete. Will be limited by the SET ROWCOUNT above.

    -- DO NOT PUT ANY CODE BETWEEN THIS DELETE AND THE "END" OR

    -- YOU'LL MESS UP THE ROWCOUNT FOR THE WHILE LOOP.

    DELETE dbo.JBMTestDetail WITH (TABLOCKX)

    WHERE Time_Stamp < @CutoffDate

    END

    --===== Restore the ability to process more than 25,000 rows

    SET ROWCOUNT 0

    SELECT 1000000 - COUNT(*) FROM jbmtestdetail

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi,

    thanks for your reply.

    I am just trying a solution (copying tables with rows we want to keep into working tables, truncating tables then reinserting rows into tables in order to avoid breaking "links" with stored procs and other db objects).

    My first attempts show that it is faster than directly deleting rows.

    I have to give a first script at the end of today (thanks to my manager) and I will make more attempts tomorrow.

    After, I will give you results of my searchs.

    See you later.

  • Most people crab about the duplicate disk space while that happens. If you don't mind that, then yes, the method you spoke of will be faster.

    The problem is, the data is not available to the users between the truncate and the completion of the insert. But, again, if that small duration that occurs using the method you spoke of will be faster.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Also, if there are foreign keys that reference the table you are truncating, you'll have to remove those first, then rebuild them. Not a big problem, but needs to be noted.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Hi, everybody.

    I write down about the results of my problem and the different attempts to resolve it.

    First, I thank you for your answers.

    So, we decided to process as I described in my previous post.

    I had to be write a sql statement, as automatic as possible.

    So here what I implemented.

    - I create a table with the names of the table I had to purge, and another one to insert datas about durations of copying data, drop constraints, truncate, re-insert from working tables, add constraints ...

    - I create 1 function, 2 stored procedures :

    * one that loops onto the table with names of the tables to purge and calls the second

    * one that makes the processes (copy, drop constraints, truncate, reinsert, add constraints).

    Our DBA prepared the emergency DB Server with a restoration of the live prod DB (excepting the fact that our DBA did not use backup/restore operations but Embarcadero, to create db objects before massive inserts by bcp and he forgot tables (including the main table of our star model), lost some identity constraints). So, my statement did not run to the end

    But, to give a beginning of measurements :

    I tried my statement on a local computer with 70 000 rows to delete from 300 000 rows and this in 40

    tables. It runs during less 10 minutes.

    Our live DB is 75Gb large. Each table of the model has more than 5 million rows.

    On some tables it takes about 2-3 minutes to run.

    So we expect, when we will be able to test it as we would be on the live DB, a max duration of 2 hours.

    The operations that could be long are re-create contraints on all the tables that I had to drop before truncating them (as Gsquared said).

    If somebody wants that I post some of the statements I wrote, I could do it.

    I know that deleting rows is logged while truncate table not.

    But I find amazing that copying, truncate, and re-insert the 2 thirds of a table is faster than deleting one third of it.

    I want to thank you for advice and information.

    Have a nice day !

    And the criteria

  • It is amazing that two copies and truncate could be so fast. Guess that's why lot so folks use it... so long as there aren't foreign keys, delete triggers, etc, etc.

    Thanks for sharing what you're trying to do...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hello,

    I will ask a friend who knows quietly good Oracle if he has already faced this problem and if Oracle DB give the same results.

    I will post in the day the sql statements (I will comment them).

    THX

  • Regarding FK:

    - Many people forget to create supporting indexes for their foreign keys (so in the child tables ! ) ! Resulting in full table scans (every time and again !!!)

    - Keep in mind to create the FK-index fully matching the FK-definition (column sequence)

    - For large clean up operations, plan application down time !! and restrict all access during your operation ! (physical or by using e.g. table-lock)

    That's the only way you can guarantee that data is still consistent during your operation.

    - If using drop/disable FK, Keep in mind to create/enable FK using the WITH CHECK option !!! Don't use with NOcheck because you will get dirty data in the long run

    Maybe these scripts can get you on track ...

    - Detect invalid FK-data after using "WITH NOCHECK" : http://www.sqlservercentral.com/scripts/Miscellaneous/31808/

    - Generate Create FK-indexes:

    http://www.sqlservercentral.com/scripts/Indexing/61391/

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • ALZDBA (4/3/2008)


    Regarding FK:

    - Many people forget to create supporting indexes for their foreign keys (so in the child tables ! ) ! Resulting in full table scans (every time and again !!!)

    - Keep in mind to create the FK-index fully matching the FK-definition (column sequence)

    - For large clean up operations, plan application down time !! and restrict all access during your operation ! (physical or by using e.g. table-lock)

    That's the only way you can guarantee that data is still consistent during your operation.

    - If using drop/disable FK, Keep in mind to create/enable FK using the WITH CHECK option !!! Don't use with NOcheck because you will get dirty data in the long run

    Maybe these scripts can get you on track ...

    - Detect invalid FK-data after using "WITH NOCHECK" : http://www.sqlservercentral.com/scripts/Miscellaneous/31808/

    - Generate Create FK-indexes:

    http://www.sqlservercentral.com/scripts/Indexing/61391/

    Hi ALZDBA,

    I don't this problem because of the star modeling. The child tables reference the main one by an ID that is

    the field PK, so a clustered index is created on each child table, and an index too.

    And, when I re-create my foreign keys, I use the WITH CHECK option (default option)

    Thanks for the links.

Viewing 14 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply