What is the most efficient way to delete records?

  • Hello,

    I have a database that is 230GB in size. A single table in this database has 200 million rows and takes up 180GB of the databases size. I'm almost out of space on both data and log disks.

    I want to clear up some space buy deleting old records from the big table. The table has a date column and an index on this column.

    I want to delete all the records up to 1 Jan 2010 that will leave about 30 million records in this table and will remove the rest.

    I can't delete them all in one go because space constraints on the transaction log disk, so I'm removing them in batches.

    This is the script that I wrote for this purpose:

    declare @C int

    Set @C = 0

    while Exists (

    select

    top 1 *

    from

    bigtable

    where

    [date]< '1 jan 2010'

    )

    begin

    Set @C = @C+1

    begin tran

    insert into _andrew (Data) values (@c)

    commit tran

    begin tran

    delete top (1000000) from

    bigtable

    where

    [date]< '1 jan 2010'

    commit tran

    end

    The log table looks like this:

    CREATE TABLE [dbo].[_andrew](

    [Id] [int] IDENTITY(1,1) NOT NULL,

    [Data] [int] NOT NULL,

    [Time] [datetime] NOT NULL CONSTRAINT [DF__andrew_Time] DEFAULT (getdate()),

    CONSTRAINT [PK__andrew] PRIMARY KEY CLUSTERED

    (

    [Id] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    ) ON [PRIMARY]

    The script took 18 hours to delete 24 million records do far. This is some what unexpected to me as I hoped that 18 hours is planty to remove all the records that I wanted removed. Obviously I was wrong.

    Could you please recommend any way of speeding up the process?

    Cheers,

    Andrew

  • 1st problem you have is in:

    IF EXISTS SELECT TOP 1..

    2nd Problem:

    DELETE TOP (1000000)..

    To find the TOP (1000000) sql must check all the records.

    Rather:

    SET ROWCOUNT 1000000

    DECLARE @deleted

    SET @deleted = 1000000 (Use the same value as in the SET ROWCOUNT)

    WHILE @deleted = 1000000

    BEGIN

    BEGIN TRAN

    delete * from

    bigtable

    where

    [date]< '1 jan 2010'

    SET @DELETED = @@ROWCOUNT (NOTE: Before the COMMIT TRAN)

    COMMIT TRAN

    END

    This way SQL will delete a maximum of 1000000 records and stop, without first checking if they are the top.

    I would also not delete that many records at once, 10000 -50000 is the sort of range Iwould go for.

    If you really want to track the number of times it loops do it all in the same TRAN Bracket.

    Leo

    Yes!

    Leo
    Nothing in life is ever so complicated that with a little work it can't be made more complicated.

  • Leo.Miller (5/6/2010)

    2nd Problem:

    DELETE TOP (1000000)..

    To find the TOP (1000000) sql must check all the records.

    Thank you for your feedback. Could you please expand on the above? Why are you saying that to find the TOP (1000000) sql must check all the records?

    Thanks,

    Andrew

  • Just another possible plan of attack.

    create another table to hold the 30 million rows you want to keep

    copy the 30 million rows to the new table in batches based on dates.

    drop the old table ( faster to deallocate than to delete)

    rename the new table to the old name.

    If you have a window of time when no updates are happening to this table, this might be a lot quicker than deleting rows.


    And then again, I might be wrong ...
    David Webb

  • A further comment is to try use a clustered index/primary key in the WHERE clause. You don't say if the table is clustered on [DATE].

    If [DATE] isn't clustered but there's an IDENTITY column or similar used for clustering and this matches the date sequence, then use a single lookup to find the MAX(IdentityColumn) where [DATE] < '1 Jan 2010'

    Then use the result of this in the DELETE... WHERE clause.

    You may also want to add a WAITFOR DELAY '00:00:02' in the loop as my sugestion is fairly efficient and will hammer your server, possibly to the detriment of other processes.

    Leo

    Yes!

    Leo
    Nothing in life is ever so complicated that with a little work it can't be made more complicated.

  • David Webb-200187 (5/6/2010)


    Just another possible plan of attack.

    create another table to hold the 30 million rows you want to keep

    copy the 30 million rows to the new table in batches based on dates.

    drop the old table ( faster to deallocate than to delete)

    rename the new table to the old name.

    If you have a window of time when no updates are happening to this table, this might be a lot quicker than deleting rows.

    This is what I'm currently considering doing. The problem is that I don't have free space. So right at the moment I'm running shrink on the database to reclaim space from the 24 million that was already deleted. I have no idea how many hours this will take. After this I hope I'll be able to copy / drop / rename in reasonable time frame. Thank you for the suggestion.

  • You asked: Why are you saying that to find the TOP (1000000) sql must check all the records?

    The only way to find the TOP record is to look at them all, then sort them in the required order, then take the top one off the top. It helps if you have a clustered index, but SQL is still doing a lot of extra reads.

    Try running the queries below with the SHOW QUERY PLAN ON and after setting

    SET STATISTICS IO ON

    SELECT TOP 1 (*) FROM Big_Table where [DATE] < '1 Jan 2010'

    vs

    SET ROWCOUNT 1

    SELECT * FROM Big_Table where [DATE] < '1 Jan 2010'

    They my return different rows, but in your case since you are controlling the delete with [Date] < '1 Jan 2010' you don't really care if the TOP 1000000 are deleted in any particular order, you just want them deleted.

    Leo

    Yes!

    Leo
    Nothing in life is ever so complicated that with a little work it can't be made more complicated.

  • Leo.Miller (5/6/2010)


    A further comment is to try use a clustered index/primary key in the WHERE clause. You don't say if the table is clustered on [DATE].

    It is not. It is clustered on the identity column but the date column order roughly corresponds to the identity order. In other words, the records were added chronologically most of the time (but not always).

    If [DATE] isn't clustered but there's an IDENTITY column or similar used for clustering and this matches the date sequence, then use a single lookup to find the MAX(IdentityColumn) where [DATE] < '1 Jan 2010'

    Then use the result of this in the DELETE... WHERE clause.

    The date column matches your description so the Max Id can be found easily. However my problem is that SQL will automatically wraps delete in a transaction, and thus moves of the records its about to delete to transaction log, which doesn't have that much space. This is exactly why I tried to delete in batched. This way data is copied to transaction log, deleted from the database and the transaction log portion of it is released on a check point as the recovery model is simple. This way for each batch the same amount of transaction log space is reused.

    EDIT: Sorry, misread you here, you are suggesting substituting where clause in my script to use clustered index identity column in the condition instead of the date column. Gotcha.

  • If you are also shrinking the database my recommendations are:

    Do the shrink on the file

    Shrink it in small amounts multiple times, rather than trying to go down in one big block.

    Tests we've done in 16 years as a DBA show that for large databases it's more efficient to shrink by small amount.

    Leo

    Yes!

    Leo
    Nothing in life is ever so complicated that with a little work it can't be made more complicated.

  • You said: In other words, the records were added chronologically most of the time (but not always).

    Check if ther are any records that would be incorrectly deleted if you used my recommendation, then you can do something like add:

    AND ID NOT IN (1,5,6,....)

    Or you coul put these problem ID's into a table and use a join to eleminate them.

    I would put in the effort to use the clustered index, it will make a big difference.

    Leo

    Yes!

    Leo
    Nothing in life is ever so complicated that with a little work it can't be made more complicated.

  • Leo.Miller (5/6/2010)The only way to find the TOP record is to look at them all, then sort them in the required order, then take the top one off the top.

    Yes, but this applies to ordered queries. My query is not ordered (there is no order by clause). Where this sort, you are referring to comes from?

  • I'm not sure what you mean byt this:

    .... However my problem is that SQL will automatically wraps delete in a transaction, and thus moves of the records its about to delete to transaction log

    TO use the MAX(ID) look it up once and store it in a constant.

    DECLARE @MAXID int

    select @MAXID = MAX(ID) from Big_Table where [Date] < '1 Jan 2010'

    But your approach is correct. Keep the database in SIMPLE mode. Keep the batch sizes reasonable. Do the deletes in a transaction bracket within the loop. You probably get a check point every 59 seconds or so, so the WAIT FOR DELAY will ensure the log is kept as small as possible.

    If you have any spare capacity on another disk you couls create a second log file for the duration of the delete process and the delete it later if log space is a problem.

    Leo

    Yes!

    Leo
    Nothing in life is ever so complicated that with a little work it can't be made more complicated.

  • Leo.Miller (5/6/2010)[/bTry running the queries below with the SHOW QUERY PLAN ON and after setting

    SET STATISTICS IO ON

    SELECT TOP 1 (*) FROM Big_Table where [DATE] < '1 Jan 2010'

    vs

    SET ROWCOUNT 1

    SELECT * FROM Big_Table where [DATE] < '1 Jan 2010'

    This is what I'm seeing when I try this:

    Table 'BigTable'. Scan count 1, logical reads 4, physical reads 3, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    |--Top(TOP EXPRESSION:((1)))

    |--Clustered Index Scan(OBJECT:([Andrew_Test].[dbo].[BigTable].[BigTable_PK]), WHERE:([Andrew_Test].[dbo].[BigTable].[Date]<'2010-01-01 00:00:00.000'))

    vs

    Table 'BigTable'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    |--Clustered Index Scan(OBJECT:([Andrew_Test].[dbo].[BigTable].[BigTable_PK]), WHERE:([Andrew_Test].[dbo].[BigTable].[Date]<'2010-01-01 00:00:00.000'))

  • Shrink for some reason didn't work. Althpugh I can see that BigTable has 9GB of unused space, shrink didn't shrink the database file at all....

    So I have rewritten my query to have the row count in and use the ID:

    declare @C int

    Set @C = 0

    SET ROWCOUNT 1000000

    DECLARE @deleted int

    SET @deleted = 1000000

    WHILE @deleted = 1000000

    begin

    Set @C = @C+1

    begin tran

    insert into _andrew (Data) values (@c)

    commit tran

    begin tran

    delete from

    BigTable

    where

    id < 171805215

    SET @DELETED = @@ROWCOUNT

    commit tran

    end

    (If a few dates that are not in order are lost I don't really care)

    I'll let it sit there for awhile and report back if it has improved deletion speed.

    Thank you for all the advice!

  • How big is the table you are testing against?

    Also are the table layouts the same?

    Also doing a DBCC DROPCLEANBUFFERS before each run avoids the use of the cache.

    Leo

    Yes!

    Leo
    Nothing in life is ever so complicated that with a little work it can't be made more complicated.

Viewing 15 posts - 1 through 15 (of 35 total)

You must be logged in to reply to this topic. Login to reply