What is the most efficient way to delete records?

Question

Post reply

What is the most efficient way to delete records?

brutsoft

SSC-Addicted

Points: 485
More actions
May 6, 2010 at 4:09 pm

#386870

Hello,
I have a database that is 230GB in size. A single table in this database has 200 million rows and takes up 180GB of the databases size. I'm almost out of space on both data and log disks.
I want to clear up some space buy deleting old records from the big table. The table has a date column and an index on this column.
I want to delete all the records up to 1 Jan 2010 that will leave about 30 million records in this table and will remove the rest.
I can't delete them all in one go because space constraints on the transaction log disk, so I'm removing them in batches.
This is the script that I wrote for this purpose:
declare @C int
Set @C = 0
while Exists (
select
top 1 *
from
bigtable
where
[date]< '1 jan 2010'
)
begin
Set @C = @C+1
begin tran
insert into _andrew (Data) values (@c)
commit tran
begin tran
delete top (1000000) from
bigtable
where
[date]< '1 jan 2010'
commit tran
end
The log table looks like this:
CREATE TABLE [dbo].[_andrew](
[Id] [int] IDENTITY(1,1) NOT NULL,
[Data] [int] NOT NULL,
[Time] [datetime] NOT NULL CONSTRAINT [DF__andrew_Time] DEFAULT (getdate()),
CONSTRAINT [PK__andrew] PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
The script took 18 hours to delete 24 million records do far. This is some what unexpected to me as I hoped that 18 hours is planty to remove all the records that I wanted removed. Obviously I was wrong.
Could you please recommend any way of speeding up the process?
Cheers,
Andrew

Viewing 15 posts - 1 through 15 (of 35 total)

You must be logged in to reply to this topic. Login to reply

Leo.Miller SSChampion Points: 13048 More actions · Answer 1

1st problem you have is in:

IF EXISTS SELECT TOP 1..

2nd Problem:

DELETE TOP (1000000)..

To find the TOP (1000000) sql must check all the records.

Rather:

SET ROWCOUNT 1000000

DECLARE @deleted

SET @deleted = 1000000 (Use the same value as in the SET ROWCOUNT)

WHILE @deleted = 1000000

BEGIN

BEGIN TRAN

delete * from

bigtable

where

[date]< '1 jan 2010'

SET @DELETED = @@ROWCOUNT (NOTE: Before the COMMIT TRAN)

COMMIT TRAN

END

This way SQL will delete a maximum of 1000000 records and stop, without first checking if they are the top.

I would also not delete that many records at once, 10000 -50000 is the sort of range Iwould go for.

If you really want to track the number of times it loops do it all in the same TRAN Bracket.

Leo

Yes!

Leo
Nothing in life is ever so complicated that with a little work it can't be made more complicated.

brutsoft SSC-Addicted Points: 485 More actions · Answer 2

Leo.Miller (5/6/2010)
2nd Problem:
DELETE TOP (1000000)..
To find the TOP (1000000) sql must check all the records.

Thank you for your feedback. Could you please expand on the above? Why are you saying that to find the TOP (1000000) sql must check all the records?

Thanks,

Andrew

David Webb-CDS SSCoach Points: 17398 More actions · Answer 3

Just another possible plan of attack.

create another table to hold the 30 million rows you want to keep

copy the 30 million rows to the new table in batches based on dates.

drop the old table ( faster to deallocate than to delete)

rename the new table to the old name.

If you have a window of time when no updates are happening to this table, this might be a lot quicker than deleting rows.

And then again, I might be wrong ...
David Webb

Leo.Miller SSChampion Points: 13048 More actions · Answer 4

A further comment is to try use a clustered index/primary key in the WHERE clause. You don't say if the table is clustered on [DATE].

If [DATE] isn't clustered but there's an IDENTITY column or similar used for clustering and this matches the date sequence, then use a single lookup to find the MAX(IdentityColumn) where [DATE] < '1 Jan 2010'

Then use the result of this in the DELETE... WHERE clause.

You may also want to add a WAITFOR DELAY '00:00:02' in the loop as my sugestion is fairly efficient and will hammer your server, possibly to the detriment of other processes.

Leo

Yes!

Leo
Nothing in life is ever so complicated that with a little work it can't be made more complicated.

brutsoft SSC-Addicted Points: 485 More actions · Answer 5

David Webb-200187 (5/6/2010)
Just another possible plan of attack.
create another table to hold the 30 million rows you want to keep
copy the 30 million rows to the new table in batches based on dates.
drop the old table ( faster to deallocate than to delete)
rename the new table to the old name.
If you have a window of time when no updates are happening to this table, this might be a lot quicker than deleting rows.

This is what I'm currently considering doing. The problem is that I don't have free space. So right at the moment I'm running shrink on the database to reclaim space from the 24 million that was already deleted. I have no idea how many hours this will take. After this I hope I'll be able to copy / drop / rename in reasonable time frame. Thank you for the suggestion.

Leo.Miller SSChampion Points: 13048 More actions · Answer 6

You asked: Why are you saying that to find the TOP (1000000) sql must check all the records?

The only way to find the TOP record is to look at them all, then sort them in the required order, then take the top one off the top. It helps if you have a clustered index, but SQL is still doing a lot of extra reads.

Try running the queries below with the SHOW QUERY PLAN ON and after setting

SET STATISTICS IO ON

SELECT TOP 1 (*) FROM Big_Table where [DATE] < '1 Jan 2010'

vs

SET ROWCOUNT 1

SELECT * FROM Big_Table where [DATE] < '1 Jan 2010'

They my return different rows, but in your case since you are controlling the delete with [Date] < '1 Jan 2010' you don't really care if the TOP 1000000 are deleted in any particular order, you just want them deleted.

Leo

Yes!

Leo
Nothing in life is ever so complicated that with a little work it can't be made more complicated.

brutsoft SSC-Addicted Points: 485 More actions · Answer 7

Leo.Miller (5/6/2010)
A further comment is to try use a clustered index/primary key in the WHERE clause. You don't say if the table is clustered on [DATE].

It is not. It is clustered on the identity column but the date column order roughly corresponds to the identity order. In other words, the records were added chronologically most of the time (but not always).

If [DATE] isn't clustered but there's an IDENTITY column or similar used for clustering and this matches the date sequence, then use a single lookup to find the MAX(IdentityColumn) where [DATE] < '1 Jan 2010'
Then use the result of this in the DELETE... WHERE clause.

The date column matches your description so the Max Id can be found easily. However my problem is that SQL will automatically wraps delete in a transaction, and thus moves of the records its about to delete to transaction log, which doesn't have that much space. This is exactly why I tried to delete in batched. This way data is copied to transaction log, deleted from the database and the transaction log portion of it is released on a check point as the recovery model is simple. This way for each batch the same amount of transaction log space is reused.

EDIT: Sorry, misread you here, you are suggesting substituting where clause in my script to use clustered index identity column in the condition instead of the date column. Gotcha.

Leo.Miller SSChampion Points: 13048 More actions · Answer 8

If you are also shrinking the database my recommendations are:

Do the shrink on the file

Shrink it in small amounts multiple times, rather than trying to go down in one big block.

Tests we've done in 16 years as a DBA show that for large databases it's more efficient to shrink by small amount.

Leo

Yes!

Leo
Nothing in life is ever so complicated that with a little work it can't be made more complicated.

Leo.Miller SSChampion Points: 13048 More actions · Answer 9

You said: In other words, the records were added chronologically most of the time (but not always).

Check if ther are any records that would be incorrectly deleted if you used my recommendation, then you can do something like add:

AND ID NOT IN (1,5,6,....)

Or you coul put these problem ID's into a table and use a join to eleminate them.

I would put in the effort to use the clustered index, it will make a big difference.

Leo

Yes!

Leo
Nothing in life is ever so complicated that with a little work it can't be made more complicated.

brutsoft SSC-Addicted Points: 485 More actions · Answer 10

Leo.Miller (5/6/2010)The only way to find the TOP record is to look at them all, then sort them in the required order, then take the top one off the top.

Yes, but this applies to ordered queries. My query is not ordered (there is no order by clause). Where this sort, you are referring to comes from?

Leo.Miller SSChampion Points: 13048 More actions · Answer 11

I'm not sure what you mean byt this:

.... However my problem is that SQL will automatically wraps delete in a transaction, and thus moves of the records its about to delete to transaction log

TO use the MAX(ID) look it up once and store it in a constant.

DECLARE @MAXID int

select @MAXID = MAX(ID) from Big_Table where [Date] < '1 Jan 2010'

But your approach is correct. Keep the database in SIMPLE mode. Keep the batch sizes reasonable. Do the deletes in a transaction bracket within the loop. You probably get a check point every 59 seconds or so, so the WAIT FOR DELAY will ensure the log is kept as small as possible.

If you have any spare capacity on another disk you couls create a second log file for the duration of the delete process and the delete it later if log space is a problem.

Leo

Yes!

Leo
Nothing in life is ever so complicated that with a little work it can't be made more complicated.

brutsoft SSC-Addicted Points: 485 More actions · Answer 12

Leo.Miller (5/6/2010)[/bTry running the queries below with the SHOW QUERY PLAN ON and after setting
SET STATISTICS IO ON
SELECT TOP 1 (*) FROM Big_Table where [DATE] < '1 Jan 2010'
vs
SET ROWCOUNT 1
SELECT * FROM Big_Table where [DATE] < '1 Jan 2010'

This is what I'm seeing when I try this:

Table 'BigTable'. Scan count 1, logical reads 4, physical reads 3, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

|--Top(TOP EXPRESSION:((1)))

|--Clustered Index Scan(OBJECT:([Andrew_Test].[dbo].[BigTable].[BigTable_PK]), WHERE:([Andrew_Test].[dbo].[BigTable].[Date]<'2010-01-01 00:00:00.000'))

vs

Table 'BigTable'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

|--Clustered Index Scan(OBJECT:([Andrew_Test].[dbo].[BigTable].[BigTable_PK]), WHERE:([Andrew_Test].[dbo].[BigTable].[Date]<'2010-01-01 00:00:00.000'))

brutsoft SSC-Addicted Points: 485 More actions · Answer 13

Shrink for some reason didn't work. Althpugh I can see that BigTable has 9GB of unused space, shrink didn't shrink the database file at all....

So I have rewritten my query to have the row count in and use the ID:

declare @C int

Set @C = 0

SET ROWCOUNT 1000000

DECLARE @deleted int

SET @deleted = 1000000

WHILE @deleted = 1000000

begin

Set @C = @C+1

begin tran

insert into _andrew (Data) values (@c)

commit tran

begin tran

delete from

BigTable

where

id < 171805215

SET @DELETED = @@ROWCOUNT

commit tran

end

(If a few dates that are not in order are lost I don't really care)

I'll let it sit there for awhile and report back if it has improved deletion speed.

Thank you for all the advice!

Leo.Miller SSChampion Points: 13048 More actions · Answer 14

How big is the table you are testing against?

Also are the table layouts the same?

Also doing a DBCC DROPCLEANBUFFERS before each run avoids the use of the cache.

Leo

Yes!

Leo
Nothing in life is ever so complicated that with a little work it can't be made more complicated.