Deleting a lot of data


I recently had the task of deleting a bit over a billion rows from a table. Now I could have done just this:

DELETE FROM tablename WHERE createdate >= '1/1/2017'

But I have a few problems here. The table has no index on createdate, potentially causing problems with tempdb (the sort on createdate). Although in this case tempdb is pretty large because of some large batch work done at various times. I’m also going to be deleting > billion rows of ~6 billion which is probably going to fill up the log of the database (which fortunately isn’t in use yet) and end up rolling back my delete anyway. Even if I don’t fill up the log, I’m still going to bloat it pretty badly (autogrowth). And last, and anything but least, this is on a production server. Even if this database was on its own drive (meaning growth of the log can’t cause a problem with any other databases) that tempdb thing (let alone other resource usage) is going to be an issue.

So what do I do?

If I was deleting most of the table there would be several options. For example, dumping just the rows I’m interested in keeping, truncating the table, and putting what I’m keeping back. But since I’m keeping such a large number of rows it would be even worse than just trying to delete the rows I want to get rid of. So since I can’t delete it all, how about I delete a bit of it at a time?

DELETE TOP (2500000) /*2.5 mil*/ FROM tablename WHERE createdate >= '1/1/2017'

I’m going to have to run that a few times though right?

1,000,000,000 / 2,500,000 = 400

The first time I did this I tried using a WHILE loop and @@rowcount.

DELETE TOP (2500000) /*2.5 mil*/ FROM #tablename WHERE createdate >= '1/1/2017'
	DELETE TOP (2500000) /*2.5 mil*/ FROM #tablename WHERE createdate >= '1/1/2017'

Unfortunately, that ran into a slight problem. Since it’s a single batch it counts as a single transaction (as best I can tell) and so I ran into the problems above. In this case, the transaction log ran out of space and all of the deletes had to roll back.

So instead I did this:

DELETE TOP (2500000) /*2.5 mil*/ FROM #tablename WHERE createdate >= '1/1/2017'
GO 401

Well, really I started out doing 50 batches at a time to make sure things went well. Then after doing that I did a quick row count and divided by the 2.5 mil to see how many times I still needed to run the delete. Last but not least the 2.5 million was an educated guess as to the size of the delete that would run fastest. Too large a value and each delete takes too long, too small and you have to run too many deletes. I was also very lucky here that the database is in SIMPLE recovery. Otherwise, I would have had to do a number of batches, log backup, number of batches, log backup etc.

Regardless, it takes a while to delete a billion rows so have fun!