Deleting data in large table

Question

Deleting data in large table

Jackie Lowery

SSCommitted

Points: 1898
More actions
November 1, 2025 at 5:37 pm
Go to Answer
#4679968

I have a large table with 17M+ rows for our package record audits. I've decided to clear out anything older than 3 months which would clear all but 1.2M records. The table doesn't currently have any indexes. What's the recommended way to clear the records from the table? Should i consider adding an index to make clearing it faster?
Jeff Moden

SSC Guru

Points: 1004704
More actions
November 1, 2025 at 6:32 pm
Answer
#4679977

If you're going to remove 25% or more of the data from a HEAP or a Table (Clustered Index Present), the best thing to do all the way around is create a empty copy of the table with only the Clustered Index (or nothing if you want it to be a HEAP), and do a minimally logged insert of only what you want to keep into the new table. If that works, the rename the tables so the original table is renamed with _Old as the object name and the new table as the original name.
Then, add the Non-Clustered indexes and any keys back to the new table.
If you have scratch and tmp schemas, you can even change which schema the Primary Key constraint is on so that you don't have to violate any naming conventions there (Constraints must be uniquely named in a given schema just like other objects must be. Non-clustered indexes are an exception because they're slaves to the table or heap).
A lot of people have different names for this type of thing... I just refer to it as a "Swap'n'Drop".

--Jeff Moden
RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
Change is inevitable... Change for the better is not.
Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Viewing 13 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

Jackie Lowery SSCommitted Points: 1898 More actions · Answer 1

I also read about using 'SWITCH' to map the data to the backup table, then insert the records back into the original table. Once the data dump is done, I'll still have over 1M records in the original table. Would you suggest an index to help with performance of future daily table cleanup? We rarely ever query the data.

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 2

Jackie Lowery wrote:

I also read about using 'SWITCH' to map the data to the backup table, then insert the records back into the original table. Once the data dump is done, I'll still have over 1M records in the original table. Would you suggest an index to help with performance of future daily table cleanup? We rarely ever query the data.

SWITCH only works on either a "whole table" or a partition of a Partitioned Table. And, you'd have to make an empty copy of the table to do the SWITCH as well. I'm thinking that's not an option here (and so didn't originally recommend it) because adding the Clustered Index to Partition the table would require the space to hold an additional copy of the table until the Clustered Index is done and then you'd have the huge amount of empty space that the original table (heap) would leave behind.

So take a log file backup just before you switch to BULK LOGGED recovery and, when you're done with the "Swap'n'Drop", switch back to the FULL Recovery Model and take another log file backup to limit the "no point in time restores allowed when a bulk operation like "Minimal Logging" has occurred" period of time.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 3

As for the old table, you were going to delete all but a million rows. When you're sure the copy worked correctly, just drop the old table.

Also, decide the maximum amount of time you want to keep data in this table. Add an index on the "Date_Created" or other temporal column and use the same "keys" where you decide what the cutoff for the million rows was. That should be a Clustered Index for this history table... try to make it "unique" as a "Primary Key", as well. It'll help you do the rare lookups , as well.

Then do nightly "rolloffs" from the logical and, hopefully, physical beginning of the table to delete any rows that have exceeded the desired "keep it until" age.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jackie Lowery SSCommitted Points: 1898 More actions · Answer 4

Jackie Lowery

SSCommitted

Points: 1898

November 1, 2025 at 8:06 pm

#4679996

Thanks a lot. I can make a plan from that.

Uwe Ricken Hall of Fame Points: 3250 More actions · Answer 5

Why not creating a NONCLUSTERED Index on the datetime column and "walk in small batches" through the table.

Small chunks of data prevent LOCK ESCALATION and you can run it smoothly in the background or you create a job which clean your data at out of business times

DECLARE @row_count INT = 1;
DECLARE @batch_size INT = 3000;

WHILE @row_count > 0
BEGIN
    DELETE TOP (@batch_size)
    FROM dbo.mytable
    WHERE datetime_column < DATEADD(MONTH, -3, GETDATE());

    SET @row_count = @@ROWCOUNT;
END
GO

Microsoft Certified Master: SQL Server 2008
MVP - Data Platform (2013 - ...)
my blog: http://www.sqlmaster.de (german only!)

carwalk Grasshopper Points: 23 More actions · Answer 6

You’ve already got two good patterns from Jeff and Uwe so I’d approach it as a mix of both, depending on how much downtime and extra space you can afford right now.

Since you’re removing apprx 93% of the table and this is essentially an audit/history table that’s rarely queried, Jeff’s Swap’n’Drop is usually the cleanest one-time fix: create a new table with the same structure but with a clustered index on your datetime / created date column (ideally a unique PK), insert only the last 3 months into it (using TABLOCK and BULK_LOGGED/SIMPLE to keep logging down), then rename the tables and drop the old heap once you’ve verified the row counts. That avoids doing a massive logged delete on 16M+ rows and gives you a properly clustered “append-only” history table that will perform better for any rare lookups you do. For ongoing clean-up, it’s absolutely worth having an index just to support deletes, even if users rarely query this table. I’d still keep the clustered index on the datetime column, then use Uwe’s pattern as a nightly/weekly job so you never get into this situation again. Something like:

DECLARE @row_count int = 1;

DECLARE @batch_size int = 3000;

WHILE @row_count > 0

BEGIN

DELETE TOP (@batch_size)

FROM dbo.PackageAudit

WHERE AuditDate < DATEADD(MONTH, -3, SYSDATETIME());

SET @row_count = @@ROWCOUNT;

END;

With the datetime column indexed, those batch deletes will use an efficient seek, avoid lock escalation, and can happily run in the background or during off-hours. If, in the future, the table keeps growing and you need even faster roll-offs, you can look at partitioning by date and using SWITCH to drop whole partitions at once – but for 17M rows and a 3-month retention window, a one-time Swap’n’Drop plus indexed, batched deletes is usually more than enough.

Jackie Lowery SSCommitted Points: 1898 More actions · Answer 7

Uwe Ricken wrote:

Why not creating a NONCLUSTERED Index on the datetime column and "walk in small batches" through the table.
Small chunks of data prevent LOCK ESCALATION and you can run it smoothly in the background or you create a job which clean your data at out of business times
DECLARE @row_count INT = 1;
DECLARE @batch_size INT = 3000;

WHILE @row_count > 0
BEGIN
    DELETE TOP (@batch_size)
    FROM dbo.mytable
    WHERE datetime_column < DATEADD(MONTH, -3, GETDATE());

    SET @row_count = @@ROWCOUNT;
END
GO

Thanks for the batch code. I'll use that after the table is cleaned up. Going to add index on the datetime column as well.

Steve Jones - SSC Editor SSC Guru Points: 741373 More actions · Answer 8

I lean towards Uwe's solution, especially when I need the table to stay online and accept insert/update/delete while I'm deleting data.

Jackie Lowery SSCommitted Points: 1898 More actions · Answer 9

Steve Jones - SSC Editor wrote:

I lean towards Uwe's solution, especially when I need the table to stay online and accept insert/update/delete while I'm deleting data.

I was thinking about this too. One question though. Would you recommend creating the index on the datetime column first, then running the batch?

Steve Jones - SSC Editor SSC Guru Points: 741373 More actions · Answer 10

Yes, indexing makes a huge difference and the performance savings can overwhelm the index creation time/CPU

Jonathan AC Roberts SSCoach Points: 18408 More actions · Answer 11

Does the table have any clustered index, e.g a clustered primary key?