SQLServerCentral Article

Deleting large number of rows from a table in a VLDB

,

Introduction

Most enterprise class systems, with the help of their interfacing systems, tend to accumulate huge amounts of data over time. Unchecked data growth and distribution has increased risk, storage and management costs. Management of years of production data within a single system also impacts service levels and disaster recovery initiatives. [1]

This leads to the unquestionable need of purging out old data after archiving it for future use and/or for compliance purposes. Purging data from any system can be done in multiple ways, but not all of them are optimal. Recently, I was faced with an assignment to help out a team at work who were struggling to maintain a pre-defined purging window for an inherited enterprise system which was undergoing some enhancements. The entire exercise was very interesting for me and hence, I thought it would be wonderful to share my research with the community and also get some feedback on the various methods that you have found to be successful over the years. Today, I will attempt to demonstrate the approach I felt to be most suitable for the assignment I had been given.

Delete Scenarios

Deletion of data is a very serious operation; once deleted, chances are the data is gone forever. Names, phone numbers, addresses, order information, stock status, and more will be lost as soon as the transaction is committed. We can classify all delete operations into the following two categories:

  1. Instantaneous delete - immediate; generally used when deleting a single record (produced maybe by a user error or some similar scenario)
  2. Batch delete - asynchronous; generally used when deleting a subset of data

Each of these can further be classified into:

  1. Soft delete - no permanent data removal, easily recoverable from user errors
  2. Hard deletes - permanent data loss, not recoverable without DBA intervention (provided the recovery models support this)

Over the years, a lot has been written and debated about the merits and demerits of soft vs. hard deletes, and hence, I would not be addressing either of these in this article.

My assignment had required me to come up with solutions for deleting data from both a table (clustered) and a heap (non-clustered). Also, another important constraint upon us was that the database size must not increase as a result of the purge.

Building the Scenario

The basic scenario is that we have an external system that pumps in a huge amount of data - about 85000 records/day, or approximately one record per second. Over time, we have gathered more than 75,000,000 rows, of which about 10,000,000 rows need to be purged.

The records to be removed may or may not be in sequence. This would be determined by a number of factors, one of which included whether or not a particular set of records is archived which depended upon the frequency of use of the record.

For this particular article, I will divide the exercise into two parts:

  1. I will use a random set for demonstrating deletes from a table and,
  2. A sequential set for demonstrating deletes from a heap

The Environment

As mentioned earlier, we had received inherited this database. One of the first things that we had noticed was that the database had default values for initial file size and growth. Running frequent, large-batch updates/deletes/inserts impacts the overall database performance because

  1. For all files, the free space initialization and allocation time increases with each new growth request
  2. For log files, a large number of VLF (Virtual Log Files) are created (http://www.sqlskills.com/BLOGS/KIMBERLY/post/Transaction-Log-VLFs-too-many-or-too-few.aspx)

To better handle this, and to ensure that the SQL Server spends less time in requesting more disk space from the OS, we changed the initial file size and the growth increments.

Because the server that I will be running today's tests on is not a fancy one (in fact it is one of my development boxes, and does not even classify as a production-grade machine), the initial file size and growth values will be set to the following:

  1. Data File size - around 1GB initial, 1MB growth increment
  2. Log File size - around 800MB initial, 2GB growth increment
Test Environment Configuration for Reference
Parameter Value
Server OS Windows Server 2008 (Standard)
SQL Server version SQL 2008, SQL 11 ("Denali")
Server Type Virtual Machine
RAM 1.5 GB
Processor Intel Core 2 Duo @ 2.80GHz
Hard-drive type Virtual ATA Drives

Deleting Random Data from a Table

First of all, let's generate the test data required for our little experiment. Please note that throughout this article, the scripts mentioned have been tested on SQL Server 2008 and above (R2 and "Denali").

The steps that we will follow are:

  1. Create a test database
  2. Create a test table - note that because we expect a large number of records, we have the key value set as a BIGINT
  3. Create the test data - for my test, I generated about 75,000,000 rows
--Create a test database
USE [master]
GO
CREATE DATABASE [DeleteRows]
 ON PRIMARY ( NAME = N'DeleteRows', 
FILENAME = N'E:\Databases\DeleteRows\DeleteRowsData.mdf', 
SIZE = 1065984KB, --Use an initial size higher than the default
 MAXSIZE = UNLIMITED, 
FILEGROWTH = 1024KB ) --Use an initial size higher than the default
 LOG ON ( NAME = N'DeleteRows_log', 
FILENAME = N'E:\Databases\DeleteRows\DeleteRowsLog.ldf',
SIZE = 894336KB, --Use an initial size higher than the default
 MAXSIZE = 2048GB,
FILEGROWTH = 25%) --Use an initial size higher than the default
GO
--Create a Test Table
USE [DeleteRows]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[DeleteRowsTbl](
[UnixDate] [BIGINT] IDENTITY(315532800,1) NOT NULL,
 [SomeName] [VARCHAR](100) NULL,
PRIMARY KEY CLUSTERED 
(
 [UnixDate] ASC
)WITH (PAD_INDEX = OFF, 
 STATISTICS_NORECOMPUTE = OFF, 
 IGNORE_DUP_KEY = OFF, 
 ALLOW_ROW_LOCKS = ON, 
 ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
--Generate some test data
USE [DeleteRows]
GO
INSERT INTO [dbo].[DeleteRowsTbl]
SELECT msc.name + CONVERT(VARCHAR(10),ROUND(RAND()*1000,0))
FROM msdb.sys.objects mso (NOLOCK)
CROSS JOIN msdb.sys.columns msc (NOLOCK)
GO 40        -- GO is used as a batch iterator

-- Check the database size
USE [DeleteRows]
GO
--Confirm that the record count is around the number we need
SELECT COUNT(*) FROM [dbo].[DeleteRowsTbl]
--Check the DB size
SELECT DB_NAME(database_id) AS 'Database Name',
Name AS 'Logical File Name',
Physical_Name AS 'Physical File Name', 
 (size*8)/1024 AS 'Datbase Size (In MB)'
FROM sys.master_files
WHERE DB_NAME(database_id) = DB_NAME()
GO

At the end of this exercise, I checked the size of my database. Please note the size statistics below:

Test Data Parameters
Parameter Value
# of records generated 77495040
Size of Data file 2694 MB
Size of Log file 1705 MB

The Purge Strategy - Using Lookup tables

In order to delete random values, most novice database programmers would simply start dynamically generating a huge WHERE clause, which is simply overkill on the server. Working on clustered tables allows us to leverage the indexes defined on them. This means that we can use a lookup table for the deletes.

The proposed concept is quite simple. The assumption is that the records to be purged are collected or "marked" over time and then the purge happens maybe once a week or so. The primary key for the records to be purged can be inserted into an indexed lookup table, which can then be used during the delete.

Let us create our lookup table, and populate it with some data, simulating the collection of records being marked for purge.

USE [DeleteRows]
GO
SET NOCOUNT ON
CREATE TABLE MyDeleteLookup (UnixDateToDelete BIGINT)
GO
BEGIN TRANSACTION
GO
CREATE NONCLUSTERED INDEX IX_MyDeleteLookup ON dbo.MyDeleteLookup
 (
UnixDateToDelete
 ) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
ALTER TABLE dbo.MyDeleteLookup SET (LOCK_ESCALATION = TABLE)
GO
COMMIT
--Populate the Look up table
INSERT INTO MyDeleteLookup
SELECT UnixDate 
FROM DeleteRowsTbl 
WHERE UnixDate LIKE '%10%'
--Check how many records we will be purging
SELECT COUNT(*) FROM dbo.MyDeleteLookup

For my test, the lookup table was populated with a total of about 4,591,769 rows (about 5% of available data). Now, let's use this lookup table to purge the respective rows from the parent table.

Running the Purge

Purging data from the parent table is now a simple matter of joining it with the lookup table and executing the delete.

--Run the Purge
USE [DeleteRows]
GO
SELECT GETDATE() AS 'Start DateTime'

DELETE Mytbl
FROM [dbo].[DeleteRowsTbl] Mytbl
INNER JOIN MyDeleteLookup Mylkp ON Mytbl.UnixDate = Mylkp.UnixDateToDelete
SELECT GETDATE() AS 'End DateTime'
--Check the database size
USE[DeleteRows]
GO
--Confirm that the record count is around the number we need
SELECT COUNT(*) FROM [dbo].[DeleteRowsTbl]
--Check the DB size
SELECT DB_NAME(database_id) AS 'Database Name',
Name AS 'Logical File Name',
Physical_Name AS 'Physical File Name', 
 (size*8)/1024 AS 'Datbase Size (In MB)'
FROM sys.master_files
WHERE DB_NAME(database_id) = DB_NAME()
GO  

Let's take a quick look at the delete statistics:

Statistics on deleting selected data from a table
Parameter Value
Data file size (Before) 3257 MB
Log file size (Before) 9254 MB
Time taken for the delete 10 min, 1 sec
Data file size (After) 3257 MB
Log file size (After) 9254 MB

On the virtual hardware that was running my test, I believe the 10min. 1 sec time taken for the purge is quite reasonable and efficient. If we observe the execution plan, we will notice that this is only possible because of the lookup table.

Also, one thing that we notice immediately is that the "before" and "after" size of the database remained the same. For the log file, this is because I had a high initial size and high growth specified and hence more continuous free space was available to log the delete operation. As for the data file size, a delete does not release the physical space back to the OS and hence the physical file size remains the same as before.

Deleting Data from a Heap (non-clustered table)

Sometimes, especially in case of data coming over from non-SQL systems (or non-RDBMS in general), the information being exchanged is in the form of a heap of data. As per MSDN, "A heap table, by definition, is a table that doesn't have any clustered indexes. Different pages of the heap-based table occupy different non-contiguous areas on a disk, and they are not linked together in any way." In our case, we had one such legacy interface sending data in as a heap. Because of multiple other dependencies, it was not be possible to change the design such that we could define a clustered index on the heap.

Deleting data from a heap is much more difficult because we (and even SQL Server) does not have the clustered index to take advantage of. The lookup table method demonstrated above can still be used to delete random data and while it will defintely help, it won't be as performant it is with the table.

One of the first recommendations that we made to the customer was to work with the vendor such that it would allow us to add a primary key column to the heap. However, that will be taking some time, so let's not worry about it right now.

When working with a heap, what becomes most important is to free up the table at regular intervals so as not to increase the bottlenecks to an unacceptable level. The most common way to do so is to limit the number of records being processed at any given point in time.

SET ROWCOUNT - the most common solution

In order to delete a range, normally we would have used the SET ROWCOUNT number option.

This solution has one major drawback - it is a deprecated feature in SQL Server 2008 and will not affect DELETE, INSERT, and UPDATE statements in the next release of SQL Server (SQL11/"Denali"). Hence, any new solutions should NOT be using this feature.

(Reference: http://msdn.microsoft.com/en-us/library/ms188774.aspx)

TOP - the alternate to SET ROWCOUNT

In order to limit the number of rows being processed, Microsoft has another offering - the TOP...PERCENT clause. You may read more about these here:

TOP (Transact-SQL)

Limiting Result Sets by TOP and PERCENT

Essentially, the prime difference between the TOP and ROWCOUT is that the query optimizer can use the value of expression in the TOP clause as part of generating an execution plan for a query. Let's see how we can delete large amounts of data from a heap using the TOP clause.

Building the test data

I will provide the entire script to start from a new database below (as such, the number of rows we will be purging will be similar as can be seen from the statistics mentioned after the script).

--Create the test database
USE [master]
GO
CREATE DATABASE [DeleteRows]
ON PRIMARY ( NAME = N'DeleteRows', 
FILENAME = N'E:\Databases\DeleteRows\DeleteRowsData.mdf', 
SIZE = 1065984KB, --Use an initial size higher than the default
 MAXSIZE = UNLIMITED, 
FILEGROWTH = 1024KB ) --Use an initial size higher than the default
 LOG ON ( NAME = N'DeleteRows_log', 
FILENAME = N'E:\Databases\DeleteRows\DeleteRowsLog.ldf',
SIZE = 894336KB, --Use an initial size higher than the default
 MAXSIZE = 2048GB,
FILEGROWTH = 25%) --Use an initial size higher than the default
GO
--Create a test non-clustered table, i.e. heap
USE [DeleteRows]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
--Note that this is a heap - No Clustered Index!
CREATE TABLE [dbo].[DeleteRowsTbl](
 [UnixDate] [BIGINT] IDENTITY(315532800,1) NOT NULL,
 [SomeName] [VARCHAR](100) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
--Generate some test data
USE [DeleteRows]
GO
INSERT INTO [dbo].[DeleteRowsTbl]
SELECT msc.name + CONVERT(VARCHAR(10),ROUND(RAND()*1000,0))
FROM msdb.sys.objects mso (NOLOCK)
CROSS JOIN msdb.sys.columns msc (NOLOCK)
GO 40

--Check the database size
USE[DeleteRows]
GO
--Confirm that the record count is around the number we need
SELECT COUNT(*) FROM[dbo].[DeleteRowsTbl]
--Check the DB size
SELECT DB_NAME(database_id) AS'Database Name',
NameAS'Logical File Name',
Physical_NameAS'Physical File Name', 
 (size*8)/1024 AS'Datbase Size (In MB)'
FROM sys.master_files
WHERE DB_NAME(database_id) = DB_NAME()
GO

Running the purge

Now, let us run the purge using the TOP...PERCENT clause. Note that as mentioned at the beginning of the article, we will be purging a range of records for this example.

--Run the purge
USE [DeleteRows]
GO
PRINTN'Starting Delete at: ' + CONVERT(NVARCHAR(25),GETDATE())
GO
DECLARE @MinRangeValue BIGINT
DECLARE @MaxRangeValue BIGINT
DECLARE @tIterationLimit INT
DECLARE @tRowCount INT
DECLARE @tBatchCounter INT
DECLARE @stmnt NVARCHAR(200)
DECLARE @params NVARCHAR(200)
SET @stmnt = N'DELETE TOP (@tIterationLimit) FROM DeleteRowsTbl WHERE UnixDate BETWEEN @tMinValue AND @tMaxValue'
SET @params = N'@tIterationLimit INT, @tMinValue BIGINT, @tMaxValue BIGINT'
 
SET @MinRangeValue = 320000000 --Some start value (valid)
SET @MaxRangeValue = 323874752 --Some end value (valid)
SET @tIterationLimit = 1000000 --Remove only 1000000 at a time
SET @tRowCount = 9999999
SET @tBatchCounter = 0
WHILE (@tRowCount > 0)
BEGIN
BEGIN TRANSACTIONDeleteRecords
     EXEC sp_executesql @stmnt,@params, @tIterationLimit = @IterationLimit, @tMinValue = @MinRangeValue, @tMaxValue = @MaxRangeValue;
     
--Standard Error handling (TRY...CATCH, etc) must go here
     SET @tRowCount = @@ROWCOUNT
     SET @tBatchCounter += 1
COMMIT TRANSACTIONDeleteRecords     
  --You may introduce a little wait here so that the heap is free for other operations. In this example, we won't because nothing's waiting for this heap
  --WAITFOR DELAY '00:05'
END
GO
PRINT N'Ending Delete at: ' + CONVERT(NVARCHAR(25),GETDATE())
GO 
--Determine the database size
USE[DeleteRows]
GO
--Confirm that the record count is around the number we need
SELECT COUNT(*) FROM[dbo].[DeleteRowsTbl]
--Check the DB size
SELECT DB_NAME(database_id) AS'Database Name',
NameAS'Logical File Name',
Physical_NameAS'Physical File Name', 
 (size*8)/1024 AS'Datbase Size (In MB)'
FROM sys.master_files
WHERE DB_NAME(database_id) = DB_NAME()
GO
--The most imporant of all, cleanup the environment
USE master
GO
ALTER DATABASE DeleteRows SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
DROP DATABASE DeleteRows;
GO          
Size & time statistics for deleting data from a heap
Parameter Value
Data File Size (Before) 2678 MB
Log File Size (Before) 1364 MB
# of rows deleted 3,874,752
Time taken for the purge 6 minutes
Data File Size (After) 2678 MB
Log File Size (After) 1364 MB

As an alternate to the TOP (..) PERCENT clause, you may also want to experiment with dynamically modifying the min and max. values of the range as a sliding window (i.e. the first iteration of the delete happens from records n to n+m, the next round from (n+m) to (n+m+p) and so on).

Do non-clustered indexes on a heap help?

As I was researching this and discussing it with some of my friends, one of them thought out aloud - We cannot have a clustered index on our heap, but can we have a non-clustered index that would help us in speeding up the delete?

What better way to answer the question than to run a simple test? I started off from scratch and just after Step #2 (creation of the table), I ran the following changes and then ran the rest of the script for the delete to find out.

--Create a non-clustered index
--(ADDITIONAL TEST ONLY)
IF EXISTS (SELECT name FROM sys.indexes
 WHERE name = N'Idx_NC_DeleteRowsTbl' 
AND object_id = OBJECT_ID(N'dbo.DeleteRowsTbl'))
DROP INDEX Idx_NC_DeleteRowsTbl
 ON dbo.DeleteRowsTbl;
GO
CREATE NONCLUSTERED INDEX Idx_NC_DeleteRowsTbl
ON [dbo].[DeleteRowsTbl] (UnixDate)

After the non-clustered index was setup, and the test data generated, we ran the purge. The time taken for the purge was now: 4 minutes!

Conclusion

For a clustered table, the approach of having a lookup table is both tried and tested - our code has now been running in production since the last couple of months, and the purging window has dropped by more than half of what it was previously.

As to the heap, besides the obvious storage needs for the index and the maintenance costs, the addition of a non-clustered index definitely helps in deletion of records from a heap. A combination of non-clustered index and the use of the TOP clause makes deletions from heaps a lot faster, and that is what we went ahead with.

References

[1] Christy Pettey, Ben Tudor, "Data Growth as the Largest Data Center Infrastructure Challenge", Gartner Survey , http://www.gartner.com/it/page.jsp?id=1460213

[2] Robert Sheldon, "The DELETE statement in SQL Server", Simple Talk Publishing, http://www.simple-talk.com/sql/learn-sql-server/the-delete-statement-in-sql-server/

Before we say good-bye

Before we say good-bye, I would welcome your feedback this article. More importantly, do you face any need to delete large quantities of data from a VLDB? Do leave your inputs in the discussion associated with this article.

If you liked this one, I would also request you to visit http://www.sqlservercentral.com/articles/Connection+Strings/71669/ to read my article on an introduction to Connection Strings and their importance. In the article, I also provide a simple Quick Reference Sheet that you can use to write your connection strings. I hope you find it very useful.

Until we meet next time,

Be courteous. Drive responsibly.

http://beyondrelational.com/blogs/nakul/default.aspx

Rate

3.66 (64)

You rated this post out of 5. Change rating

Share

Share

Rate

3.66 (64)

You rated this post out of 5. Change rating