Blog Post

SQL #50–Deleting or updating data from a table of 20+ GB

,

I don’t have a definition of what is considered a large SQL table. I am pretty sure, however, that a table with 20+ GB data plus another 20+ GB for a few indices is not a small table anymore.

Although this article The Data Loading Performance Guide was written for SQL Server 2008, it has many good information for data warehouse developers to design efficient ETL processes.

Scenarios in the Guide involve partitioned tables and nonpartitioned tables, loading data from text files and also from inside the SQL Server Database Engine.

In a data warehouse environment, it’s not uncommon to design a nightly (or weekly) ETL process to refresh the data going back a longer period of time. During this nightly process, a large amount of data will need to be deleted from the target fact table first and then reload the target fact table with the data from the source.

Our tables are nonpartitioned tables. Here are four common deletion/update scenarios I can think of for nonpartitioned tables. I’ve used the methods below in both automated ETL processes and in ad-hoc queries.

1. Deleting All Rows from a Nonpartitioned Table – Use TRUNCATE

The fastest way to remove all data from a table is to simply execute a truncate statement.

TRUNCATE TABLE factMainTable;

The advantage of TRUNCATE TABLE over DELETE with no WHERE clause is well explained in the SQL Books Online.

TRUNCATE TABLE (Transact-SQL)

  • Less transaction log space is used. TRUNCATE TABLE removes all rows from a table without logging the individual row deletions. The DELETE statement, on the other hand, removes rows one at a time and the deletion is fully logged for each deleted row.
  • Fewer locks are typically used. When the DELETE statement is executed using a row lock, each row in the table is locked for deletion. TRUNCATE TABLE always locks the table and page but not each row.
  • Without exception, zero pages are left in the table. After a DELETE statement is executed, the table can still contain empty pages.

2. Deleting a Large Amount of Rows in a Nonpartitioned Table – Switch with a Temp Table

In this scenario, I’ll assume that the rows that needs to be deleted is more than the rows you want to keep on the table. The idea is to use a temporary table to hold the rows we want to keep, and then switch the temporary table back to the main table. Because the rows we save to the temporary table are much less than what we want to delete, the logging is kept to the minimal compared to executing the DELETE command directly on the main table. In addition, the bulk-insert method can greatly help in the overall execution time because of the optimization.

Here are the general steps:

1) Create a copy of the main table with the data you want to keep.

You need to use one of the bulk insert methods.

To perform the bulk insert, you can use one of the following three methods.

The INSERT … SELECT method:

INSERT INTO factMainTable_Temp WITH (TABLOCK)

SELECT * FROM factMainTable

WHERE OrderDate >=‘20130701’

  AND OrderDate <‘20130901’

  AND <other Keep Criteria>;

The SELECT … INTO method:

SELECT * INTO factMainTable_Temp

FROM factMainTable

WHERE OrderDate >=‘20130701’

  AND OrderDate <‘20130901’

  AND <other Keep Criteria>;

As pointed out in the Guide, another way to perform the bulk insert is to use Integration Services to achieve many concurrent, streams into the factMainTable_Temp table. If you need to automate the DELETE in SSIS packages, then the Integration Services is the way to go.

2) Truncate the old data to remove all rows from the main table

BEGIN TRAN; –if you want to keep the operation transactional

TRUNCATE TABLE factMainTable;

3) Bulk Insert the Data from the Temp Table to the Main Table

INSERT INTO factMainTable;

SELECT * FROM factMainTable_Temp;

4) Drop the temporary table

DROP TABLE factMainTable_Temp;

COMMIT TRAN; –If you used a transaction, commit it now

Optionally, in step 2) you can DROP the main table:

DROP TABLE factMainTable;

Don’t forget to script out all the constraints and indexes and keys on the main table before it is dropped.

Then in step 3), you can rename the temp table as the main table:

EXECUTE SP_RENAME ‘TheShcemaName.factMainTable_Temp’, ‘factMainTable’;

Finally, re-create all constraints and indexes and keys on the main table.

3. Updating a Nonpartitioned Table – Switch with a Temp Table

When updating a fairly good size table, you might also want to try the steps in the scenario #2 – Switch with a Temp Table.

Again the idea is to avoid using the UPDATE command directly on the main table, and to take advantage of the bulk load optimizations in T-SQL with the INSERT … SELECT method or the SELECT … INTO method.

The only difference in this UPDATE situation is that the first step, where the temporary table is created, needs to take care of the new values for the columns you want to update.

Typically, the new values come from a third table. In this case, a JOIN will be needed.

INSERT INTO factMainTable_Temp WITH (TABLOCK)

SELECT main.cols, <new values>, main.more_cols

FROM factMainTable main JOIN TheThirdTable third_tbl

ON main.join_col = third_tbl.join_col

WHERE main.OrderDate >=‘20130701’

  AND main.OrderDate <‘20130901’

  AND <other Keep Criteria>;

4. Deleting 10%+ (but less than 50%) Rows from a Nonpartitioned Table – DELETE in Chunks

In this scenario, I’ll assume that the data you want to delete is less than the data you want to keep on the table. This is opposite to the scenario #2. When deleting more than 10% of the rows from a table with 20+ GB data, I’ve noticed that the deletion started to take much longer to complete.

Since the rows I want to keep is way more than what I need to delete, I doubt the “Switch with a Temp Table” method will be worth it.

So I tried the DELETE in Chunks in T-SQL. The basic idea is to take advantage of the minimum logging.

Check out this section “Using TOP to limit the number of rows deleted” in the link below.

TOP (Transact-SQL)

When a TOP (n) clause is used with DELETE, the delete operation is performed on an undefined selection of n number of rows. That is, the DELETE statement chooses any (n) number of rows that meet the criteria defined in the WHERE clause.

In my example, 200,000 rows are deleted from the factMainTable that have order dates that are between the START_DATE and the END_DATE. I do not need to use the ORDER BY clause here not only because ORDER BY will slow down the DELETE, but also that ORDER BY is not needed. What will break the WHILE loop is the system variable @@ROWCOUNT. When no more rows in the date range,  @@ROWCOUNT will be zero and the WHILE loop will end.

DECLARE @i int = 1;

WHILE @i > 0

BEGIN

DELETE TOP(200000) factMainTable

WHERE OrderDate >= @START_DATE

AND OrderDate <= @END_DATE;

SET @i = @@ROWCOUNT;

END;

Conclusion – Bulk Load Optimizations

The Guide has a good summary on the Bulk Load Methods in SQL Server.

To provide fast data insert operations, SQL Server ships with several of standard bulk load methods.

  • Integration Services Data Destinations – The method for performing bulk load from the Integration Services ETL tool.
  • BCP – The command line utility for performing bulk load from text files.
  • BULK INSERT – The method for performing bulk load in Transact-SQL. The term “BULK INSERT” refers to the specific Transact-SQL based bulk load method.
  • INSERT … SELECT – The method for performing bulk load in process with SQL Server from local queries or any OLE DB source. This method is only available as a minimally logged operation in SQL Server 2008.
  • SELECT INTO – The method for creating a new table containing the results of a query; It utilizes bulk load optimizations.
  • Similar bulk load techniques are supplied by programming interfaces to SQL Server, including the SQLBulkCopy class in ADO.NET, IRowsetFastload in OLE DB, and the SQL Server Native Client ODBC library.
    BTW, I have not been able to find the Guide for SQL Server 2012.

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating