Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Managing Large Data Sets in SQL Server 2005 and 2008

By Zach Mided,

Overview

One of SQL Server's strengths is its ability to operate against multiple rows using just one statement, but in excess that strength can be problematic. For example, performing an insert of 20 million rows will place numerous page locks and possibly a table lock on that table, which can prevent others from effectively using any applications that access that table. Alternatively, one could use cursors to modify one row at a time, but the performance of cursors is not great and is not a good solution for processing millions of rows.

This article describes techniques that can be used to make operations against large data sets more manageable. You will be able to stop and start operations at a whim and throttle them up or down depending on how heavily you may want to use system resources at any given time.

'Tis the Season...

In some situations you can schedule operations against large data sets to occur off-peak hours, but what if your system has to be very responsive around-the-clock? Or, what if the operation takes so long that it can't be completed entirely before peak-hours?

Imagine the following scenario. Your nightly loads of large volumes of data usually complete before peak-hours start, but because today is the start of the holiday season there is suddenly twice the amount of data and the process is still running when peak-hours begin. OK, just wait 30-60 minutes and let it complete...

Twiddle thumbs...

Respond to emails from upset users: "Don't worry. It will be resolved soon."...

Nope. An hour later and it still hasn't completed. How much longer could it possibly take? Management makes the call - kill the job and we'll load it tomorrow. OK, you kill it. Hmmm... performance is still bad. What's going on?

Now that huge transaction you just cancelled is rolling itself back! Now, all you can do is wait. The system could very well be ground to a halt for the entire day. How can this be avoided?

Breaking it Down

The best way to manage operations on large data sets like this is to break them down into smaller pieces and perform the operation piece-by-piece. To demonstrate how this can be done, let's imagine that a daily ETL feed of millions of rows is placed into a staging table named "new_purchase" and that our operation needs to copy all of these rows into the "purchase" table. Then let's assume that inserting these rows will take a long time because the "purchase" table already has millions of rows in it, has numerous indexes on it, and is actively used by the production application.

So we have some data to test our solution with, I will first create a "new_purchase" table and populate it with a list of fake purchases:

CREATE TABLE new_purchase (
purchase_date DATETIME,
item VARCHAR(10),
quantity INT,
price DECIMAL(10,2) ) -- populate a fake table with 100,000 new purchases SET NOCOUNT ON
DECLARE @i AS INT
SET @i=0
WHILE @i< 100000
BEGIN
INSERT new_purchase
SELECT Cast('2009-01-01' AS DATETIME) + Rand() AS purchase_date,
'SKU'+Cast(Cast(Rand() * 10000000 AS BIGINT) AS VARCHAR(7)) AS item,
Cast(Rand() * 10 + 1 AS INT) AS quantity,
Cast(Rand() * 100 + .01 AS DECIMAL(10,2)) AS price
SET @i = @i + 1
END

Second, I will create a "purchase" table, which is the table that holds the cumulative list of all purchases in the production system:

--Very large table that contains all purchases and is used by the production system
CREATE TABLE purchase (
id UNIQUEIDENTIFIER,
created_date DATETIME,
purchase_date datetime,
item VARCHAR(10),
quantity INT,
price DECIMAL(10,2),
total DECIMAL(14,2)
)

And finally, I will create a stored procedure to copy one to twenty thousand rows at a time from the "new_purchase" table into the "purchase" table. I will also create a log table to track and measure each of the copy operations.

--Create a table for tracking 
CREATE TABLE load_new_purchase_log (
start_time DATETIME,
end_time DATETIME,
row_count INT
)
GO --Create a procedure to load new purchases in groups of 1,000-20,000 rows
CREATE PROCEDURE load_new_purchase
AS
BEGIN

DECLARE @start_time AS DATETIME
DECLARE @maximum_row AS BIGINT
DECLARE @starting_row AS BIGINT
DECLARE @rows_to_process_per_iteration AS INT
DECLARE @rows_added AS INT
DECLARE @message AS VARCHAR(100)

--Define how many rows to process in each iteration
--depending on whether the routine is running during peak hours or not
SET @rows_to_process_per_iteration =
CASE WHEN DatePart(Hour, GetDate()) BETWEEN 7 AND 17 THEN 1000 --Peak hours
ELSE 20000 END --Off-Peak hours, load a larger number of records at a time --Determine how many rows need to be processed in total
SET @maximum_row = (SELECT Count(*) FROM new_purchase)

--If this process had already run, it may have already added some rows.
--Determine where to start.
SET @starting_row = IsNull((SELECT Sum(row_count) + 1
FROM load_new_purchase_log),1)

--Continue looping until all new_purchase records have been processed
SET NOCOUNT OFF
WHILE @starting_row <= @maximum_row
BEGIN
SET @message = 'Processing next '+
Cast(@rows_to_process_per_iteration AS VARCHAR(100))+
' records starting at record '+
Cast(@starting_row AS VARCHAR(100))+' (of '+
Cast(@maximum_row AS VARCHAR(100))+')'
PRINT @message

SET @start_time = GetDate()
--Insert the next set of rows and log it to the log table in the same transaction
BEGIN TRANSACTION
--Copy rows into purchase table from the new_purchase table
INSERT purchase
SELECT NewId() as id, GetDate() as created_date,
purchase_date, item, quantity, price, quantity * price as total
FROM
(
SELECT *,
Row_Number() OVER (ORDER BY purchase_date, item, quantity) AS rownumber
FROM new_purchase
) as new
WHERE rownumber BETWEEN @starting_row
AND @starting_row + @rows_to_process_per_iteration - 1

--Log the records that have been added
SET @rows_added = @@RowCount
INSERT load_new_purchase_log (start_time,end_time,row_count)
VALUES (@start_time, GetDate(), @rows_added)
COMMIT TRANSACTION
SET @starting_row = @starting_row + @rows_added --Define how many rows to process in each iteration
--depending whether the routine is running during peak hours or not
IF DatePart(Hour, GetDate()) BETWEEN 7 and 17
BEGIN
--Peak Hours
--Load a small number of rows for each iteration
SET @rows_to_process_per_iteration = 1000
--Delay 10 seconds to lighten the load even further
WAITFOR DELAY '00:00:10'
END
ELSE
BEGIN
--Off-Peak Hours
--Load a large number of rows for each iteration
SET @rows_to_process_per_iteration = 20000
END END --end while statement END --end stored proc

The main trick in the code is to use the Row_Number() function that is available in version 2005 and later. This enables you to segment the large operation in to smaller operations, each of which operates only on a sub-set of all of the rows. Note that SQL Server does not support Row_Number() calculations within the WHERE clause. So, the following would produce an error:

SELECT NewId() as id, GetDate() as created_date, 
purchase_date, item, quantity, price, quantity * price as total
FROM
(
SELECT *,
Row_Number() OVER (ORDER BY purchase_date, item, quantity) AS rownumber
FROM new_purchase
) as new
WHERE rownumber BETWEEN @starting_row
AND @starting_row + @rows_to_process_per_iteration - 1

That is why the stored procedure has to compute the Row_Number() in a sub-select statement. Even though there is this large SELECT statement in the sub-select, SQL Server performs surprisingly well at filtering out just the requested records; this is true especially if the source table has an index with the same ordering that is in the ORDER BY in the OVER clause.

The number of rows copied in each iteration is written to a log table. This enables the procedure to restart where it left off in the case that it is terminated before it can process all of the rows. The main INSERT statement and the INSERT of a row to the log table are included in the same TRANSACTION to ensure that they are COMMITTED to the database as one transaction. A secondary benefit of writing to a log table is that it is helpful for reporting on the progress of the operation and for benchmarking performance.

You will also notice a couple of places in this code where it tests to see if the current time is during peak-hours (between 7AM and 5PM). If it is then the routine only loads 1,000 rows and then pauses for 30 seconds for each iteration. If the current time is outside of peak hours, then the routine loads up to 20,000 rows per iteration because there are more system resources available for processing.

Now, to test the stored procedure I type:

EXEC load_new_purchase 

And the following output is generated:

Processing next 1000 records starting at record 1 (of 100000)
Processing next 1000 records starting at record 1001 (of 100000)
Processing next 1000 records starting at record 2001 (of 100000)

I then disconnect that query to test its ability to stop and start with ease. Then, rerun the stored procedure and let it complete. Running the following queries shows that all 100,000 rows were copied correctly to the "purchase" table.

SELECT Count(*) as record_count_in_purchase_table FROM purchase
SELECT TOP 10 * FROM purchase
SELECT TOP 10 * FROM load_new_purchase_log

Wrap-Up

In situations where system performance is important and one needs to operate on a large number of rows, breaking the operation down into smaller pieces has many benefits, including:

  1. Fewer database locks and less contention.
  2. Better performance for any other applications that share the affected tables.
  3. Adding robustness to the operation so that it can easily and efficiently be stopped and restarted at will.
  4. Visibility into the progress of the operation and automatic tracking of metrics that can then be used for benchmarking.

About the Author

Zach Mided is a seasoned technology executive specializing in translation between technology and business. He is currently employed by Alliance Global Services and is consulting to a rebate fulfillment company by helping them develop software that processes millions of dollars in rebates each week..

Total article views: 10434 | Views in the last 30 days: 5
 
Related Articles
FORUM

Background process while starting the sqlserver?

Background process while starting the sqlserver?

FORUM

Getting records created in a particular week

Getting records created in a particular week

FORUM

Single record processing

How to process records from a flat file one by one?

FORUM

Processing Cash sales

Creating invoice and receipt records for cash sales

 
Contribute

Join the most active online SQL Server Community

SQL knowledge, delivered daily, free:

Email address:  

You make SSC a better place

As a member of SQLServerCentral, you get free access to loads of fresh content: thousands of articles and SQL scripts, a library of free eBooks, a weekly database news roundup, a great Q & A platform… And it’s our huge, buzzing community of SQL Server Professionals that makes it such a success.

Join us!

Steve Jones
Editor, SQLServerCentral.com

Already a member? Jump in:

Email address:   Password:   Remember me: Forgotten your password?
Steve Jones