If you started out as a developer you were probably taught how important loops are. They are one of the first handful of things you’re taught about along with conditionals, variables, functions, stored procedures etc. On the other hand, you’ve probably noticed that database professionals tend not to like them. Now, personally, I have no problems with loops when you need them, but only when you need them.
So why do data people tend to avoid (or even actively dislike) loops? (Can you say cursor anyone?). Scaleability! Loops just don’t scale well. A loop that is fast at 100 loops is going to take twice as long at 200 loops, five times as long at 500 loops and one hundred times as long at 10,000 loops. That’s a problem in the database world when at 10,000 rows a table is still considered small and depending on your experience a mid-sized table might be 1,000,000 rows or more. As in all things I like examples, so here’s a simple one.
I’m creating a table with an identity column (the primary key/clustered index) and a date column. I’m going to record times spent updating each row one at a time and just updating the entire table. Then I’m going to add 10 rows and run again, 10 rows and run again, etc until I have 7500 rows. (Not a lot but it takes a while to run.) Quick note to everyone who reads this and thinks “But …”. I’m aware this is a really simple example. If you have buts that you think will significantly change the outcome feel free to run a test yourself and if by some odd chance (unlikely but hey, could happen) feel free to put the results in the comments, or even better blog them and link the blog in the comments :).
-- Create work table & log table CREATE TABLE LoopSpeed ([RowCount] INT, LoopTimeInMS INT, BatchTimeInMS INT); CREATE TABLE LoopSpeedTest (Id INT NOT NULL IDENTITY(1,1) PRIMARY KEY, ADate DATETIME); GO DECLARE @RowCount INT = 0; DECLARE @LoopCount INT = 0; DECLARE @StartTime DATETIME, @EndTime DATETIME DECLARE @LoopTimeInMS INT, @BatchTimeInMS INT -- 750 loops at 10 row increments. WHILE @RowCount < 7501 BEGIN -- Add 10 rows INSERT INTO LoopSpeedTest (ADate) VALUES (GetDate()), (GetDate()), (GetDate()), (GetDate()), (GetDate()), (GetDate()), (GetDate()), (GetDate()), (GetDate()), (GetDate()); SELECT @RowCount = COUNT(1) FROM LoopSpeedTest; SET @LoopCount = 0; SET @StartTime = GetDate(); -- Loop that updates each row in the table one at a time. WHILE @LoopCount < @RowCount BEGIN SET @LoopCount = @LoopCount + 1; -- Update one row. UPDATE LoopSpeedTest SET ADate = GetDate() WHERE Id = @LoopCount; END -- Get duration of loop; SET @LoopTimeInMS = DATEDIFF(millisecond,@StartTime,GetDate()); SET @StartTime = GetDate(); -- Update all of the rows in the table in a batch. UPDATE LoopSpeedTest SET ADate = GetDate(); -- Get duration of batch; SET @BatchTimeInMS = DATEDIFF(millisecond,@StartTime,GetDate()); INSERT INTO LoopSpeed VALUES (@RowCount, @LoopTimeInMS, @BatchTimeInMS); END
First let me point out that I’m running this on a very old, cheap laptop and I’m measuring milliseconds. It doesn’t take much to get wild variations in the numbers. That’s why the loop time has such wild variations at times. That said, I’m sure you notice the slow but steady increase in time taken to run the loop. Now imagine if I wanted to run an update over a million rows? We’d no longer be measuring in milliseconds, we’d be measuring in minutes or even hours. Lastly, I want you to notice that pretty flat line along the bottom. That is the very very slow increase in time that the batch update takes. At 10 rows it was approximately 0ms and at 7500 it was 20. Yes, the duration does increase with the number of rows. If nothing else, I/O does take time after all. But SQL is built to handle large numbers (these are really pretty minuscule numbers for SQL to be honest) of rows at a time.
Let me give you an analogy. Let’s say we are baking a cake and I ask you to go to the store to get the ingredients.
- 1/2 cup unsalted butter, softened
- 1 1/2 cups sugar
- 3 large eggs
- 2 1/4 cups all-purpose flour
- 1 teaspoon salt
- 3 1/2 teaspoons baking powder
- 1 1/4 cups whole milk
- 1 teaspoon vanilla
- 2 1/2 cups buttercream frosting
I start by asking you to get me the butter. You run to the store, get the butter and come back. Then I ask for some sugar. Then eggs, etc. Assuming you don’t just get frustrated and start yelling at some point it’s going to take a while right?
But if I had you that entire list, you run to the store, get each item and come back. Much faster right? And if I double the length of the list it’s going to take longer to get but not nearly as long as if I ask you to go to the store for each of the additional items one at a time. (Fortunately computers are very patient and don’t actually get angry about being asked to do one thing at a time.)
There is a good reason that database professionals tend to avoid loops. They are slow, inefficient, and scale poorly when compared to batch processing. That said, you still need to know how to use them as occasionally you have no choice. A simple example is changing the read/write status of every database on the instance. Those situations are pretty rare though. Batches really are the way to go if at all possible.