What is the fastest way to update a newly added column in to a very large table?
Recently I had to add a new column to a table which has more than 35 million very wide (1000+ bytes per row) rows and populate with data. Using the recommended method by various people the best speed I could get was 20 hours.
I will first describe the failed attempt and then I will explain the method I came up with to increase the speed of this operation significantly.
CREATE TABLE [dbo].[LargeTable]( [ID] [int] IDENTITY(1,1) NOT NULL, [InvID] [decimal](11, 0) NOT NULL, [JNumber] [tinyint] NOT NULL, [JType] [nchar](1) NOT NULL, [NameKey] [nvarchar](341) NOT NULL, [SystemID] [decimal](12, 0) NOT NULL, [BDate] [datetime] NOT NULL, [EDate] [datetime] NOT NULL, [DateTimeModified] [datetime] NOT NULL, [RowTimestamp] [timestamp] NULL, [Designation] [nvarchar](200) NOT NULL, [Capacity] [nvarchar](20) NOT NULL, [Agreement] [nvarchar](50) NOT NULL, [DateQ] [nvarchar](10) NOT NULL, CONSTRAINT [LargeTable__PK] PRIMARY KEY NONCLUSTERED ( [ID] ASC )ON [PRIMARY] ) ON [PRIMARY] CREATE CLUSTERED INDEX [LargeTable__IE1] ON [dbo].[LargeTable] ( [NameKey] ASC )ON [PRIMARY]
Note that the clustered index is on a very wide column.
The recommended method to add a new column and update it is as follows. There was nothing wrong with the method to add a new column to such a large table.
Step 1: Add new column
alter table MyTable add NewZip char(10) NULL
Step 2: Set Batch Size for update
--- (2) set piecemeal batch size set nocount on SET ROWCOUNT 1000
Next, the piecemeal update is repeatedly done until there are no NULL values in NewZip. This example sets all rows using the char(5) ZipCode column. Alternately, you can selectively set the value of an added column using a lookup or join.
Step 3: De-coupled Piecemeal Update
declare @rowcount int, @batchcount int, @rowsupdated int select @rowcount = 1,@batchcount = 0, @rowsupdated = 0 while @rowcount > 0 begin begin tran update MyTable set NewZip = ZipCode where NewZip is NULL set @rowcount = @@rowcount select @rowsupdated = @rowsupdated + @rowcount if @rowcount > 0 select @batchcount = @batchcount + 1 commit end
When I followed the above method I realized that the performance is degrading exponentially with the iteration count.
See below graph for the actual statistics during my test load. Duration is in seconds and the batch size was 10,000.
As you can clearly see when it comes to Iteration 2000+ the time to update 10,000 records has gone up to 80 seconds and with this behavior it took more than 20 hours for me to update my 35 million records table.
The reason for this is the cost to find the next 10,000 to update and it becomes costlier with each iteration because of lesser and lesser availability of records where the NewColumn is null and they are scattered everywhere in the clustered index.
You would think why not we then add an index on this new column so that its easier to locate the null records, but it doesnt help because it still has to do a costly bookmark lookup to find the actual data page and it even has to update this new index with each update.
OK So whats the solution to this issue?
We know that we cannot do this update in one shot because thatll fill up the log file and if something goes wrong then we need to do it from the beginning again. Thats why we still need to do this in batch by batch basis.
Then I realized that actually SQL server has to load the data pages (Clustered index) to update this new column anyway and if I can find a way to go through the same order as the clustered index key column(s) then it would be much faster because then SQL server optimizer can easily predict the next set of pages (read-ahead) because they are physically ordered in the clustered index key order. And for that reason it will not have to jump everywhere in the clustered index to find the next set of records to be updated since we are doing it in the same order as they are stored physically.
If I had a clustered index on an IDENTITY column my life would have been much easier because then I can iterate thorough it from start to finish advancing the loop by the batch number.
But since my clustered index is an Alpha Name column which contains values from the ASCII table (http://www.neurophys.wisc.edu/comp/docs/ascii.html)
I had to use a different approach. I had to find a way to iterate though the Clustered index from A to Z but in batches. My batch count will not be consistent with this approach but if you smart enough you can break the ASCII table in to several ranges to break the clustered index key values in to manageable batches. Dont think you can do that easily with some simple ranges like A-B, B-C; This is a huge table and A to B range it self may contain millions of records. Thats why I had to do something extra to break in to smaller batches.
To achieve this I did the following trick.
1) Create a table to hold the different Range keys from the ASCII table.
create table _CLUSTKEY (ID int Identity(1,1) primary Key, Key1 nvarchar(2), Key2 nvarchar(2))
2) Generate the ASCII key ranges and insert them in to the above created table. (If its difficult to understand the algorithm, first run this and then look at the results. Then you will be able to understand the logic easily)
declare @var as char(1), @ascii as int, @count as int set @ascii = 65 set @count = 0 while(@ascii<90) begin insert into _CLUSTKEY select char(@ascii) + char(ascii('A')+@count) , char(@ascii) + char(ascii('A')+@count+1) if (char(ascii('A')+@count+1) = 'Z') begin insert into _CLUSTKEY select char(@ascii) + char(ascii('A')+@count+1) , char(@ascii+1) + char(ascii('A')) set @ascii = @ascii + 1 set @count = 0 end else set @count = @count + 1 end
3) Insert beginning and end ranges to the above table manually
insert into _CLUSTKEY select '', 'AA' insert into _CLUSTKEY select 'ZA', CHAR(ASCII('Z')+1) Your end criteria may be different to this depending on the data you have on your clustered key column.
Once populated the above table will look like this:
3) Add the new column to my very large table
Alter table LargeTable Add AlphaName_CS int null
4) Start the update in batches (Batch boundaries are defined by the table I have created and populated above)
-- Run this Script to create a table to record the progress so that you can see how far you are from the completion create table __progress(RecCount int, LoopCount int, DateT datetime) -- Actual Update Script declare @Key1 as nvarchar(2), @Key2 as nvarchar(2),@count as int, @rowcount as int, @MaxLoopCount as int select @MaxLoopCount = max(ID) from _CLUSTKEY set @count = 1 while(@count<=@MaxLoopCount) begin -- Get the start Key and End key value for the current batch in to variables select @Key1 = Key1, @Key2 = Key2 from _CLUSTKEY where ID = @count -- Do the update for the above Key range update dbo.LargeTable set AlphaName_CS = CHECKSUM(NameKey) where NameKey >= @Key1 and NameKey < @Key2 select @rowcount = @@rowcount -- insert a record to this progress table to check the progress insert into __progress select @rowcount, @count,getdate() set @count = @count + 1 end