December 19, 2017 at 10:58 pm
Hi Folks,
I have a requirement to update a field on every record on a series of large tables (> 30M rows). Unfortunately, these tables don't necessarily have a reliable ID field that I can use, or if they do, these ID's have big gaps in them.
Initially, my code was:
DECLARE @BatchCountMin INT = 0, @BatchCountMax INT = (select MIN(ID) from Big_table), @BatchSize INT = 10000, @MaxID INT = (select MAX(ID) from Big_table)
WHILE @BatchCountMax <= @MaxID
BEGIN
SELECT @BatchCountMin = @BatchCountMax,
@BatchCountMax = @BatchCountMax + @BatchSize
BEGIN TRAN
UPDATE Big_table
SET Big_table.TextToUpdate = <SomeStringFunction On TextToUpdate>
WHERE Big_table.ID >= @BatchCountMin AND Big_table.ID < @BatchCountMax
COMMIT
END
As I have big gaps in the ID's, the query spins its wheels for long periods of time as it goes through the batches of 10000 ID's.
There are also tables where I don't have an ID, or a single field that I can use as an ID (Don't shoot me! I inherited this mess).
Considering I have to update each row anyway, are there any methods I could use to batch the updates without having to rely on some form of identifier, short of dumping into a temp table of some sort.
Cheres
December 20, 2017 at 6:58 am
Quick thought, you could either use the ROW_NUMBER or the DENSE_RANK functions for this, just wrap the initial query in a CTE.
😎
December 20, 2017 at 7:35 am
If you want to update 10,000 rows at a time, you can use an UPDATE TOP (10000) ... ORDER BY ID statement in your loop. You'll probably need an OUTPUT clause so that you know when to stop!
John
December 20, 2017 at 11:42 am
Look at ORDER BY with OFFSET FETCH
December 20, 2017 at 7:48 pm
Thanks for your feedback. I'll have a play and see what works.
Cheers
December 22, 2017 at 8:27 am
Go by the values in the clustered index. For example, if the table is clustered on a datetime column, use that as the value for the UPDATE.
Here's a shell script that lets you specify a starting and max value, batch size to update and total number of rows to update.
SET ANSI_NULLS ON;
SET ANSI_PADDING ON;
SET ANSI_WARNINGS ON;
SET ARITHABORT ON;
SET CONCAT_NULL_YIELDS_NULL ON;
SET NOCOUNT OFF;
SET NUMERIC_ROUNDABORT OFF;
SET QUOTED_IDENTIFIER ON;
SET XACT_ABORT ON;
DECLARE @batch_size_of_update int;
DECLARE @datetime datetime;
DECLARE @max_datetime datetime;
DECLARE @min_datetime datetime;
DECLARE @rows_updated int;
DECLARE @total_rows_updated int;
DECLARE @total_rows_to_update int;
SET @batch_size_of_update = 20000;
SET @total_rows_to_update = 200000;
SET @min_datetime = '19000101';
SET @max_datetime = '20171231';
SET @rows_updated = 0;
SET @total_rows_updated = 0;
PRINT 'Start time = ' + CONVERT(varchar(30), GETDATE(), 0) + '.';
WHILE 1 = 1
BEGIN
SELECT TOP (@batch_size_of_update) @datetime = datetime
FROM dbo.table_name
WHERE datetime >= @min_datetime AND
datetime <= @max_datetime
ORDER BY datetime;
UPDATE TOP (@batch_size_of_update) tn
SET whatever = 'whatever'
FROM dbo.table_name tn
WHERE datetime >= @min_datetime AND
datetime <= @datetime;
SET @rows_updated = @@ROWCOUNT;
IF @rows_updated = 0
BREAK;
SET @total_rows_updated = @total_rows_updated + @rows_updated;
PRINT 'Total Rows updated = ' + CAST(@total_rows_updated AS varchar(10)) + '.';
PRINT 'Last datetime updated = ' + CONVERT(varchar(30), @datetime, 120) + '.'
IF @total_rows_updated >= @total_rows_to_update
BREAK;
SET @min_datetime = @datetime;
IF @total_rows_updated % 200000 = 0
BEGIN
CHECKPOINT;
END; --IF
IF @batch_size_of_update >= 10000
BEGIN
IF @batch_size_of_update <= 25000
WAITFOR DELAY '00:00:00.125';
ELSE
IF @batch_size_of_update <= 50000
WAITFOR DELAY '00:00:00.250';
ELSE
WAITFOR DELAY '00:00:00.500';
END /*IF*/
END --WHILE
CHECKPOINT;
PRINT '';
PRINT 'End time = ' + CONVERT(varchar(30), GETDATE(), 0) + '.';
SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.
December 23, 2017 at 6:27 am
Would this not benefit from using a tally table? It would save the loops and missing IDs could be simply omitted?
...
December 23, 2017 at 3:44 pm
HappyGeek - Saturday, December 23, 2017 6:27 AMWould this not benefit from using a tally table? It would save the loops and missing IDs could be simply omitted?
Not a Tally Table, per se` but a Tally-Like structure using ROW_NUMBER() OVER, and that's already been suggested.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 24, 2017 at 6:23 am
This seems like a good use case for using the OFFSET clause... Something along theses lines.
Note: because you're sorting on the clustered key, there should be no sort operation in the execution plan.IF OBJECT_ID('tempdb..#test_data', 'U') IS NOT NULL
BEGIN DROP TABLE #test_data; END;
CREATE TABLE #test_data (
oddly_spaced_id BIGINT NOT NULL,
update_val INT NULL
);
WITH
cte_n1 (n) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n (n)),
cte_n2 (n) AS (SELECT 1 FROM cte_n1 a CROSS JOIN cte_n1 b),
cte_n3 (n) AS (SELECT 1 FROM cte_n2 a CROSS JOIN cte_n2 b),
cte_Tally (n) AS (
SELECT TOP (1000000)
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM
cte_n3 a CROSS JOIN cte_n3 b
)
INSERT #test_data (oddly_spaced_id)
SELECT DISTINCT
ABS(CHECKSUM(NEWID()) % t.n) * t.n
FROM
cte_Tally t
ALTER TABLE #test_data ADD PRIMARY KEY (oddly_spaced_id);
--==============================================================
DECLARE
@total_rows INT = (SELECT COUNT(1) FROM #test_data td),
@position INT = 0;
WHILE @position < @total_rows
BEGIN
WITH
cte_get_batch AS (
SELECT td.update_val
FROM
#test_data td
ORDER BY
td.oddly_spaced_id
OFFSET @position ROWS FETCH NEXT 10000 ROWS ONLY
)
UPDATE gb SET
gb.update_val = @position
FROM
cte_get_batch gb;
SET @position = @position + 10000;
END;
SELECT TOP 20000 * FROM #test_data td ORDER BY td.oddly_spaced_id;
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply