Table updates on large table with no ID or large gaps between IDs

Question

Table updates on large table with no ID or large gaps between IDs

hoffy

SSC-Addicted

Points: 436
More actions
December 19, 2017 at 10:58 pm

#330548

Hi Folks,
I have a requirement to update a field on every record on a series of large tables (> 30M rows). Unfortunately, these tables don't necessarily have a reliable ID field that I can use, or if they do, these ID's have big gaps in them.
Initially, my code was:
DECLARE @BatchCountMin INT = 0, @BatchCountMax INT = (select MIN(ID) from Big_table), @BatchSize INT = 10000, @MaxID INT = (select MAX(ID) from Big_table) WHILE @BatchCountMax <= @MaxID BEGIN SELECT @BatchCountMin = @BatchCountMax, @BatchCountMax = @BatchCountMax + @BatchSize BEGIN TRAN UPDATE Big_table SET Big_table.TextToUpdate = <SomeStringFunction On TextToUpdate> WHERE Big_table.ID >= @BatchCountMin AND Big_table.ID < @BatchCountMax COMMIT END
As I have big gaps in the ID's, the query spins its wheels for long periods of time as it goes through the batches of 10000 ID's.
There are also tables where I don't have an ID, or a single field that I can use as an ID (Don't shoot me! I inherited this mess).
Considering I have to update each row anyway, are there any methods I could use to batch the updates without having to rely on some form of identifier, short of dumping into a temp table of some sort.
Cheres

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

Eirikur Eiriksson SSC Guru Points: 182882 More actions · Answer 1

Quick thought, you could either use the ROW_NUMBER or the DENSE_RANK functions for this, just wrap the initial query in a CTE.
😎

John Mitchell-245523 SSC Guru Points: 148809 More actions · Answer 2

If you want to update 10,000 rows at a time, you can use an UPDATE TOP (10000) ... ORDER BY ID statement in your loop. You'll probably need an OUTPUT clause so that you know when to stop!

John

Joe Torre SSChampion Points: 10248 More actions · Answer 3

Joe Torre

SSChampion

Points: 10248

December 20, 2017 at 11:42 am

#1972970

Look at ORDER BY with OFFSET FETCH

hoffy SSC-Addicted Points: 436 More actions · Answer 4

Thanks for your feedback. I'll have a play and see what works.

Cheers

ScottPletcher SSC Guru Points: 100946 More actions · Answer 5

Go by the values in the clustered index. For example, if the table is clustered on a datetime column, use that as the value for the UPDATE.

Here's a shell script that lets you specify a starting and max value, batch size to update and total number of rows to update.

SET ANSI_NULLS ON; SET ANSI_PADDING ON; SET ANSI_WARNINGS ON; SET ARITHABORT ON; SET CONCAT_NULL_YIELDS_NULL ON; SET NOCOUNT OFF; SET NUMERIC_ROUNDABORT OFF; SET QUOTED_IDENTIFIER ON; SET XACT_ABORT ON;

DECLARE @batch_size_of_update int; DECLARE @datetime datetime; DECLARE @max_datetime datetime; DECLARE @min_datetime datetime; DECLARE @rows_updated int; DECLARE @total_rows_updated int; DECLARE @total_rows_to_update int;

SET @batch_size_of_update = 20000; SET @total_rows_to_update = 200000; SET @min_datetime = '19000101'; SET @max_datetime = '20171231'; SET @rows_updated = 0; SET @total_rows_updated = 0;

PRINT 'Start time = ' + CONVERT(varchar(30), GETDATE(), 0) + '.';

WHILE 1 = 1 BEGIN SELECT TOP (@batch_size_of_update) @datetime = datetime FROM dbo.table_name WHERE datetime >= @min_datetime AND datetime <= @max_datetime ORDER BY datetime; UPDATE TOP (@batch_size_of_update) tn SET whatever = 'whatever' FROM dbo.table_name tn WHERE datetime >= @min_datetime AND datetime <= @datetime; SET @rows_updated = @@ROWCOUNT; IF @rows_updated = 0 BREAK; SET @total_rows_updated = @total_rows_updated + @rows_updated; PRINT 'Total Rows updated = ' + CAST(@total_rows_updated AS varchar(10)) + '.'; PRINT 'Last datetime updated = ' + CONVERT(varchar(30), @datetime, 120) + '.' IF @total_rows_updated >= @total_rows_to_update BREAK; SET @min_datetime = @datetime; IF @total_rows_updated % 200000 = 0 BEGIN CHECKPOINT; END; --IF IF @batch_size_of_update >= 10000 BEGIN IF @batch_size_of_update <= 25000 WAITFOR DELAY '00:00:00.125'; ELSE IF @batch_size_of_update <= 50000 WAITFOR DELAY '00:00:00.250'; ELSE WAITFOR DELAY '00:00:00.500'; END /*IF*/ END --WHILE

CHECKPOINT;

PRINT ''; PRINT 'End time = ' + CONVERT(varchar(30), GETDATE(), 0) + '.';

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

HappyGeek SSCoach Points: 18726 More actions · Answer 6

Would this not benefit from using a tally table? It would save the loops and missing IDs could be simply omitted?

...

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 7

HappyGeek - Saturday, December 23, 2017 6:27 AM
Would this not benefit from using a tally table? It would save the loops and missing IDs could be simply omitted?

Not a Tally Table, per se` but a Tally-Like structure using ROW_NUMBER() OVER, and that's already been suggested.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jason A. Long SSC-Insane Points: 23780 More actions · Answer 8

This seems like a good use case for using the OFFSET clause... Something along theses lines.
Note: because you're sorting on the clustered key, there should be no sort operation in the execution plan.
IF OBJECT_ID('tempdb..#test_data', 'U') IS NOT NULL BEGIN DROP TABLE #test_data; END;

CREATE TABLE #test_data ( oddly_spaced_id BIGINT NOT NULL, update_val INT NULL );

WITH cte_n1 (n) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n (n)), cte_n2 (n) AS (SELECT 1 FROM cte_n1 a CROSS JOIN cte_n1 b), cte_n3 (n) AS (SELECT 1 FROM cte_n2 a CROSS JOIN cte_n2 b), cte_Tally (n) AS ( SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM cte_n3 a CROSS JOIN cte_n3 b ) INSERT #test_data (oddly_spaced_id) SELECT DISTINCT ABS(CHECKSUM(NEWID()) % t.n) * t.n FROM cte_Tally t

ALTER TABLE #test_data ADD PRIMARY KEY (oddly_spaced_id);

--==============================================================

DECLARE @total_rows INT = (SELECT COUNT(1) FROM #test_data td), @position INT = 0;

WHILE @position < @total_rows BEGIN WITH cte_get_batch AS ( SELECT td.update_val FROM #test_data td ORDER BY td.oddly_spaced_id OFFSET @position ROWS FETCH NEXT 10000 ROWS ONLY ) UPDATE gb SET gb.update_val = @position FROM cte_get_batch gb;

SET @position = @position + 10000;

END;

SELECT TOP 20000 * FROM #test_data td ORDER BY td.oddly_spaced_id;