Table updates on large table with no ID or large gaps between IDs

  • Hi Folks,
    I have a requirement to update a field on every record on a series of large tables (> 30M rows).  Unfortunately, these tables don't necessarily have a reliable ID field that I can use, or if they do, these ID's have big gaps in them.
    Initially, my code was:

    DECLARE @BatchCountMin INT = 0, @BatchCountMax INT =  (select MIN(ID) from Big_table), @BatchSize INT = 10000, @MaxID INT = (select MAX(ID) from Big_table)
    WHILE @BatchCountMax <= @MaxID
    BEGIN
        SELECT @BatchCountMin = @BatchCountMax,
            @BatchCountMax = @BatchCountMax + @BatchSize
        BEGIN TRAN
        UPDATE Big_table
        SET Big_table.TextToUpdate = <SomeStringFunction On TextToUpdate>
        WHERE Big_table.ID >= @BatchCountMin AND Big_table.ID < @BatchCountMax
        COMMIT
    END

    As I have big gaps in the ID's, the query spins its wheels for long periods of time as it goes through the batches of 10000 ID's.

    There are also tables where I don't have an ID, or a single field that I can use as an ID (Don't shoot me!  I inherited this mess).

    Considering I have to update each row anyway, are there any methods I could use to batch the updates without having to rely on some form of identifier, short of dumping into a temp table of some sort.

    Cheres

  • Quick thought, you could either use the ROW_NUMBER or the DENSE_RANK functions for this, just wrap the initial query in a CTE.
    😎

  • If you want to update 10,000 rows at a time, you can use an UPDATE TOP (10000) ... ORDER BY ID statement in your loop.  You'll probably need an OUTPUT clause so that you know when to stop!

    John

  • Look at ORDER BY with OFFSET FETCH

  • Thanks for your feedback.  I'll have a play and see what works.

    Cheers

  • Go by the values in the clustered index.  For example, if the table is clustered on a datetime column, use that as the value for the UPDATE.

    Here's a shell script that lets you specify a starting and max value, batch size to update and total number of rows to update.


    SET ANSI_NULLS ON;
    SET ANSI_PADDING ON;
    SET ANSI_WARNINGS ON;
    SET ARITHABORT ON;
    SET CONCAT_NULL_YIELDS_NULL ON;
    SET NOCOUNT OFF;
    SET NUMERIC_ROUNDABORT OFF;
    SET QUOTED_IDENTIFIER ON;
    SET XACT_ABORT ON;

    DECLARE @batch_size_of_update int;
    DECLARE @datetime datetime;
    DECLARE @max_datetime datetime;
    DECLARE @min_datetime datetime;
    DECLARE @rows_updated int;
    DECLARE @total_rows_updated int;
    DECLARE @total_rows_to_update int;

    SET @batch_size_of_update = 20000;
    SET @total_rows_to_update = 200000;
    SET @min_datetime = '19000101';
    SET @max_datetime = '20171231';
    SET @rows_updated = 0;
    SET @total_rows_updated = 0;

    PRINT 'Start time = ' + CONVERT(varchar(30), GETDATE(), 0) + '.';

    WHILE 1 = 1
    BEGIN
      SELECT TOP (@batch_size_of_update) @datetime = datetime
      FROM dbo.table_name
      WHERE datetime >= @min_datetime AND
       datetime <= @max_datetime
      ORDER BY datetime;
      UPDATE TOP (@batch_size_of_update) tn
      SET whatever = 'whatever'
      FROM dbo.table_name tn
      WHERE datetime >= @min_datetime AND
       datetime <= @datetime;
      SET @rows_updated = @@ROWCOUNT;
      IF @rows_updated = 0
       BREAK;
      SET @total_rows_updated = @total_rows_updated + @rows_updated;
      PRINT 'Total Rows updated = ' + CAST(@total_rows_updated AS varchar(10)) + '.';
      PRINT 'Last datetime updated = ' + CONVERT(varchar(30), @datetime, 120) + '.'
      IF @total_rows_updated >= @total_rows_to_update
       BREAK;
      SET @min_datetime = @datetime;
      IF @total_rows_updated % 200000 = 0
      BEGIN
       CHECKPOINT;
      END; --IF
      IF @batch_size_of_update >= 10000
      BEGIN
       IF @batch_size_of_update <= 25000
        WAITFOR DELAY '00:00:00.125';
       ELSE
       IF @batch_size_of_update <= 50000
        WAITFOR DELAY '00:00:00.250';
       ELSE
        WAITFOR DELAY '00:00:00.500';
      END /*IF*/
    END --WHILE

    CHECKPOINT;

    PRINT '';
    PRINT 'End time = ' + CONVERT(varchar(30), GETDATE(), 0) + '.';

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Would this not benefit from using a tally table? It would save the loops and missing IDs could be simply omitted?

    ...

  • HappyGeek - Saturday, December 23, 2017 6:27 AM

    Would this not benefit from using a tally table? It would save the loops and missing IDs could be simply omitted?

    Not a Tally Table, per se` but a Tally-Like structure using ROW_NUMBER() OVER, and that's already been suggested.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • This seems like a good use case for using the OFFSET clause... Something along theses lines.
    Note: because you're sorting on the clustered key, there should be no sort operation in the execution plan.
    IF OBJECT_ID('tempdb..#test_data', 'U') IS NOT NULL
    BEGIN DROP TABLE #test_data; END;

    CREATE TABLE #test_data (
        oddly_spaced_id BIGINT NOT NULL,
        update_val INT NULL
        );

    WITH
        cte_n1 (n) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n (n)),
        cte_n2 (n) AS (SELECT 1 FROM cte_n1 a CROSS JOIN cte_n1 b),
        cte_n3 (n) AS (SELECT 1 FROM cte_n2 a CROSS JOIN cte_n2 b),
        cte_Tally (n) AS (
            SELECT TOP (1000000)
                ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
            FROM
                cte_n3 a CROSS JOIN cte_n3 b
            )
    INSERT #test_data (oddly_spaced_id)
    SELECT DISTINCT
        ABS(CHECKSUM(NEWID()) % t.n) * t.n
    FROM
        cte_Tally t

    ALTER TABLE #test_data ADD PRIMARY KEY (oddly_spaced_id);

    --==============================================================

    DECLARE
        @total_rows INT = (SELECT COUNT(1) FROM #test_data td),
        @position INT = 0;

    WHILE @position < @total_rows
    BEGIN
        WITH
            cte_get_batch AS (
                SELECT td.update_val
                FROM
                    #test_data td
                ORDER BY
                    td.oddly_spaced_id
                OFFSET @position ROWS FETCH NEXT 10000 ROWS ONLY
                )
        UPDATE gb SET
            gb.update_val = @position
        FROM
            cte_get_batch gb;

        SET @position = @position + 10000;

    END;

    SELECT TOP 20000 * FROM #test_data td ORDER BY td.oddly_spaced_id;

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply