Transaction log and updates on large tables

  • I’ve inherited a query that on first glance should be re-written. The main part of which is performing an update on a table (200m+ rows) in a while and using nested selects in a not exists. I realise this is quite vague so can provide more detail.

    The query has a tendency to fill my transaction log. Db is in simple recovery mode, so stupid question time. Other than re-write the script to update in batches, is there anything I could do to minimise the logging to the transaction log, or a way of estimating the correct size of the log for this particular query.

    thanks!

  • ajans_guernsey wrote:

    I(1) Other than re-write the script to update in batches, is there anything I could do to minimise the logging to the transaction log, or (2) a way of estimating the correct size of the log for this particular query.

    thanks!

    (1) (A) Consider data compression, incl. page compression, if not already in use.  (B) Make sure you minimize the number of page splits caused by UPDATEs as best you can.

    (2) Not any easy way that I know of.  Row compression will usually reduce the data logged (since log data is also then row compressed), but that depends on the specific column types being updated.

    Edit: Yes, more details from you would be needed to be any more specific.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • 3) Make sure you  update only values which have actually changed.

    _____________
    Code for TallyGenerator

  • (3) SQL does that itself, so that's only a very minor consideration.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher wrote:

    (3) SQL does that itself, so that's only a very minor consideration.

    I might be misinterpreting what you're saying above but, at face value, I have to ask...

    How so?  Since when does T-SQL decide that a value in a row containing, for example, the letter "A" is supposed to be updated to an "A".  What do you think there is in T-SQL that says "Oh... that has the same value as what I'm trying to update so I'm not going to update that particular value"?

    Or do you mean something else?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • n/a.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Jeff Moden wrote:

    ScottPletcher wrote:

    (3) SQL does that itself, so that's only a very minor consideration.

    I might be misinterpreting what you're saying above but, at face value, I have to ask...

    How so?  Since when does T-SQL decide that a value in a row containing, for example, the letter "A" is supposed to be updated to an "A".  What do you think there is in T-SQL that says "Oh... that has the same value as what I'm trying to update so I'm not going to update that particular value"?

    Or do you mean something else?

    No, I meant exactly that SQL generally does not log updates of a column to the same value.  SQL does not log a modification to the data page, does not modify the page, and thus does not mark page(s) as dirty.

    There are a some exceptions:

    if snapshot is enabled, then full logging and page updates are always done: yet another reason I urge you to active snapshot only if you  really, really need it;

    updating a LOB column > 1 page, unless the update is exactly "lob_column = lob_column";

    updating of a clustering key column(s) to the same value.

    I believe SQL still does update the totals in sys.dm_db_index_operational_stats to reflect the "UPDATE", even though no data pages were actually modified / rewritten.

    As to timing, since at least SQL 2005 (perhaps even before, although I think additional bypasses were added for 2005).

    Why would SQL waste time and resources to "update" a column from 'A' to 'A' when a simple comparison could prevent that?

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Scott, could you please cite some authority to that effect?     I couldn't find it with some quick googling, so I ran an experiment.    The code below is showing the the exact same cpu times and statistics whether or not the target value is being changed.    Not calling you a liar, just trying to understand what is actually happening.    I always thought it didn't check the columns before update because it would cost CPU cycles to do all the compares and usually *something* is going to change.

    if object_id(N'tempdb..#UpdateTest') is not null drop table #UpdateTest
    if object_id(N'tempdb..#Source') is not null drop table #Source

    -- create and populate table to be updated
    create table #UpdateTest(col1 int, col2 int, col3 int)
    insert into #UpdateTest
    select top 20000 N, N+15, N*2
    from dbo.tally -- or vtally

    create unique clustered index pk#UpdateTest on #UpdateTest(col1)

    -- create an populate a source table with identical data
    select *
    into #Source
    from #updateTest

    create unique clustered index pk#Source on #Source(col1)

    -- update using identical values
    set statistics time, io on;

    update u
    set col1 = s.col1, col2 = s.col2, col3 = s.col3
    from #source s
    join #UpdateTest u on u.col1 = s.col1

    set statistics time, io off;


    -- update where col3 changes
    set statistics time, io on;

    update u
    set col1 = s.col1, col2 = s.col2, col3 = 100
    from #source s
    join #UpdateTest u on u.col1 = s.col1

    set statistics time, io off;

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • The Dixie Flatline wrote:

    The code below is showing the the exact same cpu times and statistics whether or not the target value is being changed.    Not calling you a liar, just trying to understand what is actually happening.    I always thought it didn't check the columns before update because it would cost CPU cycles to do all the compares and usually *something* is going to change.

    if object_id(N'tempdb..#UpdateTest') is not null drop table #UpdateTest
    if object_id(N'tempdb..#Source') is not null drop table #Source

    -- create and populate table to be updated
    create table #UpdateTest(col1 int, col2 int, col3 int)
    insert into #UpdateTest
    select top 20000 N, N+15, N*2
    from dbo.tally -- or vtally

    create unique clustered index pk#UpdateTest on #UpdateTest(col1)

    -- create an populate a source table with identical data
    select *
    into #Source
    from #updateTest

    create unique clustered index pk#Source on #Source(col1)

    -- update using identical values
    set statistics time, io on;

    update u
    set col1 = s.col1, col2 = s.col2, col3 = s.col3
    from #source s
    join #UpdateTest u on u.col1 = s.col1

    set statistics time, io off;


    -- update where col3 changes
    set statistics time, io on;

    update u
    set col1 = s.col1, col2 = s.col2, col3 = 100
    from #source s
    join #UpdateTest u on u.col1 = s.col1

    set statistics time, io off;

    That matches one of the exceptions I gave earlier:

    Updating any clustering key column(s) does cause I/O.

    Remove the clus key column from your UPDATEs and try again.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • I'll have to check my internal notes for outer references.  My notes just have the relevant details for me, not where I got them from.

    But I'm rather busy now, so it will be a bit before I can provide any link(s).

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Same result.    Identical timings and statistics.    Potential update to a unique clustered index column does cause additional I/O, but it's because the query plan has to include a SPLIT/SORT/COLLAPSE sequence to deal with the possibility that duplicate keys could be generated.   The sort is half the work of the index-update query plan.

    Don't worry about researching the question.  I can do some more digging over the weekend.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • The data still has to be read -- how else would SQL know that the proposed new value was the same as the value in the table itself?

    It's the writes/dirty pages that get avoided.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • OK, found one of the links:

    https://www.sql.kiwi/2010/08/the-impact-of-non-updating-updates.html

    I assume Paul White is an authoritative enough source for you :wink:.

    I mentioned another source in my ref notes but can't find it.  My internal (tech) notes are in pretty good shape, since I need to use them constantly.  My ref notes, i.e. the original sources/ideas for the internal notes, are disorganized (that's the polite way of putting it).

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher wrote:

    Jeff Moden wrote:

    ScottPletcher wrote:

    (3) SQL does that itself, so that's only a very minor consideration.

    I might be misinterpreting what you're saying above but, at face value, I have to ask...

    How so?  Since when does T-SQL decide that a value in a row containing, for example, the letter "A" is supposed to be updated to an "A".  What do you think there is in T-SQL that says "Oh... that has the same value as what I'm trying to update so I'm not going to update that particular value"?

    Or do you mean something else?

    No, I meant exactly that SQL generally does not log updates of a column to the same value.  SQL does not log a modification to the data page, does not modify the page, and thus does not mark page(s) as dirty.

    There are a some exceptions: if snapshot is enabled, then full logging and page updates are always done: yet another reason I urge you to active snapshot only if you  really, really need it; updating a LOB column > 1 page, unless the update is exactly "lob_column = lob_column"; updating of a clustering key column(s) to the same value.

    I believe SQL still does update the totals in sys.dm_db_index_operational_stats to reflect the "UPDATE", even though no data pages were actually modified / rewritten.

    As to timing, since at least SQL 2005 (perhaps even before, although I think additional bypasses were added for 2005).

    Why would SQL waste time and resources to "update" a column from 'A' to 'A' when a simple comparison could prevent that?

    Thanks, Scott.  I'll check it out.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Paul is certainly a worthwhile authority.   I will study his article later.   I did notice this quote, which was emphasized:

    A clustered table will always produce full logging and dirty buffer pages if (any part of) the clustering key is updated.

    But I also notice that his article didn't cover the use case I created above.    He only tested setting a column to a constant, or setting a column to itself.    It probably holds true in the case of setting a column to the value of a column from a source table, but I will use his methodology to check.

    Thanks.   🙂

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Viewing 15 posts - 1 through 15 (of 29 total)

You must be logged in to reply to this topic. Login to reply