Transaction log and updates on large tables

Question

Transaction log and updates on large tables

ajans_guernsey

SSC Veteran

Points: 239
More actions
April 15, 2021 at 3:12 pm

#3871621

I’ve inherited a query that on first glance should be re-written. The main part of which is performing an update on a table (200m+ rows) in a while and using nested selects in a not exists. I realise this is quite vague so can provide more detail.
The query has a tendency to fill my transaction log. Db is in simple recovery mode, so stupid question time. Other than re-write the script to update in batches, is there anything I could do to minimise the logging to the transaction log, or a way of estimating the correct size of the log for this particular query.
thanks!

Viewing 15 posts - 1 through 15 (of 29 total)

You must be logged in to reply to this topic. Login to reply

ScottPletcher SSC Guru Points: 100949 More actions · Answer 1

ajans_guernsey wrote:

I(1) Other than re-write the script to update in batches, is there anything I could do to minimise the logging to the transaction log, or (2) a way of estimating the correct size of the log for this particular query.
thanks!

(1) (A) Consider data compression, incl. page compression, if not already in use. (B) Make sure you minimize the number of page splits caused by UPDATEs as best you can.

(2) Not any easy way that I know of. Row compression will usually reduce the data logged (since log data is also then row compressed), but that depends on the specific column types being updated.

Edit: Yes, more details from you would be needed to be any more specific.

This reply was modified 3 years ago by ScottPletcher.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Sergiy SSC Guru Points: 110206 More actions · Answer 2

3) Make sure you update only values which have actually changed.

_____________
Code for TallyGenerator

ScottPletcher SSC Guru Points: 100949 More actions · Answer 3

(3) SQL does that itself, so that's only a very minor consideration.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Jeff Moden SSC Guru Points: 1003873 More actions · Answer 4

ScottPletcher wrote:

(3) SQL does that itself, so that's only a very minor consideration.

I might be misinterpreting what you're saying above but, at face value, I have to ask...

How so? Since when does T-SQL decide that a value in a row containing, for example, the letter "A" is supposed to be updated to an "A". What do you think there is in T-SQL that says "Oh... that has the same value as what I'm trying to update so I'm not going to update that particular value"?

Or do you mean something else?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ScottPletcher SSC Guru Points: 100949 More actions · Answer 5

n/a.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

ScottPletcher SSC Guru Points: 100949 More actions · Answer 6

Jeff Moden wrote:

ScottPletcher wrote:
(3) SQL does that itself, so that's only a very minor consideration.
I might be misinterpreting what you're saying above but, at face value, I have to ask...
How so? Since when does T-SQL decide that a value in a row containing, for example, the letter "A" is supposed to be updated to an "A". What do you think there is in T-SQL that says "Oh... that has the same value as what I'm trying to update so I'm not going to update that particular value"?
Or do you mean something else?

No, I meant exactly that SQL generally does not log updates of a column to the same value. SQL does not log a modification to the data page, does not modify the page, and thus does not mark page(s) as dirty.

There are a some exceptions:

if snapshot is enabled, then full logging and page updates are always done: yet another reason I urge you to active snapshot only if you really, really need it;

updating a LOB column > 1 page, unless the update is exactly "lob_column = lob_column";

updating of a clustering key column(s) to the same value.

I believe SQL still does update the totals in sys.dm_db_index_operational_stats to reflect the "UPDATE", even though no data pages were actually modified / rewritten.

As to timing, since at least SQL 2005 (perhaps even before, although I think additional bypasses were added for 2005).

Why would SQL waste time and resources to "update" a column from 'A' to 'A' when a simple comparison could prevent that?

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 7

Scott, could you please cite some authority to that effect? I couldn't find it with some quick googling, so I ran an experiment. The code below is showing the the exact same cpu times and statistics whether or not the target value is being changed. Not calling you a liar, just trying to understand what is actually happening. I always thought it didn't check the columns before update because it would cost CPU cycles to do all the compares and usually *something* is going to change.

if object_id(N'tempdb..#UpdateTest') is not null drop table #UpdateTest
if object_id(N'tempdb..#Source') is not null drop table #Source

-- create and populate table to be updated
create table #UpdateTest(col1 int, col2 int, col3 int)
insert into #UpdateTest
select top 20000 N, N+15, N*2
from dbo.tally   -- or vtally 

create unique clustered index pk#UpdateTest on #UpdateTest(col1)

-- create an populate a source table with identical data
select * 
into #Source 
from #updateTest

create unique clustered index pk#Source on #Source(col1)

-- update using identical values
set statistics time, io on;

update u
set col1 = s.col1, col2 = s.col2, col3 = s.col3
from #source s
join #UpdateTest u on u.col1 = s.col1

set statistics time, io off;


-- update where col3 changes
set statistics time, io on;

update u
set col1 = s.col1, col2 = s.col2, col3 = 100
from #source s
join #UpdateTest u on u.col1 = s.col1

set statistics time, io off;

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

ScottPletcher SSC Guru Points: 100949 More actions · Answer 8

The Dixie Flatline wrote:

The code below is showing the the exact same cpu times and statistics whether or not the target value is being changed. Not calling you a liar, just trying to understand what is actually happening. I always thought it didn't check the columns before update because it would cost CPU cycles to do all the compares and usually *something* is going to change.

if object_id(N'tempdb..#UpdateTest') is not null drop table #UpdateTest
if object_id(N'tempdb..#Source') is not null drop table #Source

-- create and populate table to be updated
create table #UpdateTest(col1 int, col2 int, col3 int)
insert into #UpdateTest
select top 20000 N, N+15, N*2
from dbo.tally   -- or vtally 

create unique clustered index pk#UpdateTest on #UpdateTest(col1)

-- create an populate a source table with identical data
select * 
into #Source 
from #updateTest

create unique clustered index pk#Source on #Source(col1)

-- update using identical values
set statistics time, io on;

update u
set col1 = s.col1, col2 = s.col2, col3 = s.col3
from #source s
join #UpdateTest u on u.col1 = s.col1

set statistics time, io off;


-- update where col3 changes
set statistics time, io on;

update u
set col1 = s.col1, col2 = s.col2, col3 = 100
from #source s
join #UpdateTest u on u.col1 = s.col1

set statistics time, io off;

That matches one of the exceptions I gave earlier:

Updating any clustering key column(s) does cause I/O.

Remove the clus key column from your UPDATEs and try again.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

ScottPletcher SSC Guru Points: 100949 More actions · Answer 9

I'll have to check my internal notes for outer references. My notes just have the relevant details for me, not where I got them from.

But I'm rather busy now, so it will be a bit before I can provide any link(s).

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 10

Same result. Identical timings and statistics. Potential update to a unique clustered index column does cause additional I/O, but it's because the query plan has to include a SPLIT/SORT/COLLAPSE sequence to deal with the possibility that duplicate keys could be generated. The sort is half the work of the index-update query plan.

Don't worry about researching the question. I can do some more digging over the weekend.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

ScottPletcher SSC Guru Points: 100949 More actions · Answer 11

The data still has to be read -- how else would SQL know that the proposed new value was the same as the value in the table itself?

It's the writes/dirty pages that get avoided.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

ScottPletcher SSC Guru Points: 100949 More actions · Answer 12

OK, found one of the links:

https://www.sql.kiwi/2010/08/the-impact-of-non-updating-updates.html

I assume Paul White is an authoritative enough source for you :wink:.

I mentioned another source in my ref notes but can't find it. My internal (tech) notes are in pretty good shape, since I need to use them constantly. My ref notes, i.e. the original sources/ideas for the internal notes, are disorganized (that's the polite way of putting it).

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Jeff Moden SSC Guru Points: 1003873 More actions · Answer 13

ScottPletcher wrote:

Jeff Moden wrote:
ScottPletcher wrote:
(3) SQL does that itself, so that's only a very minor consideration.
I might be misinterpreting what you're saying above but, at face value, I have to ask...
How so? Since when does T-SQL decide that a value in a row containing, for example, the letter "A" is supposed to be updated to an "A". What do you think there is in T-SQL that says "Oh... that has the same value as what I'm trying to update so I'm not going to update that particular value"?
Or do you mean something else?
No, I meant exactly that SQL generally does not log updates of a column to the same value. SQL does not log a modification to the data page, does not modify the page, and thus does not mark page(s) as dirty.
There are a some exceptions: if snapshot is enabled, then full logging and page updates are always done: yet another reason I urge you to active snapshot only if you really, really need it; updating a LOB column > 1 page, unless the update is exactly "lob_column = lob_column"; updating of a clustering key column(s) to the same value.
I believe SQL still does update the totals in sys.dm_db_index_operational_stats to reflect the "UPDATE", even though no data pages were actually modified / rewritten.
As to timing, since at least SQL 2005 (perhaps even before, although I think additional bypasses were added for 2005).
Why would SQL waste time and resources to "update" a column from 'A' to 'A' when a simple comparison could prevent that?

Thanks, Scott. I'll check it out.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 14

Paul is certainly a worthwhile authority. I will study his article later. I did notice this quote, which was emphasized:

A clustered table will always produce full logging and dirty buffer pages if (any part of) the clustering key is updated.

But I also notice that his article didn't cover the use case I created above. He only tested setting a column to a constant, or setting a column to itself. It probably holds true in the case of setting a column to the value of a column from a source table, but I will use his methodology to check.

Thanks. 🙂

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills