Breaking up a data load to prevent log file growth

Question

Post reply

Breaking up a data load to prevent log file growth

CapnHector

SSCoach

Points: 16143
More actions
December 3, 2012 at 3:28 pm

#271191

Im coming here to get some options for breaking up a data load. the situation is that i have to expand a set of numbers to cover every number from 000 to 999 (1 becomes 1000-1999). there are 258 thousand input numbers and if i just do a cross apply to the numbers table it causes the log to balloon (a problem because i dont want to have a "one time shrink" every month). This is being loaded into a VARCHAR(16) column which is why the data types are CHAR()
the two options i came up with are a cursor with a cross apply to a numbers table (Slow of course) and an output from an update (setting an isProcessed flag) with a cross apply to a numbers table holding 000-999 (faster than the cursor but still has a lot of data to crunch)
The sample data only shows what im trying to do with the input and has been paired down to only the relevant columns.
IF (OBJECT_ID('tempdb..#Block') IS NOT NULL)
DROP TABLE #Block
IF (OBJECT_ID('tempdb..#SampleNumbers') IS NOT NULL)
DROP TABLE #SampleNumbers
CREATE TABLE #Block (N CHAR(3) PRIMARY KEY WITH(FILLFACTOR = 100), isProcessed BIT DEFAULT(0))
CREATE TABLE #SampleNumbers (Block CHAR(7) PRIMARY KEY)
INSERT INTO #Block(N)
SELECT '000' AS N UNION ALL
SELECT TOP 999 RIGHT ('000' + CAST( ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS VARCHAR),3)
FROM sys.all_columns;
INSERT INTO #SampleNumbers
SELECT *
FROM (VALUES ('1234567'),('2345678'),('3456789'),('4567890'))x(TopLevel)
--INSERT INTO MyTable
SELECT Block + N
FROM #SampleNumbers
CROSS JOIN #Block
Currently i am using the following but wondering if there is a better way
IF (OBJECT_ID('tempdb..#tempSample') IS NOT NULL)
DROP TABLE #tempSample
CREATE TABLE #tempSample (Block CHAR(7))
CREATE CLUSTERED INDEX uix_tmpSampleIndex ON #tempSample(Block)
DECLARE @Count INT = 1
WHILE @Count > 0
BEGIN
BEGIN TRY
BEGIN TRANSACTION
UPDATE TOP (500) a SET isProcessed = 1
OUTPUT inserted.Block INTO #tempSample
FROM #SampleNumbers a
WHERE isProcessed = 0
SET @Count = @@ROWCOUNT
--INSERT INTO MyTable (MyCol)
SELECT Block + N
FROM #tempSample
CROSS APPLY #Num
COMMIT TRANSACTION
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION
END CATCH
TRUNCATE TABLE #tempSample
WAITFOR DELAY '00:00:01'
END
DROP TABLE #tempSample
This is a load into our staging database on our production server and needs to limit its impact (reason for the WAITFOR and transactions). Is there any better way to do this? (i need to work on the error handling so i don't get an infinite on an error where it keeps processing the same batch and getting the exact same error. its better than nothing as if it has been running way to long i know there was an error in the expansion.)
For faster help in answering any problems Please read How to post data/code on a forum to get the best help - Jeff Moden[/url] for the best way to ask your question.
For performance Issues see how we like them posted here: How to Post Performance Problems - Gail Shaw[/url]
Need to Split some strings? Jeff Moden's DelimitedSplit8K[/url]
Jeff Moden's Cross tab and Pivots Part 1[/url]
Jeff Moden's Cross tab and Pivots Part 2[/url]

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply

Vedran Kesegic SSCrazy Eights Points: 9302 More actions · Answer 1

Log of what db has grown? what recovery model it uses?

did you measure which statement cumulatively takes most time?

Try to avoid cross apply, use @table variable instead of #temp table in your clever OUTPUT clause. Use filtered index containing only unprocessed rows. Temporarily switch to bulk-logged recovery model to minimize logging (read more about that, it's not so simple).

_____________________________________________________
Microsoft Certified Master: SQL Server 2008
XDetails Addin - for SQL Developers
blog.sqlxdetails.com - Transaction log myths

CapnHector SSCoach Points: 16143 More actions · Answer 2

I should have mentioned that the database is in simple recovery mode so the t log growth is simply from the amount of data in a single transition

For faster help in answering any problems Please read How to post data/code on a forum to get the best help - Jeff Moden[/url] for the best way to ask your question.

For performance Issues see how we like them posted here: How to Post Performance Problems - Gail Shaw[/url]

Need to Split some strings? Jeff Moden's DelimitedSplit8K[/url]
Jeff Moden's Cross tab and Pivots Part 1[/url]
Jeff Moden's Cross tab and Pivots Part 2[/url]

Vedran Kesegic SSCrazy Eights Points: 9302 More actions · Answer 3

#tempSample is never cleared. So you throw out the same rows over and over in cross apply. Is that intentional?

You also don't need uix_tmpSampleIndex index.

_____________________________________________________
Microsoft Certified Master: SQL Server 2008
XDetails Addin - for SQL Developers
blog.sqlxdetails.com - Transaction log myths

CapnHector SSCoach Points: 16143 More actions · Answer 4

Vedran Kesegic (12/4/2012)
#tempSample is never cleared. So you throw out the same rows over and over in cross apply. Is that intentional?
You also don't need uix_tmpSampleIndex index.

#tmpSample is truncated right before the WAITFOR. as far as the index goes on #tempSample i found it to speed things up a little in my testing. To answer your second question i missed when i answered from my phone, the log that balloons is the log on our staging database where this is taking place.

The question is not so much the speed, each iteration of the loop takes 3 seconds to produce 500,000 expanded records and that includes the WAITFOR. Speed is not the primary issue as this is only the final import step of a data import (i have 250,000 records to expand to 250,000,000 so it takes some time by its self along with the other data.). the question i had was on the method and if there is any thing better to minimize load on the system. the thing that gets me is the loop but i think the nature of this beast will require it.

i do thank you for your improvements and will look at implementing them through some testing. i do like the filtered index on isProcessed which will speed up finding the top 500. im also thinking of throwing MAXDOP 1 on the queries to force it not to go parallel and take advantage of the same thing that makes the quirky update work.

For faster help in answering any problems Please read How to post data/code on a forum to get the best help - Jeff Moden[/url] for the best way to ask your question.

For performance Issues see how we like them posted here: How to Post Performance Problems - Gail Shaw[/url]

Need to Split some strings? Jeff Moden's DelimitedSplit8K[/url]
Jeff Moden's Cross tab and Pivots Part 1[/url]
Jeff Moden's Cross tab and Pivots Part 2[/url]

Vedran Kesegic SSCrazy Eights Points: 9302 More actions · Answer 5

I missed your truncate because the indentation misled me 🙂

Try with CHECKPOINT command every several batches, because log cannot be truncated if checkpoint did not occur (also happens automatically, but we will force it here).

Also, you can optimize transaction log by setting large enough initial size (e.g. 4000 MB) and increment (e.g. 512 MB). Prior to that shrink log file to smallest possible (1KB). That will optimize number of VLF's in transaction log.

_____________________________________________________
Microsoft Certified Master: SQL Server 2008
XDetails Addin - for SQL Developers
blog.sqlxdetails.com - Transaction log myths

CapnHector SSCoach Points: 16143 More actions · Answer 6

Vedran Kesegic (12/4/2012)
I missed your truncate because the indentation misled me 🙂
Try with CHECKPOINT command every several batches, because log cannot be truncated if checkpoint did not occur (also happens automatically, but we will force it here).
Also, you can optimize transaction log by setting large enough initial size (e.g. 4000 MB) and increment (e.g. 512 MB). Prior to that shrink log file to smallest possible (1KB). That will optimize number of VLF's in transaction log.

When we created the database, i grew the log to 50 gig in chunks to have optimal vlf sizes for the type of data transactions that take place in the staging database. Durring the single pass cross apply it ballooned to 100 gig (in 1 gig increments, may need to look at the auto growth settings later). In testing the loop, i did not have any issues with log file expansion with out a explicit checkpoint as SQL Server was check pointing the log as needed.

I guess from your responses that there is probably not a better way (avoiding the loop) to do this. That just makes me want to cringe that i have a loop in production code but when its the only choice left you go with what works.

For faster help in answering any problems Please read How to post data/code on a forum to get the best help - Jeff Moden[/url] for the best way to ask your question.

For performance Issues see how we like them posted here: How to Post Performance Problems - Gail Shaw[/url]

Need to Split some strings? Jeff Moden's DelimitedSplit8K[/url]
Jeff Moden's Cross tab and Pivots Part 1[/url]
Jeff Moden's Cross tab and Pivots Part 2[/url]

Vedran Kesegic SSCrazy Eights Points: 9302 More actions · Answer 7

Are you sure you have no open transactions behind (DBCC OPENTRAN)? What is log_reuse_desc from sys.databases? Not sure about the names, I'm writing from the head.

These are really huge log sizes.

Loop should not bother you, because you are processing data in batches, not row by row. My MVP colleague Dean would ask "How would you eat an elephant? One by one meal until you are finished."

Other things that comes to my mind (other than avoiding log growth with checkpoints, or bulk logged recovery model) is to create additional log file to fit this operation, and drop it afterwards.

_____________________________________________________
Microsoft Certified Master: SQL Server 2008
XDetails Addin - for SQL Developers
blog.sqlxdetails.com - Transaction log myths

CapnHector SSCoach Points: 16143 More actions · Answer 8

Vedran Kesegic (12/4/2012)
Are you sure you have no open transactions behind (DBCC OPENTRAN)? What is log_reuse_desc from sys.databases? Not sure about the names, I'm writing from the head.
These are really huge log sizes.
Loop should not bother you, because you are processing data in batches, not row by row. My MVP colleague Dean would ask "How would you eat an elephant? One by one meal until you are finished."
Other things that comes to my mind (other than avoiding log growth with checkpoints, or bulk logged recovery model) is to create additional log file to fit this operation, and drop it afterwards.

The db is in simple recovery as once the load is done we will back it up and it will then sit till our next load. the log_reuse_desc is always either nothing or while we are running the loads CHECKPOINT. at the time the log ballooned this was the only open transaction.

Thanks for the validation on the loop ill try out the filtered index in qa and if there is no improvement just leave it as it is since the log is not growing over the process. I really appreciate the help.

For faster help in answering any problems Please read How to post data/code on a forum to get the best help - Jeff Moden[/url] for the best way to ask your question.

For performance Issues see how we like them posted here: How to Post Performance Problems - Gail Shaw[/url]

Need to Split some strings? Jeff Moden's DelimitedSplit8K[/url]
Jeff Moden's Cross tab and Pivots Part 1[/url]
Jeff Moden's Cross tab and Pivots Part 2[/url]

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 9

capnhector (12/3/2012)
Im coming here to get some options for breaking up a data load. the situation is that i have to expand a set of numbers to cover every number from 000 to 999 (1 becomes 1000-1999). there are 258 thousand input numbers and if i just do a cross apply to the numbers table it causes the log to balloon (a problem because i dont want to have a "one time shrink" every month). This is being loaded into a VARCHAR(16) column which is why the data types are CHAR()

I have to ask.... why are you creating 1,000 rows of storage for every row of input? What is the business reason for that? I'm asking so I can see if there's an alternative to loading so much data that might not get used.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

CapnHector SSCoach Points: 16143 More actions · Answer 10

Jeff Moden (12/4/2012)
capnhector (12/3/2012)
Im coming here to get some options for breaking up a data load. the situation is that i have to expand a set of numbers to cover every number from 000 to 999 (1 becomes 1000-1999). there are 258 thousand input numbers and if i just do a cross apply to the numbers table it causes the log to balloon (a problem because i dont want to have a "one time shrink" every month). This is being loaded into a VARCHAR(16) column which is why the data types are CHAR()
I have to ask.... why are you creating 1,000 rows of storage for every row of input? What is the business reason for that? I'm asking so I can see if there's an alternative to loading so much data that might not get used.

its a business rule. we are expanding blocks of phone numbers where we get the first 7 of the phone number and need every number in the block. currently our dev team is in the process of trying to avoid this process by changing the code but as i have something in place that allows us to keep using the old method other fires are getting put out first.

in the old method our input file all ready had the blocks expanded when the provider sent us the data. this method was the easiest to get in place to contain the issue so we could look at it another day.

For faster help in answering any problems Please read How to post data/code on a forum to get the best help - Jeff Moden[/url] for the best way to ask your question.

For performance Issues see how we like them posted here: How to Post Performance Problems - Gail Shaw[/url]

Need to Split some strings? Jeff Moden's DelimitedSplit8K[/url]
Jeff Moden's Cross tab and Pivots Part 1[/url]
Jeff Moden's Cross tab and Pivots Part 2[/url]

CapnHector SSCoach Points: 16143 More actions · Answer 11

well the process ran yesterday evening with no ill affects. unfortunately i don't believe i can speed it up any more because of business rules about the data. Thanks for the help guys. if i get time in the next couple of months i may play with it a little more.

For faster help in answering any problems Please read How to post data/code on a forum to get the best help - Jeff Moden[/url] for the best way to ask your question.

For performance Issues see how we like them posted here: How to Post Performance Problems - Gail Shaw[/url]

Need to Split some strings? Jeff Moden's DelimitedSplit8K[/url]
Jeff Moden's Cross tab and Pivots Part 1[/url]
Jeff Moden's Cross tab and Pivots Part 2[/url]