Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 12»»

Breaking up a data load to prevent log file growth Expand / Collapse
Author
Message
Posted Monday, December 3, 2012 3:28 PM


SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Sunday, July 20, 2014 7:16 AM
Points: 945, Visits: 1,769
Im coming here to get some options for breaking up a data load. the situation is that i have to expand a set of numbers to cover every number from 000 to 999 (1 becomes 1000-1999). there are 258 thousand input numbers and if i just do a cross apply to the numbers table it causes the log to balloon (a problem because i dont want to have a "one time shrink" every month). This is being loaded into a VARCHAR(16) column which is why the data types are CHAR()

the two options i came up with are a cursor with a cross apply to a numbers table (Slow of course) and an output from an update (setting an isProcessed flag) with a cross apply to a numbers table holding 000-999 (faster than the cursor but still has a lot of data to crunch)

The sample data only shows what im trying to do with the input and has been paired down to only the relevant columns.

IF (OBJECT_ID('tempdb..#Block') IS NOT NULL)
DROP TABLE #Block

IF (OBJECT_ID('tempdb..#SampleNumbers') IS NOT NULL)
DROP TABLE #SampleNumbers

CREATE TABLE #Block (N CHAR(3) PRIMARY KEY WITH(FILLFACTOR = 100), isProcessed BIT DEFAULT(0))
CREATE TABLE #SampleNumbers (Block CHAR(7) PRIMARY KEY)

INSERT INTO #Block(N)
SELECT '000' AS N UNION ALL
SELECT TOP 999 RIGHT ('000' + CAST( ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS VARCHAR),3)
FROM sys.all_columns;

INSERT INTO #SampleNumbers
SELECT *
FROM (VALUES ('1234567'),('2345678'),('3456789'),('4567890'))x(TopLevel)

--INSERT INTO MyTable
SELECT Block + N
FROM #SampleNumbers
CROSS JOIN #Block

Currently i am using the following but wondering if there is a better way

IF (OBJECT_ID('tempdb..#tempSample') IS NOT NULL)
DROP TABLE #tempSample

CREATE TABLE #tempSample (Block CHAR(7))
CREATE CLUSTERED INDEX uix_tmpSampleIndex ON #tempSample(Block)

DECLARE @Count INT = 1

WHILE @Count > 0
BEGIN
BEGIN TRY
BEGIN TRANSACTION

UPDATE TOP (500) a SET isProcessed = 1
OUTPUT inserted.Block INTO #tempSample
FROM #SampleNumbers a
WHERE isProcessed = 0

SET @Count = @@ROWCOUNT

--INSERT INTO MyTable (MyCol)
SELECT Block + N
FROM #tempSample
CROSS APPLY #Num

COMMIT TRANSACTION
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION
END CATCH

TRUNCATE TABLE #tempSample
WAITFOR DELAY '00:00:01'

END
DROP TABLE #tempSample

This is a load into our staging database on our production server and needs to limit its impact (reason for the WAITFOR and transactions). Is there any better way to do this? (i need to work on the error handling so i don't get an infinite on an error where it keeps processing the same batch and getting the exact same error. its better than nothing as if it has been running way to long i know there was an error in the expansion.)



For faster help in answering any problems Please read How to post data/code on a forum to get the best help - Jeff Moden for the best way to ask your question.

For performance Issues see how we like them posted here: How to Post Performance Problems - Gail Shaw

Need to Split some strings? Jeff Moden's DelimitedSplit8K
Jeff Moden's Cross tab and Pivots Part 1
Jeff Moden's Cross tab and Pivots Part 2

Jeremy Oursler
Post #1392165
Posted Tuesday, December 4, 2012 2:48 AM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Tuesday, July 8, 2014 2:57 AM
Points: 369, Visits: 1,208
Log of what db has grown? what recovery model it uses?
did you measure which statement cumulatively takes most time?

Try to avoid cross apply, use @table variable instead of #temp table in your clever OUTPUT clause. Use filtered index containing only unprocessed rows. Temporarily switch to bulk-logged recovery model to minimize logging (read more about that, it's not so simple).


_____________________________________________________
Microsoft Certified Master: SQL Server 2008
XDetails Addin - for SQL Developers
blog.sqlxdetails.com - Transaction log myths
Post #1392306
Posted Tuesday, December 4, 2012 5:11 AM


SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Sunday, July 20, 2014 7:16 AM
Points: 945, Visits: 1,769
I should have mentioned that the database is in simple recovery mode so the t log growth is simply from the amount of data in a single transition


For faster help in answering any problems Please read How to post data/code on a forum to get the best help - Jeff Moden for the best way to ask your question.

For performance Issues see how we like them posted here: How to Post Performance Problems - Gail Shaw

Need to Split some strings? Jeff Moden's DelimitedSplit8K
Jeff Moden's Cross tab and Pivots Part 1
Jeff Moden's Cross tab and Pivots Part 2

Jeremy Oursler
Post #1392387
Posted Tuesday, December 4, 2012 5:19 AM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Tuesday, July 8, 2014 2:57 AM
Points: 369, Visits: 1,208
#tempSample is never cleared. So you throw out the same rows over and over in cross apply. Is that intentional?
You also don't need uix_tmpSampleIndex index.


_____________________________________________________
Microsoft Certified Master: SQL Server 2008
XDetails Addin - for SQL Developers
blog.sqlxdetails.com - Transaction log myths
Post #1392390
Posted Tuesday, December 4, 2012 6:29 AM


SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Sunday, July 20, 2014 7:16 AM
Points: 945, Visits: 1,769
Vedran Kesegic (12/4/2012)
#tempSample is never cleared. So you throw out the same rows over and over in cross apply. Is that intentional?
You also don't need uix_tmpSampleIndex index.


#tmpSample is truncated right before the WAITFOR. as far as the index goes on #tempSample i found it to speed things up a little in my testing. To answer your second question i missed when i answered from my phone, the log that balloons is the log on our staging database where this is taking place.

The question is not so much the speed, each iteration of the loop takes 3 seconds to produce 500,000 expanded records and that includes the WAITFOR. Speed is not the primary issue as this is only the final import step of a data import (i have 250,000 records to expand to 250,000,000 so it takes some time by its self along with the other data.). the question i had was on the method and if there is any thing better to minimize load on the system. the thing that gets me is the loop but i think the nature of this beast will require it.

i do thank you for your improvements and will look at implementing them through some testing. i do like the filtered index on isProcessed which will speed up finding the top 500. im also thinking of throwing MAXDOP 1 on the queries to force it not to go parallel and take advantage of the same thing that makes the quirky update work.



For faster help in answering any problems Please read How to post data/code on a forum to get the best help - Jeff Moden for the best way to ask your question.

For performance Issues see how we like them posted here: How to Post Performance Problems - Gail Shaw

Need to Split some strings? Jeff Moden's DelimitedSplit8K
Jeff Moden's Cross tab and Pivots Part 1
Jeff Moden's Cross tab and Pivots Part 2

Jeremy Oursler
Post #1392425
Posted Tuesday, December 4, 2012 7:34 AM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Tuesday, July 8, 2014 2:57 AM
Points: 369, Visits: 1,208
I missed your truncate because the indentation misled me :)
Try with CHECKPOINT command every several batches, because log cannot be truncated if checkpoint did not occur (also happens automatically, but we will force it here).

Also, you can optimize transaction log by setting large enough initial size (e.g. 4000 MB) and increment (e.g. 512 MB). Prior to that shrink log file to smallest possible (1KB). That will optimize number of VLF's in transaction log.


_____________________________________________________
Microsoft Certified Master: SQL Server 2008
XDetails Addin - for SQL Developers
blog.sqlxdetails.com - Transaction log myths
Post #1392484
Posted Tuesday, December 4, 2012 10:19 AM


SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Sunday, July 20, 2014 7:16 AM
Points: 945, Visits: 1,769
Vedran Kesegic (12/4/2012)
I missed your truncate because the indentation misled me :)
Try with CHECKPOINT command every several batches, because log cannot be truncated if checkpoint did not occur (also happens automatically, but we will force it here).

Also, you can optimize transaction log by setting large enough initial size (e.g. 4000 MB) and increment (e.g. 512 MB). Prior to that shrink log file to smallest possible (1KB). That will optimize number of VLF's in transaction log.


When we created the database, i grew the log to 50 gig in chunks to have optimal vlf sizes for the type of data transactions that take place in the staging database. Durring the single pass cross apply it ballooned to 100 gig (in 1 gig increments, may need to look at the auto growth settings later). In testing the loop, i did not have any issues with log file expansion with out a explicit checkpoint as SQL Server was check pointing the log as needed.

I guess from your responses that there is probably not a better way (avoiding the loop) to do this. That just makes me want to cringe that i have a loop in production code but when its the only choice left you go with what works.



For faster help in answering any problems Please read How to post data/code on a forum to get the best help - Jeff Moden for the best way to ask your question.

For performance Issues see how we like them posted here: How to Post Performance Problems - Gail Shaw

Need to Split some strings? Jeff Moden's DelimitedSplit8K
Jeff Moden's Cross tab and Pivots Part 1
Jeff Moden's Cross tab and Pivots Part 2

Jeremy Oursler
Post #1392606
Posted Tuesday, December 4, 2012 2:55 PM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Tuesday, July 8, 2014 2:57 AM
Points: 369, Visits: 1,208
Are you sure you have no open transactions behind (DBCC OPENTRAN)? What is log_reuse_desc from sys.databases? Not sure about the names, I'm writing from the head.
These are really huge log sizes.
Loop should not bother you, because you are processing data in batches, not row by row. My MVP colleague Dean would ask "How would you eat an elephant? One by one meal until you are finished."
Other things that comes to my mind (other than avoiding log growth with checkpoints, or bulk logged recovery model) is to create additional log file to fit this operation, and drop it afterwards.


_____________________________________________________
Microsoft Certified Master: SQL Server 2008
XDetails Addin - for SQL Developers
blog.sqlxdetails.com - Transaction log myths
Post #1392712
Posted Tuesday, December 4, 2012 3:29 PM


SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Sunday, July 20, 2014 7:16 AM
Points: 945, Visits: 1,769
Vedran Kesegic (12/4/2012)
Are you sure you have no open transactions behind (DBCC OPENTRAN)? What is log_reuse_desc from sys.databases? Not sure about the names, I'm writing from the head.
These are really huge log sizes.
Loop should not bother you, because you are processing data in batches, not row by row. My MVP colleague Dean would ask "How would you eat an elephant? One by one meal until you are finished."
Other things that comes to my mind (other than avoiding log growth with checkpoints, or bulk logged recovery model) is to create additional log file to fit this operation, and drop it afterwards.


The db is in simple recovery as once the load is done we will back it up and it will then sit till our next load. the log_reuse_desc is always either nothing or while we are running the loads CHECKPOINT. at the time the log ballooned this was the only open transaction.

Thanks for the validation on the loop ill try out the filtered index in qa and if there is no improvement just leave it as it is since the log is not growing over the process. I really appreciate the help.



For faster help in answering any problems Please read How to post data/code on a forum to get the best help - Jeff Moden for the best way to ask your question.

For performance Issues see how we like them posted here: How to Post Performance Problems - Gail Shaw

Need to Split some strings? Jeff Moden's DelimitedSplit8K
Jeff Moden's Cross tab and Pivots Part 1
Jeff Moden's Cross tab and Pivots Part 2

Jeremy Oursler
Post #1392733
Posted Tuesday, December 4, 2012 5:15 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 6:43 PM
Points: 36,752, Visits: 31,208
capnhector (12/3/2012)
Im coming here to get some options for breaking up a data load. the situation is that i have to expand a set of numbers to cover every number from 000 to 999 (1 becomes 1000-1999). there are 258 thousand input numbers and if i just do a cross apply to the numbers table it causes the log to balloon (a problem because i dont want to have a "one time shrink" every month). This is being loaded into a VARCHAR(16) column which is why the data types are CHAR()


I have to ask.... why are you creating 1,000 rows of storage for every row of input? What is the business reason for that? I'm asking so I can see if there's an alternative to loading so much data that might not get used.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1392743
« Prev Topic | Next Topic »

Add to briefcase 12»»

Permissions Expand / Collapse