Performance improvements of some Data Flow tasks

Question

Performance improvements of some Data Flow tasks

Shark Energy

SSCrazy Eights

Points: 8389
More actions
August 25, 2010 at 8:35 am

#239472

I have a task that is transferring (via Data Flow) 30,000,000 records from 1 database to another - no transformations. I am wondering If I'd be better off using BCP/Bulk Insert - but am looking to improve the performance of it before going down that route.
Details -
Each row will only be around 106bytes
30,000,000ish records so around 3gb of data x 2 (2 different tables) PLUS 800mb table of 8,000,000 records.
Server has 8gb of RAM, but 5gb is already reserved for the SQL Buffer Pool via Max Server Memory and with lock pages in memory switched on.
Now since this was switched on - running the Data Flow has resulted in the message "The Buffer Manager detected that the system was low on Virtual Memory, but was unable to swap out any buffers. 2 were considered and 2 were locked. etc etc" over and over again. This error may have existed before locked pages was turned on but I have no way of knowing. Now admittedly these messages only go on for around 15 minutes and then the commit phase kicks in and it takes 30 minutes for that (It seems to run for about 8 hours according to the logs with Diagnostic and OnProgress from say 3pm until 11, and then half an hour of the low virtual memory errors) - Anyone that can help me understand this a little better that would be a great start.
What exactly happens when this low memory message occurs? Does it page the data to the specified disk area and use that instead? I'm wondering how it completes if it cannot get enough memory.
Furthermore - it was on the DefaultBufferMaxRow default of 10000 and DefaultBufferSize of 10485760 bytes. On this basis Im assuming it would have only been passing 98000 records per buffer and would therefore require 306 buffers. Does it just keep creating buffers until the space runs out? Does it need to fill the buffers completely with all of the records before it writes the data to the destination?
I am looking to increase performance and my first step it to set the defaultbuffersize to 100mb and that will allow 900,000 records through into each buffer - But whether or not this will help will depend on the answers above to how many buffers it will try and create I guess!
So any help would be greatly appreciated :hehe:

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

Shark Energy SSCrazy Eights Points: 8389 More actions · Answer 1

In answer to some of my own question - When the memory is not available it is supposed to use the specified directory. I have changed this from the system TEMP to a faster drive. However, I don't get such messages in our logging to suggest this happens.

Shark Energy SSCrazy Eights Points: 8389 More actions · Answer 2

Furthermore - The 8 hours seems wrong. It was around an hour and a half (I still want to speed this up though and get around the errors).

The 6 and a half was the stored proc that populates this data so therefore using the SQL Buffer Pool of 5gb. Thats a whole different performance tuning issue! :crazy:

Elliott Whitlow SSC Guru Points: 102402 More actions · Answer 3

I believe those are warnings, not errors but are indicative of memory pressure. You can turn up the buffer sizes, I believe there is a max of like 100MB. And I agree 8hours did seem outrageous, I've transfered 3M rows with complex transforms in less than 10 minutes. So 30M with NO transforms should be easy.

CEWII

Shark Energy SSCrazy Eights Points: 8389 More actions · Answer 4

I am stuck trying to get my head around this. I want to understand it rather than just try various settings until 1 lot is quicker.

So far I have tried Transferring my 30 million un-transformed rows as such -

With all default settings = 42 minutes

With the buffer size modified to 100mb and defaultrowsize to 900000 (although watching the row count would sugges the server then lowers this to around 400000 - see footnote at bottom of my post) = 42 minutes

With the buffer size and rows modified as above and the final commit set to 1,000,000 = 52 minutes! (this seems to start loading the data as soon as it reaches the destination adaptor rather than waiting until all data is in place)

With the buffer size and rows modified as above, and the rows per batch set to 500,000 = 38 minutes.

HOW Does this rows per batch setting change things/work within the SSIS engine?

This is specified on the MS site -

"The data flow engine begins the task of sizing its buffers by calculating the estimated size of a single row of data. Then it multiplies the estimated size of a row by the value of DefaultBufferMaxRows to obtain a preliminary working value for the buffer size.

If the result is more than the value of DefaultBufferSize, the engine reduces the number of rows.

If the result is less than the internally-calculated minimum buffer size, the engine increases the number of rows.

If the result falls between the minimum buffer size and the value of DefaultBufferSize, the engine sizes the buffer as close as possible to the estimated row size times the value of DefaultBufferMaxRows."

If this is the case, WHY BOTHER SETTING DefaultBufferRows and Size at all if the SSIS engine overrides it? I guess to set a maximum that SSIS will not go over, but it will lower it to the actual maximum you can manage with the buffer size you have set.

Shark Energy SSCrazy Eights Points: 8389 More actions · Answer 5

Why am I struggling to get my head around this! ARGH

EDIT: Almost turning this topic into my own blog answering my only questions! ha.

So the Rows per batch is just like when doing a bulk insert. So in effect I should just be putting a figure in there of around 30,000,000 to help the engine understand what resources it needs (I think thats right).

I don't understand why it took longer when splitting the final commit into 1,000,000 chunks - but I will test some smaller numbers.

All very interesting ho hum.

Shark Energy SSCrazy Eights Points: 8389 More actions · Answer 6

My tests are really showing me that changing these settings does nothing to improve a straight forward data flow of 30 million records (3gb of data) on my server. Setting the commit batch only slows it down. The other settings don't alter the time much. The only slight improvement I've had is setting the Rows Per Batch to 30,000,000 (query hint) but I also see improvements if I set it to a false record count of say 500,000 or other.