SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


How can I speed up my data flow task?


How can I speed up my data flow task?

Author
Message
mz1derful
mz1derful
SSC Veteran
SSC Veteran (262 reputation)SSC Veteran (262 reputation)SSC Veteran (262 reputation)SSC Veteran (262 reputation)SSC Veteran (262 reputation)SSC Veteran (262 reputation)SSC Veteran (262 reputation)SSC Veteran (262 reputation)

Group: General Forum Members
Points: 262 Visits: 236
I have a data flow task with two OleDB data sources from SQL server queries totaling 392,319,750 records and I am reading them into another database table that is on a different SQL Server. The destination is also OLEdb. The package runs on yet another SQL Server. I have no table locks and both batch size and maximum commit are set to 100,000. This is literally taking days to load. I have tried a BULK load from a SQL statement and that also takes too long. I have read everything I can about improving performance but nothing seems to help. There are four non-clustered indexes on the receiving table and if I remove them, it should improve performance, correct? Can I remove them without interrupting the data flow that is current running and will it improve the current process' performance?

All suggestions are welcome. Thanks you.

Karen
mz1derful
mz1derful
SSC Veteran
SSC Veteran (262 reputation)SSC Veteran (262 reputation)SSC Veteran (262 reputation)SSC Veteran (262 reputation)SSC Veteran (262 reputation)SSC Veteran (262 reputation)SSC Veteran (262 reputation)SSC Veteran (262 reputation)

Group: General Forum Members
Points: 262 Visits: 236
Well, I answered one of my questions! I dropped the indexes and the data flow is still running, but much, much faster! I am still open to more suggestions though.

Karen
Lempster
Lempster
SSCertifiable
SSCertifiable (5.2K reputation)SSCertifiable (5.2K reputation)SSCertifiable (5.2K reputation)SSCertifiable (5.2K reputation)SSCertifiable (5.2K reputation)SSCertifiable (5.2K reputation)SSCertifiable (5.2K reputation)SSCertifiable (5.2K reputation)

Group: General Forum Members
Points: 5192 Visits: 1657
You could try adjusting the values for DefaultBUfferMaxRows, DefaultBufferSize and EngineThreads, but it's something of a black art....I've found, anyway. Also, minimise the number of 'blocking' tasks in your package (sorts for example) that have to wait for the entire data set to flow through them before any rows can be passed to downstream components.
If you have a way of dividing the total data set into equal chunks you could run mutiple Data Flow Tasks in parallel. Oh and look at increasing the packet size too.

Have a look at this TechNet article

Regards
Lempster
Daniel Bowlin
Daniel Bowlin
SSCrazy Eights
SSCrazy Eights (8.2K reputation)SSCrazy Eights (8.2K reputation)SSCrazy Eights (8.2K reputation)SSCrazy Eights (8.2K reputation)SSCrazy Eights (8.2K reputation)SSCrazy Eights (8.2K reputation)SSCrazy Eights (8.2K reputation)SSCrazy Eights (8.2K reputation)

Group: General Forum Members
Points: 8222 Visits: 2629
My first suggestion was going to be drop the indexes on the target, so congrats on figuring that one out. Since you have two different sources you might find it faster to have 2 separate data flow tasks, one for each source. This may cause some locking contention on the target table, if you can partition the target table along differences present in the two sources you should be able to work out the contention issues. If not, I have also used a separate staging target (no indexes) for each source and then used an execute SQL task to assemble the two staging tables into the final target table. Even without 2 different sources you might find a way to fractionalize a single source by date or something to allow you to try this same approach.

A little bit of experimentation is in order here because there are a number of variables. But if you continue to experiment with indexes, partitions and/or staging tables I expect you can decrease your load times substantially. When I have experimented along these lines I have been able to drop load times from over 30 hours to under 5 hours.
Evil Kraig F
Evil Kraig F
SSC-Insane
SSC-Insane (21K reputation)SSC-Insane (21K reputation)SSC-Insane (21K reputation)SSC-Insane (21K reputation)SSC-Insane (21K reputation)SSC-Insane (21K reputation)SSC-Insane (21K reputation)SSC-Insane (21K reputation)

Group: General Forum Members
Points: 21493 Visits: 7660
Start on the target, and find the spid for the SSIS component running. Try to determine most common wait type. If it's PageLatchIO or another variant on disk waits, you're fine there. If it's something else you need to inspect that. Also, the target either is in simple mode or has the log file grown out to the size you need, right? You don't want to be auto-growthing in the middle of something that large.

My usual technique for that is simply to ping sys.sysprocesses once/second and dump the results to a table for 2 or 3 minutes. Take a look at the results of that.

Why are you loading two sources at once, and are you merging the streams prior to insertion? If you're paralleling the process then you're contending.

Otherwise I'd really have to dig into details. What's the SAN target, width of the stream, any implicit conversions going on during delivery, etc etc. Something that large I always do dev testing with smaller result sets, say, 2 million rows, from the source (filter it). At least that gives me a generic baseline expectation.


- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions | Forum Netiquette
For index/tuning help, follow these directions. |Tally Tables

Twitter: @AnyWayDBA
JustMarie
JustMarie
SSCrazy
SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)

Group: General Forum Members
Points: 2211 Visits: 1362
When I'm doing large data transfers I try to find a way to break them up into no more than 30mil per set. I'll have them all in one data flow but running 10 or 15 at the same time speeds things up nicely.

Dropping them into a heap and applying the constraints/indexes later also speeds it up as you found out. :-D
onejbsmith
onejbsmith
Grasshopper
Grasshopper (10 reputation)Grasshopper (10 reputation)Grasshopper (10 reputation)Grasshopper (10 reputation)Grasshopper (10 reputation)Grasshopper (10 reputation)Grasshopper (10 reputation)Grasshopper (10 reputation)

Group: General Forum Members
Points: 10 Visits: 27
Try the SQL Bulk Copy Component from www.Aichabiis.com; it was designed specifically to make it easy to speed up large data loads. It will allow you to effortlessly split up your data and load it on multiple threads. It's especially useful on multi-core servers.

Ciao,
jbs
CozyRoc
CozyRoc
SSCertifiable
SSCertifiable (6.9K reputation)SSCertifiable (6.9K reputation)SSCertifiable (6.9K reputation)SSCertifiable (6.9K reputation)SSCertifiable (6.9K reputation)SSCertifiable (6.9K reputation)SSCertifiable (6.9K reputation)SSCertifiable (6.9K reputation)

Group: General Forum Members
Points: 6940 Visits: 2235
onejbsmith (2/23/2014)
Try the SQL Bulk Copy Component from www.Aichabiis.com; it was designed specifically to make it easy to speed up large data loads. It will allow you to effortlessly split up your data and load it on multiple threads. It's especially useful on multi-core servers.

Ciao,
jbs


You can use the free Balanced Data Distributor component to accomplish bulk-load parallelism.

Karen,

I would suggest iterative approach by first finding your reading speed. Setup a thunk destination like the Union All component and run the package to find what is your baseline performance. If the reading speed is good, then you can start to concentrate on your destination. Also from your description it sounds you have three servers involved: the source server, the processing server where the package executes and the destination server. The data has to do two hops, before it lands in your destination. If you can make your package execute on the server where is your destination, I think you will see some performance improvement.

---
SSIS Tasks Components Scripts Services | http://www.cozyroc.com/


Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search