How can I speed up my data flow task?

Question

How can I speed up my data flow task?

mz1derful

Ten Centuries

Points: 1234
More actions
February 18, 2014 at 6:08 am

#282580

I have a data flow task with two OleDB data sources from SQL server queries totaling 392,319,750 records and I am reading them into another database table that is on a different SQL Server. The destination is also OLEdb. The package runs on yet another SQL Server. I have no table locks and both batch size and maximum commit are set to 100,000. This is literally taking days to load. I have tried a BULK load from a SQL statement and that also takes too long. I have read everything I can about improving performance but nothing seems to help. There are four non-clustered indexes on the receiving table and if I remove them, it should improve performance, correct? Can I remove them without interrupting the data flow that is current running and will it improve the current process' performance?
All suggestions are welcome. Thanks you.
Karen

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

mz1derful Ten Centuries Points: 1234 More actions · Answer 1

Well, I answered one of my questions! I dropped the indexes and the data flow is still running, but much, much faster! I am still open to more suggestions though.

Karen

Lempster SSCoach Points: 15651 More actions · Answer 2

You could try adjusting the values for DefaultBUfferMaxRows, DefaultBufferSize and EngineThreads, but it's something of a black art....I've found, anyway. Also, minimise the number of 'blocking' tasks in your package (sorts for example) that have to wait for the entire data set to flow through them before any rows can be passed to downstream components.

If you have a way of dividing the total data set into equal chunks you could run mutiple Data Flow Tasks in parallel. Oh and look at increasing the packet size too.

Have a look at this TechNet article

Regards

Lempster

Daniel Bowlin SSC-Dedicated Points: 34566 More actions · Answer 3

My first suggestion was going to be drop the indexes on the target, so congrats on figuring that one out. Since you have two different sources you might find it faster to have 2 separate data flow tasks, one for each source. This may cause some locking contention on the target table, if you can partition the target table along differences present in the two sources you should be able to work out the contention issues. If not, I have also used a separate staging target (no indexes) for each source and then used an execute SQL task to assemble the two staging tables into the final target table. Even without 2 different sources you might find a way to fractionalize a single source by date or something to allow you to try this same approach.

A little bit of experimentation is in order here because there are a number of variables. But if you continue to experiment with indexes, partitions and/or staging tables I expect you can decrease your load times substantially. When I have experimented along these lines I have been able to drop load times from over 30 hours to under 5 hours.

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 4

Start on the target, and find the spid for the SSIS component running. Try to determine most common wait type. If it's PageLatchIO or another variant on disk waits, you're fine there. If it's something else you need to inspect that. Also, the target either is in simple mode or has the log file grown out to the size you need, right? You don't want to be auto-growthing in the middle of something that large.

My usual technique for that is simply to ping sys.sysprocesses once/second and dump the results to a table for 2 or 3 minutes. Take a look at the results of that.

Why are you loading two sources at once, and are you merging the streams prior to insertion? If you're paralleling the process then you're contending.

Otherwise I'd really have to dig into details. What's the SAN target, width of the stream, any implicit conversions going on during delivery, etc etc. Something that large I always do dev testing with smaller result sets, say, 2 million rows, from the source (filter it). At least that gives me a generic baseline expectation.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

JustMarie SSCertifiable Points: 7771 More actions · Answer 5

When I'm doing large data transfers I try to find a way to break them up into no more than 30mil per set. I'll have them all in one data flow but running 10 or 15 at the same time speeds things up nicely.

Dropping them into a heap and applying the constraints/indexes later also speeds it up as you found out. 😀

onejbsmith Valued Member Points: 62 More actions · Answer 6

Try the SQL Bulk Copy Component from http://www.Aichabiis.com; it was designed specifically to make it easy to speed up large data loads. It will allow you to effortlessly split up your data and load it on multiple threads. It's especially useful on multi-core servers.

Ciao,

jbs

COZYROC One Orange Chip Points: 28499 More actions · Answer 7

onejbsmith (2/23/2014)
Try the SQL Bulk Copy Component from http://www.Aichabiis.com; it was designed specifically to make it easy to speed up large data loads. It will allow you to effortlessly split up your data and load it on multiple threads. It's especially useful on multi-core servers.
Ciao,
jbs

You can use the free Balanced Data Distributor component to accomplish bulk-load parallelism.

Karen,

I would suggest iterative approach by first finding your reading speed. Setup a thunk destination like the Union All component and run the package to find what is your baseline performance. If the reading speed is good, then you can start to concentrate on your destination. Also from your description it sounds you have three servers involved: the source server, the processing server where the package executes and the destination server. The data has to do two hops, before it lands in your destination. If you can make your package execute on the server where is your destination, I think you will see some performance improvement.

---
SSIS Tasks Components Scripts Services | http://www.cozyroc.com/