How can I speed up my data flow task?

  • I have a data flow task with two OleDB data sources from SQL server queries totaling 392,319,750 records and I am reading them into another database table that is on a different SQL Server. The destination is also OLEdb. The package runs on yet another SQL Server. I have no table locks and both batch size and maximum commit are set to 100,000. This is literally taking days to load. I have tried a BULK load from a SQL statement and that also takes too long. I have read everything I can about improving performance but nothing seems to help. There are four non-clustered indexes on the receiving table and if I remove them, it should improve performance, correct? Can I remove them without interrupting the data flow that is current running and will it improve the current process' performance?

    All suggestions are welcome. Thanks you.

    Karen

  • Well, I answered one of my questions! I dropped the indexes and the data flow is still running, but much, much faster! I am still open to more suggestions though.

    Karen

  • You could try adjusting the values for DefaultBUfferMaxRows, DefaultBufferSize and EngineThreads, but it's something of a black art....I've found, anyway. Also, minimise the number of 'blocking' tasks in your package (sorts for example) that have to wait for the entire data set to flow through them before any rows can be passed to downstream components.

    If you have a way of dividing the total data set into equal chunks you could run mutiple Data Flow Tasks in parallel. Oh and look at increasing the packet size too.

    Have a look at this TechNet article

    Regards

    Lempster

  • My first suggestion was going to be drop the indexes on the target, so congrats on figuring that one out. Since you have two different sources you might find it faster to have 2 separate data flow tasks, one for each source. This may cause some locking contention on the target table, if you can partition the target table along differences present in the two sources you should be able to work out the contention issues. If not, I have also used a separate staging target (no indexes) for each source and then used an execute SQL task to assemble the two staging tables into the final target table. Even without 2 different sources you might find a way to fractionalize a single source by date or something to allow you to try this same approach.

    A little bit of experimentation is in order here because there are a number of variables. But if you continue to experiment with indexes, partitions and/or staging tables I expect you can decrease your load times substantially. When I have experimented along these lines I have been able to drop load times from over 30 hours to under 5 hours.

  • Start on the target, and find the spid for the SSIS component running. Try to determine most common wait type. If it's PageLatchIO or another variant on disk waits, you're fine there. If it's something else you need to inspect that. Also, the target either is in simple mode or has the log file grown out to the size you need, right? You don't want to be auto-growthing in the middle of something that large.

    My usual technique for that is simply to ping sys.sysprocesses once/second and dump the results to a table for 2 or 3 minutes. Take a look at the results of that.

    Why are you loading two sources at once, and are you merging the streams prior to insertion? If you're paralleling the process then you're contending.

    Otherwise I'd really have to dig into details. What's the SAN target, width of the stream, any implicit conversions going on during delivery, etc etc. Something that large I always do dev testing with smaller result sets, say, 2 million rows, from the source (filter it). At least that gives me a generic baseline expectation.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • When I'm doing large data transfers I try to find a way to break them up into no more than 30mil per set. I'll have them all in one data flow but running 10 or 15 at the same time speeds things up nicely.

    Dropping them into a heap and applying the constraints/indexes later also speeds it up as you found out. 😀

  • Try the SQL Bulk Copy Component from http://www.Aichabiis.com; it was designed specifically to make it easy to speed up large data loads. It will allow you to effortlessly split up your data and load it on multiple threads. It's especially useful on multi-core servers.

    Ciao,

    jbs

  • onejbsmith (2/23/2014)


    Try the SQL Bulk Copy Component from http://www.Aichabiis.com; it was designed specifically to make it easy to speed up large data loads. It will allow you to effortlessly split up your data and load it on multiple threads. It's especially useful on multi-core servers.

    Ciao,

    jbs

    You can use the free Balanced Data Distributor component to accomplish bulk-load parallelism.

    Karen,

    I would suggest iterative approach by first finding your reading speed. Setup a thunk destination like the Union All component and run the package to find what is your baseline performance. If the reading speed is good, then you can start to concentrate on your destination. Also from your description it sounds you have three servers involved: the source server, the processing server where the package executes and the destination server. The data has to do two hops, before it lands in your destination. If you can make your package execute on the server where is your destination, I think you will see some performance improvement.

    ---
    SSIS Tasks Components Scripts Services | http://www.cozyroc.com/

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply