Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase

How can I speed up my data flow task? Expand / Collapse
Author
Message
Posted Tuesday, February 18, 2014 6:08 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, September 22, 2014 8:55 AM
Points: 49, Visits: 209
I have a data flow task with two OleDB data sources from SQL server queries totaling 392,319,750 records and I am reading them into another database table that is on a different SQL Server. The destination is also OLEdb. The package runs on yet another SQL Server. I have no table locks and both batch size and maximum commit are set to 100,000. This is literally taking days to load. I have tried a BULK load from a SQL statement and that also takes too long. I have read everything I can about improving performance but nothing seems to help. There are four non-clustered indexes on the receiving table and if I remove them, it should improve performance, correct? Can I remove them without interrupting the data flow that is current running and will it improve the current process' performance?

All suggestions are welcome. Thanks you.

Karen
Post #1542520
Posted Tuesday, February 18, 2014 6:44 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, September 22, 2014 8:55 AM
Points: 49, Visits: 209
Well, I answered one of my questions! I dropped the indexes and the data flow is still running, but much, much faster! I am still open to more suggestions though.

Karen
Post #1542539
Posted Tuesday, February 18, 2014 8:27 AM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Yesterday @ 6:38 AM
Points: 2,126, Visits: 1,466
You could try adjusting the values for DefaultBUfferMaxRows, DefaultBufferSize and EngineThreads, but it's something of a black art....I've found, anyway. Also, minimise the number of 'blocking' tasks in your package (sorts for example) that have to wait for the entire data set to flow through them before any rows can be passed to downstream components.
If you have a way of dividing the total data set into equal chunks you could run mutiple Data Flow Tasks in parallel. Oh and look at increasing the packet size too.

Have a look at this TechNet article

Regards
Lempster
Post #1542605
Posted Tuesday, February 18, 2014 9:31 AM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Tuesday, October 7, 2014 2:36 PM
Points: 2,818, Visits: 2,565
My first suggestion was going to be drop the indexes on the target, so congrats on figuring that one out. Since you have two different sources you might find it faster to have 2 separate data flow tasks, one for each source. This may cause some locking contention on the target table, if you can partition the target table along differences present in the two sources you should be able to work out the contention issues. If not, I have also used a separate staging target (no indexes) for each source and then used an execute SQL task to assemble the two staging tables into the final target table. Even without 2 different sources you might find a way to fractionalize a single source by date or something to allow you to try this same approach.

A little bit of experimentation is in order here because there are a number of variables. But if you continue to experiment with indexes, partitions and/or staging tables I expect you can decrease your load times substantially. When I have experimented along these lines I have been able to drop load times from over 30 hours to under 5 hours.
Post #1542651
Posted Tuesday, February 18, 2014 12:19 PM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Saturday, October 18, 2014 4:43 AM
Points: 5,401, Visits: 7,512
Start on the target, and find the spid for the SSIS component running. Try to determine most common wait type. If it's PageLatchIO or another variant on disk waits, you're fine there. If it's something else you need to inspect that. Also, the target either is in simple mode or has the log file grown out to the size you need, right? You don't want to be auto-growthing in the middle of something that large.

My usual technique for that is simply to ping sys.sysprocesses once/second and dump the results to a table for 2 or 3 minutes. Take a look at the results of that.

Why are you loading two sources at once, and are you merging the streams prior to insertion? If you're paralleling the process then you're contending.

Otherwise I'd really have to dig into details. What's the SAN target, width of the stream, any implicit conversions going on during delivery, etc etc. Something that large I always do dev testing with smaller result sets, say, 2 million rows, from the source (filter it). At least that gives me a generic baseline expectation.



- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions | Forum Netiquette
For index/tuning help, follow these directions. |Tally Tables

Twitter: @AnyWayDBA
Post #1542705
Posted Friday, February 21, 2014 11:33 AM
SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Monday, October 13, 2014 9:58 AM
Points: 275, Visits: 616
When I'm doing large data transfers I try to find a way to break them up into no more than 30mil per set. I'll have them all in one data flow but running 10 or 15 at the same time speeds things up nicely.

Dropping them into a heap and applying the constraints/indexes later also speeds it up as you found out.
Post #1544148
Posted Sunday, February 23, 2014 9:43 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Thursday, October 9, 2014 1:41 PM
Points: 2, Visits: 16
Try the SQL Bulk Copy Component from www.Aichabiis.com; it was designed specifically to make it easy to speed up large data loads. It will allow you to effortlessly split up your data and load it on multiple threads. It's especially useful on multi-core servers.

Ciao,
jbs
Post #1544287
Posted Sunday, March 9, 2014 8:17 PM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Sunday, September 14, 2014 3:17 PM
Points: 1,118, Visits: 2,224
onejbsmith (2/23/2014)
Try the SQL Bulk Copy Component from www.Aichabiis.com; it was designed specifically to make it easy to speed up large data loads. It will allow you to effortlessly split up your data and load it on multiple threads. It's especially useful on multi-core servers.

Ciao,
jbs


You can use the free Balanced Data Distributor component to accomplish bulk-load parallelism.

Karen,

I would suggest iterative approach by first finding your reading speed. Setup a thunk destination like the Union All component and run the package to find what is your baseline performance. If the reading speed is good, then you can start to concentrate on your destination. Also from your description it sounds you have three servers involved: the source server, the processing server where the package executes and the destination server. The data has to do two hops, before it lands in your destination. If you can make your package execute on the server where is your destination, I think you will see some performance improvement.


---
SSIS Tasks Components Scripts Services | http://www.cozyroc.com/

Post #1549092
« Prev Topic | Next Topic »

Add to briefcase

Permissions Expand / Collapse