Blog Post

Parallelizing Slow Parts of the Data Flow - Part 1 - Preparation

,

It's quite common to have parts of your Data Flow that are slow, and there are some techniques you can use to improve performance of those parts.  One of them is to try to parallelize the slow operation - but that only works for operations that are parallelizable.  Sort isn't one of them - sorry to say.  And you really have to know how the data flow works to make sure you don't try fixing a problem that looks "serial" but is, in fact parallel - you'll only make things worse.
Identifying Parallelization Opportunities
So what scenarios does parallelization work for? 
Positive Signs
Here are some characteristics that should identify a good candidate for parallelizing a portion of your data flow.

  • The operation applies independently to single rows, or to an easily identifiable subset of rows.
  • There's a known-slow resource in use that allows concurrent use (SQL Server, disk, web, ...)

Counter Indications
If your data flow has any of the following characteristics, it's not likely that parallelizing (alone) will help speed anything up.

  • The server's CPU is at or near capacity during the slow section of the data flow.
  • Any external resources (SQL Server, file system, network) the slow section of the data flow uses are already at or near capacity.
  • Some part of the slow section requires all rows at once - an aggregate or sort, for example.

Test, Backup, Change, Test, Backup, Change, ...
After applying a parallelization technique, you'll need to figure out if it's better, right?  Then the first thing you have to do is measure the performance of your current design (obviously!).  Make sure you measure a few things - time to completion, CPU, RAM, network, and disk utilizations.  Run the test at least three times in a row to eliminate caching effects and get a good average. 
We measure all those attributes so that we can see what we're trading speed for.  The idea is to trade CPU utilization for speed - but we could have to pay RAM and I/O as well - to a degree you might not be comfortable with.  That's why we keep an eye on those metrics.
Before you make any changes... make sure you back up your original!  Your quest to improve performance using one technique might work somewhat, and you may want to try another technique.  If you have a backup of your original package, it'll (probably) be easier to implement the second technique from there, while ensuring your actual logic remains consistent.  Do also back up your attempts - you might try a few techniques but decide the first one worked "best".
Make your change to parallelize the package, then double-check the result is the same as your original.  It doesn't do any good to make your data go faster if you ruin it in the process.
Be Prepared For Disappointment
Even if your slow section seems to fit the profile I outlined, there's no guarantee applying a parallelization technique will help.  You may not be able to squeeze any more time out of the process due to some limiting factor you weren't aware of when you started.  Or the trade of speed for some other resource just isn't something you're ready to make.
What follows will be posts on using the Balanced Data Distributor, then a "poor man's" BDD using existing components for those of you that can't install additional software (even when published by Microsoft).

Original post (opens in new tab)

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating