just came up in the forums, and it highlights how difficult it can be to try to optimize Integration Services. It's difficult because the SSIS concept of the Data Flow - an in-memory pipeline of buffers - isn't the paradigm that a great many of its users are familiar with.
Specifically, the question assumes that the Lookups occur serially, and somehow there must be a better arrangement to have them occur in parallel.The Assumption
It's a good question to ask. Just about any question is because they all come with answers, sometimes more than one. On its face, it seems reasonable to think that manually arranging the data flow in a parallel manner will allow Integration Services to process the lookups at the same time. Especially because (as the package developer) you know that the lookups aren't going to touch the same columns - and Integration Services can't be expected to know that, can it? Once we're done doing the parallel lookup, we'll just stitch the results back together, easy-peasy, right?Implementing Manual ParallelismSo what does that look like? Something like this to start...
And then you have to merge the results...
Oh - I caught that. I see you questioning our premise! Merge Joins aren't very efficient, are they? Will the gains from manually parallelizing the lookups be washed out by the performance penalty associated with merging the datasets back together? Does it depend on how many lookups are being done? Or how big the rowsets are?How Do You Spell "Ass u me"?
Perhaps a little harsh - there's no need to tarnish an inquiring mind! But the moral behind that statement remains - we're assuming that Integration Services needs our help to parallelize the workflow. Fortunately not! We also assumed that "stitching the results back together" will actually be fast and easy. Far from it as we've seen - but that ends up being pretty irrelevant in the end.Sequential Parallelism - An Assembly Line
That makes no sense, does it? Maybe not - but it's the best way I could describe what Integration Services is actually doing. I've found that comparing the data flow pipeline to an automotive assembly line
can be very enlightening. One of the great achievements of the industrial age was the development of the assembly line - the paradigm that allowed the product to be worked on by different people at different stations, rather than a single worker performing multiple tasks on a single product. The efficiencies gained from being able to train workers more thoroughly at specific tasks, and being able to "tune the workflow" by placing more resources on the "slower" steps allows modern manufacturers to smoothly push a consistent flow of product out of their plants.
A strength of the assembly line process is that almost any one step can be parallelized fairly easily. If step #10 in a process takes twice as long as the other steps, simply set up two stations to do the work, and split the "cars" between those two paths. Another strength is that the plant doesn't have a lot of inventory "tied up" in the manufacturing process - if there are one hundred steps in manufacturing the car, then there are only one hundred cars under construction inside the plant (disregarding parallel optimizations). Quite similar to that strength is that it doesn't take that long (or waste that much product) to finish the first car - it only takes as long as the sum of the steps. (In the alternative workflow - a batch approach - it could take much longer to get the first product out the door, and mean lots of in-process chassis in the workflow.)
Those same strengths are achieved in the Integration Services data flow. For example, a Derived Column transformation (or any of the non-blocking synchronous transforms) can easily be parallelized - multiple threads can process different buffers at the same time, coordinating amongst themselves. Similarly, non-blocking transformations mean that the buffers can get passed from transform to transform efficiently, such that there aren't that many buffers (memory) occupied at any one time. Finally, since we're working with buffers, it doesn't take all that long to get the first results out of the pipeline and into the hands of the storage engine to "put away" in the finished car lot.Enough About Cars - What's Going On With The Data?
Quite simply put - the design surface of Integration Services is lying through it's teeth to you. It may very well look like your data flow is sequential in nature, and that "Lookup A" has to go through all the data before "Lookup B" touches anything - but that's not the case. You should immediately see through this fiction when you run your packages, and see something like I have pictured to the right. You'll see that all of those lookups are being executed (because they're yellow). If you look closely at the row counts shown between the lookups, you'll see that Lookup A has (approximately) ten thousand more rows input to it than it's output, and that this pattern continues through the next lookups. What this is telling you is that as soon as Lookup A has finished it's lookup operation on one of the buffers (about 10,000 rows) it passes the buffer on to the next lookup. It's not what I'd call "pure" parallelism - but it's definitely most of the way there.
Compare to the Manual Method
Let's take a snapshot of what's going on in the "manual" method. A quick look at the execution snapshot shows that all five lookups appear to be operating on the same buffer. That's fantastic! It definitely is a step towards better parallelism.
Unfortunately, it comes at some cost.
The cost doesn't happen to be "buffer copies" by the Multicast component. The Multicast doesn't actually copy any data - it's another Data Flow designer fiction. It only makes you think there are multiple copies, when in fact it's using the same memory for five operations at the same time.
The cost comes in the Merge Joins. They aren't "inefficient" by any means - it's just that they're not as efficient as not using them at all. In fact, the performance gain we may see from manually parallelizing the lookups disappears - and then some - due to their use.
My tests using this package
on my dual-core machine with plenty of headroom showed pretty conclusive results. The apparently unoptimized "standard" arrangement of doing lookups was more than SEVEN TIMES faster than the manually parallelized arrangement. That's correct. I pushed ten million rows through five lookups in just over 33s using the standard arrangement compared to the manual rearrangement runtime of 241s. Based on the CPU usage profiles of the runs, I would expect a significantly wider gap in runtimes in "real-life" scenarios where your server has four or more cores. All CPU cores were maxed out in the "standard" arrangement, but hovered between 55% and 60% for the majority of the manual test - of which 5-10% was consumed with running my other desktop apps. Those CPU results clearly demonstrate that attempting to manually parallelize a series of lookups actually results in serialization. Brutal serialization, in fact.
There is Still Room for Improvement
So what if you're still dissatisfied with the performance of your Lookups, and you were hoping "parallelizing" them would help you? I suggest you try tweaking your Lookups individually - adjust the caching behaviour, and analyze how you've indexed the underlying lookup tables. I also suggest you see the many articles from reputable sources about improving Lookup performance:
One thing you'll note about all of those articles is that none of them suggest attempting to manually parallelize the data flow like I demonstrated above. Once you understand the internals of the Data Flow (which isn't easy) - it's one of those ideas that just dies in your brain because you know it no longer makes any sense at all.