First - I'll get the "what's a pipeline got to do with Integration Services" question out of the way. The "pipeline" I'm referring to here is the word commonly used to describe the flow of data in the Data Flow Task. That term doesn't seem to show up in any of the MSDN documentation, but the Microsoft bloggers just can't seem to talk enough about it
. I suppose it's an internal term that's leaked out into general usage.
The SQL Server Integration Services' Data Flow Task reads data from one or more sources, flows that data through this "pipeline" where some parts of it will get transformed (changed), and then sends the data to one or more destinations. The choice of the word "pipeline" will tend to get people thinking of a liquid in a pipe - and that's an apt metaphor for the phenomenon of backpressure
. The same kind of action takes place in SSIS.
When talking about a liquid, backpressure refers to the resistance that the fluid encounters flowing through a pipe. When the fluid encounters a constricted section of the pipe, it can't flow as fast through that section. But since (most) fluids are incompressible and most pipes are rigid, the reduction in flow rate at this constricted point means that the flow is "backed up" all the way to the start of the pipe. If a kink is stopping the flow of water near the end of the hose, the tap isn't able to inject any more water into the hose - there's simply no space to put it. The rate at which you can inject water into the hose is directly related to two things: how constricted the smallest part of the hose is, and how much the hose can "expand" prior to the blockage. (If the hose can inflate like a balloon, you can still pump water in.)Data Flow Backpressure
SSIS's Data Flow Task has very similar constraints. The source - or "tap" - wants to inject data into the pipeline as fast as it can. But that data has to flow through the components - the "pipe" - in order to reach the destination - the "end" of the hose. Somewhat like a garden hose, the data flow pipeline can
stretch to allow the source to inject more data into the pipeline than is ultimately getting through to the end. But there's a limit, of course, just like with a hose. If it's stretched too much, it will burst.
In SSIS's case, since it's a "fictitious" hose, the nice engineers that built it were able to set limits on how much of a balloon you're allowed to turn your hose into. It's the RAM limit of the process and/or machine you're running your package on. In most cases, taking advantage of SSIS' "stretchy hose" is a very bad thing - you'll easily consume all the available memory on your machine, and start spilling into virtual memory... and that's slow disk!Backpressure In Action
Here's an SSIS package that demonstrates backpressure quite well.
You can build your own quite easily from what I'll describe here, in case you don't have SSIS 2008. The first task in it is a Data Flow Task that looks like this:The first component hooks up to your local instance's master database, fetching a cross-joined resultset from sysobjects. (It's the volume of rows that matter here - not the content.) The second component is one of my Data Flow wiretap scripts that will dump timing information to the logging stream that we'll be able to watch in the Output window. The last component is a script destination. Normally, I'd just use a Union All or Derived Column to cap the flow, but those result in a lot of "unused column" warnings that would clutter up the logging stream.As you can see below, running this flow results in something like this pretty quickly. In my case, under 1.5 seconds, just about as quick as SQL Server can deliver the rows to SSIS. (So fast I didn't bother to capture an "in-flight" picture.)The next Data Flow Task in the package shows how an "impeded" data flow will behave. In this flow, there's one more script component in there.The component inserted at the third position is intended to mimic a transformation that takes effort to perform. This is the equivalent of an uncached Lookup, OLE DB Command, Merge Join, etc... What does this look like at runtime? Well, no problem taking a screen capture in mid-execution on this one...How does this image show the effects of backpressure? It was taken over 20 seconds after the data flow started. Let me explain in more detail.Our first "free-flowing" data flow loaded the data into SSIS' pipeline in under 1.5 seconds. The second "impeded" data flow hasn't even finished loading the 100K rows after 20 seconds... why not? Because the "sleep" transform is slowing things down. So why isn't the source simply reading as fast as it can, and piling up the data at the slow transform's doorstep? Because if it did that, it would consume memory. The developers of SSIS thought that consuming memory in this situation would be completely unnecessary if they could just tell the source to "slow down" to a manageable level... so they did.Any "execution tree" (there's another bit of SSIS internals jargon) will experience this backpressure - forcing every component to operate no faster than the slowest. One of the key parts of managing the performance of your packages is understanding pipeline backpressure. It can mislead you about what's performing slowly in your flow, causing you to spend time and effort where it will have no effect. Tweaking the DefaultBufferSize and DefaultBufferMaxRows properties can change some of the effects of backpressure - but doing so only delays the inevitable with sufficiently large data sets, and typically offers little meaningful benefit. The real key is understanding your data flow's behaviour through decomposition.I have one more example data flow for you in the package, where I add a sort component to the middle of the flow. How will this affect the flow of data? The sleep component will definitely cause the end-to-end flow to be slow, but will the source stream quickly like the "free-flowing" example, or will it be throttled like the "impeded" flow? I'll leave that up to you to explain...