If you've ever needed to debug a fairly complex Data Flow, you've probably wanted to take a peek at what data is moving through the flow at various points. The data viewer in Integration Services is how you do that - but it has a major drawback. You may be processing thousands or millions of rows - and you'll have to page through each buffer with a plain data viewer. If you're trying to debug such a flow, one of the best ways to do so is by narrowing the scope of what you're looking at. But doing so with a data viewer means you need to restructure or filter your data flow, which means you definitely need to remember to remove that filtering before it goes to the next step of development.
Here's a little tip that should help you weed through data that you don't want to see, as well as move your "viewport" from one place to another in the data flow as easily as possible. Almost as easily as just removing and adding a new data viewer.The Parts
You'll have to add a Multicast, Conditional Split, Sort, and Derived Column to your data flow. They each serve a purpose - and depending on your flow, you may or may not need each one, or you may be able to exchange it for another component.The Multicast
This is the key to the "mobility" of using this technique. Wherever you want to see what's passing by - the place where you would otherwise place a data viewer - simply disconnect the flow, and insert the Multicast in between:
The Conditional Split
Next, since you don't want to see a million rows, or have to press that "play" button a thousand times, hoping to see relevant data, place a Conditional Split after the Multicast:
Inside the Conditional Split, place whatever condition will filter out only the records you're interested in seeing. In my case, I only want to see rows that have a customer key of 451:
This is an optional part of the technique, but can be quite useful when the rows you're interested in are scattered throughout the rows passing by in the data flow. If you've researched the optimizations that SSIS uses to process data, you should know that the Multicast doesn't actually copy memory and the Conditional Split doesn't actually split rows either. The Multicast is simply a visual aid for you to design packages with - the pipeline engine looks at it as a license to multi-thread the data flow. The Conditional Split isn't really moving rows down two paths, it's simply marking each row with an "output name", and only allows relevant rows to be operated on by transforms attached to the various outputs.
If you simply attached a data viewer to the output of the Conditional Split, you may have to page through tens or hundreds of buffers (pages) by pressing the "play" button... while only seeing one row on each page. Using the Sort component forces the pipeline engine to copy and reformat the buffer. You can think of it as "buffer defragmentation."
The Derived Column
Virtually any transform will do here - but the Derived Column works well in that you can simply drop it on the surface without configuration and it will work. It's purpose here is simply to provide an endpoint, so that we can place a data viewer between it and the previous transform:
Using the Technique
The first time you set up this sub-flow, it could look something like this:
Each part of the sub-flow I've described here would have correct metadata, including the data viewer. If you run the flow, you'll see your data viewer pop up with only the filtered results you were looking for. After you're confident that the flow is dealing with the data correctly up to that point, you may want to move it to examine what the data looks like after the next component in the main flow. To do that, you need to:
Detach the incoming flow to the "Data Viewer Multicast" as well as the outgoing flow from it, and reattach the main flow back together.
Detach the main flow where you want to insert the "intelligent" data viewer sub-flow, and attach the "Data Viewer Multicast" in between the two components.
Resolve any metadata issues (red circles with the white X) in the sub-flow.
Remove and re-add the data viewer on the flow entering the "Data Viewer Endpoint" if your column metadata's changed. (You won't get an error or warning about this!)
You would end up with something like this:
Wrapping It Up
No, this isn't the easiest or quickest way to monitor what's going through your pipeline - but it does provide more flexibility and usability than the plain data viewer for complex flows that need debugging. Especially ones like the one I'm working on now, where I'm moving this down the flow... then up... then back down again...