SQL Clone
SQLServerCentral is supported by Redgate
Log in  ::  Register  ::  Not logged in

Performing a Cross Join (Cartesian Product) in SSIS

There are a few times where you might want to perform a cross join/cartesian join/cartesian product operation in SSIS.  One of those scenarios could be performing some kind of a double-sided "range join" where you can't line up a specific key in your data flow to a specific key in your lookup table.  Quite often you're planning to  have a Conditional Split right after your cross join operation to heuristically eliminate the rows that don't qualify for your "range" matching.
Warning - performing a cross join is an inherently expensive endeavour!  It's called a cartesian product for a reason - you're multiplying rows.  Every row from one source will be replicated for each and every row from the other source.  If you have 100 rows from each source, your cartesian product will result in 100x100 = 10,000 rows.  This can quickly get to be a rediculously large number.  A source of 200,000 rows joined to a 3,000 row table - quite a reasonably small "range lookup" scenario - will result in 600,000,000 rows.  Don't say you haven't been warned about performance - in terms of processing power AND memory.
The obvious component you'll try to get this functionality from is the Merge Join - but it only supports Full Outer joins, Left Outer joins, and Inner joins.  There are no other components that offer join capability in SQL Server Integration Services, so how can you perform this kind of join?
Cross Join Pattern #1
By using several components together, you can replicate a cross join result.  The first method I'll describe will work with any two data flows, from any data source, at any point in the data flow.  It won't perform spectacularly - but then I already warned you about the dangers of performing cross joins, didn't I?  It's actually quite simple.
I've chosen to (quite uselessly) cross join SQL Server's sysobjects and sysindexes tables.
Step 1 - Add an Artificial Key Column
Drop Derived Column components onto your design surface - one per flow.  In each of those Derived Columns, add a brand-new column called "join_key" and give it a value of 1.  (You can call your column whatever you like, and give it whatever value you like - as long as you give the column in each flow the same value.)
Step 2 - Sort on the Artificial Key
Now drop two Sort components on your surface, attaching them to the outputs of the Derived Columns, and configure them to sort on the artificial key.  In my case, I set them to sort on "join_key" (of course).
Step 3 - Perform a Full Outer Join
Place a Merge Join component on the design surface, attach both flows to it - the inputs you attach them to don't matter at all.  Configure the component to perform a Full Outer Join.  Select the columns you want from each input - and be careful to label them appropriately as they exit the component.  You do NOT need to select the artificial key column - I've left my "join_key" unselected.
The Result
As you can see, my queries of sysobjects and sysindexes have been cross joined to each other.  Individually, they have very few rows each, but the cross join makes that row count explode.

Cross Join Pattern #2
This pattern performs a little better than #1 - but it has more restrictions on it.  This technique can't be used anywhere in the data flow - it has to be used at the top.  It can also only be used on sources where we can ask for artificially generated columns - so it's not going to work on text files.
Step 1 - Add an Artificial Key Column
We still have to do this - but we do it a different way this time.  First, we have to add the join_key to the source query.  We don't have to issue an ORDER BY operation in the query (not that SSIS would pay attention to it), because the join_key is a constant.
Step 2 - Sort on the Artificial Key
Like we said before, we don't actually need to sort the data - we just have to let SSIS know that it is sorted, otherwise it won't let us use the Merge Join component to merge the flows.  In order to do that, we have to right-click on the OLE DB Source, and select Advanced Editor.  Change to the Input and Output Properties tab, select the "OLE DB Source Output" node, and find the IsSorted property.  Set that property to true.
Next, expand the "OLE DB Source Output" node, expand the "Output Columns" node, select the "join_key" node (or whatever column name you used), and find the SortKeyPosition property.  Set that property to 1.  (Click on the images to get larger, readable copies.)
Step 3 - Perform a Full Outer Join
Place a Merge Join component on the design just like in pattern #1, and you've got your cross join set up.


No comments.

Leave a Comment

Please register or log in to leave a comment.