Single vs Multiple Data Flows

Question

Post reply

Single vs Multiple Data Flows

Jason Whitney

Mr or Mrs. 500

Points: 508
More actions
November 5, 2012 at 2:56 pm

#151113

I have used SSIS a bit in the past (2005), but I am embarking on a new BI project and want to know the best practice for using a single vs multiple data flows. My scenario is, I have 16 databases that are 'almost' identical. Rules needed to clean and transform the data will be 90% the same, but each database will have a few outliers that will require special steps just for that db's data.
Should I:
1. Create a single data flow with 16 data sources that each have a couple db specific steps before hitting a union all clause and running the other 90% validation at once? This helps keep logic implemented in only one place even if it is more complicated to handle all the different use cases.
2. Create 16 data flows, one for each database, and duplicate all the logic in each data flow to appropriately handle the db specific issues? Each data flow is smaller, but duplicate logic is spread across the package.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

Koen Verbeeck SSC Guru Points: 259085 More actions · Answer 1

Normally I would go for option 1, as it minimizes code duplication.

Although 16 sources are lot, which can give issues if there's memory pressure:

http://www.mattmasson.com/index.php/2012/01/too-many-sources-in-a-data-flow/

And maintaining a UNION ALL with 16 inputs is also a nightmare 🙂

But so is maintaining 16 different dataflows.

I would try to go for option 1, and test out if so many sources don't give issues on your system.

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

Jason Whitney Mr or Mrs. 500 Points: 508 More actions · Answer 2

Thanks for the advice. I will try option 1 and let you know how it goes.

P Jones SSChampion Points: 12352 More actions · Answer 3

How about a for (or rather for each datasource) loop that processes one datasource at a time and has some logic inside the loop for the differences? I often use scripts to set the value of variables which can be tested as part of the constraint condition and used in connections in expressions.