How to handle "big data" or "large data" with SSIS ?

Question

How to handle "big data" or "large data" with SSIS ?

blasto_max

SSCertifiable

Points: 5025
More actions
October 29, 2013 at 1:41 pm

#285457

I am running dynamic queries that often return HUGE (300MB - 1GB) result sets initially. Later, it should not be this big (not sure about that though) because I will be using delta loading. These result sets are then loaded into a C# data table. A script loops over these rows and then generates a query (stored in SSIS variable) to load them to the appropriate destination columns (determined by other scripts).
For small result sets, my package runs properly. But, for big ones, it simply fails due to out of memory error.
How do I resolve this problem ? Can you suggest some strategies ? I guess I could fetch smaller parts of the data at a time and then load into target. Not sure how to go about it though. Is there a recipe for this ?
A brief intro to how the process works -
Execute SQL: Get big ResultSet > Script: Read each row and generate a String SQL like "Insert INTO TableABC VALUES" + {all columns of 1 row here}. Then, concatenate SQL to a String destinationInsert > Execute SQL: execute SQL inside String destinationInsert.
ETL process complete. Does that help ?

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 1

300mb of data should load in no time flat, but you seem to be going at it the hard way.

If the input file is to simply be inserted into a table, why not use a flat file data source and bulk load it directly into a table? going through and manually building out a huge series of single row inserts is not going to scale, especially when you have facilities to dump the data directly.

Even if you end up having to clean up data or manipulate other things - your best bet is to get the data into the DB itself as quickly as you can, clean it up THEN move it around. Trying to clean it up row by row in transit is going to lead to pain.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Phil Parkin SSC Guru Points: 246686 More actions · Answer 2

Matt Miller (#4) (10/29/2013)
300mb of data should load in no time flat, but you seem to be going at it the hard way.
If the input file is to simply be inserted into a table, why not use a flat file data source and bulk load it directly into a table? going through and manually building out a huge series of single row inserts is not going to scale, especially when you have facilities to dump the data directly.
Even if you end up having to clean up data or manipulate other things - your best bet is to get the data into the DB itself as quickly as you can, clean it up THEN move it around. Trying to clean it up row by row in transit is going to lead to pain.

+1, though doing some clean-up in non-blocking components such as Derived Column transformations should not slow things down too much. Just keep the data pipeline fairly simple and avoid looping and blocking components. By using .NET datatables, you are throwing away a lot of the power of SSIS and may as well write the entire process as a C# app (that's not a recommendation :-))

If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

Koen Verbeeck SSC Guru Points: 259085 More actions · Answer 3

Matt Miller (#4) (10/29/2013)
If the input file is to simply be inserted into a table, why not use a flat file data source and bulk load it directly into a table? going through and manually building out a huge series of single row inserts is not going to scale, especially when you have facilities to dump the data directly.

It seems the problem the OP is having is the sources are dynamic, which won't work in a regular data flow.

Matt Miller (#4) (10/29/2013)
Trying to clean it up row by row in transit is going to lead to pain.

Synchronous components in SSIS almost have no performance impact.

@blasto_max: can you describe your sources a bit more? How come they are "dynamic"? Are they predictable?

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

blasto_max SSCertifiable Points: 5025 More actions · Answer 4

Koen Verbeeck (10/30/2013)
Matt Miller (#4) (10/29/2013)
If the input file is to simply be inserted into a table, why not use a flat file data source and bulk load it directly into a table? going through and manually building out a huge series of single row inserts is not going to scale, especially when you have facilities to dump the data directly.
It seems the problem the OP is having is the sources are dynamic, which won't work in a regular data flow.
Matt Miller (#4) (10/29/2013)
Trying to clean it up row by row in transit is going to lead to pain.
Synchronous components in SSIS almost have no performance impact.
@blasto_max: can you describe your sources a bit more? How come they are "dynamic"? Are they predictable?

Thanks for those questions, I have dynamic sources. In brief - Source is like SomeID, Random Column Names....Destination is like - Sane column names...

In source, for example Column1 is just a value holder. For SomeID = 1, Column1 could go to Dest1 and for SomeID = 7, Column1 could go to Dest19. This is how it is. So, I create mappings of what goes where, generate SQL code to fetch data and load data. For ID = 1, Column1 is always mapped to fixed Dest.

After getting the mappings for ID = 1, I fetch HUGE data from the source tables for each Sub ID associated with an ID. Its a mess that I did not create.

Is the situation clearer now ?

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 5

Two ways to do this:

1. the SSIS route - use the flat file connector to make SSIS aware of the generic content. then use a conditional split based on column1 to define out multiple mappings to multiple destinations based on your column1 values.

2. SSIS + SQL route. still use the flat file to dump the data into staging (which matches the generic input columns you have). Then run a series of INSERT statements which filter based on the value of column1, e.g.

Insert into MyDest19 (col1, col2, col3, col4, etc....)

Select col1, col2, col3, col4, etc....

from mystagingtable

where column1='Dest19'

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

blasto_max SSCertifiable Points: 5025 More actions · Answer 6

Matt Miller (#4) (10/30/2013)
Two ways to do this:
1. the SSIS route - use the flat file connector to make SSIS aware of the generic content. then use a conditional split based on column1 to define out multiple mappings to multiple destinations based on your column1 values.
2. SSIS + SQL route. still use the flat file to dump the data into staging (which matches the generic input columns you have). Then run a series of INSERT statements which filter based on the value of column1, e.g.
Insert into MyDest19 (col1, col2, col3, col4, etc....)
Select col1, col2, col3, col4, etc....
from mystagingtable
where column1='Dest19'

can't i just somehow execute my big .SQL files instead ?

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 7

blasto_max (10/30/2013)
Matt Miller (#4) (10/30/2013)
Two ways to do this:
1. the SSIS route - use the flat file connector to make SSIS aware of the generic content. then use a conditional split based on column1 to define out multiple mappings to multiple destinations based on your column1 values.
2. SSIS + SQL route. still use the flat file to dump the data into staging (which matches the generic input columns you have). Then run a series of INSERT statements which filter based on the value of column1, e.g.
Insert into MyDest19 (col1, col2, col3, col4, etc....)
Select col1, col2, col3, col4, etc....
from mystagingtable
where column1='Dest19'
can't i just somehow execute my big .SQL files instead ?

I suppose it's possible with enough RAM and a idle server, but it's very inefficient and will always be subject to running into resource issues like you've already seen when you have a big day or something changes. You could possibly look at using a flat file OUTPUT and devise some custom streaming solution to build up the SQL statement in a file rather than memory but at that point you really aren't using SSIS at all.

Just for giggles I dummied up a version of option 1. above. It took about 6 minutes to wire it up, 6 more minutes to shake out a few data type bugs. My first run of a 5.2M row (600MB) file took 53 seconds to load into 2 tables based on a filter, and that's before I've done any tweaks to turn on bulk inserts. Single row inserts will be running for literally hours on that kind of operation.

SQL Server is designed to perform best on set operations (i.e. bunches of rows at a time). If you can design your process in that way - you will be better off in the long run.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?