Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase

How to handle "big data" or "large data" with SSIS ? Expand / Collapse
Author
Message
Posted Tuesday, October 29, 2013 1:41 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Thursday, May 15, 2014 8:51 PM
Points: 183, Visits: 480
I am running dynamic queries that often return HUGE (300MB - 1GB) result sets initially. Later, it should not be this big (not sure about that though) because I will be using delta loading. These result sets are then loaded into a C# data table. A script loops over these rows and then generates a query (stored in SSIS variable) to load them to the appropriate destination columns (determined by other scripts).

For small result sets, my package runs properly. But, for big ones, it simply fails due to out of memory error.
How do I resolve this problem ? Can you suggest some strategies ? I guess I could fetch smaller parts of the data at a time and then load into target. Not sure how to go about it though. Is there a recipe for this ?

A brief intro to how the process works -

Execute SQL: Get big ResultSet > Script: Read each row and generate a String SQL like "Insert INTO TableABC VALUES" + {all columns of 1 row here}. Then, concatenate SQL to a String destinationInsert > Execute SQL: execute SQL inside String destinationInsert.

ETL process complete. Does that help ?
Post #1509553
Posted Tuesday, October 29, 2013 10:57 PM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Saturday, August 30, 2014 5:24 PM
Points: 7,139, Visits: 15,191
300mb of data should load in no time flat, but you seem to be going at it the hard way.

If the input file is to simply be inserted into a table, why not use a flat file data source and bulk load it directly into a table? going through and manually building out a huge series of single row inserts is not going to scale, especially when you have facilities to dump the data directly.

Even if you end up having to clean up data or manipulate other things - your best bet is to get the data into the DB itself as quickly as you can, clean it up THEN move it around. Trying to clean it up row by row in transit is going to lead to pain.


----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
Post #1509650
Posted Wednesday, October 30, 2013 1:13 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 8:38 AM
Points: 5,047, Visits: 11,803
Matt Miller (#4) (10/29/2013)
300mb of data should load in no time flat, but you seem to be going at it the hard way.

If the input file is to simply be inserted into a table, why not use a flat file data source and bulk load it directly into a table? going through and manually building out a huge series of single row inserts is not going to scale, especially when you have facilities to dump the data directly.

Even if you end up having to clean up data or manipulate other things - your best bet is to get the data into the DB itself as quickly as you can, clean it up THEN move it around. Trying to clean it up row by row in transit is going to lead to pain.


+1, though doing some clean-up in non-blocking components such as Derived Column transformations should not slow things down too much. Just keep the data pipeline fairly simple and avoid looping and blocking components. By using .NET datatables, you are throwing away a lot of the power of SSIS and may as well write the entire process as a C# app (that's not a recommendation )



Help us to help you. For better, quicker and more-focused answers to your questions, consider following the advice in this link.

When you ask a question (and please do ask a question: "My T-SQL does not work" just doesn't cut it), please provide enough information for us to understand its context.
Post #1509667
Posted Wednesday, October 30, 2013 2:16 AM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Today @ 1:06 PM
Points: 13,639, Visits: 10,531
Matt Miller (#4) (10/29/2013)

If the input file is to simply be inserted into a table, why not use a flat file data source and bulk load it directly into a table? going through and manually building out a huge series of single row inserts is not going to scale, especially when you have facilities to dump the data directly.


It seems the problem the OP is having is the sources are dynamic, which won't work in a regular data flow.

Matt Miller (#4) (10/29/2013)

Trying to clean it up row by row in transit is going to lead to pain.


Synchronous components in SSIS almost have no performance impact.

@blasto_max: can you describe your sources a bit more? How come they are "dynamic"? Are they predictable?




How to post forum questions.
Need an answer? No, you need a question.
What’s the deal with Excel & SSIS?

Member of LinkedIn. My blog at LessThanDot.

MCSA SQL Server 2012 - MCSE Business Intelligence
Post #1509677
Posted Wednesday, October 30, 2013 10:29 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Thursday, May 15, 2014 8:51 PM
Points: 183, Visits: 480
Koen Verbeeck (10/30/2013)
Matt Miller (#4) (10/29/2013)

If the input file is to simply be inserted into a table, why not use a flat file data source and bulk load it directly into a table? going through and manually building out a huge series of single row inserts is not going to scale, especially when you have facilities to dump the data directly.


It seems the problem the OP is having is the sources are dynamic, which won't work in a regular data flow.

Matt Miller (#4) (10/29/2013)

Trying to clean it up row by row in transit is going to lead to pain.


Synchronous components in SSIS almost have no performance impact.

@blasto_max: can you describe your sources a bit more? How come they are "dynamic"? Are they predictable?


Thanks for those questions, I have dynamic sources. In brief - Source is like SomeID, Random Column Names....Destination is like - Sane column names...

In source, for example Column1 is just a value holder. For SomeID = 1, Column1 could go to Dest1 and for SomeID = 7, Column1 could go to Dest19. This is how it is. So, I create mappings of what goes where, generate SQL code to fetch data and load data. For ID = 1, Column1 is always mapped to fixed Dest.
After getting the mappings for ID = 1, I fetch HUGE data from the source tables for each Sub ID associated with an ID. Its a mess that I did not create.

Is the situation clearer now ?
Post #1509862
Posted Wednesday, October 30, 2013 11:25 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Saturday, August 30, 2014 5:24 PM
Points: 7,139, Visits: 15,191
Two ways to do this:
1. the SSIS route - use the flat file connector to make SSIS aware of the generic content. then use a conditional split based on column1 to define out multiple mappings to multiple destinations based on your column1 values.
2. SSIS + SQL route. still use the flat file to dump the data into staging (which matches the generic input columns you have). Then run a series of INSERT statements which filter based on the value of column1, e.g.

Insert into MyDest19 (col1, col2, col3, col4, etc....)
Select col1, col2, col3, col4, etc....
from mystagingtable
where column1='Dest19'



----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
Post #1509889
Posted Wednesday, October 30, 2013 1:33 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Thursday, May 15, 2014 8:51 PM
Points: 183, Visits: 480
Matt Miller (#4) (10/30/2013)
Two ways to do this:
1. the SSIS route - use the flat file connector to make SSIS aware of the generic content. then use a conditional split based on column1 to define out multiple mappings to multiple destinations based on your column1 values.
2. SSIS + SQL route. still use the flat file to dump the data into staging (which matches the generic input columns you have). Then run a series of INSERT statements which filter based on the value of column1, e.g.

Insert into MyDest19 (col1, col2, col3, col4, etc....)
Select col1, col2, col3, col4, etc....
from mystagingtable
where column1='Dest19'



can't i just somehow execute my big .SQL files instead ?
Post #1509925
Posted Wednesday, October 30, 2013 6:01 PM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Saturday, August 30, 2014 5:24 PM
Points: 7,139, Visits: 15,191
blasto_max (10/30/2013)
Matt Miller (#4) (10/30/2013)
Two ways to do this:
1. the SSIS route - use the flat file connector to make SSIS aware of the generic content. then use a conditional split based on column1 to define out multiple mappings to multiple destinations based on your column1 values.
2. SSIS + SQL route. still use the flat file to dump the data into staging (which matches the generic input columns you have). Then run a series of INSERT statements which filter based on the value of column1, e.g.

Insert into MyDest19 (col1, col2, col3, col4, etc....)
Select col1, col2, col3, col4, etc....
from mystagingtable
where column1='Dest19'



can't i just somehow execute my big .SQL files instead ?


I suppose it's possible with enough RAM and a idle server, but it's very inefficient and will always be subject to running into resource issues like you've already seen when you have a big day or something changes. You could possibly look at using a flat file OUTPUT and devise some custom streaming solution to build up the SQL statement in a file rather than memory but at that point you really aren't using SSIS at all.

Just for giggles I dummied up a version of option 1. above. It took about 6 minutes to wire it up, 6 more minutes to shake out a few data type bugs. My first run of a 5.2M row (600MB) file took 53 seconds to load into 2 tables based on a filter, and that's before I've done any tweaks to turn on bulk inserts. Single row inserts will be running for literally hours on that kind of operation.

SQL Server is designed to perform best on set operations (i.e. bunches of rows at a time). If you can design your process in that way - you will be better off in the long run.


----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
Post #1509985
« Prev Topic | Next Topic »

Add to briefcase

Permissions Expand / Collapse