Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««7891011»»»

Anatomy of an Incremental Load Expand / Collapse
Author
Message
Posted Sunday, August 16, 2009 11:41 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Monday, July 28, 2014 5:43 PM
Points: 71, Visits: 480
Understood Scott. And I think your contribution will likely help me in my work in the future.

And I personally feel that all technical examples offered in venues such as this should have volume performance limitations investigated and noted in the article, as this issue can be one of the "deal breakers" for an ETL solution, and an unpleasant disappointment if users make incorrect assumptions.

And as I mentioned previously however, I also expect such performance limitations would NOT be an issue for many ETL problems, although as you have experienced, will be for some.

Best,
Ken




Post #771641
Posted Sunday, August 16, 2009 1:00 PM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Monday, July 21, 2014 12:44 PM
Points: 389, Visits: 1,041
Hi mescotty22,

This article was originally published in early 2008. The purpose of the article was to introduce and demonstrate one way of loading data incrementally using SSIS. It's not the only way, and this design pattern is not the ultimate solution for all incremental loading strategies. For me, this (or something very similar to this) is the usual starting point.

If this doesn't perform for me as desired, I tinker with the design. How and where I tinker... depends.

There were some really cool suggestions about approaches to making this solution scale. Some can be found earlier in this forum near this post (http://www.sqlservercentral.com/Forums/FindPost491603.aspx). Some are things I've tried before and some were things I'd never tried.

Everyone is working in different environments. 100 million rows is a lot - sometimes. Data shape comes into play with scale, as does hardware. I currently work on servers with 24 to 96 CPUs and 256 to 1,024 GB RAM. Some 100M row loads on these systems are, frankly, no big deal.

We're all working in different environments with different requirements. My intent with this article was to help folks get started understanding incremental loads with SSIS.

:{> Andy


Andy Leonard
CSO, Linchpin People
Follow me on Twitter: @AndyLeonard
Post #771647
Posted Monday, August 17, 2009 4:48 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Friday, July 25, 2014 4:50 AM
Points: 5, Visits: 304
Hi Andy,

good article.

I didn't read all the posts, so maybe someone write about it, but did you know TableDifference component (http://www.sqlbi.eu/Default.aspx?tabid=74)? I used it a couple of times and it's more easy than the way you show us.
Post #771793
Posted Tuesday, August 18, 2009 6:40 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Friday, May 3, 2013 12:27 PM
Points: 21, Visits: 42
A very interesting article, which I have bookmarked for future reference. I, too, find SSIS difficult to learn coming from a VB/.NET/T-SQL background. I ended up building an extensible VB.NET task engine with process-specific task handlers instantiated by reflection and data processing written in T-SQL stored procedures rather than trying to fight with SSIS to handle my very specific external EDI file import, error handling, reprocessing, and reporting requirements.

One issue I have is that incremental loads are NOT always faster. I had one case with one step in an ETL process where an incremental load of the last 30 days of data, with aggregation occurring when selecting from the data source, took 3 minutes (~4,300 rows). FYI, the process pulls data from a remote server to the data warehouse.

I was unhappy with that result, so I tried a TNL (Truncate-N-Load) method for all data from 8/1/06 to present (~128,000 rows). It took 10 seconds. The TNL method time blew me away versus the incremental method's time. It wasn't what I expected, but I guess we live and learn.
Post #772628
Posted Thursday, August 20, 2009 3:46 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, June 9, 2014 11:54 AM
Points: 4, Visits: 86
If you have million rows in source table. it may take longer time to do lookup and conditional split comparison than simply batch load all data into destination table
Post #774640
Posted Monday, August 24, 2009 4:36 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Thursday, July 17, 2014 3:01 PM
Points: 39, Visits: 890
Hello Andy, thanks for a great overview using SSIS. I especially like the detailed explanation with screenshots. We currently use a stored procedure to handle our comparisons and upserts, once the data is loaded using SSIS. One of these steps is a stored proc to ensure the lookup tables are populated with any new values before updating the main table. Any particular thoughts on handling this or would we perform "mini" processes similar to this step?


Post #775873
Posted Monday, June 7, 2010 9:45 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Wednesday, February 5, 2014 10:11 AM
Points: 7, Visits: 33
Um, just ooc - I didn't see any discussion of just doing this in SQL?
Why use SSIS at all?
Post #933772
Posted Monday, June 7, 2010 11:12 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Sunday, August 21, 2011 7:43 PM
Points: 2, Visits: 17
i prefer to use the merge join transformation combined with a conditional split. For more complex situations use the konesans checksum algorithm transformation to compare staging table vs fact/dimension table.

regards

Robert Cooke
Post #933796
Posted Tuesday, June 8, 2010 6:50 AM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Monday, July 21, 2014 12:44 PM
Points: 389, Visits: 1,041
There are use cases for SSIS that are not served well by using T-SQL. The example used in the article is not a good use case for SSIS over T-SQL - at least not all of the time.

SSIS is suited for moving data between instances of SQL Server on different physical servers or between SQL Server and another database platform. SSIS can also perform better than T-SQL on certain large data transfers between SQL Server databases.

As is often the case, the correct answer is "it depends". I encourage developers to test and see which will perform better: T-SQL or SSIS. That's what I do.

:{> Andy


Andy Leonard
CSO, Linchpin People
Follow me on Twitter: @AndyLeonard
Post #933960
Posted Tuesday, June 8, 2010 11:46 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, October 1, 2013 2:54 PM
Points: 5, Visits: 34
I respectfully disagree that incremental loads are faster than the truncate method. The problem with an incremental load is the amount of resources it uses on the SSIS side to manage joins and lookups. The need for SSIS to load the dataset into memory before it can perform a comparison is great. In addition to the resources required to pull up the data in SQL, the amount of resources needed by the SSIS machine to manage the comparisons may even be greater. Part of this is the need to ensure string comparisons are done with trimmed fields, re-sorting, and data conversions that are performed to ensure the comparisons are accurate. I've yet to see a comment somewhere where the accuracy of the comparisons are put to question but they do occur if these are not taken to consideration. This is due also to the fact that the collation of the SQL server is different from the collation of the Windows Server machine that runs the SSIS package. From my experience, we have processed, logged, validated, and loaded about 60 million records in 2.5 hours using the truncated method (with drop/re-create specific indexes), as opposed to the incremental that takes about 6-8 hours.
Post #934177
« Prev Topic | Next Topic »

Add to briefcase «««7891011»»»

Permissions Expand / Collapse