Anatomy of an Incremental Load

  • Hi,

    A very nice article.I have few things to add for this post

    1.Why should we bring all the data from the source during incremental load.Why can't we bring the data that is created or modified yesterday

    2.We have tested and using the approch like we will have a date parametes append to source sql so that it will pull only the rows that has been insterted or modified

    3.Then we can use SCD or lookup depending uppon the data delta

  • This is a great solution for the sake of its simplicity. However, it is not workable for environments with large transactional loads on the source and destination tables.

    I work at a company that provides real estate marketing solutions. Our core operational database must be synced to a customized transactional layer that is both accessed by an enterprise gui backend and is replicated to redundant (web-facing drone) copies.

    This must be done as fast as possible. I have experimented with the approach you mentioned in the article, but I found it extremely slow.

    I ended up creating an audit table for incremental, full, and custom syncs. The logic is to go back to the previous sync date and time and stage the source data locally on the destination server. Indexes are added, and a single source query renders "new" and "changed" records which are then conditionally redirected in the data flow to an ole db destination and a ole db command respectively. This approach has virtually eliminated latency between the ODS source and transactional destination (and implicitly the drone destinations).

    Scott Warren

    ETL Developer

  • Thanks Scott for adding some valuable content re. volume throughput.

    One thing I learned after working in a shop that integrated medical claims records (one om my accounts was 50 Mil. records per batch) is that it is not likely to encounter an ETL "pattern" that will scale well - even using SSIS - unless performance for a large data set has been considered in every step of the processing.

    Having said that - and in the author's defense -just as often the pain point and majority of costs in ETL processing are in the logic development and maintenance, and not the data volume.

    Ken

  • My take on this article was that it was meant to demonstrate some SSIS mechanics for identifying new and changed rows, and, as such, I found it pretty useful, notwithstanding comments about performance.

    Jay Taylor

  • Ken (and also Jay),

    I understand that my offering is not applicable in all cases and that some are not concerned about performance. I am just illustrating alternatives for shops where the performance of a memory hogging Lookup Transformation and all-inclusive source result set will not suffice.

    I agree that Andy's solution is easier to maintain for the uninitiated, but I am something of a specialist. I did not want beginners reading the article to think that this is the only or best solution for an incremental sync of environments.

    Scott Warren

    ETL Developer

  • Understood Scott. And I think your contribution will likely help me in my work in the future.

    And I personally feel that all technical examples offered in venues such as this should have volume performance limitations investigated and noted in the article, as this issue can be one of the "deal breakers" for an ETL solution, and an unpleasant disappointment if users make incorrect assumptions.

    And as I mentioned previously however, I also expect such performance limitations would NOT be an issue for many ETL problems, although as you have experienced, will be for some.

    Best,

    Ken

  • Hi mescotty22,

    This article was originally published in early 2008. The purpose of the article was to introduce and demonstrate one way of loading data incrementally using SSIS. It's not the only way, and this design pattern is not the ultimate solution for all incremental loading strategies. For me, this (or something very similar to this) is the usual starting point.

    If this doesn't perform for me as desired, I tinker with the design. How and where I tinker... depends.

    There were some really cool suggestions about approaches to making this solution scale. Some can be found earlier in this forum near this post (http://www.sqlservercentral.com/Forums/FindPost491603.aspx). Some are things I've tried before and some were things I'd never tried.

    Everyone is working in different environments. 100 million rows is a lot - sometimes. Data shape comes into play with scale, as does hardware. I currently work on servers with 24 to 96 CPUs and 256 to 1,024 GB RAM. Some 100M row loads on these systems are, frankly, no big deal.

    We're all working in different environments with different requirements. My intent with this article was to help folks get started understanding incremental loads with SSIS.

    :{> Andy

    Andy Leonard, Chief Data Engineer, Enterprise Data & Analytics

  • Hi Andy,

    good article.

    I didn't read all the posts, so maybe someone write about it, but did you know TableDifference component (http://www.sqlbi.eu/Default.aspx?tabid=74)? I used it a couple of times and it's more easy than the way you show us.

  • A very interesting article, which I have bookmarked for future reference. I, too, find SSIS difficult to learn coming from a VB/.NET/T-SQL background. I ended up building an extensible VB.NET task engine with process-specific task handlers instantiated by reflection and data processing written in T-SQL stored procedures rather than trying to fight with SSIS to handle my very specific external EDI file import, error handling, reprocessing, and reporting requirements.

    One issue I have is that incremental loads are NOT always faster. I had one case with one step in an ETL process where an incremental load of the last 30 days of data, with aggregation occurring when selecting from the data source, took 3 minutes (~4,300 rows). FYI, the process pulls data from a remote server to the data warehouse.

    I was unhappy with that result, so I tried a TNL (Truncate-N-Load) method for all data from 8/1/06 to present (~128,000 rows). It took 10 seconds. The TNL method time blew me away versus the incremental method's time. It wasn't what I expected, but I guess we live and learn.

    J Pratt

  • If you have million rows in source table. it may take longer time to do lookup and conditional split comparison than simply batch load all data into destination table

  • Hello Andy, thanks for a great overview using SSIS. I especially like the detailed explanation with screenshots. We currently use a stored procedure to handle our comparisons and upserts, once the data is loaded using SSIS. One of these steps is a stored proc to ensure the lookup tables are populated with any new values before updating the main table. Any particular thoughts on handling this or would we perform "mini" processes similar to this step?

  • Um, just ooc - I didn't see any discussion of just doing this in SQL?

    Why use SSIS at all?

  • i prefer to use the merge join transformation combined with a conditional split. For more complex situations use the konesans checksum algorithm transformation to compare staging table vs fact/dimension table.

    regards

    Robert Cooke

  • There are use cases for SSIS that are not served well by using T-SQL. The example used in the article is not a good use case for SSIS over T-SQL - at least not all of the time.

    SSIS is suited for moving data between instances of SQL Server on different physical servers or between SQL Server and another database platform. SSIS can also perform better than T-SQL on certain large data transfers between SQL Server databases.

    As is often the case, the correct answer is "it depends". I encourage developers to test and see which will perform better: T-SQL or SSIS. That's what I do.

    :{> Andy

    Andy Leonard, Chief Data Engineer, Enterprise Data & Analytics

  • I respectfully disagree that incremental loads are faster than the truncate method. The problem with an incremental load is the amount of resources it uses on the SSIS side to manage joins and lookups. The need for SSIS to load the dataset into memory before it can perform a comparison is great. In addition to the resources required to pull up the data in SQL, the amount of resources needed by the SSIS machine to manage the comparisons may even be greater. Part of this is the need to ensure string comparisons are done with trimmed fields, re-sorting, and data conversions that are performed to ensure the comparisons are accurate. I've yet to see a comment somewhere where the accuracy of the comparisons are put to question but they do occur if these are not taken to consideration. This is due also to the fact that the collation of the SQL server is different from the collation of the Windows Server machine that runs the SSIS package. From my experience, we have processed, logged, validated, and loaded about 60 million records in 2.5 hours using the truncated method (with drop/re-create specific indexes), as opposed to the incremental that takes about 6-8 hours.

Viewing 15 posts - 76 through 90 (of 101 total)

You must be logged in to reply to this topic. Login to reply