Anatomy of an Incremental Load

  • Hi Andy,

    Thanks for a very informative article!

    Some things I found while walking through this.

    The Changed Rows detection condition expression failed for me. The error was that the expression returned a null. I think this is because some of the columns being tested contained nulls. If this is true then what would be the best way to handle it? I thought of this (isnull(NameStyle, ‘’) != isnull(Dest_NameStyle, ‘’))). I thought that this would also eliminate your need for the isnull check at the end.

    When I tested I also noticed that the second time I ran it, new rows were inserted even though I didn’t delete any from my destination. I think this is because the destination is being keyed with an identity column. The rows I inserted didn’t have the same contactid so the next time the package ran it found the same rows missing. What would be the best way of dealing with this? The only way I could think of is to do it the same way you do the updates so you could set the identity insert on in the sql statement.

    -Kevin

  • cmcklw:

    I'm doing my compare based on the Business Key of the table, as opposed to something I'm generating internally (like the Surrogate Keys I'm generating with an IDENTITY column). Is this possible in your situation?

    HTH,

    Rick


    Rick Todd

  • Rick Todd (2/11/2008)


    cmcklw:

    I'm doing my compare based on the Business Key of the table, as opposed to something I'm generating internally (like the Surrogate Keys I'm generating with an IDENTITY column). Is this possible in your situation?

    HTH,

    Rick

    Yes. I think that would work but I wanted to know the best way to deal with a table that does not have a Business Key and the only key is an identity column.

    Thanks,

    -Kevin

  • cmcklw (2/11/2008)


    Rick Todd (2/11/2008)


    cmcklw:

    I'm doing my compare based on the Business Key of the table, as opposed to something I'm generating internally (like the Surrogate Keys I'm generating with an IDENTITY column). Is this possible in your situation?

    HTH,

    Rick

    Yes. I think that would work but I wanted to know the best way to deal with a table that does not have a Business Key and the only key is an identity column.

    Thanks,

    -Kevin

    I guess we'll wait for Andy or someone else to weigh in, but I can't see how you can use any of this logic if you don't have a Business Key to make sure you're matching up the right rows.


    Rick Todd

  • cmcklw -

    There has to be some element that matches what you're importing to what's already in there. Meaning - there would need to be some way for you to say on a row-by-row basis, that this row from this table matches up with THIS row in this other table. Without having some basis to be able to cross-reference them, how could you figure that out?

    Whether it's an identity field, a name field, etc... or several fields really isn't important, but that "natural key" (interestingly enough - there's a bit of a brawl going on on another thread about "natural keys") would be what you need to code into the condition. Meaning - is there or is there NOT a match? And if this is a mismatch anywhere in that natural key, then that would entail a NEW row versus updating an existing row with new data.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • SSCrazy,

    Understood.

    I guess my first post was confusing. I was mearly trying to point out that the example in the article didn't work when I tried it. All green doesn't allways mean all good.

    -Kevin

  • Good article,

    Although I'd say this is perhaps a starting point for:

    a) People who want to achieve what the SCD wizard does without the weight of the SCD implementation.

    b) It's a good platform for developing your own high performance and more functional SCD process.

    For example, you could increase performance by hashing the row contents (or maybe even batches of rows :)) at source and destination, then comparing the hashes to check for changes.

    Of course, I would expect Microsoft to build on the SCD implementation quite a bit in future releases, hence users should note that there is likely an easier migration path by using the out-the-box SCD implementation instead, if it works for you.

    Chod.

  • In my experience the SCD is not very efficient once you get above a thousand rows or so.

    There's one thing I mention in this article that you can use to improve SCD-generated data flow performance: stage the updates and do a set-based update after the data flow. I find SSIS (like SQL Server) can be impacted by everything from the network layer to the shape and size of the data to the data access layer design of applications interfacing with the data. When I started doing SSIS consulting I took the "rule book" approach. Do this, this, and that and SSIS will sing. I've seen too many "rules" fail miserably at other sites so I've abandoned that approach in favor of test, measurement, and observation.

    If you are loading from a system that you cannot join between the source and destination tables would you use a lookup, then the conditional split, or stage all the data in SQL Server and then load into the destination using the method you show here?

    A lookup will allow you to join between rows in the pipeline and any RDBMS source you can load into the pipeline with them. In the case of non-RDBMS data, you can sometimes gain performance by staging some (keys) or all of the data in an RDBMS table and then performing a lookup. You can accomplish a similar lookup function with another source and a merge join. The "outer join" lookup is a neat trick though - very clean in most cases.

    I am still looking for this "ultimate" technology that will allow me to efficiently join data from two different servers without doing mostly full table scans in the remote database.

    SSIS isn't able to do everything for everyone in every situation. It is, however, very powerful at moving data around an enterprise. There is (still) no substitute for tuning using indices. I've been known to use Execute SQL Tasks to add indices before a data flow task and remove them afterwards - also to update statistics on a table or given index. But both techniques were well beyond the scope of this article.

    I just add a "timestamp" column to each source table. Of course, this isn't really a timestamp its a varbinary that's database row version.

    Hi Linda. I've seen this used before for larger tables. It's a good idea so long as you can guarantee the integrity of the timestamp. I've seen this get folks in trouble with multiple updates on historically-tracked data between ETL loads: you can lose all but the last state before the load.

    The Changed Rows detection condition expression failed for me. The error was that the expression returned a null.

    Yep Kevin, if a part of the expression evaluates to null, the entire expression evaluates to null. Since the conditional split expressions must evaluate to a Boolean value, this will throw an exception. I handled the exceptions I encountered in my (brand spanking new) instance of the AdventureWorks.Person.Contact table. As I mentioned in the article, "NULL-handling in SSIS Expression Language could darn-near rate its own article!"

    Thanks for all the cool feedback and great ideas! I'll write more!

    :{> Andy

    Andy Leonard, Chief Data Engineer, Enterprise Data & Analytics

  • Now that I've thought about this, and tinkered with it a bit, I definitely like it. I think what I like best about this solution as opposed to some of the other ones I've tinkered with (mainly utilizing UpdateDate columns in the source system) is that it's highly resilient. Read as, hard to screw up.

    It's quite easy to make a mistake in your logic in the UpdateDate based solution, and not know it. Of course the obvious benefit of the UpdateDate based solution is that you're restricting the rows right in the query against the source system, which obviously cuts down on both sides of the equation: the window of time needed on the source system, and the processing on the ETL system as you start with less rows, which is always a plus.

    The one scenario I have that I haven't found an elegant solution for, is when you have data that gets conformed during the ETL process, and you want to use an approach like this. This is because you can't compare the Source rows with the Destination rows, as the Destination rows get changed during the ETL process. An obvious example of this is Customer/Address information that's passed through an Address Cleansing tool. What I'm currently doing in that scenario is to still compare Source and Destination, and at least restrict the processing of exact matches right from the start, then I check again after I'm done conforming, so that I don't waste any more CPU cycles on rows that are destined for the scrap heap anyway. Has anyone come up with a more elegant solution than that?

    Cheers,

    Rick


    Rick Todd

  • Hi Rick,

    SQL Server 2008 includes some help for incremental loads in SSIS. I've blogged about Change Data Capture and SSIS in CTP5 here - it accomplishes exactly what you mentioned: it sorts the changes, new rows, and deletes at the source. I am updating the post for CTP6 and it will be available at Solid Quality's website soon.

    :{> Andy

    Andy Leonard, Chief Data Engineer, Enterprise Data & Analytics

  • Note: The last section of the Condition expression manages NULLs. This is but one method of addressing NULLs when comparing column values in SSIS Expression Language. NULL-handling in SSIS Expression Language could darn-near rate its own article!

    Can you guys share what other methods of addressing NULLs when comparing values in SSIS Expression Language or/and T-SQL you use?

  • Hi Andy,

    This incremental Load is easy to understand because you made it very friendly to comprehend. Thanks for that part!!

    However, the return results for CHANGED ROWS says 18460 rows. That is exactly the same result I got following your example. However, isn't only 3994 rows actually changed and not 18460 rows. So this incremental load is not accurate interms of saying the number of rows that did changed and therefore it is not optimized. I think the ISNULL function which is necessary but also skew up the actual row changes. So it'll actually change 18460 rows out of 19972 total rows on the Andyworks.dbo.contact. It is changing more rows than necessary!! Is there an update to this link where you may have solve this problem?

  • However, the return results for CHANGED ROWS says 18460 rows. That is exactly the same result I got following your example. However, isn't only 3994 rows actually changed and not 18460 rows. So this incremental load is not accurate interms of saying the number of rows that did changed and therefore it is not optimized. I think the ISNULL function which is necessary but also skew up the actual row changes. So it'll actually change 18460 rows out of 19972 total rows on the Andyworks.dbo.contact. It is changing more rows than necessary!! Is there an update to this link where you may have solve this problem?

    Same here. That's why I was asking the following:

    Can you guys share what other methods of addressing NULLs when comparing values in SSIS Expression Language or/and T-SQL you use?

  • Folks, you're right. Thanks for pointing this out and apologies for not doing a better job proofing my code before publication. I will get an updated version out soon and post a link here.

    Andy Leonard

    Andy Leonard, Chief Data Engineer, Enterprise Data & Analytics

  • I am afraid this is not a practical approach for environments where the target database is large. Left outer join 50 million + rows for every import? Simply won't fly.

    Admittedly, there is no "trivial" solution for this problem, however I'd like to see some discussion of a solution that will be practical for a target db of more than a few million rows.

    For example, use of "history" table of previously loaded keys, "intelligent" use of timestamp of some other time related "delta" calculation, etc.

Viewing 15 posts - 16 through 30 (of 101 total)

You must be logged in to reply to this topic. Login to reply