Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes

  • Great article thanks! I can see where it might be able to speed things up, but only if you are constrained on your writes. In my current ETL process I am constrained reading the data, as it comes over a WAN link, so all this would end up doing is complicating things and adding more CPU overhead. Now if I could store the hash in the source DB, and only pull changed records over the WAN that could make a huge difference.

    You actually end up doing more reads, so if the percent of changes is very high you could actually perform worse. (Or if you are CPU bound the hash calculation could end up slowing you down.)

    The other thing I would like to see it deal with it deletes, if records are deleted from the source, they would stay in your destination...

  • Great article, Brett. I have a half completed one in my pipe similar to this - but directed solely at how to use a hash to speed up my SCD component - because I think it should!

    For those that are interested in doing this, but aren't comfortable with coding, you can use the Konesans Checksum, or you can use the Multiple Hash component on CodePlex that offers several different hashing methods - see the SSIS Community Tasks and Components site for those extensions.

    Todd McDermid - SQL Server MVP, MCSD.Net, MCTS (SQL08BI)
    Find great SSIS extensions - most are open source - at SSIS Community Tasks and Components[/url]
    Are you doing Data Warehouse ETL with SSIS? Don't use the slow SCD Wizard, use the Kimball Method SCD component instead.

  • UMG Developer (6/8/2010)


    Great article thanks! I can see where it might be able to speed things up, but only if you are constrained on your writes. In my current ETL process I am constrained reading the data, as it comes over a WAN link, so all this would end up doing is complicating things and adding more CPU overhead. Now if I could store the hash in the source DB, and only pull changed records over the WAN that could make a huge difference.

    You actually end up doing more reads, so if the percent of changes is very high you could actually perform worse. (Or if you are CPU bound the hash calculation could end up slowing you down.)

    The other thing I would like to see it deal with it deletes, if records are deleted from the source, they would stay in your destination...

    If you were able to have your source system calculate the hashes for each record in advance you could get by very easily by not having to do the hash calculation in the SSIS package and just query IDs and hashes all day long to do comparisons with. This method was devised with a source system in mind that didn't do that. I've also never encountered an issue with being CPU bound on hash calculations even with millions of records.

    Deletes would be handled separately in another data flow and is out of scope for this particular data flow as it would require a Lookup going in the reverse direction (destination to source) to find the IDs that didn't exist anymore and then piping those to a staging table or OLEDB Command where you could then run the deletes.

  • Brett Flippin (6/8/2010)If you were able to have your source system calculate the hashes for each record in advance you could get by very easily by not having to do the hash calculation in the SSIS package and just query IDs and hashes all day long to do comparisons with. This method was devised with a source system in mind that didn't do that. I've also never encountered an issue with being CPU bound on hash calculations even with millions of records.

    That could be useful if I can get the hash calculation the same on both sides, though it would probably be a lot of work. Again, if the percent of changes is really high it could end up pulling more data over the WAN than a full truncate/re-load. I think the way to do it, if you can, would be to store the last synced hash value in the source system, so that you can pull only the changed and new records over the WAN link. I wouldn't even need the hash value on the destination side. (I think this has some real possibilities.) Thanks for the question that made me think more about this!

    Brett Flippin (6/8/2010)I've also never encountered an issue with being CPU bound on hash calculations even with millions of records.

    The hash calculations might not cause a CPU problem, but the server may well be doing something else at the same time that uses most, if not all of the CPU available.

  • Todd McDermid (6/8/2010)


    Great article, Brett. I have a half completed one in my pipe similar to this - but directed solely at how to use a hash to speed up my SCD component - because I think it should!

    For those that are interested in doing this, but aren't comfortable with coding, you can use the Konesans Checksum, or you can use the Multiple Hash component on CodePlex that offers several different hashing methods - see the SSIS Community Tasks and Components site for those extensions.

    Thanks Todd. Multiple Hash is indeed a great tool as well.

  • I'm afraid I'm not familiar with the tools used in the article, but on first glance it appears the following two general "hash to find differences" cautions are appropriate:

    First, when concatenating to a string and hashing the string, one must convert each fields to a fixed length string at or beyond the maximum possible length for each or use a delimiter that you guarantee and enforce the guarantee that the delimiter isn't in your data, otherwise the following get the same concatenated string, and thus the same hash:

    Field1 Field2

    'ab' 'cd'

    'a' 'bcd'

    'abc' 'd'

    Addendum: Make sure your tools convert NULL differently than any other value.

    Second, when the hashes that are different, assume the data is different. When the hashes are the same, now you have to either:

    A) Ignore collisions when they happen (The "oh, it won't happen to me!" approach), and accept that every once in awhile you're going to fail to notice a data change.

    B) Test field by field (or the entire concatenated string, if you did it right) to see whether or not the string is actually different.

    Regarding the statistical "It's a very low chance, see my math" arguments that crop up regarding collisions, a non-zero chance must be properly accounted for, either by stating that failing to notice some data changes is acceptable (and hopefully instituting a periodic check to see how many there have been, perhaps an annual or semiannual reconciliation), or by continuing the detection process until, assuming your code works properly, the chance is zero (we'll ignore platform bugs for now, since those can corrupt your data up regardless).

  • You have some good points Nadrek, thanks for bringing them up!

  • Very nice article.

    We could also use Merge statement in Sql Server 2008 to achieve this. Any thoughts on that.

  • Neeraj Gupta-485504 (6/8/2010)


    Very nice article.

    We could also use Merge statement in Sql Server 2008 to achieve this. Any thoughts on that.

    I think you are correct, if you have direct access to both as a table you could use the MERGE statement along with the BINARY_CHECKSUM function to do the entire process of INSERTing, UPDATEing, and DELETEing.

  • Could you please elaborate the use of binary checksum function.

    Thank you

  • An untested idea for minimizing hash collisions could be, having 2 hash columns. One for hashing columns in ascending and one in descending order. If you have a hash collisions in ascending order the chances for also having it in the descending order for the collision rows, should be nearly impossible, unless the values are the same in all columns

  • An untested idea for minimizing hash collisions could be, having 2 hash columns. One for hashing columns in ascending and one in descending order. If you have a hash collisions in ascending order the chances for also having it in the descending order for the collision rows, should be nearly impossible, unless the values are the same in all columns

  • Greate article and I also use the Multiple Hash component that Todd stated http://ssismhash.codeplex.com/[/url] but I like the script component use. This would be great for a simple Type 1 SCD, but obviously would need to be expanded upon for a Type 2 and Todd has a great component for that as well http://kimballscd.codeplex.com/[/url]. If you were doing a Type 1 you could typically leverage some sort of last update field or possibly change data capture and then simply update the changed records and just bypass the use of a hash.

    ----------------------------------------------------------------------------------------
    Dan English - http://denglishbi.wordpress.com

  • Interesting read with relation to this topic:

    Let's Hash a BLOB by Michael Coles

    Perhaps more applicable to hashing techniques outside of ETL loading, or those paranoid about hash collisions, but still interesting stuff. Mr. Coles also includes some .NET code for using stronger hashes than the MD5 and SHA1 covered thus far.

  • Neeraj Gupta-485504 (6/8/2010)


    Could you please elaborate the use of binary checksum function.

    Here is a link to BOL for it: BINARY_CHECKSUM

    Essentially you could do something like:

    UPDATE A

    SET LastName = B.LastName

    FROM Table1 A

    INNER JOIN Table2 B

    ON A.PK = B.PK

    WHERE

    (SELECT BINARY_CHECKSUM(*) FROM Table1 WHERE PK = A.PK) <> (SELECT BINARY_CHECKSUM(*) FROM Table2 WHERE PK = A.PK);

    That probably isn't the best way to do it, you could get rid of the sub-queries by specifying column names like:

    UPDATE A

    SET LastName = B.LastName

    FROM Table1 A

    INNER JOIN Table2 B

    ON A.PK = B.PK

    WHERE

    BINARY_CHECKSUM(A.LastName) <> BINARY_CHECKSUM(B.LastName);

    You could use the same idea with a MERGE statement to handle the whole load process.

Viewing 15 posts - 16 through 30 (of 66 total)

You must be logged in to reply to this topic. Login to reply