Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes

  • Brett Flippin

    Right there with Babe

    Points: 717

    Comments posted to this topic are about the item Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes

  • rcooke

    Valued Member

    Points: 54

    i like the hash approach - to save time you could just use the konesans checksum algorithm add in to calculate the crc value. Combined with the conditional split it works a treat.

  • popnadrian

    SSC Journeyman

    Points: 84

    I implement something like this and it increased dramatically the SSIS performance. Fortunately i have in the input a hash column already computed, from sources

  • hennie7863

    SSCommitted

    Points: 1653

    Great! I haven't read this but it looks great. Will read this and place some comments (when i have them).

  • Koen Verbeeck

    SSC Guru

    Points: 258965

    Nice article. Although I would have used a temp/staging table to store my updates, instead of using the OLE DB Command. That way you can use a set based update instead of an update per row.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Adam Aspin

    SSCarpal Tunnel

    Points: 4845

    Really interesting and well-written article, thank you.

    As a matter of interest, is it any faster (or in any other way better or worse) to create the hash value as part of the T-SQL Select statement, using code based on something like this:

    [font="Courier New"]SELECT HashBytes('MD5', FirstName + LastName) FROM [AdventureWorks].[Person].[Contact][/font]

    Of course you will have to use all the fields that you will be hashing on, and cast all values to (n)varchar, but I wondered if you had tried this, and if so, which appraoch gave better performance?

  • blakmk

    Ten Centuries

    Points: 1160

    First class article, do you have any benchmarks about how it improved performance?

  • Brett Flippin

    Right there with Babe

    Points: 717

    rcooke (6/7/2010)


    i like the hash approach - to save time you could just use the konesans checksum algorithm add in to calculate the crc value. Combined with the conditional split it works a treat.

    Konesans Checksum is very prone to collisions, so it can introduce errors into your ETL process. Using MD5 hashes the chance for hash collisions is much much lower.

  • Brett Flippin

    Right there with Babe

    Points: 717

    da-zero (6/8/2010)


    Nice article. Although I would have used a temp/staging table to store my updates, instead of using the OLE DB Command. That way you can use a set based update instead of an update per row.

    This is the way I normally do it as well. I wanted my example to focus more on the hashing this time around.

    I heartily endorse doing anything you can do avoid using the OLEDB Command or SCD Task for updates.

  • Brett Flippin

    Right there with Babe

    Points: 717

    Adam Aspin (6/8/2010)


    Really interesting and well-written article, thank you.

    As a matter of interest, is it any faster (or in any other way better or worse) to create the hash value as part of the T-SQL Select statement, using code based on something like this:

    [font="Courier New"]SELECT HashBytes('MD5', FirstName + LastName) FROM [AdventureWorks].[Person].[Contact][/font]

    Of course you will have to use all the fields that you will be hashing on, and cast all values to (n)varchar, but I wondered if you had tried this, and if so, which appraoch gave better performance?

    There shouldn't be much difference in doing it this way, however I try to avoid doing string manipulation in SQL whenever possible as that can be a major drag on performance. That was the chief reason in deciding to do it in .Net.

  • Brett Flippin

    Right there with Babe

    Points: 717

    blakmk (6/8/2010)


    First class article, do you have any benchmarks about how it improved performance?

    That was the one thing I knew I was missing going forward with this article. Unfortunately I don't have any concrete before and after times anymore. I will do what I can to get some other results and post them to the thread though I'm not promising anything for sure.

  • Wes W.

    SSC Journeyman

    Points: 84

    Better yet, I never even use MD5 because the chance for collision is much higher than SHA1. I'd highly recommend not using MD5 for large databases and only using SHA1, at least until SQL Server supports a stronger hash.

    It's fairly easy to calculate the risk for collision, and I recommend anyone attempting this approach make their own calculation, so at least they understand the risks of introducing inaccurate data.

    Other than that, hashing techniques works great.

  • belldb

    SSC Veteran

    Points: 255

    I see what you're trying to point out, but keep in mind that the basis for checking individual records is 1:1 because the lookup is joined on the primary key, therefore, you're not checking against the entire dataset, which could create the behavior you describe.

    I am interested in seeing which specific changes to a record could result in generating the identical MD5 checksum.

  • Wes W.

    SSC Journeyman

    Points: 84

    belldb (6/8/2010)


    I see what you're trying to point out, but keep in mind that the basis for checking individual records is 1:1 because the lookup is joined on the primary key, therefore, you're not checking against the entire dataset, which could create the behavior you describe.

    I am interested in seeing which specific changes to a record could result in generating the identical MD5 checksum.

    Cool, I don't have the sample database loaded, so didn't know for certain you're joining on the primary key column, since ContactID doesn't have a primary key defined in the code sample. In light of this new information, the 1:1 should make MD5 just fine.

    About the only other thing I have to add is that most hashes output a fixed length latin value (CHAR 40 for SHA1), so I'd probably use CHAR instead of either VARCHAR, NCHAR, or NVARCHAR for storing the hash value - this post is about maximizing speed after all.

    Even with 1:1 comparisons, the wider your input for the hash becomes, the greater the probability for the hash collision. Although I don't have a reference for a speed difference in SQL Server between MD5 versus SHA1 hashing, SHA1 is plenty fast enough especially since it's safer. We should all try to be as close to ACID as possible, so I'll simply never use MD5 anymore.

  • Brett Flippin

    Right there with Babe

    Points: 717

    Great discussion so far. I wanted to chime back in and say that I agree with the recommendations for using SHA-1, but by using the ID plus the Hash to identify records MD5 works very well and has little to no chance for producing a collision.

    Either way works fine, and my code should be able to modified very easily to produce a SHA-1 hash as well.

    Here's a revised CreateHash function that should work for SHA-1 hashes (haven't tested just going off memory).

    Public Shared Function CreateHash(ByVal data As String) As String

    Dim dataToHash As Byte() = (New UnicodeEncoding()).GetBytes(data)

    Dim sha As New SHA1CryptoServiceProvider()

    Dim hashedData As Byte() = sha.ComputeHash(dataToHash)

    RNGCryptoServiceProvider.Create().GetBytes(dataToHash)

    Dim s As String = Convert.ToBase64String(hashedData, Base64FormattingOptions.None)

    Return s

    End Function

Viewing 15 posts - 1 through 15 (of 67 total)

You must be logged in to reply to this topic. Login to reply