Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes

Question

Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes

Brett Flippin

Right there with Babe

Points: 717
More actions
June 7, 2010 at 10:02 pm

#236091

Comments posted to this topic are about the item Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes

Viewing 15 posts - 1 through 15 (of 67 total)

You must be logged in to reply to this topic. Login to reply

rcooke Valued Member Points: 54 More actions · Answer 1

i like the hash approach - to save time you could just use the konesans checksum algorithm add in to calculate the crc value. Combined with the conditional split it works a treat.

popnadrian SSC Journeyman Points: 84 More actions · Answer 2

I implement something like this and it increased dramatically the SSIS performance. Fortunately i have in the input a hash column already computed, from sources

hennie7863 SSCommitted Points: 1653 More actions · Answer 3

Great! I haven't read this but it looks great. Will read this and place some comments (when i have them).

Koen Verbeeck SSC Guru Points: 259207 More actions · Answer 4

Nice article. Although I would have used a temp/staging table to store my updates, instead of using the OLE DB Command. That way you can use a set based update instead of an update per row.

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

Adam Aspin SSCarpal Tunnel Points: 4900 More actions · Answer 5

Really interesting and well-written article, thank you.

As a matter of interest, is it any faster (or in any other way better or worse) to create the hash value as part of the T-SQL Select statement, using code based on something like this:

[font="Courier New"]SELECT HashBytes('MD5', FirstName + LastName) FROM [AdventureWorks].[Person].[Contact][/font]

Of course you will have to use all the fields that you will be hashing on, and cast all values to (n)varchar, but I wondered if you had tried this, and if so, which appraoch gave better performance?

blakmk Ten Centuries Points: 1160 More actions · Answer 6

First class article, do you have any benchmarks about how it improved performance?

Sql Server Blog
Sql Server Consultancy

Brett Flippin Right there with Babe Points: 717 More actions · Answer 7

rcooke (6/7/2010)
i like the hash approach - to save time you could just use the konesans checksum algorithm add in to calculate the crc value. Combined with the conditional split it works a treat.

Konesans Checksum is very prone to collisions, so it can introduce errors into your ETL process. Using MD5 hashes the chance for hash collisions is much much lower.

Brett Flippin Right there with Babe Points: 717 More actions · Answer 8

da-zero (6/8/2010)
Nice article. Although I would have used a temp/staging table to store my updates, instead of using the OLE DB Command. That way you can use a set based update instead of an update per row.

This is the way I normally do it as well. I wanted my example to focus more on the hashing this time around.

I heartily endorse doing anything you can do avoid using the OLEDB Command or SCD Task for updates.

Brett Flippin Right there with Babe Points: 717 More actions · Answer 9

Adam Aspin (6/8/2010)
Really interesting and well-written article, thank you.
As a matter of interest, is it any faster (or in any other way better or worse) to create the hash value as part of the T-SQL Select statement, using code based on something like this:
[font="Courier New"]SELECT HashBytes('MD5', FirstName + LastName) FROM [AdventureWorks].[Person].[Contact][/font]
Of course you will have to use all the fields that you will be hashing on, and cast all values to (n)varchar, but I wondered if you had tried this, and if so, which appraoch gave better performance?

There shouldn't be much difference in doing it this way, however I try to avoid doing string manipulation in SQL whenever possible as that can be a major drag on performance. That was the chief reason in deciding to do it in .Net.

Brett Flippin Right there with Babe Points: 717 More actions · Answer 10

blakmk (6/8/2010)
First class article, do you have any benchmarks about how it improved performance?

That was the one thing I knew I was missing going forward with this article. Unfortunately I don't have any concrete before and after times anymore. I will do what I can to get some other results and post them to the thread though I'm not promising anything for sure.

Wes W. SSC Journeyman Points: 84 More actions · Answer 11

Better yet, I never even use MD5 because the chance for collision is much higher than SHA1. I'd highly recommend not using MD5 for large databases and only using SHA1, at least until SQL Server supports a stronger hash.

It's fairly easy to calculate the risk for collision, and I recommend anyone attempting this approach make their own calculation, so at least they understand the risks of introducing inaccurate data.

Other than that, hashing techniques works great.

belldb Old Hand Points: 315 More actions · Answer 12

I see what you're trying to point out, but keep in mind that the basis for checking individual records is 1:1 because the lookup is joined on the primary key, therefore, you're not checking against the entire dataset, which could create the behavior you describe.

I am interested in seeing which specific changes to a record could result in generating the identical MD5 checksum.

Wes W. SSC Journeyman Points: 84 More actions · Answer 13

belldb (6/8/2010)
I see what you're trying to point out, but keep in mind that the basis for checking individual records is 1:1 because the lookup is joined on the primary key, therefore, you're not checking against the entire dataset, which could create the behavior you describe.
I am interested in seeing which specific changes to a record could result in generating the identical MD5 checksum.

Cool, I don't have the sample database loaded, so didn't know for certain you're joining on the primary key column, since ContactID doesn't have a primary key defined in the code sample. In light of this new information, the 1:1 should make MD5 just fine.

About the only other thing I have to add is that most hashes output a fixed length latin value (CHAR 40 for SHA1), so I'd probably use CHAR instead of either VARCHAR, NCHAR, or NVARCHAR for storing the hash value - this post is about maximizing speed after all.

Even with 1:1 comparisons, the wider your input for the hash becomes, the greater the probability for the hash collision. Although I don't have a reference for a speed difference in SQL Server between MD5 versus SHA1 hashing, SHA1 is plenty fast enough especially since it's safer. We should all try to be as close to ACID as possible, so I'll simply never use MD5 anymore.

Brett Flippin Right there with Babe Points: 717 More actions · Answer 14

Great discussion so far. I wanted to chime back in and say that I agree with the recommendations for using SHA-1, but by using the ID plus the Hash to identify records MD5 works very well and has little to no chance for producing a collision.

Either way works fine, and my code should be able to modified very easily to produce a SHA-1 hash as well.

Here's a revised CreateHash function that should work for SHA-1 hashes (haven't tested just going off memory).

Public Shared Function CreateHash(ByVal data As String) As String

Dim dataToHash As Byte() = (New UnicodeEncoding()).GetBytes(data)

Dim sha As New SHA1CryptoServiceProvider()

Dim hashedData As Byte() = sha.ComputeHash(dataToHash)

RNGCryptoServiceProvider.Create().GetBytes(dataToHash)

Dim s As String = Convert.ToBase64String(hashedData, Base64FormattingOptions.None)

Return s

End Function