SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes


Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes

Author
Message
Brett Flippin
Brett Flippin
Old Hand
Old Hand (325 reputation)Old Hand (325 reputation)Old Hand (325 reputation)Old Hand (325 reputation)Old Hand (325 reputation)Old Hand (325 reputation)Old Hand (325 reputation)Old Hand (325 reputation)

Group: General Forum Members
Points: 325 Visits: 164
Comments posted to this topic are about the item Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes
rcooke
rcooke
Grasshopper
Grasshopper (24 reputation)Grasshopper (24 reputation)Grasshopper (24 reputation)Grasshopper (24 reputation)Grasshopper (24 reputation)Grasshopper (24 reputation)Grasshopper (24 reputation)Grasshopper (24 reputation)

Group: General Forum Members
Points: 24 Visits: 17
i like the hash approach - to save time you could just use the konesans checksum algorithm add in to calculate the crc value. Combined with the conditional split it works a treat.
popnadrian
popnadrian
SSC Rookie
SSC Rookie (40 reputation)SSC Rookie (40 reputation)SSC Rookie (40 reputation)SSC Rookie (40 reputation)SSC Rookie (40 reputation)SSC Rookie (40 reputation)SSC Rookie (40 reputation)SSC Rookie (40 reputation)

Group: General Forum Members
Points: 40 Visits: 18
I implement something like this and it increased dramatically the SSIS performance. Fortunately i have in the input a hash column already computed, from sources
hennie7863
hennie7863
Right there with Babe
Right there with Babe (791 reputation)Right there with Babe (791 reputation)Right there with Babe (791 reputation)Right there with Babe (791 reputation)Right there with Babe (791 reputation)Right there with Babe (791 reputation)Right there with Babe (791 reputation)Right there with Babe (791 reputation)

Group: General Forum Members
Points: 791 Visits: 3421
Great! I haven't read this but it looks great. Will read this and place some comments (when i have them).
Koen Verbeeck
Koen Verbeeck
SSC Guru
SSC Guru (125K reputation)SSC Guru (125K reputation)SSC Guru (125K reputation)SSC Guru (125K reputation)SSC Guru (125K reputation)SSC Guru (125K reputation)SSC Guru (125K reputation)SSC Guru (125K reputation)

Group: General Forum Members
Points: 125081 Visits: 13344
Nice article. Although I would have used a temp/staging table to store my updates, instead of using the OLE DB Command. That way you can use a set based update instead of an update per row.


How to post forum questions.
Need an answer? No, you need a question.
What’s the deal with Excel & SSIS?
My blog at SQLKover.

MCSE Business Intelligence - Microsoft Data Platform MVP
Adam Aspin
Adam Aspin
SSCrazy
SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)

Group: General Forum Members
Points: 2241 Visits: 1032
Really interesting and well-written article, thank you.
As a matter of interest, is it any faster (or in any other way better or worse) to create the hash value as part of the T-SQL Select statement, using code based on something like this:

SELECT HashBytes('MD5', FirstName + LastName) FROM [AdventureWorks].[Person].[Contact]

Of course you will have to use all the fields that you will be hashing on, and cast all values to (n)varchar, but I wondered if you had tried this, and if so, which appraoch gave better performance?
blakmk
blakmk
Mr or Mrs. 500
Mr or Mrs. 500 (588 reputation)Mr or Mrs. 500 (588 reputation)Mr or Mrs. 500 (588 reputation)Mr or Mrs. 500 (588 reputation)Mr or Mrs. 500 (588 reputation)Mr or Mrs. 500 (588 reputation)Mr or Mrs. 500 (588 reputation)Mr or Mrs. 500 (588 reputation)

Group: General Forum Members
Points: 588 Visits: 337
First class article, do you have any benchmarks about how it improved performance?

Sql Server Blog
Sql Server Consultancy

Brett Flippin
Brett Flippin
Old Hand
Old Hand (325 reputation)Old Hand (325 reputation)Old Hand (325 reputation)Old Hand (325 reputation)Old Hand (325 reputation)Old Hand (325 reputation)Old Hand (325 reputation)Old Hand (325 reputation)

Group: General Forum Members
Points: 325 Visits: 164
rcooke (6/7/2010)
i like the hash approach - to save time you could just use the konesans checksum algorithm add in to calculate the crc value. Combined with the conditional split it works a treat.


Konesans Checksum is very prone to collisions, so it can introduce errors into your ETL process. Using MD5 hashes the chance for hash collisions is much much lower.
Brett Flippin
Brett Flippin
Old Hand
Old Hand (325 reputation)Old Hand (325 reputation)Old Hand (325 reputation)Old Hand (325 reputation)Old Hand (325 reputation)Old Hand (325 reputation)Old Hand (325 reputation)Old Hand (325 reputation)

Group: General Forum Members
Points: 325 Visits: 164
da-zero (6/8/2010)
Nice article. Although I would have used a temp/staging table to store my updates, instead of using the OLE DB Command. That way you can use a set based update instead of an update per row.


This is the way I normally do it as well. I wanted my example to focus more on the hashing this time around.

I heartily endorse doing anything you can do avoid using the OLEDB Command or SCD Task for updates.
Brett Flippin
Brett Flippin
Old Hand
Old Hand (325 reputation)Old Hand (325 reputation)Old Hand (325 reputation)Old Hand (325 reputation)Old Hand (325 reputation)Old Hand (325 reputation)Old Hand (325 reputation)Old Hand (325 reputation)

Group: General Forum Members
Points: 325 Visits: 164
Adam Aspin (6/8/2010)
Really interesting and well-written article, thank you.
As a matter of interest, is it any faster (or in any other way better or worse) to create the hash value as part of the T-SQL Select statement, using code based on something like this:

SELECT HashBytes('MD5', FirstName + LastName) FROM [AdventureWorks].[Person].[Contact]

Of course you will have to use all the fields that you will be hashing on, and cast all values to (n)varchar, but I wondered if you had tried this, and if so, which appraoch gave better performance?


There shouldn't be much difference in doing it this way, however I try to avoid doing string manipulation in SQL whenever possible as that can be a major drag on performance. That was the chief reason in deciding to do it in .Net.
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum







































































































































































SQLServerCentral


Search