Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes


Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes

Author
Message
Brett Flippin
Brett Flippin
SSC Journeyman
SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)

Group: General Forum Members
Points: 79 Visits: 164
Comments posted to this topic are about the item Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes
rcooke
rcooke
Forum Newbie
Forum Newbie (4 reputation)Forum Newbie (4 reputation)Forum Newbie (4 reputation)Forum Newbie (4 reputation)Forum Newbie (4 reputation)Forum Newbie (4 reputation)Forum Newbie (4 reputation)Forum Newbie (4 reputation)

Group: General Forum Members
Points: 4 Visits: 17
i like the hash approach - to save time you could just use the konesans checksum algorithm add in to calculate the crc value. Combined with the conditional split it works a treat.
popnadrian
popnadrian
Grasshopper
Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)

Group: General Forum Members
Points: 12 Visits: 18
I implement something like this and it increased dramatically the SSIS performance. Fortunately i have in the input a hash column already computed, from sources
hennie7863
hennie7863
Valued Member
Valued Member (57 reputation)Valued Member (57 reputation)Valued Member (57 reputation)Valued Member (57 reputation)Valued Member (57 reputation)Valued Member (57 reputation)Valued Member (57 reputation)Valued Member (57 reputation)

Group: General Forum Members
Points: 57 Visits: 3421
Great! I haven't read this but it looks great. Will read this and place some comments (when i have them).
Koen Verbeeck
Koen Verbeeck
SSCoach
SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)

Group: General Forum Members
Points: 16455 Visits: 13207
Nice article. Although I would have used a temp/staging table to store my updates, instead of using the OLE DB Command. That way you can use a set based update instead of an update per row.



How to post forum questions.
Need an answer? No, you need a question.
What’s the deal with Excel & SSIS?

Member of LinkedIn. My blog at SQLKover.

MCSA SQL Server 2012 - MCSE Business Intelligence
Adam Aspin
Adam Aspin
SSC-Enthusiastic
SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)

Group: General Forum Members
Points: 183 Visits: 1028
Really interesting and well-written article, thank you.
As a matter of interest, is it any faster (or in any other way better or worse) to create the hash value as part of the T-SQL Select statement, using code based on something like this:

SELECT HashBytes('MD5', FirstName + LastName) FROM [AdventureWorks].[Person].[Contact]

Of course you will have to use all the fields that you will be hashing on, and cast all values to (n)varchar, but I wondered if you had tried this, and if so, which appraoch gave better performance?
blakmk
blakmk
SSC-Enthusiastic
SSC-Enthusiastic (142 reputation)SSC-Enthusiastic (142 reputation)SSC-Enthusiastic (142 reputation)SSC-Enthusiastic (142 reputation)SSC-Enthusiastic (142 reputation)SSC-Enthusiastic (142 reputation)SSC-Enthusiastic (142 reputation)SSC-Enthusiastic (142 reputation)

Group: General Forum Members
Points: 142 Visits: 337
First class article, do you have any benchmarks about how it improved performance?

Sql Server Blog
Sql Server Consultancy

Brett Flippin
Brett Flippin
SSC Journeyman
SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)

Group: General Forum Members
Points: 79 Visits: 164
rcooke (6/7/2010)
i like the hash approach - to save time you could just use the konesans checksum algorithm add in to calculate the crc value. Combined with the conditional split it works a treat.


Konesans Checksum is very prone to collisions, so it can introduce errors into your ETL process. Using MD5 hashes the chance for hash collisions is much much lower.
Brett Flippin
Brett Flippin
SSC Journeyman
SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)

Group: General Forum Members
Points: 79 Visits: 164
da-zero (6/8/2010)
Nice article. Although I would have used a temp/staging table to store my updates, instead of using the OLE DB Command. That way you can use a set based update instead of an update per row.


This is the way I normally do it as well. I wanted my example to focus more on the hashing this time around.

I heartily endorse doing anything you can do avoid using the OLEDB Command or SCD Task for updates.
Brett Flippin
Brett Flippin
SSC Journeyman
SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)

Group: General Forum Members
Points: 79 Visits: 164
Adam Aspin (6/8/2010)
Really interesting and well-written article, thank you.
As a matter of interest, is it any faster (or in any other way better or worse) to create the hash value as part of the T-SQL Select statement, using code based on something like this:

SELECT HashBytes('MD5', FirstName + LastName) FROM [AdventureWorks].[Person].[Contact]

Of course you will have to use all the fields that you will be hashing on, and cast all values to (n)varchar, but I wondered if you had tried this, and if so, which appraoch gave better performance?


There shouldn't be much difference in doing it this way, however I try to avoid doing string manipulation in SQL whenever possible as that can be a major drag on performance. That was the chief reason in deciding to do it in .Net.
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search