Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 12345»»»

Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes Expand / Collapse
Author
Message
Posted Monday, June 7, 2010 10:02 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Friday, May 10, 2013 4:31 PM
Points: 49, Visits: 135
Comments posted to this topic are about the item Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes
Post #933776
Posted Monday, June 7, 2010 11:05 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Sunday, August 21, 2011 7:43 PM
Points: 2, Visits: 17
i like the hash approach - to save time you could just use the konesans checksum algorithm add in to calculate the crc value. Combined with the conditional split it works a treat.
Post #933792
Posted Tuesday, June 8, 2010 12:00 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Tuesday, May 27, 2014 9:25 AM
Points: 10, Visits: 18
I implement something like this and it increased dramatically the SSIS performance. Fortunately i have in the input a hash column already computed, from sources
Post #933803
Posted Tuesday, June 8, 2010 2:37 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Tuesday, October 21, 2014 1:17 AM
Points: 46, Visits: 3,353
Great! I haven't read this but it looks great. Will read this and place some comments (when i have them).
Post #933848
Posted Tuesday, June 8, 2010 3:09 AM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Today @ 7:40 AM
Points: 13,294, Visits: 11,085
Nice article. Although I would have used a temp/staging table to store my updates, instead of using the OLE DB Command. That way you can use a set based update instead of an update per row.



How to post forum questions.
Need an answer? No, you need a question.
What’s the deal with Excel & SSIS?

Member of LinkedIn. My blog at LessThanDot.

MCSA SQL Server 2012 - MCSE Business Intelligence
Post #933856
Posted Tuesday, June 8, 2010 3:12 AM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Friday, October 10, 2014 12:58 AM
Points: 81, Visits: 904
Really interesting and well-written article, thank you.
As a matter of interest, is it any faster (or in any other way better or worse) to create the hash value as part of the T-SQL Select statement, using code based on something like this:

SELECT HashBytes('MD5', FirstName + LastName) FROM [AdventureWorks].[Person].[Contact]

Of course you will have to use all the fields that you will be hashing on, and cast all values to (n)varchar, but I wondered if you had tried this, and if so, which appraoch gave better performance?
Post #933857
Posted Tuesday, June 8, 2010 4:23 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Monday, July 21, 2014 5:54 AM
Points: 136, Visits: 337
First class article, do you have any benchmarks about how it improved performance?


Sql Server Blog
Sql Server Consultancy
Post #933880
Posted Tuesday, June 8, 2010 6:05 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Friday, May 10, 2013 4:31 PM
Points: 49, Visits: 135
rcooke (6/7/2010)
i like the hash approach - to save time you could just use the konesans checksum algorithm add in to calculate the crc value. Combined with the conditional split it works a treat.


Konesans Checksum is very prone to collisions, so it can introduce errors into your ETL process. Using MD5 hashes the chance for hash collisions is much much lower.
Post #933928
Posted Tuesday, June 8, 2010 6:07 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Friday, May 10, 2013 4:31 PM
Points: 49, Visits: 135
da-zero (6/8/2010)
Nice article. Although I would have used a temp/staging table to store my updates, instead of using the OLE DB Command. That way you can use a set based update instead of an update per row.


This is the way I normally do it as well. I wanted my example to focus more on the hashing this time around.

I heartily endorse doing anything you can do avoid using the OLEDB Command or SCD Task for updates.
Post #933932
Posted Tuesday, June 8, 2010 6:09 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Friday, May 10, 2013 4:31 PM
Points: 49, Visits: 135
Adam Aspin (6/8/2010)
Really interesting and well-written article, thank you.
As a matter of interest, is it any faster (or in any other way better or worse) to create the hash value as part of the T-SQL Select statement, using code based on something like this:

SELECT HashBytes('MD5', FirstName + LastName) FROM [AdventureWorks].[Person].[Contact]

Of course you will have to use all the fields that you will be hashing on, and cast all values to (n)varchar, but I wondered if you had tried this, and if so, which appraoch gave better performance?


There shouldn't be much difference in doing it this way, however I try to avoid doing string manipulation in SQL whenever possible as that can be a major drag on performance. That was the chief reason in deciding to do it in .Net.
Post #933934
« Prev Topic | Next Topic »

Add to briefcase 12345»»»

Permissions Expand / Collapse