Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««12345»»»

Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes Expand / Collapse
Author
Message
Posted Tuesday, June 08, 2010 6:12 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Friday, May 10, 2013 4:31 PM
Points: 49, Visits: 135
blakmk (6/8/2010)
First class article, do you have any benchmarks about how it improved performance?


That was the one thing I knew I was missing going forward with this article. Unfortunately I don't have any concrete before and after times anymore. I will do what I can to get some other results and post them to the thread though I'm not promising anything for sure.
Post #933936
Posted Tuesday, June 08, 2010 7:12 AM


Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Yesterday @ 2:51 PM
Points: 10, Visits: 251
Better yet, I never even use MD5 because the chance for collision is much higher than SHA1. I'd highly recommend not using MD5 for large databases and only using SHA1, at least until SQL Server supports a stronger hash.

It's fairly easy to calculate the risk for collision, and I recommend anyone attempting this approach make their own calculation, so at least they understand the risks of introducing inaccurate data.

Other than that, hashing techniques works great.
Post #933975
Posted Tuesday, June 08, 2010 9:10 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Thursday, February 27, 2014 9:41 AM
Points: 7, Visits: 104
I see what you're trying to point out, but keep in mind that the basis for checking individual records is 1:1 because the lookup is joined on the primary key, therefore, you're not checking against the entire dataset, which could create the behavior you describe.

I am interested in seeing which specific changes to a record could result in generating the identical MD5 checksum.
Post #934064
Posted Tuesday, June 08, 2010 9:48 AM


Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Yesterday @ 2:51 PM
Points: 10, Visits: 251
belldb (6/8/2010)
I see what you're trying to point out, but keep in mind that the basis for checking individual records is 1:1 because the lookup is joined on the primary key, therefore, you're not checking against the entire dataset, which could create the behavior you describe.

I am interested in seeing which specific changes to a record could result in generating the identical MD5 checksum.


Cool, I don't have the sample database loaded, so didn't know for certain you're joining on the primary key column, since ContactID doesn't have a primary key defined in the code sample. In light of this new information, the 1:1 should make MD5 just fine.

About the only other thing I have to add is that most hashes output a fixed length latin value (CHAR 40 for SHA1), so I'd probably use CHAR instead of either VARCHAR, NCHAR, or NVARCHAR for storing the hash value - this post is about maximizing speed after all.

Even with 1:1 comparisons, the wider your input for the hash becomes, the greater the probability for the hash collision. Although I don't have a reference for a speed difference in SQL Server between MD5 versus SHA1 hashing, SHA1 is plenty fast enough especially since it's safer. We should all try to be as close to ACID as possible, so I'll simply never use MD5 anymore.
Post #934081
Posted Tuesday, June 08, 2010 10:21 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Friday, May 10, 2013 4:31 PM
Points: 49, Visits: 135
Great discussion so far. I wanted to chime back in and say that I agree with the recommendations for using SHA-1, but by using the ID plus the Hash to identify records MD5 works very well and has little to no chance for producing a collision.

Either way works fine, and my code should be able to modified very easily to produce a SHA-1 hash as well.

Here's a revised CreateHash function that should work for SHA-1 hashes (haven't tested just going off memory).

    Public Shared Function CreateHash(ByVal data As String) As String
Dim dataToHash As Byte() = (New UnicodeEncoding()).GetBytes(data)
Dim sha As New SHA1CryptoServiceProvider()
Dim hashedData As Byte() = sha.ComputeHash(dataToHash)
RNGCryptoServiceProvider.Create().GetBytes(dataToHash)
Dim s As String = Convert.ToBase64String(hashedData, Base64FormattingOptions.None)
Return s
End Function

Post #934103
Posted Tuesday, June 08, 2010 10:36 AM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Thursday, March 27, 2014 8:50 AM
Points: 2,163, Visits: 2,184
Great article thanks! I can see where it might be able to speed things up, but only if you are constrained on your writes. In my current ETL process I am constrained reading the data, as it comes over a WAN link, so all this would end up doing is complicating things and adding more CPU overhead. Now if I could store the hash in the source DB, and only pull changed records over the WAN that could make a huge difference.

You actually end up doing more reads, so if the percent of changes is very high you could actually perform worse. (Or if you are CPU bound the hash calculation could end up slowing you down.)

The other thing I would like to see it deal with it deletes, if records are deleted from the source, they would stay in your destination...
Post #934112
Posted Tuesday, June 08, 2010 10:37 AM


SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Thursday, July 22, 2010 9:44 AM
Points: 36, Visits: 109
Great article, Brett. I have a half completed one in my pipe similar to this - but directed solely at how to use a hash to speed up my SCD component - because I think it should!

For those that are interested in doing this, but aren't comfortable with coding, you can use the Konesans Checksum, or you can use the Multiple Hash component on CodePlex that offers several different hashing methods - see the SSIS Community Tasks and Components site for those extensions.


Todd McDermid - SQL Server MVP, MCSD.Net, MCTS (SQL08BI)
Find great SSIS extensions - most are open source - at SSIS Community Tasks and Components
Are you doing Data Warehouse ETL with SSIS? Don't use the slow SCD Wizard, use the Kimball Method SCD component instead.
Post #934114
Posted Tuesday, June 08, 2010 10:53 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Friday, May 10, 2013 4:31 PM
Points: 49, Visits: 135
UMG Developer (6/8/2010)
Great article thanks! I can see where it might be able to speed things up, but only if you are constrained on your writes. In my current ETL process I am constrained reading the data, as it comes over a WAN link, so all this would end up doing is complicating things and adding more CPU overhead. Now if I could store the hash in the source DB, and only pull changed records over the WAN that could make a huge difference.

You actually end up doing more reads, so if the percent of changes is very high you could actually perform worse. (Or if you are CPU bound the hash calculation could end up slowing you down.)

The other thing I would like to see it deal with it deletes, if records are deleted from the source, they would stay in your destination...


If you were able to have your source system calculate the hashes for each record in advance you could get by very easily by not having to do the hash calculation in the SSIS package and just query IDs and hashes all day long to do comparisons with. This method was devised with a source system in mind that didn't do that. I've also never encountered an issue with being CPU bound on hash calculations even with millions of records.

Deletes would be handled separately in another data flow and is out of scope for this particular data flow as it would require a Lookup going in the reverse direction (destination to source) to find the IDs that didn't exist anymore and then piping those to a staging table or OLEDB Command where you could then run the deletes.
Post #934128
Posted Tuesday, June 08, 2010 11:04 AM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Thursday, March 27, 2014 8:50 AM
Points: 2,163, Visits: 2,184
Brett Flippin (6/8/2010)If you were able to have your source system calculate the hashes for each record in advance you could get by very easily by not having to do the hash calculation in the SSIS package and just query IDs and hashes all day long to do comparisons with. This method was devised with a source system in mind that didn't do that. I've also never encountered an issue with being CPU bound on hash calculations even with millions of records.


That could be useful if I can get the hash calculation the same on both sides, though it would probably be a lot of work. Again, if the percent of changes is really high it could end up pulling more data over the WAN than a full truncate/re-load. I think the way to do it, if you can, would be to store the last synced hash value in the source system, so that you can pull only the changed and new records over the WAN link. I wouldn't even need the hash value on the destination side. (I think this has some real possibilities.) Thanks for the question that made me think more about this!

Brett Flippin (6/8/2010)I've also never encountered an issue with being CPU bound on hash calculations even with millions of records.


The hash calculations might not cause a CPU problem, but the server may well be doing something else at the same time that uses most, if not all of the CPU available.
Post #934143
Posted Tuesday, June 08, 2010 1:29 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Friday, May 10, 2013 4:31 PM
Points: 49, Visits: 135
Todd McDermid (6/8/2010)
Great article, Brett. I have a half completed one in my pipe similar to this - but directed solely at how to use a hash to speed up my SCD component - because I think it should!

For those that are interested in doing this, but aren't comfortable with coding, you can use the Konesans Checksum, or you can use the Multiple Hash component on CodePlex that offers several different hashing methods - see the SSIS Community Tasks and Components site for those extensions.


Thanks Todd. Multiple Hash is indeed a great tool as well.
Post #934222
« Prev Topic | Next Topic »

Add to briefcase ««12345»»»

Permissions Expand / Collapse