Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes

Question

Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes

Viewing 15 posts - 16 through 30 (of 67 total)

You must be logged in to reply to this topic. Login to reply

UMG Developer SSChampion Points: 13482 More actions · Answer 1

Great article thanks! I can see where it might be able to speed things up, but only if you are constrained on your writes. In my current ETL process I am constrained reading the data, as it comes over a WAN link, so all this would end up doing is complicating things and adding more CPU overhead. Now if I could store the hash in the source DB, and only pull changed records over the WAN that could make a huge difference.

You actually end up doing more reads, so if the percent of changes is very high you could actually perform worse. (Or if you are CPU bound the hash calculation could end up slowing you down.)

The other thing I would like to see it deal with it deletes, if records are deleted from the source, they would stay in your destination...

Todasdasdfd asfasf Ten Centuries Points: 1340 More actions · Answer 2

Great article, Brett. I have a half completed one in my pipe similar to this - but directed solely at how to use a hash to speed up my SCD component - because I think it should!

For those that are interested in doing this, but aren't comfortable with coding, you can use the Konesans Checksum, or you can use the Multiple Hash component on CodePlex that offers several different hashing methods - see the SSIS Community Tasks and Components site for those extensions.

Todd McDermid - SQL Server MVP, MCSD.Net, MCTS (SQL08BI)
Find great SSIS extensions - most are open source - at SSIS Community Tasks and Components[/url]
Are you doing Data Warehouse ETL with SSIS? Don't use the slow SCD Wizard, use the Kimball Method SCD component instead.

Brett Flippin Right there with Babe Points: 717 More actions · Answer 3

UMG Developer (6/8/2010)
Great article thanks! I can see where it might be able to speed things up, but only if you are constrained on your writes. In my current ETL process I am constrained reading the data, as it comes over a WAN link, so all this would end up doing is complicating things and adding more CPU overhead. Now if I could store the hash in the source DB, and only pull changed records over the WAN that could make a huge difference.
You actually end up doing more reads, so if the percent of changes is very high you could actually perform worse. (Or if you are CPU bound the hash calculation could end up slowing you down.)
The other thing I would like to see it deal with it deletes, if records are deleted from the source, they would stay in your destination...

If you were able to have your source system calculate the hashes for each record in advance you could get by very easily by not having to do the hash calculation in the SSIS package and just query IDs and hashes all day long to do comparisons with. This method was devised with a source system in mind that didn't do that. I've also never encountered an issue with being CPU bound on hash calculations even with millions of records.

Deletes would be handled separately in another data flow and is out of scope for this particular data flow as it would require a Lookup going in the reverse direction (destination to source) to find the IDs that didn't exist anymore and then piping those to a staging table or OLEDB Command where you could then run the deletes.

UMG Developer SSChampion Points: 13482 More actions · Answer 4

Brett Flippin (6/8/2010)If you were able to have your source system calculate the hashes for each record in advance you could get by very easily by not having to do the hash calculation in the SSIS package and just query IDs and hashes all day long to do comparisons with. This method was devised with a source system in mind that didn't do that. I've also never encountered an issue with being CPU bound on hash calculations even with millions of records.

That could be useful if I can get the hash calculation the same on both sides, though it would probably be a lot of work. Again, if the percent of changes is really high it could end up pulling more data over the WAN than a full truncate/re-load. I think the way to do it, if you can, would be to store the last synced hash value in the source system, so that you can pull only the changed and new records over the WAN link. I wouldn't even need the hash value on the destination side. (I think this has some real possibilities.) Thanks for the question that made me think more about this!

Brett Flippin (6/8/2010)I've also never encountered an issue with being CPU bound on hash calculations even with millions of records.

The hash calculations might not cause a CPU problem, but the server may well be doing something else at the same time that uses most, if not all of the CPU available.

Brett Flippin Right there with Babe Points: 717 More actions · Answer 5

Todd McDermid (6/8/2010)
Great article, Brett. I have a half completed one in my pipe similar to this - but directed solely at how to use a hash to speed up my SCD component - because I think it should!
For those that are interested in doing this, but aren't comfortable with coding, you can use the Konesans Checksum, or you can use the Multiple Hash component on CodePlex that offers several different hashing methods - see the SSIS Community Tasks and Components site for those extensions.

Thanks Todd. Multiple Hash is indeed a great tool as well.

Nadrek SSC-Insane Points: 20039 More actions · Answer 6

I'm afraid I'm not familiar with the tools used in the article, but on first glance it appears the following two general "hash to find differences" cautions are appropriate:

First, when concatenating to a string and hashing the string, one must convert each fields to a fixed length string at or beyond the maximum possible length for each or use a delimiter that you guarantee and enforce the guarantee that the delimiter isn't in your data, otherwise the following get the same concatenated string, and thus the same hash:

Field1 Field2

'ab' 'cd'

'a' 'bcd'

'abc' 'd'

Addendum: Make sure your tools convert NULL differently than any other value.

Second, when the hashes that are different, assume the data is different. When the hashes are the same, now you have to either:

A) Ignore collisions when they happen (The "oh, it won't happen to me!" approach), and accept that every once in awhile you're going to fail to notice a data change.

B) Test field by field (or the entire concatenated string, if you did it right) to see whether or not the string is actually different.

Regarding the statistical "It's a very low chance, see my math" arguments that crop up regarding collisions, a non-zero chance must be properly accounted for, either by stating that failing to notice some data changes is acceptable (and hopefully instituting a periodic check to see how many there have been, perhaps an annual or semiannual reconciliation), or by continuing the detection process until, assuming your code works properly, the chance is zero (we'll ignore platform bugs for now, since those can corrupt your data up regardless).

UMG Developer SSChampion Points: 13482 More actions · Answer 7

You have some good points Nadrek, thanks for bringing them up!

Neeraj Gupta-485504 SSC Enthusiast Points: 179 More actions · Answer 8

Very nice article.

We could also use Merge statement in Sql Server 2008 to achieve this. Any thoughts on that.

UMG Developer SSChampion Points: 13482 More actions · Answer 9

Neeraj Gupta-485504 (6/8/2010)
Very nice article.
We could also use Merge statement in Sql Server 2008 to achieve this. Any thoughts on that.

I think you are correct, if you have direct access to both as a table you could use the MERGE statement along with the BINARY_CHECKSUM function to do the entire process of INSERTing, UPDATEing, and DELETEing.

Neeraj Gupta-485504 SSC Enthusiast Points: 179 More actions · Answer 10

Could you please elaborate the use of binary checksum function.

Thank you

Ligtorn Ten Centuries Points: 1107 More actions · Answer 11

An untested idea for minimizing hash collisions could be, having 2 hash columns. One for hashing columns in ascending and one in descending order. If you have a hash collisions in ascending order the chances for also having it in the descending order for the collision rows, should be nearly impossible, unless the values are the same in all columns

Ligtorn Ten Centuries Points: 1107 More actions · Answer 12

An untested idea for minimizing hash collisions could be, having 2 hash columns. One for hashing columns in ascending and one in descending order. If you have a hash collisions in ascending order the chances for also having it in the descending order for the collision rows, should be nearly impossible, unless the values are the same in all columns

denglishbi SSCrazy Eights Points: 8831 More actions · Answer 13

Greate article and I also use the Multiple Hash component that Todd stated http://ssismhash.codeplex.com/%5B/url%5D but I like the script component use. This would be great for a simple Type 1 SCD, but obviously would need to be expanded upon for a Type 2 and Todd has a great component for that as well http://kimballscd.codeplex.com/%5B/url%5D. If you were doing a Type 1 you could typically leverage some sort of last update field or possibly change data capture and then simply update the changed records and just bypass the use of a hash.

----------------------------------------------------------------------------------------
Dan English - http://denglishbi.wordpress.com

Wes W. SSC Journeyman Points: 84 More actions · Answer 14

Interesting read with relation to this topic:

Let's Hash a BLOB by Michael Coles

Perhaps more applicable to hashing techniques outside of ETL loading, or those paranoid about hash collisions, but still interesting stuff. Mr. Coles also includes some .NET code for using stronger hashes than the MD5 and SHA1 covered thus far.

UMG Developer SSChampion Points: 13482 More actions · Answer 15

Neeraj Gupta-485504 (6/8/2010)
Could you please elaborate the use of binary checksum function.

Here is a link to BOL for it: BINARY_CHECKSUM

Essentially you could do something like:

UPDATE A

SET LastName = B.LastName

FROM Table1 A

INNER JOIN Table2 B

ON A.PK = B.PK

WHERE

(SELECT BINARY_CHECKSUM(*) FROM Table1 WHERE PK = A.PK) <> (SELECT BINARY_CHECKSUM(*) FROM Table2 WHERE PK = A.PK);

That probably isn't the best way to do it, you could get rid of the sub-queries by specifying column names like: