Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««12345»»»

HASHBYTES can help quickly load a Data Warehouse Expand / Collapse
Author
Message
Posted Wednesday, April 14, 2010 7:17 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, May 11, 2010 7:16 AM
Points: 6, Visits: 26
Wrapping hashbytes with checksum is not bad as long as you include all columns in your join as indicated and create the index including all columns required for the join. If you only use the result of the checksum in your join - yes it is bad!
Post #903107
Posted Wednesday, April 14, 2010 7:18 AM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Wednesday, September 24, 2014 1:20 PM
Points: 1,276, Visits: 1,135
I wouldn't expect many collisions with MD5 unless you're processing 2^64 or more rows. If you're getting collisions with MD5 or SHA I would immediately suspect the concatenation function you're using.
Post #903108
Posted Wednesday, April 14, 2010 7:29 AM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Tuesday, November 4, 2014 12:48 PM
Points: 2,818, Visits: 2,566
Fascinating article. Well thought out and presented. Definitely an approach I will keep in mind, partcularly when source data may be out of sync at load time as presented here.

Thanks.
Post #903117
Posted Wednesday, April 14, 2010 7:33 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Wednesday, September 7, 2011 7:17 AM
Points: 2, Visits: 57
Mike C (4/14/2010)
I wouldn't expect many collisions with MD5 unless you're processing 2^64 or more rows. If you're getting collisions with MD5 or SHA I would immediately suspect the concatenation function you're using.


Ha! I apologize. Not sure why, but I was thinking CRC32 this entire time. MD5 and SHA1 both work great. CRC32 will result in collisions.
Post #903122
Posted Wednesday, April 14, 2010 7:49 AM
Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Thursday, May 8, 2014 10:08 PM
Points: 358, Visits: 397
hennie7863 (4/14/2010)
I have not read the article in detail, so correct me if i'm wrong, but the hashbytes accepts only one column (of one type character) and in this article the hashbyte calculation is made simple by using one column.

Currently i''m using the far from perfect checksum calculation (and addtitional fields comparison) for comparing records. This accepts multiple columns and types. When indexed, this will be faster than the hashbytes functionality.

This article draw my attention because the checksum is not perfect. Multiple records will lead to the same checksum result. I'm interested in a solution using the hashbytes with multiple columns and multiple types. Still looking for a good solution.

Yes, I did make the example overly simplified because I just wanted to trigger an ah-ha! moment. The article was already running long without getting into a lot of what if situations. To address your concern, it's pretty easy to convert various types of columns into a common type and concatenate them together in a function that would also process the MD5 sum. If the resulting length exceeds the HASHBYTES function's input limit then this method isn't for you or you need to come up with a way to shorten them somehow. There's no reason why a extremely wide column couldn't be put into two or three checksums in the XREF table.

Obviously this isn't the perfect method for everyone but I think some people will find it useful.
Post #903138
Posted Wednesday, April 14, 2010 7:56 AM
Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Thursday, May 8, 2014 10:08 PM
Points: 358, Visits: 397
Mike C (4/13/2010)
Nice article! I implemented something very similar for loading data warehouses and datamarts a while back. One thing to keep in mind is that you can actually use a BINARY(16) column to hold MD5 hashes or BINARY(20) for SHA-1 since they're guaranteed to return a binary hash with exact length of 128-bits or 160-bits respectively. Also HASHBYTES is restricted in what it can hash -- there's an 8,000 byte restriction on the input, so you can't hash LOB data or columns wider than 8,000 bytes with it effectively. You can work around that limitation by chopping up your input or by generating the hashes in your ETL tool of choice (SSIS script components allow direct access to the System.Security.Cryptography namespace, for example).


Thanks for the positive feedback!

The tricky thing about keeping the output as binary makes it hard on the eyes when pasted into one's article text for illustration. No, it doesn't have to be converted in actual use. But also not all MD5 functions return binary depending on the ETL tool. I used HASHBYTES to stay on topic with SQL Server.
Post #903148
Posted Wednesday, April 14, 2010 7:59 AM
Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Thursday, May 8, 2014 10:08 PM
Points: 358, Visits: 397
tarquin (4/14/2010)
I need to use multiple columns in the hashindex and so use a checksum over hashbytes such as:
checksum(hashbytes('MD5', column1), hashbytes('MD5',column2)) - this works pretty well but is still not 100% unique across all records (in fact even a hashbytes on a single column is not 100% unique across all values).


I'd suggest getting rid of the checksum and using hashbytes with the columns concatenated together into a single call to the function. This should result in a (reasonably) unique entry. Different data types just need to be converted to a common type.
Post #903151
Posted Wednesday, April 14, 2010 8:04 AM
Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Thursday, May 8, 2014 10:08 PM
Points: 358, Visits: 397
rmd0 (4/14/2010)
One issue I have with this solution is that MD5 *will* have collisions with moderately "wide" tables, especially when bit columns are involved. In fact, I would see as many as two or three collisions in as few as 1 million rows on a table with a dozen columns. It would usually occur when the column that changed as a bit field.


This is not the case. Please see the web page linked at the end of the article concerning the likelyhood of MD5 collisions.
Post #903156
Posted Wednesday, April 14, 2010 8:04 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Friday, November 7, 2014 3:59 AM
Points: 46, Visits: 3,359

Yes, I did make the example overly simplified because I just wanted to trigger an ah-ha! moment. The article was already running long without getting into a lot of what if situations. To address your concern, it's pretty easy to convert various types of columns into a common type and concatenate them together in a function that would also process the MD5 sum. If the resulting length exceeds the HASHBYTES function's input limit then this method isn't for you or you need to come up with a way to shorten them somehow. There's no reason why a extremely wide column couldn't be put into two or three checksums in the XREF table.

Obviously this isn't the perfect method for everyone but I think some people will find it useful.


Hi magarity,

Thnx for your reply. The concatenation option already came accross my mind and yes i agree that it is not the best way. I'm afraid that for comparing based on hashvalues we have to choose the best solution from the not so very well available options.

I did a little googlization and i came up with following options:
* Checksum function
* Hashbytes function
* SSISmultiplehash (3rd)
* Checksum transformation konesans (3rd)

The problem with 3rd party (open) components i have is that when you have a bug, you're stuck. Hmmm...

Keep writing articles!

Greetz,
Hennie
Post #903157
Posted Wednesday, April 14, 2010 8:33 AM
SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Wednesday, November 19, 2014 12:06 PM
Points: 887, Visits: 2,453
A few comments with the underlying assumption that the desired result is a (potentially) more efficient method that is guaranteed to return the same results as the "long way".

For guaranteed accuracy, use whatever hash you like as a first pass. Whenever the hashes are different, the data is different. Whenever the hashes are the same, run a field by field (or equivalent) "long way" comparison to see if the data is actually the same after all.

Note that field concatenation comparisons do not work with variable length types mixed in (trivial example: "a " "b" and "a" " b" are not the same; alternate, "1" "1/22/1988" and "" "11/22/1988" are not the same).

If you deliberately decide that some non-zero percentage of erroneous matches are acceptable, then test your hash vs a "full length" compare to see what your particular data results in. In reality, there are only real results. Probabilities let you predict probabilities, but most business data I've seen doesn't follow "random input" or normal distribution rules.

That said, if you rely on probability, plan for what you'll do after a collision happens.
Post #903196
« Prev Topic | Next Topic »

Add to briefcase ««12345»»»

Permissions Expand / Collapse