Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


HASHBYTES can help quickly load a Data Warehouse


HASHBYTES can help quickly load a Data Warehouse

Author
Message
tarquin
tarquin
Forum Newbie
Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)

Group: General Forum Members
Points: 6 Visits: 26
Wrapping hashbytes with checksum is not bad as long as you include all columns in your join as indicated and create the index including all columns required for the join. If you only use the result of the checksum in your join - yes it is bad!
Mike C
Mike C
Ten Centuries
Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)

Group: General Forum Members
Points: 1351 Visits: 1168
I wouldn't expect many collisions with MD5 unless you're processing 2^64 or more rows. If you're getting collisions with MD5 or SHA I would immediately suspect the concatenation function you're using.
Daniel Bowlin
Daniel Bowlin
SSCrazy
SSCrazy (3K reputation)SSCrazy (3K reputation)SSCrazy (3K reputation)SSCrazy (3K reputation)SSCrazy (3K reputation)SSCrazy (3K reputation)SSCrazy (3K reputation)SSCrazy (3K reputation)

Group: General Forum Members
Points: 2954 Visits: 2629
Fascinating article. Well thought out and presented. Definitely an approach I will keep in mind, partcularly when source data may be out of sync at load time as presented here.

Thanks.
rmd0
rmd0
Forum Newbie
Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)

Group: General Forum Members
Points: 2 Visits: 57
Mike C (4/14/2010)
I wouldn't expect many collisions with MD5 unless you're processing 2^64 or more rows. If you're getting collisions with MD5 or SHA I would immediately suspect the concatenation function you're using.


Ha! I apologize. Not sure why, but I was thinking CRC32 this entire time. MD5 and SHA1 both work great. CRC32 will result in collisions.
magarity kerns
magarity kerns
Old Hand
Old Hand (384 reputation)Old Hand (384 reputation)Old Hand (384 reputation)Old Hand (384 reputation)Old Hand (384 reputation)Old Hand (384 reputation)Old Hand (384 reputation)Old Hand (384 reputation)

Group: General Forum Members
Points: 384 Visits: 397
hennie7863 (4/14/2010)
I have not read the article in detail, so correct me if i'm wrong, but the hashbytes accepts only one column (of one type character) and in this article the hashbyte calculation is made simple by using one column.

Currently i''m using the far from perfect checksum calculation (and addtitional fields comparison) for comparing records. This accepts multiple columns and types. When indexed, this will be faster than the hashbytes functionality.

This article draw my attention because the checksum is not perfect. Multiple records will lead to the same checksum result. I'm interested in a solution using the hashbytes with multiple columns and multiple types. Still looking for a good solution.

Yes, I did make the example overly simplified because I just wanted to trigger an ah-ha! moment. The article was already running long without getting into a lot of what if situations. To address your concern, it's pretty easy to convert various types of columns into a common type and concatenate them together in a function that would also process the MD5 sum. If the resulting length exceeds the HASHBYTES function's input limit then this method isn't for you or you need to come up with a way to shorten them somehow. There's no reason why a extremely wide column couldn't be put into two or three checksums in the XREF table.

Obviously this isn't the perfect method for everyone but I think some people will find it useful.
magarity kerns
magarity kerns
Old Hand
Old Hand (384 reputation)Old Hand (384 reputation)Old Hand (384 reputation)Old Hand (384 reputation)Old Hand (384 reputation)Old Hand (384 reputation)Old Hand (384 reputation)Old Hand (384 reputation)

Group: General Forum Members
Points: 384 Visits: 397
Mike C (4/13/2010)
Nice article! I implemented something very similar for loading data warehouses and datamarts a while back. One thing to keep in mind is that you can actually use a BINARY(16) column to hold MD5 hashes or BINARY(20) for SHA-1 since they're guaranteed to return a binary hash with exact length of 128-bits or 160-bits respectively. Also HASHBYTES is restricted in what it can hash -- there's an 8,000 byte restriction on the input, so you can't hash LOB data or columns wider than 8,000 bytes with it effectively. You can work around that limitation by chopping up your input or by generating the hashes in your ETL tool of choice (SSIS script components allow direct access to the System.Security.Cryptography namespace, for example).


Thanks for the positive feedback!

The tricky thing about keeping the output as binary makes it hard on the eyes when pasted into one's article text for illustration. No, it doesn't have to be converted in actual use. But also not all MD5 functions return binary depending on the ETL tool. I used HASHBYTES to stay on topic with SQL Server.
magarity kerns
magarity kerns
Old Hand
Old Hand (384 reputation)Old Hand (384 reputation)Old Hand (384 reputation)Old Hand (384 reputation)Old Hand (384 reputation)Old Hand (384 reputation)Old Hand (384 reputation)Old Hand (384 reputation)

Group: General Forum Members
Points: 384 Visits: 397
tarquin (4/14/2010)
I need to use multiple columns in the hashindex and so use a checksum over hashbytes such as:
checksum(hashbytes('MD5', column1), hashbytes('MD5',column2)) - this works pretty well but is still not 100% unique across all records (in fact even a hashbytes on a single column is not 100% unique across all values).


I'd suggest getting rid of the checksum and using hashbytes with the columns concatenated together into a single call to the function. This should result in a (reasonably) unique entry. Different data types just need to be converted to a common type.
magarity kerns
magarity kerns
Old Hand
Old Hand (384 reputation)Old Hand (384 reputation)Old Hand (384 reputation)Old Hand (384 reputation)Old Hand (384 reputation)Old Hand (384 reputation)Old Hand (384 reputation)Old Hand (384 reputation)

Group: General Forum Members
Points: 384 Visits: 397
rmd0 (4/14/2010)
One issue I have with this solution is that MD5 *will* have collisions with moderately "wide" tables, especially when bit columns are involved. In fact, I would see as many as two or three collisions in as few as 1 million rows on a table with a dozen columns. It would usually occur when the column that changed as a bit field.


This is not the case. Please see the web page linked at the end of the article concerning the likelyhood of MD5 collisions.
hennie7863
hennie7863
Valued Member
Valued Member (57 reputation)Valued Member (57 reputation)Valued Member (57 reputation)Valued Member (57 reputation)Valued Member (57 reputation)Valued Member (57 reputation)Valued Member (57 reputation)Valued Member (57 reputation)

Group: General Forum Members
Points: 57 Visits: 3421

Yes, I did make the example overly simplified because I just wanted to trigger an ah-ha! moment. The article was already running long without getting into a lot of what if situations. To address your concern, it's pretty easy to convert various types of columns into a common type and concatenate them together in a function that would also process the MD5 sum. If the resulting length exceeds the HASHBYTES function's input limit then this method isn't for you or you need to come up with a way to shorten them somehow. There's no reason why a extremely wide column couldn't be put into two or three checksums in the XREF table.

Obviously this isn't the perfect method for everyone but I think some people will find it useful.


Hi magarity,

Thnx for your reply. The concatenation option already came accross my mind and yes i agree that it is not the best way. I'm afraid that for comparing based on hashvalues we have to choose the best solution from the not so very well available options.

I did a little googlization and i came up with following options:
* Checksum function
* Hashbytes function
* SSISmultiplehash (3rd)
* Checksum transformation konesans (3rd)

The problem with 3rd party (open) components i have is that when you have a bug, you're stuck. Hmmm...

Keep writing articles!

Greetz,
Hennie
Nadrek
Nadrek
Ten Centuries
Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)

Group: General Forum Members
Points: 1039 Visits: 2673
A few comments with the underlying assumption that the desired result is a (potentially) more efficient method that is guaranteed to return the same results as the "long way".

For guaranteed accuracy, use whatever hash you like as a first pass. Whenever the hashes are different, the data is different. Whenever the hashes are the same, run a field by field (or equivalent) "long way" comparison to see if the data is actually the same after all.

Note that field concatenation comparisons do not work with variable length types mixed in (trivial example: "a " "b" and "a" " b" are not the same; alternate, "1" "1/22/1988" and "" "11/22/1988" are not the same).

If you deliberately decide that some non-zero percentage of erroneous matches are acceptable, then test your hash vs a "full length" compare to see what your particular data results in. In reality, there are only real results. Probabilities let you predict probabilities, but most business data I've seen doesn't follow "random input" or normal distribution rules.

That said, if you rely on probability, plan for what you'll do after a collision happens.
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search