HASHBYTES: Is CHECKSUM really required?

Question

HASHBYTES: Is CHECKSUM really required?

Nakul Vachhrajani

SSChampion

Points: 10283
More actions
March 26, 2013 at 10:37 pm

#291582

Comments posted to this topic are about the item HASHBYTES: Is CHECKSUM really required?
Thanks & Regards,
Nakul Vachhrajani.
http://nakulvachhrajani.com
Follow me on
Twitter: @sqltwins

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply

Alex Fekken Ten Centuries Points: 1109 More actions · Answer 1

Thanks for the article Nakul.

I have a question about HASHBYTES (and hashing in general) to which I have found it impossible so far to find a definite answer. Perhaps you know: you list HASHBYTES as providing change detection, but is that really true?

In a lot of places you see claims that "any change" will affect the hash value but that is not true: it is easy to see that for any (finite) hash algorithm there are always two files (or datablobs) that are the same size and differ by only two bits and that will give the same hash value. But it is not so obvious (at least to me) if there can be two files/datablobs that differ by only one bit and that will have the same hash value. Of course I realise that from a practical point of view it is extremely unlikely that any two files/datablobs will have the same hash value but I am interested in the theoretical question: will a single bit change always change the hash value?

Thanks,

Alex

the_huge_bear SSC Journeyman Points: 95 More actions · Answer 2

Hi guys,

Thanks a lot for your publication, just few words to advise.

I use this approch, and I meet many problem with the checksum process.

So if you have some Null value in the key the checksum failed or return an "strange" value.

So take care to controle all fields have a value and best way be not null, and to reduce failure of cheksum process, i add a "|" between each fields, add an IsNull function 'In case Of".

Regards

sqlchan Valued Member Points: 50 More actions · Answer 3

Is CHECKSUM guarenteed to be unique? Can different combinations of data give the same CHECKSUM value?

danny.heijl Valued Member Points: 61 More actions · Answer 4

A checksum is 4 bytes, with very little chance for uniqueness, and is very easily tampered with.

Hashbytes is a 16 bytes (for MD5) or 20 bytes (for SHA1) hash value calculated over a maximum of 8000 input bytes, with a pretty good guarantee for uniqueness and in the case of SHA1 also very difficult to tamper with..

So HashBytes (SHA1) is the right choice in my opinion.

gary.strange-sqlconsumer SSCommitted Points: 1816 More actions · Answer 5

Good Article Nakul,

But it would have benefited immensely from a paragraph on fault tolerance.

Microsoft's msdn page refers to it explicitly...

http://msdn.microsoft.com/en-gb/library/ms189788.aspx

As table rows increase the chance of the "Birthday Paradox" increases and as the strategy is to reduce index size and therefore search times. I would imagine you would be targeting a table with a large number of rows. So perhaps a counter productive strategy.

Agreeably tables with a relatively low number of rows but a very wide search criteria may benefit from the CHECKSUM strategy.

gary.strange-sqlconsumer SSCommitted Points: 1816 More actions · Answer 6

Good Article Nakul,

But it would have benefited immensely from a paragraph on fault tolerance.

Microsoft's msdn page refers to it explicitly...

http://msdn.microsoft.com/en-gb/library/ms189788.aspx

As table rows increase the chance of the "Birthday Paradox" increases and as the strategy is to reduce index size and therefore search times. I would imagine you would be targeting a table with a large number of rows. So perhaps a counter productive strategy.

Agreeably tables with a relatively low number of rows but a very wide search criteria may benefit from the CHECKSUM strategy.

gary.strange-sqlconsumer SSCommitted Points: 1816 More actions · Answer 7

Good Article Nakul,

But it would have benefited immensely from a paragraph on fault tolerance.

Microsoft's msdn page refers to it explicitly...

http://msdn.microsoft.com/en-gb/library/ms189788.aspx

As table rows increase the chance of the "Birthday Paradox" increases and as the strategy is to reduce index size and therefore search times. I would imagine you would be targeting a table with a large number of rows. So perhaps a counter productive strategy.

Agreeably tables with a relatively low number of rows but a very wide search criteria may benefit from the CHECKSUM strategy.

Nakul Vachhrajani SSChampion Points: 10283 More actions · Answer 8

Thank-you, all for your time and valuable feedback.

Yes, CHECKSUM fails to detect symmetric changes and has therefore been listed only as a "basic" change detection mechanism. The "ideal" one would be HASHBYTES, but it comes at an additional cost - storage.

As mentioned in the article, I would request the kind reader to please go through my in-depth study on the various change and tamper detection mechanisms available where I attempt to evaluate each of the pros & cons of the various methods (Part 01[/url] and Part 02[/url]) .

Thanks & Regards,
Nakul Vachhrajani.
http://nakulvachhrajani.com

Follow me on
Twitter: @sqltwins

suharis SSC Rookie Points: 43 More actions · Answer 9

HI Nakul, we are using CHECKSUM currently, not sure if this is used by Default features of database, but always our etls are failing with CHECKSUM causing I/O errors

GarySha Newbie Points: 9 More actions · Answer 10

I found the issue with Null value.If we concatenate n columns in table1 and compare them with same type of concatenated n columns in table2 . If any column in table 1 is null, Hashbyte result of whole concatenated columns would be Null and binary_checksum result will be some default value. So first we have to use coalesce function or some other way to treat null value for columns. Then only we can take benefit of these comparison functions.

Ryan.Polk Old Hand Points: 373 More actions · Answer 11

HASHBYTES is meant for a literal string (or binary) import of one column, so any potentially null fields would need to be wrapped in ISNULL(x,'') - and that's assuming you want a NULL string to be treated the exact same as an empty string.

CHECKSUM and BINARY_CHECKSUM have multi-column input, as opposed to a single string value.

richardd Hall of Fame Points: 3963 More actions · Answer 12

Your query using the CHECKSUM index is incorrect:

SELECT *

FROM dbo.PostalCode

WHERE PostalCheckSum = CHECKSUM(@PostalArea, @PostalCity, @PostalState);

As previously mentioned, the value returned from CHECKSUM will not be unique for the inputs. The same input will always produce the same output, but different input is not guaranteed to produce different output.

Therefore, your query still needs to test the other columns:

SELECT *

FROM dbo.PostalCode

WHERE PostalCheckSum = CHECKSUM(@PostalArea, @PostalCity, @PostalState)

AND PostalArea = @PostalArea

AND PostalCity = @PostalCity

AND PostalState = @PostalState;

Nakul Vachhrajani SSChampion Points: 10283 More actions · Answer 13

suharis (3/27/2013)
HI Nakul, we are using CHECKSUM currently, not sure if this is used by Default features of database, but always our etls are failing with CHECKSUM causing I/O errors

Are you trying to insert a CHECKSUM value? It should be a calculated field (it can be persisted by creating an index over the CHECKSUM).

This is a great forum question - can you post it with supporting details on the SSC forums?

Thanks & Regards,
Nakul Vachhrajani.
http://nakulvachhrajani.com

Follow me on
Twitter: @sqltwins

adam.everett SSC Veteran Points: 265 More actions · Answer 14

Sorry this is a very bad example. Checksum like you say is not unique, creating very many duplicates. So as a solution is rather pointless, if the user gets incorrect results back from the query. Yes it saved index space, but the end query result is wrong!

Maybe I missed the point of the example?