HASHBYTES: Is CHECKSUM really required?

  • Comments posted to this topic are about the item HASHBYTES: Is CHECKSUM really required?

    Thanks & Regards,
    Nakul Vachhrajani.
    http://nakulvachhrajani.com

    Follow me on
    Twitter: @sqltwins

  • Thanks for the article Nakul.

    I have a question about HASHBYTES (and hashing in general) to which I have found it impossible so far to find a definite answer. Perhaps you know: you list HASHBYTES as providing change detection, but is that really true?

    In a lot of places you see claims that "any change" will affect the hash value but that is not true: it is easy to see that for any (finite) hash algorithm there are always two files (or datablobs) that are the same size and differ by only two bits and that will give the same hash value. But it is not so obvious (at least to me) if there can be two files/datablobs that differ by only one bit and that will have the same hash value. Of course I realise that from a practical point of view it is extremely unlikely that any two files/datablobs will have the same hash value but I am interested in the theoretical question: will a single bit change always change the hash value?

    Thanks,

    Alex

  • Hi guys,

    Thanks a lot for your publication, just few words to advise.

    I use this approch, and I meet many problem with the checksum process.

    So if you have some Null value in the key the checksum failed or return an "strange" value.

    So take care to controle all fields have a value and best way be not null, and to reduce failure of cheksum process, i add a "|" between each fields, add an IsNull function 'In case Of".

    Regards

  • Is CHECKSUM guarenteed to be unique? Can different combinations of data give the same CHECKSUM value?

  • A checksum is 4 bytes, with very little chance for uniqueness, and is very easily tampered with.

    Hashbytes is a 16 bytes (for MD5) or 20 bytes (for SHA1) hash value calculated over a maximum of 8000 input bytes, with a pretty good guarantee for uniqueness and in the case of SHA1 also very difficult to tamper with..

    So HashBytes (SHA1) is the right choice in my opinion.

  • Good Article Nakul,

    But it would have benefited immensely from a paragraph on fault tolerance.

    Microsoft's msdn page refers to it explicitly...

    http://msdn.microsoft.com/en-gb/library/ms189788.aspx

    As table rows increase the chance of the "Birthday Paradox" increases and as the strategy is to reduce index size and therefore search times. I would imagine you would be targeting a table with a large number of rows. So perhaps a counter productive strategy.

    Agreeably tables with a relatively low number of rows but a very wide search criteria may benefit from the CHECKSUM strategy.

  • Good Article Nakul,

    But it would have benefited immensely from a paragraph on fault tolerance.

    Microsoft's msdn page refers to it explicitly...

    http://msdn.microsoft.com/en-gb/library/ms189788.aspx

    As table rows increase the chance of the "Birthday Paradox" increases and as the strategy is to reduce index size and therefore search times. I would imagine you would be targeting a table with a large number of rows. So perhaps a counter productive strategy.

    Agreeably tables with a relatively low number of rows but a very wide search criteria may benefit from the CHECKSUM strategy.

  • Good Article Nakul,

    But it would have benefited immensely from a paragraph on fault tolerance.

    Microsoft's msdn page refers to it explicitly...

    http://msdn.microsoft.com/en-gb/library/ms189788.aspx

    As table rows increase the chance of the "Birthday Paradox" increases and as the strategy is to reduce index size and therefore search times. I would imagine you would be targeting a table with a large number of rows. So perhaps a counter productive strategy.

    Agreeably tables with a relatively low number of rows but a very wide search criteria may benefit from the CHECKSUM strategy.

  • Thank-you, all for your time and valuable feedback.

    Yes, CHECKSUM fails to detect symmetric changes and has therefore been listed only as a "basic" change detection mechanism. The "ideal" one would be HASHBYTES, but it comes at an additional cost - storage.

    As mentioned in the article, I would request the kind reader to please go through my in-depth study on the various change and tamper detection mechanisms available where I attempt to evaluate each of the pros & cons of the various methods (Part 01[/url] and Part 02[/url]) .

    Thanks & Regards,
    Nakul Vachhrajani.
    http://nakulvachhrajani.com

    Follow me on
    Twitter: @sqltwins

  • HI Nakul, we are using CHECKSUM currently, not sure if this is used by Default features of database, but always our etls are failing with CHECKSUM causing I/O errors

  • I found the issue with Null value.If we concatenate n columns in table1 and compare them with same type of concatenated n columns in table2 . If any column in table 1 is null, Hashbyte result of whole concatenated columns would be Null and binary_checksum result will be some default value. So first we have to use coalesce function or some other way to treat null value for columns. Then only we can take benefit of these comparison functions.

  • HASHBYTES is meant for a literal string (or binary) import of one column, so any potentially null fields would need to be wrapped in ISNULL(x,'') - and that's assuming you want a NULL string to be treated the exact same as an empty string.

    CHECKSUM and BINARY_CHECKSUM have multi-column input, as opposed to a single string value.

  • Your query using the CHECKSUM index is incorrect:

    SELECT *

    FROM dbo.PostalCode

    WHERE PostalCheckSum = CHECKSUM(@PostalArea, @PostalCity, @PostalState);

    As previously mentioned, the value returned from CHECKSUM will not be unique for the inputs. The same input will always produce the same output, but different input is not guaranteed to produce different output.

    Therefore, your query still needs to test the other columns:

    SELECT *

    FROM dbo.PostalCode

    WHERE PostalCheckSum = CHECKSUM(@PostalArea, @PostalCity, @PostalState)

    AND PostalArea = @PostalArea

    AND PostalCity = @PostalCity

    AND PostalState = @PostalState;

  • suharis (3/27/2013)


    HI Nakul, we are using CHECKSUM currently, not sure if this is used by Default features of database, but always our etls are failing with CHECKSUM causing I/O errors

    Are you trying to insert a CHECKSUM value? It should be a calculated field (it can be persisted by creating an index over the CHECKSUM).

    This is a great forum question - can you post it with supporting details on the SSC forums?

    Thanks & Regards,
    Nakul Vachhrajani.
    http://nakulvachhrajani.com

    Follow me on
    Twitter: @sqltwins

  • Sorry this is a very bad example. Checksum like you say is not unique, creating very many duplicates. So as a solution is rather pointless, if the user gets incorrect results back from the query. Yes it saved index space, but the end query result is wrong!

    Maybe I missed the point of the example?

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply