• Eirikur Eiriksson (9/16/2015)


    A more common scenario would be something like this, given the facts that:

    a) Inserts and update are relatively inexpensive

    b) Data comparison is relatively expensive

    c) Good enough algorithms are relatively inexpensive

    Method for high collision hash algorithms (ie. CHECKSUM)

    1. Load / insert data into a database

    2. Create checksum/hash

    3. Receive Delta data

    4. Create checksum/hash for the Delta

    5. Load / insert / update new hash value Delta

    6. Compare data for identical hash value Delta

    Method for low collision hash algorithms (ie. MD5, SHA1,Tiger etc.)

    1. Load / insert data into a database

    2. Create checksum/hash

    3. Receive Delta data

    4. Create checksum/hash for the Delta

    5. Load / insert / update new hash value Delta

    6. Ignore identical hash value Delta

    The theoretical collision rate for algorithms such as MD5 and SHA1 is somewhere in the range of 2^60 or 1/1.15x10^18. Compared to other risks such as using NOLOCK or READUNCOMMITED, it is truly low enough to be ignored.

    😎

    Few years back I was working on a system that relied heavily on hashes for change detection and found that SHA1 is one of the faster algorithms generated by the HASHBYTES function in SQL Server (2012).

    Good writup by Thomas Kejser, Exploring Hash Functions In SQL Server[/url]

    Things change quite quickly in the hashing world; NIST added SHA2 with 256, 356, and 512 bit output to the Secure Hash Standard in 2002 - quite a long time ago - and although this was aimed at security rather than at reducing random collisions the increased hash length leads to vastly lower chances of random collision than SHA1 has. A 224 bit version was added to the standard a couple of years later. In the 512 bit variant of SHA2 the chance of two random strings having the same hash is 10^-154 as opposed to SHA1's 10^-48 (your 10^18 comes from SHA1's cryptographic strength, reduced from 80 bits tp 61 bits by know breaking techniques, which is irrelevant in the current - noncryptographic - context). So maybe it would be better to use SHA2 with 512 bits if the probability of random collisions has to be very low.

    NIST added SHA3 to the standard last month; but the idea of that was to have a very different style of hash so that maybe cryptographic attacks that are effective for SHA2 won't be effective for SHA3 and vice versa. It doesn't add any extra hash length, so doesn't affect the random collision rate.

    I wonder when (or if) SQL or T-SQL will add SHA2 support. Given the failure to support the modern float standard and get rid of the representation problem I don't expect anything to happen any time soon.

    Tom