Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

SQL Musings

Steve Jones is the editor of SQLServerCentral.com and visits a wide variety of data related topics in his daily editorial. Steve has spent years working as a DBA and general purpose Windows administrator, primarily working with SQL Server since it was ported from Sybase in 1990. You can follow Steve on Twitter at www.twitter.com/way0utwest

SQL Server Encryption - Hashing Collisions

A hash is a computation that transforms one set of data into another (hopefully smaller) set of data. So a hash on your 2,000 character blog post should generate a smaller, 10-20 byte value. In doing that, obviously there are many more possible 2,000 character sets of data then there are 10 or 20 byte sets of data. Most of those don't make sense since they would be random scrambling of character values, but they still exist.

When two or more large set sets of data were to generate the same 20byte hash, we have a collision. And collisions are bad.

Why?

The short answer is that you then can't be sure that the source data the 2,000 character strings are equivalent. Their hashes are, but they aren't. They could be, and that's one thing that you are usually using a has to check. For example, one common use of hashing is to detect if data has changed. You might calculate a hash from your application, and one from the database and compare them. If they're the same, you assume nothing has changed.

Why do you do this? It is often much less resource intensive than comparing all fields, or sending a large amount of data to (or from) the server. You could more easily send a 20 byte hash than a 2,000 byte chunk of data.

The HASHBYTES function in SQL Server can be used to generate hashes with a few algorithms, and SHA-1 is the best one to use. It has a low chance of collisions and is considered fairly secure by the cryptographic community. Not great, and it's been superseded by SHA-2, but it should work for most data detection changes.

However there are a few other functions have been supported for some time. There are CHECKSUM, BINARY_CHECKSUM, and CHECKSUM_AGG that exist. You shouldn't use these because of a high chance of collisions. I'll repeat that.

Don't use these!

If you are detecting data changes, consider this code:

select 'LE' 'String', checksum('le') 'CheckSumHash'

It returns

hash1

Now, consider that you run this:

select 'AAAAAAAAAAAAAAAALE' 'String', checksum('AAAAAAAAAAAAAAAALE') 'CheckSumHash'

you get this:

hash2

Not exactly what you were expecting!

The issue is that there are changes of collisions with these functions, perhaps too high to take the chance of two values being incorrectly compared. I would be very careful about using these functions, and if you do, be sure that you note this potential issue to all support people. If the application appears to be missing changes because of a checksum, be sure you do not have any type of collision taking place.

Comments

Posted by matt stockham on 3 June 2009

I don't think you should state that checksum etc shouldn't be used; hashsum may be a better solution but it's still a hash and still runs the risk of collision.

Posted by matt stockham on 3 June 2009

Of course I meant hashBYTES may be a better solution .....

Posted by Chris Chilvers on 9 June 2009

Something else that can help is to send the length of the data with the hash, less chance (but not zero chance) two pieces of data having the same hash and the same length.

Posted by Michael Coles on 9 June 2009

Matt - I ran some really simple (unscientific) tests of HashBytes SHA-1 versus CHECKSUM against a "real" data set (the list of the most common names in the United States from the U.S. Census Bureau).  HashBytes SHA-1 generated zero collisions.  CHECKSUM generated collisions as frequently as 1 in 10 (depending on the combinations of data).  The odds of an SHA-1 collision are on the order of 1 in 2^160 (roughly about 1 in 10,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000).  As Steve demonstrated the CHECKSUM function "repeats" every 16 characters.

Chris, try this:

SELECT CHECKSUM('LE'), LEN('LE')

SELECT CHECKSUM('MU'), LEN('MU')

As you can see the length of the string isn't a big help with CHECKSUM. For more examples try this one:

WITH X

AS

(

SELECT 'A' AS C

UNION ALL

SELECT 'B'

UNION ALL

SELECT 'C'

UNION ALL

SELECT 'D'

UNION ALL

SELECT 'E'

UNION ALL

SELECT  'F'

UNION ALL

SELECT  'G'

UNION ALL

SELECT  'H'

UNION ALL

SELECT  'I'

UNION ALL

SELECT  'J'

UNION ALL

SELECT  'K'

UNION ALL

SELECT  'L'

UNION ALL

SELECT  'M'

UNION ALL

SELECT  'N'

UNION ALL

SELECT  'O'

UNION ALL

SELECT  'P'

UNION ALL

SELECT  'Q'

UNION ALL

SELECT  'R'

UNION ALL

SELECT  'S'

UNION ALL

SELECT  'T'

UNION ALL

SELECT  'U'

UNION ALL

SELECT  'V'

UNION ALL

SELECT  'W'

UNION ALL

SELECT  'X'

UNION ALL

SELECT  'Y'

UNION ALL

SELECT  'Z'

)

SELECT X1.C + X2.C, CHECKSUM(X1.C + X2.C)

FROM X X1

CROSS JOIN X X2

ORDER BY CHECKSUM(X1.C + X2.C)

Try substituting HashBytes for CHECKSUM in the above and you won't see a single collision.

Posted by Steve Jones on 9 June 2009

It's Michael's book, so he posted his code, and this is a good reason to buy it. You'll learn some interesting things.

Posted by mark donskoy on 9 June 2009

I think one simple possible solution to the collisions would be calculating 2 HASHBYTES - one by SHA-1 and another by SHA-2 algorithms.

Then one can either compare both pairs of HASHBYTES or get just one string that is some of those 2 functions.

I think the chance that 2 different strings would get the same 2 HASHBYTES are close to 0.

Posted by Michael Coles on 9 June 2009

Correction to previous post - there should only be 48 zeroes behind that 1, not 49.  Still really low odds of a collision.

Mark - 1:10^48 is pretty close to zero, but some people do go with the "belt-and-suspenders" method and apply multiple hashes.  The HashBytes function doesn't natively support SHA-2, but you could use MD5 as the second hash or get SHA-2 functionality through SQL CLR.

Posted by Renato Buda on 16 June 2009

One thing not mentioned in the article is the support for binary_checksum(*).

This could be used to readily detect changes on a row without coding every column. (It could if it was reliable anyway).

Does anyone know how to get this kind of programming shortcut with HashBytes?

Posted by Jacob Luebbers on 2 July 2009

@Renato: The naive approach would be to simply cast all columns to a binary datatype, concatenate then together and pass that to HashBytes(). It should work fine, but performance may not be very good. Test and see :-)

Posted by vanram on 29 July 2009

1) RSA Labs says that SHA-1 is more 'collision-free' than MD5

2) RSA Labs says Researchers (1994) brute force attach estimate a collision using MD5 on avg after 24 days running on a $10M system.

What i have not seen explained by RSA labs or here is what scenario will result in the collision:

a) 2 radically different strings

b) 2 very similar strings

The reason i want to know is that if the answer is (a) and not (b) then i can use Hashbytes as the technique for recognising change of none key columns of a row for an ETL load. BUT if (b) is the problem, and by definition the values wil mostly be similar will the weight the result to creating frequent collisions?

Leave a Comment

Please register or log in to leave a comment.