SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

SQL Musings

Add to Technorati Favorites Add to Google
Author Bio
Steve Jones Editor at SQLServerCentral.com You can follow Steve on Twitter as way0utwest (www.twitter.com/way0utwest)
 

SQL Server Encryption - Hashing Collisions

By Steve Jones in SQL Musings | 06-01-2009 2:39 AM | Categories: Filed under: , ,
Rating: (not yet rated) |  Discuss | 3,618 Reads | 152 Reads in Last 30 Days |10 comment(s)

A hash is a computation that transforms one set of data into another (hopefully smaller) set of data. So a hash on your 2,000 character blog post should generate a smaller, 10-20 byte value. In doing that, obviously there are many more possible 2,000 character sets of data then there are 10 or 20 byte sets of data. Most of those don't make sense since they would be random scrambling of character values, but they still exist.

When two or more large set sets of data were to generate the same 20byte hash, we have a collision. And collisions are bad.

Why?

The short answer is that you then can't be sure that the source data the 2,000 character strings are equivalent. Their hashes are, but they aren't. They could be, and that's one thing that you are usually using a has to check. For example, one common use of hashing is to detect if data has changed. You might calculate a hash from your application, and one from the database and compare them. If they're the same, you assume nothing has changed.

Why do you do this? It is often much less resource intensive than comparing all fields, or sending a large amount of data to (or from) the server. You could more easily send a 20 byte hash than a 2,000 byte chunk of data.

The HASHBYTES function in SQL Server can be used to generate hashes with a few algorithms, and SHA-1 is the best one to use. It has a low chance of collisions and is considered fairly secure by the cryptographic community. Not great, and it's been superseded by SHA-2, but it should work for most data detection changes.

However there are a few other functions have been supported for some time. There are CHECKSUM, BINARY_CHECKSUM, and CHECKSUM_AGG that exist. You shouldn't use these because of a high chance of collisions. I'll repeat that.

Don't use these!

If you are detecting data changes, consider this code:

select 'LE' 'String', checksum('le') 'CheckSumHash'

It returns

hash1

Now, consider that you run this:

select 'AAAAAAAAAAAAAAAALE' 'String', checksum('AAAAAAAAAAAAAAAALE') 'CheckSumHash'

you get this:

hash2

Not exactly what you were expecting!

The issue is that there are changes of collisions with these functions, perhaps too high to take the chance of two values being incorrectly compared. I would be very careful about using these functions, and if you do, be sure that you note this potential issue to all support people. If the application appears to be missing changes because of a checksum, be sure you do not have any type of collision taking place.

Comments
 

matt stockham said:

I don't think you should state that checksum etc shouldn't be used; hashsum may be a better solution but it's still a hash and still runs the risk of collision.

June 3, 2009 9:57 PM
 

matt stockham said:

Of course I meant hashBYTES may be a better solution .....

June 3, 2009 9:58 PM
 

Chris Chilvers said:

Something else that can help is to send the length of the data with the hash, less chance (but not zero chance) two pieces of data having the same hash and the same length.

June 9, 2009 2:52 AM
 

Michael Coles said:

Matt - I ran some really simple (unscientific) tests of HashBytes SHA-1 versus CHECKSUM against a "real" data set (the list of the most common names in the United States from the U.S. Census Bureau).  HashBytes SHA-1 generated zero collisions.  CHECKSUM generated collisions as frequently as 1 in 10 (depending on the combinations of data).  The odds of an SHA-1 collision are on the order of 1 in 2^160 (roughly about 1 in 10,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000).  As Steve demonstrated the CHECKSUM function "repeats" every 16 characters.

Chris, try this:

SELECT CHECKSUM('LE'), LEN('LE')

SELECT CHECKSUM('MU'), LEN('MU')

As you can see the length of the string isn't a big help with CHECKSUM. For more examples try this one:

WITH X

AS

(

SELECT 'A' AS C

UNION ALL

SELECT 'B'

UNION ALL

SELECT 'C'

UNION ALL

SELECT 'D'

UNION ALL

SELECT 'E'

UNION ALL

SELECT  'F'

UNION ALL

SELECT  'G'

UNION ALL

SELECT  'H'

UNION ALL

SELECT  'I'

UNION ALL

SELECT  'J'

UNION ALL

SELECT  'K'

UNION ALL

SELECT  'L'

UNION ALL

SELECT  'M'

UNION ALL

SELECT  'N'

UNION ALL

SELECT  'O'

UNION ALL

SELECT  'P'

UNION ALL

SELECT  'Q'

UNION ALL

SELECT  'R'

UNION ALL

SELECT  'S'

UNION ALL

SELECT  'T'

UNION ALL

SELECT  'U'

UNION ALL

SELECT  'V'

UNION ALL

SELECT  'W'

UNION ALL

SELECT  'X'

UNION ALL

SELECT  'Y'

UNION ALL

SELECT  'Z'

)

SELECT X1.C + X2.C, CHECKSUM(X1.C + X2.C)

FROM X X1

CROSS JOIN X X2

ORDER BY CHECKSUM(X1.C + X2.C)

Try substituting HashBytes for CHECKSUM in the above and you won't see a single collision.

June 9, 2009 9:31 AM
 

Steve Jones said:

It's Michael's book, so he posted his code, and this is a good reason to buy it. You'll learn some interesting things.

June 9, 2009 9:37 AM
 

mark_donskoy said:

I think one simple possible solution to the collisions would be calculating 2 HASHBYTES - one by SHA-1 and another by SHA-2 algorithms.

Then one can either compare both pairs of HASHBYTES or get just one string that is some of those 2 functions.

I think the chance that 2 different strings would get the same 2 HASHBYTES are close to 0.

June 9, 2009 9:42 AM
 

Michael Coles said:

Correction to previous post - there should only be 48 zeroes behind that 1, not 49.  Still really low odds of a collision.

Mark - 1:10^48 is pretty close to zero, but some people do go with the "belt-and-suspenders" method and apply multiple hashes.  The HashBytes function doesn't natively support SHA-2, but you could use MD5 as the second hash or get SHA-2 functionality through SQL CLR.

June 9, 2009 11:03 AM
 

Renato Buda said:

One thing not mentioned in the article is the support for binary_checksum(*).

This could be used to readily detect changes on a row without coding every column. (It could if it was reliable anyway).

Does anyone know how to get this kind of programming shortcut with HashBytes?

June 16, 2009 6:25 AM
 

Jacob Luebbers said:

@Renato: The naive approach would be to simply cast all columns to a binary datatype, concatenate then together and pass that to HashBytes(). It should work fine, but performance may not be very good. Test and see :-)

July 2, 2009 10:05 PM
 

vanram said:

1) RSA Labs says that SHA-1 is more 'collision-free' than MD5

2) RSA Labs says Researchers (1994) brute force attach estimate a collision using MD5 on avg after 24 days running on a $10M system.

What i have not seen explained by RSA labs or here is what scenario will result in the collision:

a) 2 radically different strings

b) 2 very similar strings

The reason i want to know is that if the answer is (a) and not (b) then i can use Hashbytes as the technique for recognising change of none key columns of a row for an ETL load. BUT if (b) is the problem, and by definition the values wil mostly be similar will the weight the result to creating frequent collisions?

July 29, 2009 12:50 PM
Leave a Comment
Only members of SQLServerCentral may leave comments. Register now for your free account or Sign-In if you are already a member.