Blog Post

SQL Server Encryption - Hashing Collisions

,

A hash is a computation that transforms one set of data into another (hopefully smaller) set of data. So a hash on your 2,000 character blog post should generate a smaller, 10-20 byte value. In doing that, obviously there are many more possible 2,000 character sets of data then there are 10 or 20 byte sets of data. Most of those don't make sense since they would be random scrambling of character values, but they still exist.

When two or more large set sets of data were to generate the same 20byte hash, we have a collision. And collisions are bad.

Why?

The short answer is that you then can't be sure that the source data the 2,000 character strings are equivalent. Their hashes are, but they aren't. They could be, and that's one thing that you are usually using a has to check. For example, one common use of hashing is to detect if data has changed. You might calculate a hash from your application, and one from the database and compare them. If they're the same, you assume nothing has changed.

Why do you do this? It is often much less resource intensive than comparing all fields, or sending a large amount of data to (or from) the server. You could more easily send a 20 byte hash than a 2,000 byte chunk of data.

The HASHBYTES function in SQL Server can be used to generate hashes with a few algorithms, and SHA-1 is the best one to use. It has a low chance of collisions and is considered fairly secure by the cryptographic community. Not great, and it's been superseded by SHA-2, but it should work for most data detection changes.

However there are a few other functions have been supported for some time. There are CHECKSUM, BINARY_CHECKSUM, and CHECKSUM_AGG that exist. You shouldn't use these because of a high chance of collisions. I'll repeat that.

Don't use these!

If you are detecting data changes, consider this code:

select 'LE' 'String', checksum('le') 'CheckSumHash'

It returns

hash1

Now, consider that you run this:

select 'AAAAAAAAAAAAAAAALE' 'String', checksum('AAAAAAAAAAAAAAAALE') 'CheckSumHash'

you get this:

hash2

Not exactly what you were expecting!

The issue is that there are changes of collisions with these functions, perhaps too high to take the chance of two values being incorrectly compared. I would be very careful about using these functions, and if you do, be sure that you note this potential issue to all support people. If the application appears to be missing changes because of a checksum, be sure you do not have any type of collision taking place.

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating