One of the problems with hashing is that you can have collisions from values that are not very similar. This means that if you are using hashing as a way to identify similar values, you need to make further checks with the original data after the hash matches are gathered.
This post will show a few examples of the collisions that can occur if you use the CHECKSUM() or BINARY_CHECKSUM() functions.
If we examine this code:
-- Checksum declare @i varchar(200) , @j varchar(200); select @i = 'LE'; select @j = 'AAAAAAAAAAAAAAAALE'; select Plaintext = @i , checksum = CHECKSUM(@i) UNION ALL SELECT Plaintext = @j , checksum = CHECKSUM(@j); GO
This returns a result like this:
Note that these two values are the same as far as the checksum hash goes.
If we switch to BINARY_CHECKSUM(), we can get similar results.
-- binary_checksum is no better declare @i varchar(200) , @j varchar(200) , @k varchar(200); select @i = 'LE' select @j = 'Ou' select @k = 'MU' select Plaintext = @i , BINARY_CHECKSUM(@i) UNION ALL SELECT Plaintext = @j , BINARY_CHECKSUM(@j) UNION ALL SELECT Plaintext = @k , BINARY_CHECKSUM(@k) GO
While these two functions can be useful, you do have to be careful with the results. A matching hash from these functions does not mean that the source data is the same.
Filed under: Blog Tagged: encryption, security, syndicated, T-SQL