Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

The Voice of the DBA

Steve Jones is the editor of SQLServerCentral.com and visits a wide variety of data related topics in his daily editorial. Steve has spent years working as a DBA and general purpose Windows administrator, primarily working with SQL Server since it was ported from Sybase in 1990. You can follow Steve on Twitter at twitter.com/way0utwest

HASHBYTES – A T-SQL Function

Someone was asking if the HASHBYTES function was a good one to use in T-SQL as far as performance goes.. I wish I had a good reference for the function, but the best one I had on SQLServerCentral was this piece on using it to load a data warehouse. I also wrote an editorial on it not working with strings beyond 8k, which seems to be a bug, or a lack of resources devoted to ensuring string functions work with varchar(max).

The HASHBYTES function returns a hash of an input string. A hash is essential a calculation based on the values of the input, and two inputs that are the same, ought to produce the same hash. One catch with this function is that you provide the algorithm used, which can be one of these:

  • MD2
  • MD4
  • MD5
  • SHA
  • SHA1

Each of these produces different output, returning a varbinary(max) value. As an example, suppose I hash “Steve Jones”

SELECT HASHBYTES('MD2', 'Steve Jones') 'MD2'
UNION
SELECT HASHBYTES('MD4', 'Steve Jones') 'MD4'
UNION
SELECT HASHBYTES('MD5', 'Steve Jones') 'MD5'
UNION
SELECT HASHBYTES('SHA', 'Steve Jones') 'SHA'
UNION
SELECT HASHBYTES('SHA', 'Steve Jones') 'SHA1'

The results look like this:

MD2

———————————————-

0x27851A666BFCB4A35F971DD742CDA15F

0x2E978DE4841B1F3651A8DF4B2D2CF5F5C624A76B

0x75931813C7EAAEAB3CD1D8D621935903

0x979AC597C05CA6DE3A88C31A456D1125

As you can see, there’s a different hash for the same value using different algorithms. However if I were to compare the same string to itself, I can easily tell if something has changed. If the hashes aren’t the same, there’s a difference. I’m not sure this is a great use, but the more obvious use is that I can hash a password and then have the user enter their own version, hash it, and compare the results. In this way, the system never needs to know the value.

Just make sure you use the same algorithm Winking smile


Filed under: Blog Tagged: security, syndicated, T-SQL

Comments

Posted by DEK46656 on 6 July 2011

Steve,

I’ve been using HASHBYTES for a while now in my data warehouse work, but not for fingerprinting the data (I use CHECKSUM for that when needed).  I use HASHBYTES to create a surrogate key: technically, it’s a “synthetic key” if I understand Joe Celko correctly.

This all went back to reading different article and postings that Joe had relating to using IDENTITY to generate a key.  In one of the threads somewhere it finally came across (to me) why you shouldn’t use IDENTITY for a primary key;  IDENTITY is non-deterministic, and the key should be deterministic.  For me it was a paradigm shift once I figured that out.  Now I have a series of functions and approaches to generate keys based on chance of collisions, requisite fields involved, and data space.

Posted by Steve Jones on 6 July 2011

That's very interesting. I'm not sure I think that the key needs to be deterministic, but there is an argument there. However, if the data warehouse is "corrected", do you recompute the hash? Or do you leave it alone? In the latter case, it's not deterministic from the data anymore. In the former, this could have large performance implications.

Posted by DEK on 6 July 2011

This stuff is used in the dimension population, and is most often tied to a unique name.  I have also combined a number of attributes together into a VARCHAR(8000) string and fed that into HASHBYTES to produce a key.  In what I do, the name (amongst other things) doesn’t change.  If it changes for any reason, a new entry is made.  I also use it for T2SCD work, often tying the DATETIME of the entry into the key being generated.

Note that HASHBYTES is one of the tools I use for this: I’ve come up with a number of approaches to generating a deterministic key.  Sometimes I’ll have a series of attributes that represent something (CPU SLOT, CPU Number) that may be TINYINT is size: I convert them to BINARY, concatenate them together, and CONVERT that into INT.

Posted by paul.kohlmiller on 6 July 2011

I tried to create my own hash function that I wanted to work in a certain way. I wanted strings that are similar to have similar hash values. So the string "acetaminomphen 125 mg tablet" would have a closer hash value to "acetaminophen liquid" than it would to "aspirin 150 mg". But I abandoned that as a waste of time (both mine and the computer's). Now I'm thinking I have another reason for HASHBYTES - improving the performance of a SP that looks to see if a text value (that can be anywhere from 1 to 8000+ bytes) is already in a table or not. If I do it only for values longer than 512 bytes, it might work. The inability to index fields longer than 900 bytes is the motivation.

Posted by DEK on 6 July 2011

Paul,

I think you want SOUNDEX.  Try the following to see if its what you want.

DECLARE @tRange TABLE (

TestStr VARCHAR(8000)

)

INSERT @tRange (TestStr)

VALUES ('acetaminomphen 125 mg tablet'),

('acetaminophen liquid'),

('aspirin 150 mg')

SELECT TestStr, SndLike = SOUNDEX(TestStr)

FROM @tRange

Posted by jbreffni on 6 July 2011

HASHBYTES can be useful in a distributed database scenario.

e.g. if you want to distribute a database across 200 nodes by address

1.  get the hashbyte of the address - this is essentially a number

2.  get the modulus 200 of this number

3.  distribute the data to each node using the modulus to determine the node number

Leave a Comment

Please register or log in to leave a comment.