hashbytes

Question

Post reply

hashbytes

mahnj18

Old Hand

Points: 399
More actions
August 2, 2018 at 12:20 pm

#367011

Hi ,
For the same 'hello' string why I am getting 3 different answers. Where am I wrong? Does anybody know the reason?
How to ensure that I get same answer even when using different methods?
1)
create table #temp(name nvarchar(6));
insert into #temp values('hello');
select hashbytes('MD5',(select X.* from (values(null))foo(bar) for xml auto)) from #temp as X;
output-> 0x5C421374C552E349F699FD19C9E2B469
2)
select hashbytes('MD5',name) from #temp;
output-> 0xFD186DD49A16B1BF2BD2F44E495E14C9
3)
select HASHBYTES('MD5','hello');
Output-> 0x5D41402ABC4B2A76B9719D911017C592
-MJ

Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

Bert-701015 SSCommitted Points: 1615 More actions · Answer 1

hashbytes is contingent on data type:

select HASHBYTES('MD5','hello'), HASHBYTES('MD5',N'hello')

gives different results.

andycadley SSCertifiable Points: 5296 More actions · Answer 2

In #1 you are hashing the NVarChar string "<foo name="hello"/>"
In #2 you are hashing the NVarChar string "hello"
In #3 you are hashing the VarChar string "hello"

These are all different things, so will always produce different results when hashing. You need to make sure you are always dealing with the right thing before you hash it.
select hashbytes('MD5',N'<foo name="hello"/>') -- will match 1 select HASHBYTES('MD5',N'hello'); -- will match 2 select hashbytes('MD5',Cast(name as VarChar(6))) from #temp; -- will match 3

mahnj18 Old Hand Points: 399 More actions · Answer 3

Thanks Andy, Bert for your answers.

Your answers give an indication that there will be a lot of challenges for implementing a scenario like below one...

"A table(in real scenario many tables..) data with multiple columns and rows is pulled over network into a different environment(say target).
Using sql server I take hashbytes of these tables(end result needs to be 1 hashcode per table... and not for each individual columns) and also at other end(target), I use same hash function(one that has been tested to give same code for same strings.. e.g. at sql server if say MD5(or any other hash function) for 'hello' gives some code 12345
and a different programming language(python etc..) which uses MD5(or any other hash function) for 'hello' also gives me same code 12345). And finally compare these codes per table to verify data correctness."

Will there be any challenges in this approach... like say...
a) the way sql server treats nvarchar, varchar,decimal,bigint,int and the way a programming language(python/R) treats its data types...?
b) handling of NULLS and decimal data
c) a method used to get 1 hashcode per table at sql server(e.g. xml auto..), but same method may not be available at target end.

What would be your suggestions?
Thanks for your help...

-MJ

Jeff Moden SSC Guru Points: 1004703 More actions · Answer 4

First things first... stop using MD5. Seriously... stop using it NOW and pretend it never existed. Its old, it's slow, it's fairly easy for hackers to break especially if they're using a rack of GPUs, it's been deprecated throughout the industry for a long time, and it's no longer supported.

Second, hashing algorithms work at the byte level so no matter which RDBMS or application development code you're using, if the stored result varies at the byte level (and it almost always will), then you can expect any and all hashing algorithms to produce a different result.

3rd, stop trying to build your own decoder ring. People do some really stupid things like using the incredible SHA2-512 algorithm to hash things like SSNs but, because they didn't heavily salt the data, even a 1 CPU SQL Server can decode a million of them in seconds because there are only 10 Billion possible SSNs. If you have data to protect, then protect it properly and that includes (especially) data at rest.

Here's a link that tells you how to do it right except don't use things like MD5 and any of the SHA1 series. Google has more..
https://docs.microsoft.com/en-us/sql/relational-databases/security/encryption/encrypt-a-column-of-data?view=sql-server-2017
https://docs.microsoft.com/en-us/sql/relational-databases/security/encryption/choose-an-encryption-algorithm?view=sql-server-2017

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

andycadley SSCertifiable Points: 5296 More actions · Answer 5

I think he's planning on using hashing as a way of check-summing data, not as some sort of budget encryption, which is an entirely apt use. You're right about MD5 though, it's weak and more prone to collisions than newer hashing methods (which weakens the guarantee of data integrity).

The only way to go about this is to cast everything to a standard format (probably nvarchar(max)), combine them and hash them, something like:

select hashbytes('sha2_512', string_agg( cast(name as nvarchar(max)) + cast(system_type_id as nvarchar(max)) ,'') ) from sys.columns

Bear in mind you'll need to consider how things like dates and numeric values get converted to ensure that your hash at both ends is being calculated across the same data format.

mahnj18 Old Hand Points: 399 More actions · Answer 6

Thanks Andy,
You are right, this code needs to only ensure that data at both end is exactly same. And not work on encryption or adding any security feature...
If the data format at other end changes then how to handle that scenario?

-MJ

andycadley SSCertifiable Points: 5296 More actions · Answer 7

If you transform the data at the other end, a hash based solution is unlikely to help much. About the only way to make that work is have a bunch of code that performs the transform in reverse (thus theoretically turning your data back into what you had originally) and then compare the hashes there.

Whether that ends up more difficult than just writing specific validation logic that compares against the source data really depends upon how complex your transformations are.

Jeff Moden SSC Guru Points: 1004703 More actions · Answer 8

IF that's the case and understanding that

mahnj18 - Monday, August 6, 2018 3:56 PM
Thanks Andy,
You are right, this code needs to only ensure that data at both end is exactly same. And not work on encryption or adding any security feature...
If the data format at other end changes then how to handle that scenario?
-MJ

If that's the case, then what are you sending to? Another SQL Server? Of does it absolutely need to be some text based file?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)