hashbytes

  • Hi ,
    For the same 'hello' string why I am getting 3 different answers. Where am I wrong?  Does anybody know the reason?
    How to ensure that I get same answer even when using different methods?
    1)
    create table #temp(name nvarchar(6));
    insert into #temp values('hello');
    select hashbytes('MD5',(select X.* from (values(null))foo(bar) for xml auto)) from #temp as X; 
    output->  0x5C421374C552E349F699FD19C9E2B469

    2)
    select hashbytes('MD5',name) from #temp; 
    output->  0xFD186DD49A16B1BF2BD2F44E495E14C9

    3)
    select HASHBYTES('MD5','hello');
    Output-> 0x5D41402ABC4B2A76B9719D911017C592

    -MJ

  • hashbytes is contingent on data type:

    select HASHBYTES('MD5','hello'), HASHBYTES('MD5',N'hello')

    gives different results.

  • In #1 you are hashing the NVarChar string "<foo name="hello"/>"
    In #2 you are hashing the NVarChar string "hello"
    In #3 you are hashing the VarChar string "hello"

    These are all different things, so will always produce different results when hashing. You need to make sure you are always dealing with the right thing before you hash it.

    select hashbytes('MD5',N'<foo name="hello"/>') -- will match 1
    select HASHBYTES('MD5',N'hello');  -- will match 2
    select hashbytes('MD5',Cast(name as VarChar(6))) from #temp;  -- will match 3

  • Thanks Andy, Bert for your answers.

    Your answers give an indication that there will be a lot of challenges for implementing a scenario like below one...

    "A table(in real scenario many tables..) data with multiple columns and rows is pulled over network into a different environment(say target).
    Using sql server I take hashbytes of these tables(end result needs to be 1 hashcode per table... and not for each individual columns) and also at other end(target), I use same hash function(one that has been tested to give same code for same strings..  e.g.  at sql server if say MD5(or any other hash function) for 'hello' gives some code 12345 
    and a different programming language(python etc..) which uses MD5(or any other hash function) for 'hello' also gives me same code 12345). And finally compare these codes per table to verify data correctness."

    Will there be any challenges in this approach...  like say...
    a) the way sql server treats nvarchar, varchar,decimal,bigint,int  and the way a programming language(python/R) treats its data types...?
    b) handling of NULLS and decimal data
    c) a method used to get 1 hashcode per table at sql server(e.g. xml auto..),  but same method may not be available at target end.

    What would be your suggestions?
    Thanks for your help...

    -MJ

  • First things first... stop using MD5.  Seriously... stop using it NOW and pretend it never existed.  Its old, it's slow, it's fairly easy for hackers to break especially if they're using a rack of GPUs, it's been deprecated throughout the industry for a long time, and it's no longer supported.

    Second, hashing algorithms work at the byte level so no matter which RDBMS or application development code you're using, if the stored result varies at the byte level (and it almost always will), then you can expect any and all hashing algorithms to produce a different result.

    3rd, stop trying to build your own decoder ring.  People do some really stupid things like using the incredible SHA2-512 algorithm to hash things like SSNs but, because they didn't heavily salt the data, even a 1 CPU SQL Server can decode a million of them in seconds because there are only 10 Billion possible SSNs.  If you have data to protect, then protect it properly and that includes (especially) data at rest.

    Here's a link that tells you how to do it right except don't use things like MD5 and any of the SHA1 series.  Google has more..
    https://docs.microsoft.com/en-us/sql/relational-databases/security/encryption/encrypt-a-column-of-data?view=sql-server-2017
    https://docs.microsoft.com/en-us/sql/relational-databases/security/encryption/choose-an-encryption-algorithm?view=sql-server-2017

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I think he's planning on using hashing as a way of check-summing data, not as some sort of budget encryption, which is an entirely apt use. You're right about MD5 though, it's weak and more prone to collisions than newer hashing methods (which weakens the guarantee of data integrity).

    The only way to go about this is to cast everything to a standard format (probably nvarchar(max)), combine them and hash them, something like:


    select
       hashbytes('sha2_512',
             string_agg(
                cast(name as nvarchar(max))
                + cast(system_type_id as nvarchar(max))
             ,'')
        )
    from sys.columns

    Bear in mind you'll need to consider how things like dates and numeric values get converted to ensure that your hash at both ends is being calculated across the same data format.

  • Thanks Andy,
    You are right,  this code needs to only ensure that data at both end is exactly same. And not work on encryption or adding any security feature...
    If the data format at other end changes then how to handle that scenario?

    -MJ

  • If you transform the data at the other end, a hash based solution is unlikely to help much. About the only way to make that work is have a bunch of code that performs the transform in reverse (thus theoretically turning your data back into what you had originally) and then compare the hashes there. 

    Whether that ends up more difficult than just writing specific validation logic that compares against the source data really depends upon how complex your transformations are.

  • IF that's the case and understanding that

    mahnj18 - Monday, August 6, 2018 3:56 PM

    Thanks Andy,
    You are right,  this code needs to only ensure that data at both end is exactly same. And not work on encryption or adding any security feature...
    If the data format at other end changes then how to handle that scenario?

    -MJ

    If that's the case, then what are you sending to?  Another SQL Server?  Of does it absolutely need to be some text based file?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply