HASHBYTES can help quickly load a Data Warehouse

  • L' Eomot InversΓ© (10/9/2011)


    Langston Montgomery (10/8/2011)


    Has anyone come across cases where SHA1 has produced output longer than 20 bytes, because I think I have. We are using HASHBYTES to detect changes in our data warehouse, and we were getting type 2 changes, even though all the columns values were exactly the same between the most current and previous rows; even the hashbyte value looked the same!

    But, upon looking closer, the HASHBYTE function was returning this: 0xEE0816119692A1C4DCC2045EA1E37CD47EFF4900

    But, the value in our varbinary(20) column in the table was storing this: 0xEE0816119692A1C4DCC2045EA1E37CD47EFF49

    I don't think "longer that 20 bytes" is relevant, as the number ending 00 is 20 bytes (the one without 00 is only 19).

    You're right, L'Eomot. I wrote that wrong. What I was trying to say was that HASHBYTE was returning a 20 byte length value, but when SQL wrote it somehow it got truncated to 19.

    Notice the two zeros at the end of the function's output. For some reason, when SQL writes the output from the function to the table it truncates those last two "characters" at the end!

    We have no idea why it does that, but when we increase the varbinary column in the table from 20 to 40, SQL doesn't truncate the output anymore, and all is well.

    Anyway, if anyone has any insight on why SQL seems to be truncating the output from the function as it writes to the varbinary(20) column, please let me know, I'm curious!

    I tried a few things to see if I could reproduce this behaviour, using hashbytes to generate values for local variables, for columns in table variable, for columns in temp tables, and for columns in premanent tables, but got no truncation ever when using varchar(20). Maybe I'm on a different version - I'm using SQL 2008 R2.

    We're using SQL 2008 R2 and SSIS 2008. We are also using the OLE DST component with fast load, nolock, and don't check constraints to load the type2 change to the table. Wanna read something funny?... I reduced the varbinary column from 40 back to 20 expecting SQL to truncate it again... and it doesn't! It's almost like something had to be reset. I thought it was ANSI_PADDING or something, SSIS wouldn't turn that off... I don't think.

    Anyway, thanks for your response, L'Eomot. I don't like when things work without knowing what fixed it, but we're good for now!

  • !!False Alarm!!

    I feel like such an idiot. I'm almost embarrassed to admit this, but out of courtesy, I thought I would give an update on this in case anyone else makes the oversight I did.

    You guessed it. The ANSI_PADDING was turned off!! Doh!! This was a newly created database. Rookie mistake. Luckily, worse case, our data warehouse will have a few thousand phantom type2 updates which won't cause any noise, save some space being used up for nothing.

    Thanks to all for taking the time to read this. I'm gonna go dig a whole and stick my head in it for a while!

  • Hi,

    I've found the SHA-256 algorithm meets my needs so far in ETL loads.

    I use a C# script task within SSIS to compare hash values in the Dimension tables, which obviously by their nature tend to be smaller than Fact tables.

    Using the cryptography namespace in C# gets around some of the limitations of the T-SQL Hashbytes function specifically with data types and NULL handling.

    The trick for Fact tables is to implement Change Tracking at source (less overhead than CDC) which can then be used for net data extraction based on primary keys detected by the "changetable" function. So the incremental loads are tiny (provided the gap between data pulls is reasonably small) compared to "initial" load. Annoyingly, in the source system I work with, we have Updates, Inserts and Deletes on production database transactional tables so these have to be handled by ETL process. Nevertheless, with the appropriate SSIS tasks we can achieve good performance and main thing is the update process on Fact table is still a batch process rather than a row-based process. The T-SQL Merge statement is really powerful for comparing rows as an alternative to hashing algorithms. So far I'm using hashbytes and provided there is no loss of synchronicity via the Change Tracking mechanism, this can also be used for row comparisons Fact as well as Dimension.

  • Hi there,

    I have one question, I'm using SQL server 2016 and I have atable with 150 columns with multiple data types
    I implemented hashbytes (SHA1) for Merge statement (Insert / Update)
    I notice that sometimes (I have around 3 million rows) it not detect changes
    (for example it not detect 7 rows that need to update out of 3 millions)
    Is there any limitation for how many columns you can have in hashbytesfunction?

    Thanks,
    Oded Dror

  • There is a 8KB limit as input to the HASHBYTES function. Make sure your columns don't exceed that (not necessarily a character limit).

  • I'll add that if HASHBYTES is different between any two rows, then the rows are guaranteed to be different but, if they are the same,
    there's not necessarily a guarantee that the rows are the same.  You need to verify that rows with the same hashing are, in fact, the
    same.  I agree that the chances are very slim that they're not the same but you've already found out that "slim" doesn't mean "never". πŸ˜‰

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Francis Rodrigues-459442 - Wednesday, December 6, 2017 7:17 PM

    There is a 8KB limit as input to the HASHBYTES function. Make sure your columns don't exceed that (not necessarily a character limit).

    Francis,

    Thank you

    Oded Dror

  • Jeff Moden - Wednesday, December 6, 2017 8:52 PM

    I'll add that if HASHBYTES is different between any two rows, then the rows are guaranteed to be different but, if they are the same,
    there's not necessarily a guarantee that the rows are the same.  You need to verify that rows with the same hashing are, in fact, the
    same.  I agree that the chances are very slim that they're not the same but you've already found out that "slim" doesn't mean "never". πŸ˜‰

    Jeff,

    Thank you
    Oded Dror

  • odeddror - Wednesday, December 6, 2017 9:22 PM

    Francis Rodrigues-459442 - Wednesday, December 6, 2017 7:17 PM

    There is a 8KB limit as input to the HASHBYTES function. Make sure your columns don't exceed that (not necessarily a character limit).

    Francis,

    Thank you

    Oded Dror

    Since you are using SQL Server 2016 there is no longer an input limit of 8000 bytes to the HASHBYTES function.

    https://docs.microsoft.com/en-us/sql/t-sql/functions/hashbytes-transact-sql

  • Lynn Pettis - Wednesday, December 6, 2017 10:35 PM

    odeddror - Wednesday, December 6, 2017 9:22 PM

    Francis Rodrigues-459442 - Wednesday, December 6, 2017 7:17 PM

    There is a 8KB limit as input to the HASHBYTES function. Make sure your columns don't exceed that (not necessarily a character limit).

    Francis,

    Thank you

    Oded Dror

    Since you are using SQL Server 2016 there is no longer an input limit of 8000 bytes to the HASHBYTES function.

    https://docs.microsoft.com/en-us/sql/t-sql/functions/hashbytes-transact-sql

    Francis,

    Thanks again,

    Oded Dror

  • odeddror - Thursday, December 7, 2017 5:52 AM

    Lynn Pettis - Wednesday, December 6, 2017 10:35 PM

    odeddror - Wednesday, December 6, 2017 9:22 PM

    Francis Rodrigues-459442 - Wednesday, December 6, 2017 7:17 PM

    There is a 8KB limit as input to the HASHBYTES function. Make sure your columns don't exceed that (not necessarily a character limit).

    Francis,

    Thank you

    Oded Dror

    Since you are using SQL Server 2016 there is no longer an input limit of 8000 bytes to the HASHBYTES function.

    https://docs.microsoft.com/en-us/sql/t-sql/functions/hashbytes-transact-sql

    Francis,

    Thanks again,

    Oded Dror

    Francis?

  • odeddror - Wednesday, December 6, 2017 6:21 PM

    Hi there,

    I have one question, I'm using SQL server 2016 and I have atable with 150 columns with multiple data types
    I implemented hashbytes (SHA1) for Merge statement (Insert / Update)
    I notice that sometimes (I have around 3 million rows) it not detect changes
    (for example it not detect 7 rows that need to update out of 3 millions)
    Is there any limitation for how many columns you can have in hashbytesfunction?

    Thanks,
    Oded Dror

    First: Make sure you're comparing the entire 20 byte output of SHA1.
    Second: Try changing from SHA1 to SHA2_512, and make certain you're comparing the entire 64 byte output.  This should significantly reduce collisions.
      Also, any rows that are missed (changes not detected) should be different rows than for SHA1, statistically.

    If SHA2_512 has the same 7 missing rows out of 3 million on SQL 2016, the problem is not hash collision, the problem is that you're not correctly concatenating the data together.
    If you're just concatenating
    1       23
    and
    12      3
    both end up as
    123

    while if you're making it, say, comma delimited concatenated, then the much less likely
    Bob,    Jones
    and
    Bob     ,Jones
    both end up as
    Bob,,Jones

  • Nadrek - Thursday, December 7, 2017 9:29 AM

    odeddror - Wednesday, December 6, 2017 6:21 PM

    Hi there,

    I have one question, I'm using SQL server 2016 and I have atable with 150 columns with multiple data types
    I implemented hashbytes (SHA1) for Merge statement (Insert / Update)
    I notice that sometimes (I have around 3 million rows) it not detect changes
    (for example it not detect 7 rows that need to update out of 3 millions)
    Is there any limitation for how many columns you can have in hashbytesfunction?

    Thanks,
    Oded Dror

    First: Make sure you're comparing the entire 20 byte output of SHA1.
    Second: Try changing from SHA1 to SHA2_512, and make certain you're comparing the entire 64 byte output.  This should significantly reduce collisions.
      Also, any rows that are missed (changes not detected) should be different rows than for SHA1, statistically.

    If SHA2_512 has the same 7 missing rows out of 3 million on SQL 2016, the problem is not hash collision, the problem is that you're not correctly concatenating the data together.
    If you're just concatenating
    1       23
    and
    12      3
    both end up as
    123

    while if you're making it, say, comma delimited concatenated, then the much less likely
    Bob,    Jones
    and
    Bob     ,Jones
    both end up as
    Bob,,Jones

    Lynn,

    Thank you

    Oded Dror

  • Jeff Moden - Wednesday, December 6, 2017 8:52 PM

    I'll add that if HASHBYTES is different between any two rows, then the rows are guaranteed to be different but, if they are the same,
    there's not necessarily a guarantee that the rows are the same.  You need to verify that rows with the same hashing are, in fact, the
    same.  I agree that the chances are very slim that they're not the same but you've already found out that "slim" doesn't mean "never". πŸ˜‰

    Depending on how many different rows are possible and how many different rows you actually have, "slim" may be quite close to "never" or a long way away from it.  Roughly speaking, the chances of two different rows picked at random have the same hash are about 1 in 1 000000 000000 000000 000000 000000 000000 000000 000000  (1 in 10**48), but if you've got 3 million different rows picked at random the chance that there's a pair that have the same hash is 9 trillion times that (so about 1 in 10**35) which is a lot further from zero.

    Tom

  • TomThomson - Friday, December 8, 2017 4:03 PM

    Jeff Moden - Wednesday, December 6, 2017 8:52 PM

    I'll add that if HASHBYTES is different between any two rows, then the rows are guaranteed to be different but, if they are the same,
    there's not necessarily a guarantee that the rows are the same.  You need to verify that rows with the same hashing are, in fact, the
    same.  I agree that the chances are very slim that they're not the same but you've already found out that "slim" doesn't mean "never". πŸ˜‰

    Depending on how many different rows are possible and how many different rows you actually have, "slim" may be quite close to "never" or a long way away from it.  Roughly speaking, the chances of two different rows picked at random have the same hash are about 1 in 1 000000 000000 000000 000000 000000 000000 000000 000000  (1 in 10**48), but if you've got 3 million different rows picked at random the chance that there's a pair that have the same hash is 9 trillion times that (so about 1 in 10**35) which is a lot further from zero.

    Jeff,

    Thank you
    Oded Dror

Viewing 15 posts - 91 through 105 (of 108 total)

You must be logged in to reply to this topic. Login to reply