HASHBYTES can help quickly load a Data Warehouse

  • Hi ...

    The reason for converting to upper case is that often users make changes to the data by simply adding the same details in all capital letters (for various reasons). We do not want to pick this up as change in the data, if all that changes is the case.

    Cheers

  • magarity kerns (4/13/2010)


    Comments posted to this topic are about the item <A HREF="/articles/Data+Warehouse/69679/">HASHBYTES can help quickly load a Data Warehouse</A>

    Hi great Article thanks!

    Do you have any time comaprision between the 'method mentioned in this article' Vs 'An all column comparison' method?

    Like what should be a minimum number of rows or what is minimum number of columns after which time consumed in calculating Hash will outdo the time aved in comparison.

  • Like what should be a minimum number of rows or what is minimum number of columns after which time consumed in calculating Hash will outdo the time aved in comparison.

    Calculating the hash is extremely lightweight as data goes through an ETL tool. Lightweight enough that it's a foregone conclusion and you can see from other comments that a lot of people are already using one hash function or another. I think you've missed the key architectural point - the 3 column xref tables dramatically reduce disk IO which is much more of a bottleneck than the cpu doing a hash function. The thing is to compare the width of the target table to the width of the xref table.

    Many data warehouse tables can be quite wide. Reading a 50 column wide target table just to get the hash value is many times slower than reading the 3 column wide xref table. With SQL Server this can be mitigated with an 'included column' on an index for the target table but some database engines don't have this. Earlier versions of SS don't have it. Also consider development time; it is faster to develop a standardized routine rather than if one mixes and matches techniques. Finally, data warehouses have a strong tendency to change so when a target table that's only 10 columns wide suddenly has a requirement for an additional 25 columns, you would have to rewrite its ETL if you picked the all column compare method because it was only 10 to start.

  • I think you've missed the key architectural point - the 3 column xref tables dramatically reduce disk IO which is much more of a bottleneck than the cpu doing a hash function. The thing is to compare the width of the target table to the width of the xref table.

    Many data warehouse tables can be quite wide. Reading a 50 column wide target table just to get the hash value is many times slower than reading the 3 column wide xref table.

    I now understand your point in using xref table, thanks for the clarification.

  • Hi Mike,

    I am trying to do what you said concatenate all values to varbinary and then hash

    when you say concatenate you mean add the binary value of each column as follows:

    DECLARE @ColList Varchar(max), @SQLStatment VARCHAR(MAX)

    SET @ColList = ''

    select @ColList = @ColList + 'convert(varbinary(max), isnull(' + Name + ',0) )+ ' from syscolumns where id = object_id('aggregate')

    SELECT @SQLStatment = 'SELECT ' + Substring(@ColList,1,len(@ColList)-1) + ' From aggregate'

    select @SQLStatment

    exec(@SQLStatment)

  • Hey guys,

    Check out this website, will probably solve your issues with Hash Bytes within SSIS:

    http://ssismhash.codeplex.com/

    God Bless.

  • How do you determine what size to make your VARBINARY field to hold the results of the HASHBYTES function?

    Is it simply a matter of setting it to be equal to the largest HashByte function input value?

    DECLARE @vHashResults AS VARBINARY(40)

    DECLARE @vHashInput01 AS CHAR(20)

    DECLARE @vHashInput02 AS CHAR(20)

    SET @vHashInput01 = 'thisisa20bytechar123'

    SET @vHashInput02 = 'thisisa21bytechar1234'

    SET @vHashResults = HashBytes('SHA1', @vHashInput01 + @vHashInput02)

  • wta306 (4/25/2011)


    How do you determine what size to make your VARBINARY field to hold the results of the HASHBYTES function?

    Is it simply a matter of setting it to be equal to the largest HashByte function input value?

    DECLARE @vHashResults AS VARBINARY(40)

    DECLARE @vHashInput01 AS CHAR(20)

    DECLARE @vHashInput02 AS CHAR(20)

    SET @vHashInput01 = 'thisisa20bytechar123'

    SET @vHashInput02 = 'thisisa21bytechar1234'

    SET @vHashResults = HashBytes('SHA1', @vHashInput01 + @vHashInput02)

    I would be somewhat horrified if HashBytes('SHA1', <anything>) delived a result with any length other than 160 bits (20 bytes)! That's what MS thinks it delivers too, according to BoL.

    Tom

  • Tom.Thomson (4/26/2011)


    wta306 (4/25/2011)


    How do you determine what size to make your VARBINARY field to hold the results of the HASHBYTES function?

    Is it simply a matter of setting it to be equal to the largest HashByte function input value?

    DECLARE @vHashResults AS VARBINARY(40)

    DECLARE @vHashInput01 AS CHAR(20)

    DECLARE @vHashInput02 AS CHAR(20)

    SET @vHashInput01 = 'thisisa20bytechar123'

    SET @vHashInput02 = 'thisisa21bytechar1234'

    SET @vHashResults = HashBytes('SHA1', @vHashInput01 + @vHashInput02)

    I would be somewhat horrified if HashBytes('SHA1', <anything>) delived a result with any length other than 160 bits (20 bytes)! That's what MS thinks it delivers too, according to BoL.

    Tom - thank you for the response.

    Unless I'm mistaken, MD5 is only 128 bits (16bytes)...and yet the author of the original article stores the resulting values in a VARBINARY(100)? What am I missing?

  • wta306 (4/27/2011)


    Unless I'm mistaken, MD5 is only 128 bits (16bytes)...and yet the author of the original article stores the resulting values in a VARBINARY(100)? What am I missing?

    Yes, MD5 is 16 bytes. I suspect VARBINARY(100) was used because it will cater for using SHA1 (20 bytes) and, when SHA2 family of hash functions (SHA256 - 32 bytes, SHA512 - 64 bytes, SHA224 -28 bytes, and SHA384 - 48 bytes) are supported VARBINARY(100) will cater for all of them; and maybe it will even cater for SHA3. Presumably we'll know when NIST announces the winner of the SHA3 design contest next year; I'm pretty sure I've seen something about lengths for SHA3, so maybe we could know already, but I don't recall what so maybe not; I suspect VARBINARY(100) may turn out to be too short to cover all variants of SHA3 (but it will be years before SQL Server supports SHA3).

    To be certain why VARBINARY(100) was used in the article we'll have to hope for a response from the author; but I'm reasonably happy with my guess that it was intended to cope with known future growth in hash sizes (to 512 bits) and have a little bit to spare. After all, it has to be at least VARBINARY(20) not BINARY(16) to cope with all hashes supported by SQL Server 2008 unless one uses different data type depending on which hash is being used (which would make changing from one hash to another twice as hard) and there's no extra overhead in using VARBINART(64) or VARBINARY(100) so that it can cover SHA2 instead of restricting it to MDn and SHA1 by using VARBINARY(20). If I'd been writing something about hashing and wanted to cover future possibilities I might even have chosen VARBINARY(256) in the hope of catering for SHA4 when/if that happens!

    Tom

  • Tom.Thomson (4/27/2011)


    wta306 (4/27/2011)


    Unless I'm mistaken, MD5 is only 128 bits (16bytes)...and yet the author of the original article stores the resulting values in a VARBINARY(100)? What am I missing?

    Yes, MD5 is 16 bytes. I suspect VARBINARY(100) was used because it will cater for using SHA1 (20 bytes) and, when SHA2 family of hash functions (SHA256 - 32 bytes, SHA512 - 64 bytes, SHA224 -28 bytes, and SHA384 - 48 bytes) are supported VARBINARY(100) will cater for all of them; and maybe it will even cater for SHA3. Presumably we'll know when NIST announces the winner of the SHA3 design contest next year; I'm pretty sure I've seen something about lengths for SHA3, so maybe we could know already, but I don't recall what so maybe not; I suspect VARBINARY(100) may turn out to be too short to cover all variants of SHA3 (but it will be years before SQL Server supports SHA3).

    To be certain why VARBINARY(100) was used in the article we'll have to hope for a response from the author; but I'm reasonably happy with my guess that it was intended to cope with known future growth in hash sizes (to 512 bits) and have a little bit to spare. After all, it has to be at least VARBINARY(20) not BINARY(16) to cope with all hashes supported by SQL Server 2008 unless one uses different data type depending on which hash is being used (which would make changing from one hash to another twice as hard) and there's no extra overhead in using VARBINART(64) or VARBINARY(100) so that it can cover SHA2 instead of restricting it to MDn and SHA1 by using VARBINARY(20). If I'd been writing something about hashing and wanted to cover future possibilities I might even have chosen VARBINARY(256) in the hope of catering for SHA4 when/if that happens!

    wow, ok - that makes a ton of sense. thank you!!

  • Hi,

    We use SHA1 and store the result as Varbinary(MAX).

    Do you see any problems with this ?

    Thanks

  • Cyclone (4/27/2011)


    Hi,

    We use SHA1 and store the result as Varbinary(MAX).

    Do you see any problems with this ?

    Thanks

    Yes, there is a problem: a varbinary(max) column can't be a key column in an index, so if you want to use your hash in an index you shouldn't use varbinary(max). That's the only problem I can think of.

    Tom

  • Has anyone come across cases where SHA1 has produced output longer than 20 bytes, because I think I have. We are using HASHBYTES to detect changes in our data warehouse, and we were getting type 2 changes, even though all the columns values were exactly the same between the most current and previous rows; even the hashbyte value looked the same!

    But, upon looking closer, the HASHBYTE function was returning this: 0xEE0816119692A1C4DCC2045EA1E37CD47EFF4900

    But, the value in our varbinary(20) column in the table was storing this: 0xEE0816119692A1C4DCC2045EA1E37CD47EFF49

    Notice the two zeros at the end of the function's output. For some reason, when SQL writes the output from the function to the table it truncates those last two "characters" at the end!

    We have no idea why it does that, but when we increase the varbinary column in the table from 20 to 40, SQL doesn't truncate the output anymore, and all is well.

    Anyway, if anyone has any insight on why SQL seems to be truncating the output from the function as it writes to the varbinary(20) column, please let me know, I'm curious!

    L' Eomot Inversé (4/26/2011)


    wta306 (4/25/2011)


    How do you determine what size to make your VARBINARY field to hold the results of the HASHBYTES function?

    Is it simply a matter of setting it to be equal to the largest HashByte function input value?

    DECLARE @vHashResults AS VARBINARY(40)

    DECLARE @vHashInput01 AS CHAR(20)

    DECLARE @vHashInput02 AS CHAR(20)

    SET @vHashInput01 = 'thisisa20bytechar123'

    SET @vHashInput02 = 'thisisa21bytechar1234'

    SET @vHashResults = HashBytes('SHA1', @vHashInput01 + @vHashInput02)

    I would be somewhat horrified if HashBytes('SHA1', <anything>) delived a result with any length other than 160 bits (20 bytes)! That's what MS thinks it delivers too, according to BoL.

  • Langston Montgomery (10/8/2011)


    Has anyone come across cases where SHA1 has produced output longer than 20 bytes, because I think I have. We are using HASHBYTES to detect changes in our data warehouse, and we were getting type 2 changes, even though all the columns values were exactly the same between the most current and previous rows; even the hashbyte value looked the same!

    But, upon looking closer, the HASHBYTE function was returning this: 0xEE0816119692A1C4DCC2045EA1E37CD47EFF4900

    But, the value in our varbinary(20) column in the table was storing this: 0xEE0816119692A1C4DCC2045EA1E37CD47EFF49

    I don't think "longer that 20 bytes" is relevant, as the number ending 00 is 20 bytes (the one without 00 is only 19).

    Notice the two zeros at the end of the function's output. For some reason, when SQL writes the output from the function to the table it truncates those last two "characters" at the end!

    We have no idea why it does that, but when we increase the varbinary column in the table from 20 to 40, SQL doesn't truncate the output anymore, and all is well.

    Anyway, if anyone has any insight on why SQL seems to be truncating the output from the function as it writes to the varbinary(20) column, please let me know, I'm curious!

    I tried a few things to see if I could reproduce this behaviour, using hashbytes to generate values for local variables, for columns in table variable, for columns in temp tables, and for columns in premanent tables, but got no truncation ever when using varchar(20). Maybe I'm on a different version - I'm using SQL 2008 R2.

    Tom

Viewing 15 posts - 76 through 90 (of 108 total)

You must be logged in to reply to this topic. Login to reply