HASHBYTES can help quickly load a Data Warehouse

  • Mike C (4/13/2010)


    Nice article! I implemented something very similar for loading data warehouses and datamarts a while back. One thing to keep in mind is that you can actually use a BINARY(16) column to hold MD5 hashes or BINARY(20) for SHA-1 since they're guaranteed to return a binary hash with exact length of 128-bits or 160-bits respectively. Also HASHBYTES is restricted in what it can hash -- there's an 8,000 byte restriction on the input, so you can't hash LOB data or columns wider than 8,000 bytes with it effectively. You can work around that limitation by chopping up your input or by generating the hashes in your ETL tool of choice (SSIS script components allow direct access to the System.Security.Cryptography namespace, for example).

    Thanks for the positive feedback!

    The tricky thing about keeping the output as binary makes it hard on the eyes when pasted into one's article text for illustration. No, it doesn't have to be converted in actual use. But also not all MD5 functions return binary depending on the ETL tool. I used HASHBYTES to stay on topic with SQL Server.

  • tarquin (4/14/2010)


    I need to use multiple columns in the hashindex and so use a checksum over hashbytes such as:

    checksum(hashbytes('MD5', column1), hashbytes('MD5',column2)) - this works pretty well but is still not 100% unique across all records (in fact even a hashbytes on a single column is not 100% unique across all values).

    I'd suggest getting rid of the checksum and using hashbytes with the columns concatenated together into a single call to the function. This should result in a (reasonably) unique entry. Different data types just need to be converted to a common type.

  • rmd0 (4/14/2010)


    One issue I have with this solution is that MD5 *will* have collisions with moderately "wide" tables, especially when bit columns are involved. In fact, I would see as many as two or three collisions in as few as 1 million rows on a table with a dozen columns. It would usually occur when the column that changed as a bit field.

    This is not the case. Please see the web page linked at the end of the article concerning the likelyhood of MD5 collisions.

  • Yes, I did make the example overly simplified because I just wanted to trigger an ah-ha! moment. The article was already running long without getting into a lot of what if situations. To address your concern, it's pretty easy to convert various types of columns into a common type and concatenate them together in a function that would also process the MD5 sum. If the resulting length exceeds the HASHBYTES function's input limit then this method isn't for you or you need to come up with a way to shorten them somehow. There's no reason why a extremely wide column couldn't be put into two or three checksums in the XREF table.

    Obviously this isn't the perfect method for everyone but I think some people will find it useful.

    Hi magarity,

    Thnx for your reply. The concatenation option already came accross my mind and yes i agree that it is not the best way. I'm afraid that for comparing based on hashvalues we have to choose the best solution from the not so very well available options.

    I did a little googlization and i came up with following options:

    * Checksum function

    * Hashbytes function

    * SSISmultiplehash (3rd)

    * Checksum transformation konesans (3rd)

    The problem with 3rd party (open) components i have is that when you have a bug, you're stuck. Hmmm...

    Keep writing articles!

    Greetz,

    Hennie

  • A few comments with the underlying assumption that the desired result is a (potentially) more efficient method that is guaranteed to return the same results as the "long way".

    For guaranteed accuracy, use whatever hash you like as a first pass. Whenever the hashes are different, the data is different. Whenever the hashes are the same, run a field by field (or equivalent) "long way" comparison to see if the data is actually the same after all.

    Note that field concatenation comparisons do not work with variable length types mixed in (trivial example: "a " "b" and "a" " b" are not the same; alternate, "1" "1/22/1988" and "" "11/22/1988" are not the same).

    If you deliberately decide that some non-zero percentage of erroneous matches are acceptable, then test your hash vs a "full length" compare to see what your particular data results in. In reality, there are only real results. Probabilities let you predict probabilities, but most business data I've seen doesn't follow "random input" or normal distribution rules.

    That said, if you rely on probability, plan for what you'll do after a collision happens.

  • Good article. I'll need to reread it in more depth when I have a bit more time.

  • magarity kerns (4/14/2010)


    rmd0 (4/14/2010)


    One issue I have with this solution is that MD5 *will* have collisions with moderately "wide" tables, especially when bit columns are involved. In fact, I would see as many as two or three collisions in as few as 1 million rows on a table with a dozen columns. It would usually occur when the column that changed as a bit field.

    This is not the case. Please see the web page linked at the end of the article concerning the likelyhood of MD5 collisions.

    In my book any likelyhood = a certainty (at some point down the line). Do you have anything in place to catch collisions should they occur or are you just hoping they won't?

    Hope isn't a strategy 🙂

  • tarquin (4/14/2010)


    Wrapping hashbytes with checksum is not bad as long as you include all columns in your join as indicated and create the index including all columns required for the join. If you only use the result of the checksum in your join - yes it is bad!

    If you're planning to compare all of the columns anyway then you don't need to generate a hash code.

  • magarity kerns (4/14/2010)


    hennie7863 (4/14/2010)


    I have not read the article in detail, so correct me if i'm wrong, but the hashbytes accepts only one column (of one type character) and in this article the hashbyte calculation is made simple by using one column.

    Currently i''m using the far from perfect checksum calculation (and addtitional fields comparison) for comparing records. This accepts multiple columns and types. When indexed, this will be faster than the hashbytes functionality.

    This article draw my attention because the checksum is not perfect. Multiple records will lead to the same checksum result. I'm interested in a solution using the hashbytes with multiple columns and multiple types. Still looking for a good solution.

    Yes, I did make the example overly simplified because I just wanted to trigger an ah-ha! moment. The article was already running long without getting into a lot of what if situations. To address your concern, it's pretty easy to convert various types of columns into a common type and concatenate them together in a function that would also process the MD5 sum. If the resulting length exceeds the HASHBYTES function's input limit then this method isn't for you or you need to come up with a way to shorten them somehow. There's no reason why a extremely wide column couldn't be put into two or three checksums in the XREF table.

    Obviously this isn't the perfect method for everyone but I think some people will find it useful.

    Since HASHBYTES accepts varbinary input, and not all data types are implicitly convertible to varbinary (but nearly all are explicitly convertible), I'd recommend using varbinary as the common data type. Another alternative to generating multiple hashes and storing in multiple columns for LOB data is to generate and concatenate multiple hashes on substrings (simple using a recursive CTE and FOR XML) and hash the resulting concatenated hashes. That way you always end up with a single hash value. This is actually similar to the methods commonly used in cryptography to combine hashes from multiple pieces of data (difference is after concatenation they usually perform some additional bitwise logical ops).

  • magarity kerns (4/14/2010)


    Mike C (4/13/2010)


    Nice article! I implemented something very similar for loading data warehouses and datamarts a while back. One thing to keep in mind is that you can actually use a BINARY(16) column to hold MD5 hashes or BINARY(20) for SHA-1 since they're guaranteed to return a binary hash with exact length of 128-bits or 160-bits respectively. Also HASHBYTES is restricted in what it can hash -- there's an 8,000 byte restriction on the input, so you can't hash LOB data or columns wider than 8,000 bytes with it effectively. You can work around that limitation by chopping up your input or by generating the hashes in your ETL tool of choice (SSIS script components allow direct access to the System.Security.Cryptography namespace, for example).

    Thanks for the positive feedback!

    The tricky thing about keeping the output as binary makes it hard on the eyes when pasted into one's article text for illustration. No, it doesn't have to be converted in actual use. But also not all MD5 functions return binary depending on the ETL tool. I used HASHBYTES to stay on topic with SQL Server.

    No problem, I think it's a great intro article and I hope you follow up with more!

    You don't have to convert single columns that are implicitly convertible to varbinary. HASHBYTES performs the implicit conversion for you. If you want to concatenate multiple columns, or have a data type that's not implicitly convertible, you do have to perform the explicit conversion yourself. Also if you concatenate columns you could get hit with a couple of conversions. For instance, you might explicitly convert an INT to a VARCHAR to concatenate it with a VARCHAR column, then HASHBYTES will still perform the implicit conversion to varbinary. For me it was a little easier to auto-generate the statements with the explicit conversions so I didn't have to worry about which types can be implicitly converted and which types can be concatenated.

    BTW, here's my old connect item on adding a "Better Collision Free Hash Function" to T-SQL:

    https://connect.microsoft.com/SQLServer/feedback/details/513376/a-better-collision-free-hash-function-for-comparing-rows-of-data

    Here's an article on using CLR to hash a BLOB: http://sqlblog.com/blogs/michael_coles/archive/2009/04/12/let-s-hash-a-blob.aspx

  • Mike C (4/14/2010)


    For instance, you might explicitly convert an INT to a VARCHAR to concatenate it with a VARCHAR column, then HASHBYTES will still perform the implicit conversion to varbinary.

    So, let's say we have two integer columns, with two rows of data:

    1 11

    11 1

    And we convert them to VARCHAR and concatenate them:

    "111"

    "111"

    Now we get collisions on the concatenated strings even before hashing. You can't safely and concatenate convert 4 bytes of data to less than 4 bytes of data without adding overhead of some sort.

  • Nadrek (4/14/2010)


    A few comments with the underlying assumption that the desired result is a (potentially) more efficient method that is guaranteed to return the same results as the "long way".

    For guaranteed accuracy, use whatever hash you like as a first pass. Whenever the hashes are different, the data is different. Whenever the hashes are the same, run a field by field (or equivalent) "long way" comparison to see if the data is actually the same after all.

    Note that field concatenation comparisons do not work with variable length types mixed in (trivial example: "a " "b" and "a" " b" are not the same; alternate, "1" "1/22/1988" and "" "11/22/1988" are not the same).

    If you deliberately decide that some non-zero percentage of erroneous matches are acceptable, then test your hash vs a "full length" compare to see what your particular data results in. In reality, there are only real results. Probabilities let you predict probabilities, but most business data I've seen doesn't follow "random input" or normal distribution rules.

    That said, if you rely on probability, plan for what you'll do after a collision happens.

    Field separators of some sort help in your examples. In the trivial example below I'm using "|" as a field separator although it could be anything (even non-printable characters or strings of characters):

    "a " + "|" + "b" = "a |b"

    Is not the same as

    "a" + "|" + " b" = "a| b"

    And

    "1" + "|" + "1/22/1998" = "1|1/22/1998"

    Is not the same as

    "11/22/1998"

    I also prepend the binary length of the string to each column, which provides additional differentiation and provides a useful method for dealing with NULLs. For instance:

    "a " + "b" ==> 0x00000002 + "a " + "|" + 0x00000001 + "b"

    When you have a NULL you can deal with it by passing in the binary equivalent of -1 as the length, or you can concatenate a length of 0 yet append one or more data bytes (0x00000000 + 0x00), or come up with some other scenario that cannot be duplicated by any other combination of length + data for any other data types.

  • james.wheeler10 (4/14/2010)


    magarity kerns (4/14/2010)


    rmd0 (4/14/2010)


    One issue I have with this solution is that MD5 *will* have collisions with moderately "wide" tables, especially when bit columns are involved. In fact, I would see as many as two or three collisions in as few as 1 million rows on a table with a dozen columns. It would usually occur when the column that changed as a bit field.

    This is not the case. Please see the web page linked at the end of the article concerning the likelyhood of MD5 collisions.

    In my book any likelyhood = a certainty (at some point down the line). Do you have anything in place to catch collisions should they occur or are you just hoping they won't?

    Hope isn't a strategy 🙂

    You're correct that hope isn't a strategy. But -- and this is particularly true in our domain -- science is a nice strategic tool.

    Calculate the odds of a hash collision with SHA-1. According to the Birthday Paradox, the 160 bit hash width means you will generate 2^80 hashes (1,208,925,819,614,629,174,706,176) before you generate a single collision. When you have a single table with that many rows in it you can switch to SHA-256 where you can generate 2^128 hashes (3.4*10^38) before you generate a collision. Keep in mind that there are professional cryptanalysts who have been actively trying to force hash collisions with various algorithms for the past decade and haven't yet succeeded in most cases.

    So yes, you are correct, once you've generated 1,208,925,819,614,629,174,706,176 160-bit hashes in a single table you're fairly certain to have a collision and it's probably time to move up into a larger hash size. And we didn't even touch on the topic of 512 bit hashes yet.

  • Great article. The idea behind using the hash to determine data change is a great. Determining which hashing algorithm to use and where to use it (ETL, update existing data, etc.) is the hard part.

  • As long as you choose cryptographically secure hashes with long bit lengths (160 bits or more recommended) you'll do great. Where to hash depends on where you can do it most efficiently 🙂

Viewing 15 posts - 16 through 30 (of 108 total)

You must be logged in to reply to this topic. Login to reply