Use HASHBYTES to identify changes

  • Comments posted to this topic are about the item Use HASHBYTES to identify changes

  • Hi

    This would be really useful for us to speed up our ETL process, but when I adapt the code to run for our main fact table it tells me

    "String or binary data would be truncated."

    I am guessing this is because the row has too much data in it?

    Is there any way I can get around this so that it still identifies changes to any column in the record? What is the maximum amount of data I can use this function with?

    Thanks

    Jack

  • Doesn't the column set returned by the SELECT * include the column rowDiffHash ?

    In which case the hash is non-deterministic - in the sense that repeating the UPDATE multiple times would return different results each time.

  • The Hashbytes function should always return a varbinary(20), but the input values are limited to 8000 bytes. So if your row adds up to more than 8000 bytes, then that would be the issue. Instead of doing a "SELECT *" you could do a SELECT of only the columns that you want to watch for change on.

    Jim

  • Yes, SELECT * will include the rowDiffHash if it has a value. My process checks the record set received today against the record set received the day before, so for me, the rowDiffHash is always NULL. If you want to recalculate the row value, you will have to select specific columns and exclude the rowDiffHash column, otherwise that would throw off the process.

    Jim

  • Jim Youmans-439383 (11/25/2013)


    The Hashbytes function should always return a varbinary(20), but the input values are limited to 8000 bytes. So if your row adds up to more than 8000 bytes, then that would be the issue. Instead of doing a "SELECT *" you could do a SELECT of only the columns that you want to watch for change on.

    Jim

    Agreed, Jim - particularly with the XML PATH, there's a huge amount of overhead for each row, so it's very easy to end up with truncation/errors.

    Additionally, where are the benchmarks (done with Profiler) that show performance differences, including the overhead of the extra column and HASHBYTES calculations, with good indexing?

    Many other pitfalls involved in using hashes to detect differences are discussed in:

    http://www.sqlservercentral.com/Forums/Topic902868-507-1.aspx

    Suffice it to say:

    If the hash is different, the data is different.

    If the hash is the same, you know nothing about the data for certain.

  • If you use HASHBYTES with the SHA1 you can expect a collision after you generate hashes for about 1,208,925,819,614,629,174,706,176 rows of data. This would take me a few million years. If you use the MD5, collisions are much more common.

    Here is a really good article about it:

    http://sqlblog.com/blogs/michael_coles/archive/2010/04/17/find-a-hash-collision-win-100.aspx

  • This was removed by the editor as SPAM

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply