RE: HASHBYTES can help quickly load a Data Warehouse

SSC-Insane

Points: 20039

April 14, 2010 at 8:33 am

A few comments with the underlying assumption that the desired result is a (potentially) more efficient method that is guaranteed to return the same results as the "long way".

For guaranteed accuracy, use whatever hash you like as a first pass. Whenever the hashes are different, the data is different. Whenever the hashes are the same, run a field by field (or equivalent) "long way" comparison to see if the data is actually the same after all.

Note that field concatenation comparisons do not work with variable length types mixed in (trivial example: "a " "b" and "a" " b" are not the same; alternate, "1" "1/22/1988" and "" "11/22/1988" are not the same).

If you deliberately decide that some non-zero percentage of erroneous matches are acceptable, then test your hash vs a "full length" compare to see what your particular data results in. In reality, there are only real results. Probabilities let you predict probabilities, but most business data I've seen doesn't follow "random input" or normal distribution rules.

That said, if you rely on probability, plan for what you'll do after a collision happens.