RE: Suggestion/solution for using Hashbytes across entire table

SSC Guru

Points: 125519

October 6, 2015 at 3:13 pm

Kim Crosser (10/6/2015)
...
So - I am using parallel tables that contain the primary keys of the source tables and a computed checksum, used to detect changes to the source tables since the last conversion run.
The CHECKSUM function is *probably* sufficient to detect virtually all these changes, and has worked so far (that we can detect), but I would like to eliminate the "*probably*" part. 🙂 Plus, in future projects I would like to have another arrow in my quiver of useful techniques...

The datatype of CHECKSUM is a 4 byte Integer, which isn't selective enough to be used as a reliable hash across all columns in a 200+ byte wide record, especially if there are unique integer columns within the record. Years ago, I attempted to leverage CHECKSUM(*) as a quick and dirty means of identifying duplicate records within large tables and found it returning too many false positives.

For what you're doing, you'll want to use HASHBYTES() function with at least the MD5 (16 byte hash) algorithm. SHA2 can go up to a 64 byte hash, but of course the point of your using a hash us to conserve space in the reference tables and provide fast lookups. However, there is no option for doing something like HASHBYTES(*) across all columns, so you'll have to include the full list of column across which to hash.

But that shouldn't be a problem, because I would think that the column structure for tables in the legacy database should be static at this point. So you're only coding this ETL step once for each table, right?

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho