RE: HASHBYTES can help quickly load a Data Warehouse

SSC Veteran

Points: 298

April 14, 2010 at 6:26 am

Iv'e been developing a DWH for about three years and starting using this technique about a year ago. I started off by trying a checksum but that was way off in terms of uniqueness, with sql 2005 hashbytes improved this but you cant do it over multiple columns. The MD5 hash is much quicker than the other algorithms but is less unique, I also tried SHA1 but its very slow in comparison.

I need to use multiple columns in the hashindex and so use a checksum over hashbytes such as:

checksum(hashbytes('MD5', column1), hashbytes('MD5',column2)) - this works pretty well but is still not 100% unique across all records (in fact even a hashbytes on a single column is not 100% unique across all values).

I suggest you wrap the hashbytes into a UDF so that a change is easy to make for all ETL loads so instead of doing hashbytes('MD5', column1) do dbo.fnHash(column1).

A trick to improve uniqueness when using checksum is to convert dates to strings so use checksum(convert(varchar(10), dateColumn, 102). Beware of floats they cause problems in checksums as even if they are the same value they may give different results - convert them to decimal(38, 8). As a final point put strings as the first column in checksum so checksum(hashbytes(stringColumn), intColumn) is more unique than checksum(intColumn, hashbytes(stringColumn)).

When you do your update query into the DWH you must also include all hashed columns in your join such as: "join Stage.Hash = DWH.Hash and Stage.Column1 = DWH.Column1" - this will always be 100% unique. This has implications for your index on your hash column it must have the hash as a leading column and then include all other columns that are part of the hash such as: create index index_name on table_name (HashIndex, Column1).

This will save you a lot of pain....................