RE: HASHBYTES can help quickly load a Data Warehouse

SSCrazy

Points: 2763

July 7, 2010 at 9:50 am

Like what should be a minimum number of rows or what is minimum number of columns after which time consumed in calculating Hash will outdo the time aved in comparison.

Calculating the hash is extremely lightweight as data goes through an ETL tool. Lightweight enough that it's a foregone conclusion and you can see from other comments that a lot of people are already using one hash function or another. I think you've missed the key architectural point - the 3 column xref tables dramatically reduce disk IO which is much more of a bottleneck than the cpu doing a hash function. The thing is to compare the width of the target table to the width of the xref table.

Many data warehouse tables can be quite wide. Reading a 50 column wide target table just to get the hash value is many times slower than reading the 3 column wide xref table. With SQL Server this can be mitigated with an 'included column' on an index for the target table but some database engines don't have this. Earlier versions of SS don't have it. Also consider development time; it is faster to develop a standardized routine rather than if one mixes and matches techniques. Finally, data warehouses have a strong tendency to change so when a target table that's only 10 columns wide suddenly has a requirement for an additional 25 columns, you would have to rewrite its ETL if you picked the all column compare method because it was only 10 to start.