We are exploring the idea of using a hash value as the surrogate key in Dimensions. Eg , Lets say we have a dimension table "user_dim" having 2 columns, "user_dim_id" and "user_name"
In the general approach
"user_dim_id" would likely be an identity key of integer data type.
While loading the fact , the "user_name" coming from source will be looked up on user_dim and the resultant user_dim_id will be stored in the fact table .
If we use hashing , this lookup on the dimension table while loading the fact can be saved. While loading the fact , hash the username using the same algorithm and we will get the same value as we would get while loading the dimension.
create table dbo.user_dim(user_dim_id varbinary(20) primary key , username varchar(100))
create table dbo.user_fact(user_fact_id int identity(1,1), user_dim_id varbinary(20), items_bought int)
insert into user_dim (user_dim_id, username)
select hashbytes('SHA1', 'Robin') , 'Robin'
insert into user_fact (user_dim_id, items_bought)
select hashbytes('SHA1', 'Robin') , 10
select * from user_dim
select * from user_fact
user_fact_id user_dim_id items_bought
1 0x8A1273623DCDFF1BC248AD8B98D2C842B6FB8E24 10
Now the question is , if this is so convenient , why does none of the Datawarehousing books recommend this ?
I could find these below answers by researching but need expert advise from you guys .
- Hash keys are in general random values and are not sequential, so benefits of Clustered Index can’t be utilized on hash and effect of this could be huge for big dimension tables
- All the hash functions have a collision risk . The risk differs by the algorithm used and the value , however the risk is always there .
- Collision might result in PK constraints violation.
-The reporting tool accessing the DW , might have a hard time in joining these hash values compared to joining on integer keys.