Using Hash values as Dimension Keys in Datawarehouse

Question

Using Hash values as Dimension Keys in Datawarehouse

koustav_1982

SSCommitted

Points: 1788
More actions
December 12, 2018 at 1:41 am

#376597

All
We are exploring the idea of using a hash value as the surrogate key in Dimensions. Eg , Lets say we have a dimension table "user_dim" having 2 columns, "user_dim_id" and "user_name"
In the general approach
"user_dim_id" would likely be an identity key of integer data type.
While loading the fact , the "user_name" coming from source will be looked up on user_dim and the resultant user_dim_id will be stored in the fact table .
If we use hashing , this lookup on the dimension table while loading the fact can be saved. While loading the fact , hash the username using the same algorithm and we will get the same value as we would get while loading the dimension.
Code Snippet
create table dbo.user_dim(user_dim_id varbinary(20) primary key , username varchar(100)) create table dbo.user_fact(user_fact_id int identity(1,1), user_dim_id varbinary(20), items_bought int)
insert into user_dim (user_dim_id, username) select hashbytes('SHA1', 'Robin') , 'Robin'
insert into user_fact (user_dim_id, items_bought) select hashbytes('SHA1', 'Robin') , 10
select * from user_dim
select * from user_fact
user_dim_id    username
0x8A1273623DCDFF1BC248AD8B98D2C842B6FB8E24    Robin
user_fact_id    user_dim_id    items_bought
1    0x8A1273623DCDFF1BC248AD8B98D2C842B6FB8E24    10
Now the question is , if this is so convenient , why does none of the Datawarehousing books recommend this ?
I could find these below answers by researching but need expert advise from you guys .
-          Hash keys are in general random values and are not sequential, so benefits of Clustered Index can’t be utilized on hash and effect of this could be huge for big dimension tables
-          All the hash functions have a collision risk . The risk differs by the algorithm used and the value , however the risk is always there .
-          Collision might result in PK constraints violation.

-The reporting tool accessing the DW , might have a hard time in joining these hash values compared to joining on integer keys.

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply

Martin Schoombee SSCoach Points: 19116 More actions · Answer 1

Hash values should not be used in the way you're describing above.

It's usually implemented to help detect changes across multiple attributes in a dimension during the ETL process, but never meant to be used as surrogate key as there will be no benefit.

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 2

Martin Schoombee - Wednesday, December 12, 2018 9:30 AM
Hash values should not be used in the way you're describing above.
It's usually implemented to help detect changes across multiple attributes in a dimension during the ETL process, but never meant to be used as surrogate key as there will be no benefit.

Not completely true. I am reading about the Data Vault 2.0 methodology and it actually uses MD5 hashes as SIDs. The MD5 hash is taken of the business key (usually the natural key for the table). There is more to this and I would suggest researching about to learn more.

Jonathan AC Roberts SSCoach Points: 18263 More actions · Answer 3

Lynn Pettis - Wednesday, December 12, 2018 9:50 AM
Martin Schoombee - Wednesday, December 12, 2018 9:30 AM
Hash values should not be used in the way you're describing above.
It's usually implemented to help detect changes across multiple attributes in a dimension during the ETL process, but never meant to be used as surrogate key as there will be no benefit.
Not completely true. I am reading about the Data Vault 2.0 methodology and it actually uses MD5 hashes as SIDs. The MD5 hash is taken of the business key (usually the natural key for the table). There is more to this and I would suggest researching about to learn more.

Internally SQL Server will sometimes use hashing when joining two tables. If you try this yourself though it is not as efficient as SQL Servers internal process.
I don't see any advantage in the case the OP has suggested of storing the hashed username on the DIM table as the ID. The hash is longer than the username and needs a hash calculation so they could more easily store the username as the key.
Also, what if the username exists more than once on the Dim table, how would the Fact table know which Dim row to point to?

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 4

Jonathan AC Roberts - Wednesday, December 12, 2018 10:16 AM
Lynn Pettis - Wednesday, December 12, 2018 9:50 AM
Martin Schoombee - Wednesday, December 12, 2018 9:30 AM
Hash values should not be used in the way you're describing above.
It's usually implemented to help detect changes across multiple attributes in a dimension during the ETL process, but never meant to be used as surrogate key as there will be no benefit.
Not completely true. I am reading about the Data Vault 2.0 methodology and it actually uses MD5 hashes as SIDs. The MD5 hash is taken of the business key (usually the natural key for the table). There is more to this and I would suggest researching about to learn more.
Internally SQL Server will sometimes use hashing when joining two tables. If you try this yourself though it is not as efficient as SQL Servers internal process.
I don't see any advantage in the case the OP has suggested of storing the hashed username on the DIM table as the ID. The hash is longer than the username and needs a hash calculation so they could more easily store the username as the key.
Also, what if the username exists more than once on the Dim table, how would the Fact table know which Dim row to point to?

I can see its benefits in Data Vault 2.0. Not saying it should always be used, just that it is a tool and needs to be used appropriately.

frederico_fonseca SSCoach Points: 16017 More actions · Answer 5

DataVault 2.0 does indeed require hashes to be fully compliant.

It is an interesting reading, with some parts not for the faint of heart, but it does work well specially if there are high volumes of data - or if the implementation is done using Hadoop for example.

Primary aspect of DV is that the business key is unique - if there can be multiple records of the same entity (user) with the same business key (Username) then that is not the correct BK.
Using hashes makes it so that both the fact and the dim can be loaded in parallel as hash is calculated on both the loads - and result will always be the same.
Using a generated ID would mean that the Dim would need to be loaded first, and only then the fact could be loaded which goes against DV model.

xsevensinzx One Orange Chip Points: 25560 More actions · Answer 6

I use HASH keys in the MPP data warehouse. Wouldn't use them in the SMP setting though.

koustav_1982 SSCommitted Points: 1788 More actions · Answer 7

What I am interested to understand and learn is , the issues that one might face in going down this path .

Performance
Etl load time
Report run time
Any other scenario where hash key might cause issues

xsevensinzx One Orange Chip Points: 25560 More actions · Answer 8

koustav_1982 - Wednesday, December 12, 2018 5:37 PM
What I am interested to understand and learn is , the issues that one might face in going down this path .
Performance
Etl load time
Report run time
Any other scenario where hash key might cause issues

Well, for example, I once had a hashed clustered index on a customer ID-like field. Inserting millions of records decreased the ETL process 1000x fold because of the hashed value and SQL Server having to order the hashed value among all others. Had to go to replacing the hash to a sequential identifier, which still decreased ETL performance waiting for the data to be cleaned.

Now I store data in the MPP side of things, hashed keys are better because if there is enough uniqueness of the hashed value, then you can distribute that hashed value across N compute nodes. Having the distribution allows ALL nodes to work together when querying the hashed value like on say a JOIN. Having sequential leads to constant data movement (shifting of the data to evenly distribute it across N compute nodes) with every ETL process, which decreases performance.

koustav_1982 SSCommitted Points: 1788 More actions · Answer 9

koustav_1982

SSCommitted

Points: 1788

December 20, 2018 at 2:28 am

#2016920

Thanks for the help guys.

Jeff Moden SSC Guru Points: 1003867 More actions · Answer 10

Heh... hold the phone.... If you have a user ID (and assuming that it's unique), why do you need another surrogate key to begin with??? :blink:

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Chris Harshman SSC-Forever Points: 42192 More actions · Answer 11

It is typical to have separate surrogate keys in your data warehouse than your OLTP system. Consider that there could be cases, especially in slowly changing dimensions, where the row identified by username matches multiple rows in the dimension table. There may also be rows in the dimension that do not have corresponding rows in the OLTP system, e.g. having a row to represent "Unknown" values. Also, while it may not be a consideration here, a dimension table may have rows that come from multiple sources, so in that case neither source key would be appropriate to represent all dimension rows.

As for using hashing to generate the dimension key, I'd say be very careful with that, as you need to be absolutely sure that the source value you are hashing never changes. Username seems like something that could change over time and you would probably want to keep the same dimension row key when username does change, so that you don't have to update all of the existing fact records that use that hashed value.

xsevensinzx One Orange Chip Points: 25560 More actions · Answer 12

Jeff Moden - Thursday, December 20, 2018 8:59 AM
Heh... hold the phone.... If you have a user ID (and assuming that it's unique), why do you need another surrogate key to begin with??? :blink:

Depends on the control i guess. Not uncommon to cleanse the keys from the source you don't control and generate new ones so the system controls the entire key generation process. That way if the surrogate key does change for an existing surrogate key (existing id gets assigned a new id for some reason) and their is history of this change, you can still retain the same surrogate key you generated in your system for example.

In other cases like the key being a unique hash of some type, taking that hash and assigning a unique sequential id to that hash is often a good approach if you want to index on that identifier. For example, applying an index on a long hashing value with billions of records is hard to manage in most SMP systems. Reassigning a smaller sequential value is much easier. In that case, you would have the original hashing key

When it comes to receiving hashes, it's often difficult to index those. Thus, generating new keys that are easier to index can improve performance of inserts and reads. For example, I had difficult time indexing billions of hashes versus just using a sequential identifier generated by the system in the SMP environment.

Jeff Moden SSC Guru Points: 1003867 More actions · Answer 13

xsevensinzx - Friday, December 21, 2018 7:16 AM
Jeff Moden - Thursday, December 20, 2018 8:59 AM
Heh... hold the phone.... If you have a user ID (and assuming that it's unique), why do you need another surrogate key to begin with??? :blink:
Depends on the control i guess. Not uncommon to cleanse the keys from the source you don't control and generate new ones so the system controls the entire key generation process. That way if the surrogate key does change for an existing surrogate key (existing id gets assigned a new id for some reason) and their is history of this change, you can still retain the same surrogate key you generated in your system for example.
In other cases like the key being a unique hash of some type, taking that hash and assigning a unique sequential id to that hash is often a good approach if you want to index on that identifier. For example, applying an index on a long hashing value with billions of records is hard to manage in most SMP systems. Reassigning a smaller sequential value is much easier. In that case, you would have the original hashing key
When it comes to receiving hashes, it's often difficult to index those. Thus, generating new keys that are easier to index can improve performance of inserts and reads. For example, I had difficult time indexing billions of hashes versus just using a sequential identifier generated by the system in the SMP environment.

I guess I'm still amazed that we still have to do things like that in a DW. Heh... kind of "the key is not the key unless we use it as a key" thing.

In the example that you gave above, do you maintain a cross-reference table that you have to keep up for things like customer name spelling?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

xsevensinzx One Orange Chip Points: 25560 More actions · Answer 14

Jeff Moden - Friday, December 21, 2018 8:11 AM
xsevensinzx - Friday, December 21, 2018 7:16 AM
Jeff Moden - Thursday, December 20, 2018 8:59 AM
Heh... hold the phone.... If you have a user ID (and assuming that it's unique), why do you need another surrogate key to begin with??? :blink:
Depends on the control i guess. Not uncommon to cleanse the keys from the source you don't control and generate new ones so the system controls the entire key generation process. That way if the surrogate key does change for an existing surrogate key (existing id gets assigned a new id for some reason) and their is history of this change, you can still retain the same surrogate key you generated in your system for example.
In other cases like the key being a unique hash of some type, taking that hash and assigning a unique sequential id to that hash is often a good approach if you want to index on that identifier. For example, applying an index on a long hashing value with billions of records is hard to manage in most SMP systems. Reassigning a smaller sequential value is much easier. In that case, you would have the original hashing key
When it comes to receiving hashes, it's often difficult to index those. Thus, generating new keys that are easier to index can improve performance of inserts and reads. For example, I had difficult time indexing billions of hashes versus just using a sequential identifier generated by the system in the SMP environment.
I guess I'm still amazed that we still have to do things like that in a DW. Heh... kind of "the key is not the key unless we use it as a key" thing.
In the example that you gave above, do you maintain a cross-reference table that you have to keep up for things like customer name spelling?

Pretty much. We create reference tables for these for sure. For example, I had an alpha numeric key from Google that was hard to manage performance on when you get into the billions. Would replace with a surrogate key generated by the data warehouse. A cross-reference table is created to maintain the original with the new. It also allows us to create additional analytics on top of it like the first time we saw the key and how many repeats therefore after like in cases like customer id's and so forth.

It's critical to do this because you want to try to unify all these disparate datasets together. You also want one system doing the marrying of these datasets to be uniform across all and not just creating more disparity that has no structure and most importantly, no control.