Using Hash values as Dimension Keys in Datawarehouse

  • All 

    We are exploring the idea of using a hash value as the surrogate key in Dimensions. Eg , Lets say we have a dimension table "user_dim" having 2 columns, "user_dim_id" and "user_name"

    In the general approach 
    "user_dim_id" would likely be an identity key of integer data type.

    While loading the fact , the "user_name"  coming from source will be looked up on user_dim and the resultant user_dim_id will be stored in the fact table .

    If we use hashing , this lookup on the dimension table while loading the fact can be saved. While loading the fact , hash the username using the same algorithm and we will get the same value as we would get while loading the dimension. 

    Code Snippet
    create table dbo.user_dim(user_dim_id varbinary(20) primary key , username varchar(100))
    create table dbo.user_fact(user_fact_id int identity(1,1), user_dim_id varbinary(20), items_bought int)

    insert into user_dim (user_dim_id, username)
    select hashbytes('SHA1', 'Robin') , 'Robin'

    insert into user_fact (user_dim_id, items_bought)
    select hashbytes('SHA1', 'Robin') , 10

    select * from user_dim

    select * from user_fact

    user_dim_id    username
    0x8A1273623DCDFF1BC248AD8B98D2C842B6FB8E24    Robin

    user_fact_id    user_dim_id    items_bought
    1    0x8A1273623DCDFF1BC248AD8B98D2C842B6FB8E24    10

    Now the question is , if this is so convenient , why does none of the Datawarehousing books recommend this ?

    I could find these below answers by researching but need expert advise from you guys .

    -          Hash keys are in general random values and are not sequential, so benefits of Clustered Index can’t be utilized on hash and effect of this could be huge for big dimension tables
    -          All the hash functions have a collision risk . The risk differs by the algorithm used and the value , however the risk is always there . 
    -          Collision might result in PK constraints violation.

    -The reporting tool accessing the DW , might have a hard time in joining these hash values compared to joining on integer keys.

  • Hash values should not be used in the way you're describing above.

    It's usually implemented to help detect changes across multiple attributes in a dimension during the ETL process, but never meant to be used as surrogate key as there will be no benefit.

  • Martin Schoombee - Wednesday, December 12, 2018 9:30 AM

    Hash values should not be used in the way you're describing above.

    It's usually implemented to help detect changes across multiple attributes in a dimension during the ETL process, but never meant to be used as surrogate key as there will be no benefit.

    Not completely true.  I am reading about the Data Vault 2.0 methodology and it actually uses MD5 hashes as SIDs.  The MD5 hash is taken of the business key (usually the natural key for the table).   There is more to this and I would suggest researching about to learn more.

  • Lynn Pettis - Wednesday, December 12, 2018 9:50 AM

    Martin Schoombee - Wednesday, December 12, 2018 9:30 AM

    Hash values should not be used in the way you're describing above.

    It's usually implemented to help detect changes across multiple attributes in a dimension during the ETL process, but never meant to be used as surrogate key as there will be no benefit.

    Not completely true.  I am reading about the Data Vault 2.0 methodology and it actually uses MD5 hashes as SIDs.  The MD5 hash is taken of the business key (usually the natural key for the table).   There is more to this and I would suggest researching about to learn more.

    Internally SQL Server will sometimes use hashing when joining two tables. If you try this yourself though it is not as efficient as SQL Servers internal process.
    I don't see any advantage in the case the OP has suggested of storing the hashed username on the DIM table as the ID. The hash is  longer than the username and needs a hash calculation so they could more easily store the username as the key.
    Also, what if the username exists more than once on the Dim table, how would the Fact table know which Dim row to point to?

  • Jonathan AC Roberts - Wednesday, December 12, 2018 10:16 AM

    Lynn Pettis - Wednesday, December 12, 2018 9:50 AM

    Martin Schoombee - Wednesday, December 12, 2018 9:30 AM

    Hash values should not be used in the way you're describing above.

    It's usually implemented to help detect changes across multiple attributes in a dimension during the ETL process, but never meant to be used as surrogate key as there will be no benefit.

    Not completely true.  I am reading about the Data Vault 2.0 methodology and it actually uses MD5 hashes as SIDs.  The MD5 hash is taken of the business key (usually the natural key for the table).   There is more to this and I would suggest researching about to learn more.

    Internally SQL Server will sometimes use hashing when joining two tables. If you try this yourself though it is not as efficient as SQL Servers internal process.
    I don't see any advantage in the case the OP has suggested of storing the hashed username on the DIM table as the ID. The hash is  longer than the username and needs a hash calculation so they could more easily store the username as the key.
    Also, what if the username exists more than once on the Dim table, how would the Fact table know which Dim row to point to?

    I can see its benefits in Data Vault 2.0.  Not saying it should always be used, just that it is a tool and needs to be used appropriately.

  • DataVault 2.0 does indeed require hashes to be fully compliant.

    It is an interesting reading, with some parts not for the faint of heart, but it does work well specially if there are high volumes of data - or if the implementation is done using Hadoop for example.

    Primary aspect of DV is that the business key is unique - if there can be multiple records of the same entity (user) with the same business key (Username) then that is not the correct BK.
    Using hashes makes it so that both the fact and the dim can be loaded in parallel as hash is calculated on both the loads - and result will always be the same.
    Using a generated ID would mean that the Dim would need to be loaded first, and only then the fact could be loaded which goes against DV model.

  • I use HASH keys in the MPP data warehouse. Wouldn't use them in the SMP setting though.

  • What I am interested to understand and learn is , the issues that one might face in going down this path .

    Performance
    Etl load time
    Report run time 
    Any other scenario where hash key might cause issues

  • koustav_1982 - Wednesday, December 12, 2018 5:37 PM

    What I am interested to understand and learn is , the issues that one might face in going down this path .

    Performance
    Etl load time
    Report run time 
    Any other scenario where hash key might cause issues

    Well, for example, I once had a hashed clustered index on a customer ID-like field. Inserting millions of records decreased the ETL process 1000x fold because of the hashed value and SQL Server having to order the hashed value among all others.  Had to go to replacing the hash to a sequential identifier, which still decreased ETL performance waiting for the data to be cleaned.

    Now I store data in the MPP side of things, hashed keys are better because if there is enough uniqueness of the hashed value, then you can distribute that hashed value across N compute nodes. Having the distribution allows ALL nodes to work together when querying the hashed value like on say a JOIN. Having sequential leads to constant data movement (shifting of the data to evenly distribute it across N compute nodes) with every ETL process, which decreases performance.

  • Thanks for the help guys.

  • Heh... hold the phone.... If you have a user ID (and assuming that it's unique), why do you need another surrogate key to begin with??? :blink:

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • It is typical to have separate surrogate keys in your data warehouse than your OLTP system.  Consider that there could be cases, especially in slowly changing dimensions, where the row identified by username matches multiple rows in the dimension table.  There may also be rows in the dimension that do not have corresponding rows in the OLTP system, e.g. having a row to represent "Unknown" values.  Also, while it may not be a consideration here, a dimension table may have rows that come from multiple sources, so in that case neither source key would be appropriate to represent all dimension rows.

    As for using hashing to generate the dimension key, I'd say be very careful with that, as you need to be absolutely sure that the source value you are hashing never changes.  Username seems like something that could change over time and you would probably want to keep the same dimension row key when username does change, so that you don't have to update all of the existing fact records that use that hashed value.

  • Jeff Moden - Thursday, December 20, 2018 8:59 AM

    Heh... hold the phone.... If you have a user ID (and assuming that it's unique), why do you need another surrogate key to begin with??? :blink:

    Depends on the control i guess. Not uncommon to cleanse the keys from the source you don't control and generate new ones so the system controls the entire key generation process. That way if the surrogate key does change for an existing surrogate key (existing id gets assigned a new id for some reason) and their is history of this change, you can still retain the same surrogate key you generated in your system for example.

    In other cases like the key being a unique hash of some type, taking that hash and assigning a unique sequential id to that hash is often a good approach if you want to index on that identifier. For example, applying an index on a long hashing value with billions of records is hard to manage in most SMP systems. Reassigning a smaller sequential value is much easier. In that case, you would have the original hashing key

    When it comes to receiving hashes, it's often difficult to index those. Thus, generating new keys that are easier to index can improve performance of inserts and reads. For example, I had difficult time indexing billions of hashes versus just using a sequential identifier generated by the system in the SMP environment.

  • xsevensinzx - Friday, December 21, 2018 7:16 AM

    Jeff Moden - Thursday, December 20, 2018 8:59 AM

    Heh... hold the phone.... If you have a user ID (and assuming that it's unique), why do you need another surrogate key to begin with??? :blink:

    Depends on the control i guess. Not uncommon to cleanse the keys from the source you don't control and generate new ones so the system controls the entire key generation process. That way if the surrogate key does change for an existing surrogate key (existing id gets assigned a new id for some reason) and their is history of this change, you can still retain the same surrogate key you generated in your system for example.

    In other cases like the key being a unique hash of some type, taking that hash and assigning a unique sequential id to that hash is often a good approach if you want to index on that identifier. For example, applying an index on a long hashing value with billions of records is hard to manage in most SMP systems. Reassigning a smaller sequential value is much easier. In that case, you would have the original hashing key

    When it comes to receiving hashes, it's often difficult to index those. Thus, generating new keys that are easier to index can improve performance of inserts and reads. For example, I had difficult time indexing billions of hashes versus just using a sequential identifier generated by the system in the SMP environment.

    I guess I'm still amazed that we still have to do things like that in a DW.  Heh... kind of "the key is not the key unless we use it as a key" thing. 

    In the example that you gave above, do you maintain a cross-reference table that you have to keep up for things like customer name spelling?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Friday, December 21, 2018 8:11 AM

    xsevensinzx - Friday, December 21, 2018 7:16 AM

    Jeff Moden - Thursday, December 20, 2018 8:59 AM

    Heh... hold the phone.... If you have a user ID (and assuming that it's unique), why do you need another surrogate key to begin with??? :blink:

    Depends on the control i guess. Not uncommon to cleanse the keys from the source you don't control and generate new ones so the system controls the entire key generation process. That way if the surrogate key does change for an existing surrogate key (existing id gets assigned a new id for some reason) and their is history of this change, you can still retain the same surrogate key you generated in your system for example.

    In other cases like the key being a unique hash of some type, taking that hash and assigning a unique sequential id to that hash is often a good approach if you want to index on that identifier. For example, applying an index on a long hashing value with billions of records is hard to manage in most SMP systems. Reassigning a smaller sequential value is much easier. In that case, you would have the original hashing key

    When it comes to receiving hashes, it's often difficult to index those. Thus, generating new keys that are easier to index can improve performance of inserts and reads. For example, I had difficult time indexing billions of hashes versus just using a sequential identifier generated by the system in the SMP environment.

    I guess I'm still amazed that we still have to do things like that in a DW.  Heh... kind of "the key is not the key unless we use it as a key" thing. 

    In the example that you gave above, do you maintain a cross-reference table that you have to keep up for things like customer name spelling?

    Pretty much. We create reference tables for these for sure. For example, I had an alpha numeric key from Google that was hard to manage performance on when you get into the billions. Would replace with a surrogate key generated by the data warehouse. A cross-reference table is created to maintain the original with the new. It also allows us to create additional analytics on top of it like the first time we saw the key and how many repeats therefore after like in cases like customer id's and so forth.

    It's critical to do this because you want to try to unify all these disparate datasets together. You also want one system doing the marrying of these datasets to be uniform across all and not just creating more disparity that has no structure and most importantly, no control.

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply