Is a durable key always necessary in a dim?

  • I want to use a Type 2 SCD SSIS transform on a dimension table that does not have a durable key.  However, the table does have a 4-column natural key.  Please bear in mind, these 4 columns are also foreign keys to other tables.  Is a single durable key necessary in this case?

  • It is recommended, yes...chances are that you most likely would want to add the durable key (or a surrogate key if that's what you meant) to a fact table as foreign key, and having a composite key will not be ideal from a performance or design perspective.

    I would also recommend that you don't use the Type-2 SCD component in SSIS...it's one of the worst-performing RBAR-type data flow tasks.

  • Durable keys are important because you don't want to have to update all the existing records in a Fact table if the key in the Dimension table changes.  Dimension table keys should also be separate from the "natural" key or foreign key used in the OLTP system because you will need to represent the data at different points in time for your type 2 slowly changing Dimension.

    Say you have a customer John Doe who is CustomerID 12345 in your OLTP system.  If he lives in Ohio in 2018, then moves to Kentucky in January 2019, you would want to know both of those addresses in your Demographics Dimension table.  Any orders he made in 2018 should be reflected in Ohio, and any orders he made in 2019 should be reflected in Kentucky, so different records in your Fact table will point to different records in your Dimension table based on what was current at the time the Fact occurred.

    If you try to use a multi-column key, you will end up making the Fact table wider than it should be.  Your Fact table will typically be narrow (few columns) and long (many rows), while your Dimension tables will typically be wide (many columns) and short (fewer rows).  You would also need a way to distinguish the different point in time records for the same "natural" key, so you might as well use a single surrogate key anyway.

    I also agree with Martin that there are better ways to handle type 2 SCD than the SSIS transformation task.  Take a look at what you can do with the MERGE statement OUTPUT clause:

    https://wateroxconsulting.com/archives/type-2-scd-with-sql-merge/

     

  • I totally agree that the SCD object is far from ideal.  However:  1) the data set is very small (just a few hundred megabytes of data), 2) we are under a tight deadline and the SCD object is the quickest to implement.

     

    That being said, shouldn't the SCD object automatically increment the Durable Key/Surrogate Key when it adds a new row?

  • SCD requires a Business Key in the source. In the target dimension table you can use an IDENTITY column for a surrogate key. You don't need to do anything in the SCD component to achieve that.

  • imani_technology wrote:

    I totally agree that the SCD object is far from ideal.  However:  1) the data set is very small (just a few hundred megabytes of data), 2) we are under a tight deadline and the SCD object is the quickest to implement.

    That being said, shouldn't the SCD object automatically increment the Durable Key/Surrogate Key when it adds a new row?

    Cutting corners will cost you in the long run, and I'd strongly advise against it. If the data set is small now, it doesn't mean that it will remain small in the long run.

    Durable keys and surrogate keys are different things. A surrogate key is unique for every row in the dimension table, and you can use an identity field for that. A durable key remains the same for all historic records of an entity, and gives you the ability to access all records of a customer etc. without the need for self-joins or complicated queries.

  • And surrogate keys can and usually are durable keys.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden wrote:

    And surrogate keys can and usually are durable keys.

    yes, true 🙂

  • Okay, so let's say we use the MERGE statement instead of an SCD object.  How do I get data from the preceding step in the SSIS package (a Derived Column transform, in this case) into the MERGE statement?

    Also, how would the MERGE statement deal with both the PK (business/surrogate key) and the Durable Key (used for joining the fact table to the dimension table)?

  • You cannot directly use SSIS data flow to drive a MERGE statement. The MERGE statement would have to be part of your control flow instead and you should use a staging table to supply the data for the MERGE.

    You may be confusing the idea of a durable key with a surrogate key. Durable key is Ralph Kimball's jargon for a value which is not unique (and therefore not really a key at all) in a Type 2 dimension but which is a surrogate for an entity in the business domain where there is no other stable business key. There is no method for maintaining such durable values unless you have some kind of indication that a business key value has changed and some other means to identify the old version of the row in the dimension - usually some alternative key from the source data.

  • So the Slowly Changing Dimension object turned out to be \"less than ideal\" (in other words, trash). Where is a good place for me to learn how to implement an SCD using a MERGE statement? Also, I will need to use hashbytes.

  • You can find lots of examples around:

    https://www.google.com/search?q=scd+type+2+merge+sql

  • imani_technology wrote:

    So the Slowly Changing Dimension object turned out to be \"less than ideal\" (in other words, trash). Where is a good place for me to learn how to implement an SCD using a MERGE statement? Also, I will need to use hashbytes.

    If you have SQL Server 2016, Temporal Tables work as Type 6 SCDs.  I wouldn't use MERGE on a bet for this.  Just use a nice little Temporal Table if you have 2016 or greater and a well written trigger to do the same thing as  a Temporal Table if you have less than 2016.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Fortunately, we are using SQL Server 2016.  Have you used temporal tables in a Type 2 SCD situation?  How well does it work?  Also, how does it impact reporting against dimension tables?

  • Offhand I see two hurdles to using a temporal table as a slowly changing dimension.  First, the related records in the temporal table would all have the same primary key value, which would make it more difficult to match a fact record to its corresponding point in time dimension record.  The typical methodology for handling this is that each row in the type 2 SCD would have a different key value so that the fact rows can each reference the dimension values to their specific point in time.  If you go back to my example from last month of customer John Doe, When he lived in Ohio the row in the dimension table may have the key DimCustomerKey = 4567, and then when he moves to Kentucky in 2019 he gets a new dimension row with a different key, say DimCustomerKey = 4632.  The old fact records that were generated for him prior would still have DimCustomerKey = 4567 and new fact records for 2019 would have DimCustomerKey = 4632.  In a temporal table, both records have the same key and can only be differentiated by using the clause "FOR SYSTEM_TIME ALL" and adding additional WHERE clauses to match to the specific StartTime - EndTime range.

    The second hurdle will be if not every attribute change requires history tracking.  I'm not sure that you can even specify which columns are constrained to create new history records in a temporal table and which are not significant.

     

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply