Capturing the Reason For Change for Type 2 changes

  • If I want to populate a ReasonForChange column in a dimension table for Type 2 changes, is it best practice to populate that column in the new row (the one with an EndDate of 9999-12-31 and IsCurrent = 'Y') or the old row (with EndDate = the date the attribute changed and IsCurrent = 'N')?

    Or both perhaps?

    I'm fine with writing the code to identify the ReasonForChange, just not sure where to write the value.

    Thanks

    Lempster

  • I'm by no means an expert of any kind on data warehousing so take this as my opinion.

    I would put it in the expired record since the change is the reason why the record expired. The current record has no reason for change since it hasn't changed.

  • It depends.....mostly on the level of tracking data that you want to keep. If you are storing something minor, then it is probably not an issue to keep it in the table. Just make sure you document and stay consistent on which record has the change info. Normally, I would put it in the new record and try not to touch the old record.

    We have systems where they want to search for which column changed, so we have a second table to store the changed data in column/value format. It may seem redundant, but it makes the searches run fast.

    I have seen approaches where the keep the change data in XML format. Any design that would use blob data should use a separate table.

  • Lempster (1/29/2014)


    If I want to populate a ReasonForChange column in a dimension table for Type 2 changes, is it best practice to populate that column in the new row (the one with an EndDate of 9999-12-31 and IsCurrent = 'Y') or the old row (with EndDate = the date the attribute changed and IsCurrent = 'N')?

    Or both perhaps?

    I'm fine with writing the code to identify the ReasonForChange, just not sure where to write the value.

    Thanks

    Lempster

    You don't need the pain of maintaining an "IsCurrent" column for Type 2 SCDs. If you have both a start and end date per TYPE 2 SCDs, the dates are good enough to tell you what is current especially since you were smart enough to NOT use a NULL end date. I'll also recommend that you don't actually use 9999-12-31 as an end date. Instead, use 9999-12-30 or even just '9999' (which will auto-magically convert to 9999-01-01) so you have at least 1 day of "headroom" for certain range calculations.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks for the replies folks. As JustMarie and EricEyster made opposing arguments, I guess the most important point is to be consistent!

    Jeff, I don't see it as much of a pain to maintain a IsCurrent flag; I agree that it could be seen as superfluous, but I'm just following Kimball best practice which states:

    ...the current-flag provides a rapid way to isolate exactly the set of dimension members that is in effect at the moment of the query.

    So it's for ease of querying more than anything else. Thanls for the tip about the EndDate value though - I think I can afford to lose most of the year 9999 :-). (Although thinking in that vein lead to the Y2K problem didn't it? ;-))

  • Lempster (2/27/2014)


    Thanks for the replies folks. As JustMarie and EricEyster made opposing arguments, I guess the most important point is to be consistent!

    Jeff, I don't see it as much of a pain to maintain a IsCurrent flag; I agree that it could be seen as superfluous, but I'm just following Kimball best practice which states:

    ...the current-flag provides a rapid way to isolate exactly the set of dimension members that is in effect at the moment of the query.

    So it's for ease of querying more than anything else. Thanls for the tip about the EndDate value though - I think I can afford to lose most of the year 9999 :-). (Although thinking in that vein lead to the Y2K problem didn't it? ;-))

    As long as you are consistent on assigning the end of time value, that becomes your isCurrent flag.

    where endDate = '99991230'

  • EricEyster (2/27/2014)


    As long as you are consistent on assigning the end of time value, that becomes your isCurrent flag.

    where endDate = '99991230'

    Yeah, I get that.

  • I also maintain a pair of dates as well as a current flag. On the current records the end date is null however. On my agent tables, which are SCD2 because agent numbers are reused, I expose the flag as an active attribute. THere may be others where that is also usefully exposed as an attribute.

  • After reflection I'm going to change my answer to agree with EricEyster.

    People aren't going to look in the previously expired record to see why it expired. The current record is the one that's getting all the attention so it should have the necessary info as to why it's the current record.

  • It is common practice to include both RowStartDate, RowEndDate, and RowIsCurrent. Here is Kimball Group's (Warren Thornwaite) thinking regarding "RowChangedReason" along with the code to do it:

    http://www.kimballgroup.com/2006/06/01/design-tip-80-adding-a-row-change-reason-attribute/

  • Lempster (2/27/2014)


    Thanks for the replies folks. As JustMarie and EricEyster made opposing arguments, I guess the most important point is to be consistent!

    Jeff, I don't see it as much of a pain to maintain a IsCurrent flag; I agree that it could be seen as superfluous, but I'm just following Kimball best practice which states:

    ...the current-flag provides a rapid way to isolate exactly the set of dimension members that is in effect at the moment of the query.

    So it's for ease of querying more than anything else. Thanls for the tip about the EndDate value though - I think I can afford to lose most of the year 9999 :-). (Although thinking in that vein lead to the Y2K problem didn't it? ;-))

    Put an index on the "Current-Flag" and watch your GUI's timeout when they try to do an INSERT because of the massive extent splits that will occur. 😉 I recommend just doing the dates correctly.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (2/28/2014)


    Put an index on the "Current-Flag" and watch your GUI's timeout when they try to do an INSERT because of the massive extent splits that will occur. 😉 I recommend just doing the dates correctly.

    I'm talking about a Data Warehouse so there aren't going to be any GUIs to timeout, certainly not any doing inserts. There will of course be inserts on a daily basis due to the ETL process. The relational tables in the Data Warehouse will have multidimensional cubes built on them and it will be the cubes that are queried by end users, not the relational tables directly.

    I will of course undertake extensive testing, but at this point I'm inclined to follow Kimball best practice.

    Regards

    Lempster

  • Lempster (3/3/2014)


    Jeff Moden (2/28/2014)


    Put an index on the "Current-Flag" and watch your GUI's timeout when they try to do an INSERT because of the massive extent splits that will occur. 😉 I recommend just doing the dates correctly.

    I'm talking about a Data Warehouse so there aren't going to be any GUIs to timeout, certainly not any doing inserts. There will of course be inserts on a daily basis due to the ETL process. The relational tables in the Data Warehouse will have multidimensional cubes built on them and it will be the cubes that are queried by end users, not the relational tables directly.

    I will of course undertake extensive testing, but at this point I'm inclined to follow Kimball best practice.

    Regards

    Lempster

    Things change a little if you are going to use SSAS. The DW becomes little more than a data store to facilitate the ETL processes. Sure, you need enough to also support your debugging when things go bump in the night, but the Kimball design assumes your users are getting data from the relational engine.

    If you want to display the isCurrent flag for testing or for ease of loading to SSAS, create a view to calculate the isCurrent flag using a case statement on the endDate.

  • Kimball design assumes your users are getting data from the relational engine

    Would you mind explaining further what you mean by this? It's possible that I'm not understanding something, but Kimball is geared to SSAS and SSAS is not pulling the data from the relational engine. What am I missing?

  • RonKyle (3/3/2014)


    Kimball design assumes your users are getting data from the relational engine

    Would you mind explaining further what you mean by this? It's possible that I'm not understanding something, but Kimball is geared to SSAS and SSAS is not pulling the data from the relational engine. What am I missing?

    Yes, follow the Kimball design in the SSAS database. Assuming you are using MOLAP, SSAS pulls the data from the relational engine during dimension/partition processing and does not touch it again. We have systems that rebuild a single partition each week, letting most the of data in the DW untouched until it is purged. no need for heavy indexing, etc, on the relational DW side to support end user queries. Instead, focus on optimizing the ETL processes for fast load, select for the partition, and purge.

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply