Capturing the Reason For Change for Type 2 changes

Question

Capturing the Reason For Change for Type 2 changes

Lempster

SSCoach

Points: 15651
More actions
January 29, 2014 at 9:11 am

#285058

If I want to populate a ReasonForChange column in a dimension table for Type 2 changes, is it best practice to populate that column in the new row (the one with an EndDate of 9999-12-31 and IsCurrent = 'Y') or the old row (with EndDate = the date the attribute changed and IsCurrent = 'N')?
Or both perhaps?
I'm fine with writing the code to identify the ReasonForChange, just not sure where to write the value.
Thanks
Lempster

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply

JustMarie SSCertifiable Points: 7771 More actions · Answer 1

I'm by no means an expert of any kind on data warehousing so take this as my opinion.

I would put it in the expired record since the change is the reason why the record expired. The current record has no reason for change since it hasn't changed.

EricEyster SSCrazy Points: 2974 More actions · Answer 2

It depends.....mostly on the level of tracking data that you want to keep. If you are storing something minor, then it is probably not an issue to keep it in the table. Just make sure you document and stay consistent on which record has the change info. Normally, I would put it in the new record and try not to touch the old record.

We have systems where they want to search for which column changed, so we have a second table to store the changed data in column/value format. It may seem redundant, but it makes the searches run fast.

I have seen approaches where the keep the change data in XML format. Any design that would use blob data should use a separate table.

Jeff Moden SSC Guru Points: 1003851 More actions · Answer 3

Lempster (1/29/2014)
If I want to populate a ReasonForChange column in a dimension table for Type 2 changes, is it best practice to populate that column in the new row (the one with an EndDate of 9999-12-31 and IsCurrent = 'Y') or the old row (with EndDate = the date the attribute changed and IsCurrent = 'N')?
Or both perhaps?
I'm fine with writing the code to identify the ReasonForChange, just not sure where to write the value.
Thanks
Lempster

You don't need the pain of maintaining an "IsCurrent" column for Type 2 SCDs. If you have both a start and end date per TYPE 2 SCDs, the dates are good enough to tell you what is current especially since you were smart enough to NOT use a NULL end date. I'll also recommend that you don't actually use 9999-12-31 as an end date. Instead, use 9999-12-30 or even just '9999' (which will auto-magically convert to 9999-01-01) so you have at least 1 day of "headroom" for certain range calculations.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Lempster SSCoach Points: 15651 More actions · Answer 4

Thanks for the replies folks. As JustMarie and EricEyster made opposing arguments, I guess the most important point is to be consistent!

Jeff, I don't see it as much of a pain to maintain a IsCurrent flag; I agree that it could be seen as superfluous, but I'm just following Kimball best practice which states:

...the current-flag provides a rapid way to isolate exactly the set of dimension members that is in effect at the moment of the query.

So it's for ease of querying more than anything else. Thanls for the tip about the EndDate value though - I think I can afford to lose most of the year 9999 :-). (Although thinking in that vein lead to the Y2K problem didn't it? ;-))

EricEyster SSCrazy Points: 2974 More actions · Answer 5

Lempster (2/27/2014)
Thanks for the replies folks. As JustMarie and EricEyster made opposing arguments, I guess the most important point is to be consistent!
Jeff, I don't see it as much of a pain to maintain a IsCurrent flag; I agree that it could be seen as superfluous, but I'm just following Kimball best practice which states:
...the current-flag provides a rapid way to isolate exactly the set of dimension members that is in effect at the moment of the query.
So it's for ease of querying more than anything else. Thanls for the tip about the EndDate value though - I think I can afford to lose most of the year 9999 :-). (Although thinking in that vein lead to the Y2K problem didn't it? ;-))

As long as you are consistent on assigning the end of time value, that becomes your isCurrent flag.

where endDate = '99991230'

Lempster SSCoach Points: 15651 More actions · Answer 6

EricEyster (2/27/2014)
As long as you are consistent on assigning the end of time value, that becomes your isCurrent flag.
where endDate = '99991230'

Yeah, I get that.

RonKyle SSC-Dedicated Points: 31537 More actions · Answer 7

I also maintain a pair of dates as well as a current flag. On the current records the end date is null however. On my agent tables, which are SCD2 because agent numbers are reused, I expose the flag as an active attribute. THere may be others where that is also usefully exposed as an attribute.

JustMarie SSCertifiable Points: 7771 More actions · Answer 8

After reflection I'm going to change my answer to agree with EricEyster.

People aren't going to look in the previously expired record to see why it expired. The current record is the one that's getting all the attention so it should have the necessary info as to why it's the current record.

sneumersky SSCertifiable Points: 7667 More actions · Answer 9

It is common practice to include both RowStartDate, RowEndDate, and RowIsCurrent. Here is Kimball Group's (Warren Thornwaite) thinking regarding "RowChangedReason" along with the code to do it:

http://www.kimballgroup.com/2006/06/01/design-tip-80-adding-a-row-change-reason-attribute/

Jeff Moden SSC Guru Points: 1003851 More actions · Answer 10

Lempster (2/27/2014)
Thanks for the replies folks. As JustMarie and EricEyster made opposing arguments, I guess the most important point is to be consistent!
Jeff, I don't see it as much of a pain to maintain a IsCurrent flag; I agree that it could be seen as superfluous, but I'm just following Kimball best practice which states:
...the current-flag provides a rapid way to isolate exactly the set of dimension members that is in effect at the moment of the query.
So it's for ease of querying more than anything else. Thanls for the tip about the EndDate value though - I think I can afford to lose most of the year 9999 :-). (Although thinking in that vein lead to the Y2K problem didn't it? ;-))

Put an index on the "Current-Flag" and watch your GUI's timeout when they try to do an INSERT because of the massive extent splits that will occur. 😉 I recommend just doing the dates correctly.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Lempster SSCoach Points: 15651 More actions · Answer 11

Jeff Moden (2/28/2014)
Put an index on the "Current-Flag" and watch your GUI's timeout when they try to do an INSERT because of the massive extent splits that will occur. 😉 I recommend just doing the dates correctly.

I'm talking about a Data Warehouse so there aren't going to be any GUIs to timeout, certainly not any doing inserts. There will of course be inserts on a daily basis due to the ETL process. The relational tables in the Data Warehouse will have multidimensional cubes built on them and it will be the cubes that are queried by end users, not the relational tables directly.

I will of course undertake extensive testing, but at this point I'm inclined to follow Kimball best practice.

Regards

Lempster

EricEyster SSCrazy Points: 2974 More actions · Answer 12

Lempster (3/3/2014)
Jeff Moden (2/28/2014)
Put an index on the "Current-Flag" and watch your GUI's timeout when they try to do an INSERT because of the massive extent splits that will occur. 😉 I recommend just doing the dates correctly.
I'm talking about a Data Warehouse so there aren't going to be any GUIs to timeout, certainly not any doing inserts. There will of course be inserts on a daily basis due to the ETL process. The relational tables in the Data Warehouse will have multidimensional cubes built on them and it will be the cubes that are queried by end users, not the relational tables directly.
I will of course undertake extensive testing, but at this point I'm inclined to follow Kimball best practice.
Regards
Lempster

Things change a little if you are going to use SSAS. The DW becomes little more than a data store to facilitate the ETL processes. Sure, you need enough to also support your debugging when things go bump in the night, but the Kimball design assumes your users are getting data from the relational engine.

If you want to display the isCurrent flag for testing or for ease of loading to SSAS, create a view to calculate the isCurrent flag using a case statement on the endDate.

RonKyle SSC-Dedicated Points: 31537 More actions · Answer 13

Kimball design assumes your users are getting data from the relational engine

Would you mind explaining further what you mean by this? It's possible that I'm not understanding something, but Kimball is geared to SSAS and SSAS is not pulling the data from the relational engine. What am I missing?

EricEyster SSCrazy Points: 2974 More actions · Answer 14

RonKyle (3/3/2014)
Kimball design assumes your users are getting data from the relational engine
Would you mind explaining further what you mean by this? It's possible that I'm not understanding something, but Kimball is geared to SSAS and SSAS is not pulling the data from the relational engine. What am I missing?

Yes, follow the Kimball design in the SSAS database. Assuming you are using MOLAP, SSAS pulls the data from the relational engine during dimension/partition processing and does not touch it again. We have systems that rebuild a single partition each week, letting most the of data in the DW untouched until it is purged. no need for heavy indexing, etc, on the relational DW side to support end user queries. Instead, focus on optimizing the ETL processes for fast load, select for the partition, and purge.