Typically, how often does an ETL process affect (through an update or an insert) a Fact table as a result of activity in a Dimension table?

  • Hey guys,
    Thanks in advance for your help. I'm going to try to describe the above situation, and it may conclude in a more fundamental design error than I would like to hear, but if that's the case, so be it....

    So, let's say you have a fact row (FR1) and it is related to a Dimension Row (DK1) through a foreign Key relationship. The relationship that is being documented here is: at the time of this particular Fact Event (FR1), the fact row was related to this particular state of the Dimension attributes(DK1). In our described situation, we are dealing with a typical Type 2 SCD with a date range. So, the next thing that happens is that a new row gets added to the Dimension to create DK2. The new row is the same Dimension object as in DK1, but something about the attributes in DK1 have changed, so we have added a new row for this new state of the dimension object. The SCD date ranges are handled appropriately and everything is fine to this point.

    The problem is: What do I have to do about the fact that the original FR1 row has a reference to DK1 and not to DK2? As one person stated, if the source system were inspected, the attribute data would not be the same. The reference would still point to the old attribute data. Additionally, there will be times where you would want to maintain that original relationship at the time of the event.

    One solution was, in an effort to maintain additivity, and for that additivity to function with no filters or caveats, was to add a row to the fact table that would back out the quantities associated in FR1, so the second row (FR2) would be like FR1 * -1. Then add a third row(FR3) for the new attribute values. Obviously, table growth over time would be an issue. This seems to be the most comprehensive solution, but it would seem like we would spend all of our time (both development-wise and resource-wise) maintaining these relationships. Is this what was really meant by the statement that "your fact tables will dwarf your dimension tables"?

    Another solution was to progress to a Type 3 SCD, which would be to add another column that contains the previous value of the Foreign Key Dimension relationship and still add another row to the fact table.

    Another mention was to go back to Type 1, which I doubt we would want to do.

    In any case, this problem seems very odd in that we don't see any way around it, but I don't really understand how I hadn't heard of the same problem or a solution before now in all of the research and preparation that we have done until now. Thanks for reading, hopefully I conveyed the question appropriately, but any knowledgeable answer or thoughts would be very much appreciated. Thanks for your time.

    BTW, yes, I did try to search/google for an answer for this, but nothing was specific enough to answer the question directly enough for me. Yes, this is our company's first DW effort, and we are using all of Kimball's books to guide us. But the question just seems so obvious that I feel like we are missing something that might be very apparent from another viewpoint. Thanks again.

  • cwe424 - Wednesday, March 7, 2018 5:01 PM

    Hey guys,
    Thanks in advance for your help. I'm going to try to describe the above situation, and it may conclude in a more fundamental design error than I would like to hear, but if that's the case, so be it....

    So, let's say you have a fact row (FR1) and it is related to a Dimension Row (DK1) through a foreign Key relationship. The relationship that is being documented here is: at the time of this particular Fact Event (FR1), the fact row was related to this particular state of the Dimension attributes(DK1). In our described situation, we are dealing with a typical Type 2 SCD with a date range. So, the next thing that happens is that a new row gets added to the Dimension to create DK2. The new row is the same Dimension object as in DK1, but something about the attributes in DK1 have changed, so we have added a new row for this new state of the dimension object. The SCD date ranges are handled appropriately and everything is fine to this point.

    The problem is: What do I have to do about the fact that the original FR1 row has a reference to DK1 and not to DK2? As one person stated, if the source system were inspected, the attribute data would not be the same. The reference would still point to the old attribute data. Additionally, there will be times where you would want to maintain that original relationship at the time of the event.

    One solution was, in an effort to maintain additivity, and for that additivity to function with no filters or caveats, was to add a row to the fact table that would back out the quantities associated in FR1, so the second row (FR2) would be like FR1 * -1. Then add a third row(FR3) for the new attribute values. Obviously, table growth over time would be an issue. This seems to be the most comprehensive solution, but it would seem like we would spend all of our time (both development-wise and resource-wise) maintaining these relationships. Is this what was really meant by the statement that "your fact tables will dwarf your dimension tables"?

    Another solution was to progress to a Type 3 SCD, which would be to add another column that contains the previous value of the Foreign Key Dimension relationship and still add another row to the fact table.

    Another mention was to go back to Type 1, which I doubt we would want to do.

    In any case, this problem seems very odd in that we don't see any way around it, but I don't really understand how I hadn't heard of the same problem or a solution before now in all of the research and preparation that we have done until now. Thanks for reading, hopefully I conveyed the question appropriately, but any knowledgeable answer or thoughts would be very much appreciated. Thanks for your time.

    BTW, yes, I did try to search/google for an answer for this, but nothing was specific enough to answer the question directly enough for me. Yes, this is our company's first DW effort, and we are using all of Kimball's books to guide us. But the question just seems so obvious that I feel like we are missing something that might be very apparent from another viewpoint. Thanks again.

    Tell you what, show us your use case with a simple set of tables and data that represents what you are talking about.  I used to be really good at word problems eons ago, now I find I am a much better visual problem solver and if you can show us with what may be considered a trivial example that will help me (and maybe others).

  • I second Lynn's comment...give us some examples of what you're seeing and what you're expecting to see. It almost sounds like you have additive numbers (measures) in your type 2 dimension, but it's not clear what the exact issue is without an example to help us see it better.

  • Lynn, this is a classic case where the dimension record is updated with new information that invalidates the fact record at the time of it's recording. He/she is asking if there is away to not having to update the fact table with the new FK's.

    In all Type 2 SCD scenarios, when the change happens to the dimensional record, the historical record is still valid with the historical fact record. For example, at the time of the fact record, a person lived in NC. The dimension record contains the NC attribute. Then he moved later in the year to CA. Thus, any new facts for that user should reflect the new dimension record for CA. Both the old and new facts are valid and have the correct FK's to the dimension. 

    In this case, the old has to be updated with the new key, which is expensive to the data warehouse. Type 2 SCD is not recommended here.

  • Onto the original question. 

    Do you need to reference the historic dimensional data? If not, then just update the existing record and generate no new keys. Why store data you will never use? I know that sounds silly being you're asking the question, but figured I would ask if you're just trying to do this proactively versus being told to do so?

    The recommended approach (as quoted from Ralph Kimballs forums), is to never change the dimensional key. In meaning, don't go back and update the fact table keys with the new keys once they are invalidated. I agree with this approach.

    To get another version of the dimension row, the original method was to perform a self-join of the dimension using it's natural key to locate any other version of the dimension row (usually to get the current version, which is why it has a current flag).

    Other techniques have evolved.  There is the notion of a stable surrogate key (basically a Type 1 key) that acts as an alternate key to the dimension.  You would either use that key for the self join (basically to improve join performance versus using a natural key) or you can store it as an additional foreign key on the fact to avoid the self join completely.

    You have the option of joining to the type 2 table using the alternate key and filtering on the current flag,  or, you can field a separate Type 1 table, depending on your needs.

    As far as how you update measures, it depends on your need.

    The self join method:

    The dimension has a type 2 primary key, its natural key, and a current row flag. The fact contains the type 2 foreign key. A FROM clause would look something like:

    FROM fact f 
    join dim d on f.dim_key = d.dim_key
    join dim cd on d.natural_key = cd.natural_key and current_flag (is true)

    The dim aliased by cd will contain the current version of the row. Alias d will contain the historical version of the row.

    The stable key method:

    The stable key is a surrogate for the natural key. Basically, when used in a type 2 dimension the process is to set this value to the primary key value when the dimension row is first created. You then carry this same value on all future versions of the dimension row. The dimension has a type 2 primary key, the stable key , and a current row flag. The fact contains the type 2 foreign key, and the stable key. A FROM clause to get the current row would look something like:

    FROM fact f
    join dim d on f.dim_stable_key = d.stable_key and current_flag

    This eliminates the dimension self join.

    Hope that helps!

  • xsevensinzx - Wednesday, March 7, 2018 5:46 PM

    Lynn, this is a classic case where the dimension record is updated with new information that invalidates the fact record at the time of it's recording. He/she is asking if there is away to not having to update the fact table with the new FK's.

    In all Type 2 SCD scenarios, when the change happens to the dimensional record, the historical record is still valid with the historical fact record. For example, at the time of the fact record, a person lived in NC. The dimension record contains the NC attribute. Then he moved later in the year to CA. Thus, any new facts for that user should reflect the new dimension record for CA. Both the old and new facts are valid and have the correct FK's to the dimension. 

    In this case, the old has to be updated with the new key, which is expensive to the data warehouse. Type 2 SCD is not recommended here.

    I understand that, but it would help me to see what is going on in his environment even if the example he provides is trivial in nature but shows what he is asking.

  • Hey guys,
    My hands were full yesterday evening and I could not get on to get a response in, but I want to thank Martin and Lynn for taking the time to be interested and helpful. This is honestly the most responsive and beneficial set of responses I have ever received on SSC and it is greatly appreciated. I was planning on honoring your requests for more info, but it turns out that I believe sevensinz entirely understood what I was asking. Clearly, this was my error in not being concise enough to get the point across without providing an entire diagram of what we are doing. My apologies for that. I agree that the best way to answer a specific question is with a completely whole view of the subject, but from my perspective, this was more of a general "conceptual" question, and I was hoping to only need to provide a vague scenario, so that we would not get bogged down in the details of our entire project. In the future, I should probably err on the side of more info.

    Do you need reference the historical dimensional data?
    Yeah, this has to do with a business model that includes contracts and agreements and their ongoing relationships and how they produce invoices and accounts receivable data. So, while I am going to skip a lot of context there, we are going to be interested in how those relationships begin, exist, expand, contract, and end, and the monetary impacts of those changes. I'm starting to believe that the real challenge here is that it is not just a static retail sales transaction that happens at a given moment and is then over. It is more of an ongoing lifetime situation. I don't know if that helps or if it affects your response, but the answer is a yes.

    So, after a long (2 hour plus) follow up discussion with my colleagues, and examination of any alternatives that were pointed out and that we could find on our own, I believe we are in a position to use the Static/Stable Key method. It will be very easy to locate the current row relative to the fact, as well as the original row, and then a small amount of SCD date comparisons if you need anything in between.

    Once again, thanks for everybody's input, it was great material for our discussion. The lesson for me here is making sure you understand all of the caveats related to the requirements concerning the historical dimensional data and it's relation to the fact data. Giving it a one time stamp is fine, if that's all you need, but if you want the full ongoing history, you are going to have to dig a little deeper. Thanks again!

    Clint

  • It's a tricky one for sure. I haven't myself dealt with it, but I could see it happening. To add some more insight for you. I imagined how I would solve this problem. I actually typed it out and responded (if you saw) before I deleted it. Then I went out to find a counter argument to that solution. Instead, I found the quoted article pretty much outlining the same solution from the Kimball forums.

    Figured I would mention that in case you find yourself in another tough situation. Sometimes just researching on your own proposal works too.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply