Fact Table Update (Design)

  • Hi,

    Apologies to Jeff & Admin (and everyone this may help) as I took down the original question as I felt it wasn't clear so here goes 🙂

    I think the root of the issue / question is when is it considered ok to update a fact. Let me give a concrete example of what I'm facing and a design approach that I've been considering but am unsure of. I work in healthcare and we have (e.g.) a fact table where the grain is one fact per patient appointment. Example surrogate keys would be PatientSK, OutcomeSK (e.g. Attended, Cancelled etc), ContactDateSK etc. Say after an initial load of the Fact table on 01/01/2022 there are say 1,000 appointments for PatientSK of 1234. On 05/01/2022 say there is a Type2 change in Dim_Patient for this patient which creates a new PatientSK of 1235. In my mind this is essentially the same Patient but with a more updated record in the dimension. I understand why in Kimball's Scd2 example you'd want to maintain the link between the historical fact and original surrogate key as it links to the older department which would be crucial for reporting but in the example I gave it could be the Patients gender or ethnicity which was updated so it wouldn't really make sense to report older facts with a surrogate key pointing towards EthnicityA and later facts with a PatientSK linking to a EthnicityB for the same patient (We'd want this as a Type2 change and not Type1). I am trying to ascertain whether the process outlined below makes sense from a dimension modelling standpoint? Given that the target fact table will be temporal would mean that although the underlying surrogate keys are updated daily to the most recent active dimension members, the history is preserved and we could always query the fact table as it was on any particular day by winding back the temporal table (as long as the dimensions were never reloaded so the surrogate key integrity remains intact). Please see final question after the 3 step process below in relation to how my thoughts progressed on this.

    Step 1) After completing the initial load of the fact table in the target data warehouse, make the fact table temporal so that it can be wound back to any day in the past (so essentially this will be like a daily snapshot and preserve the surrogate keys as they were on any particular day)

    Step 2) Create a staging fact table daily (loading all historical records into Stage as volume permits) but joining to the underlying dimensions on the IsActive=1 (so the latest currently active dimension member and not constraining on the date of the fact between the RowStartDate & RowEndDate on the underlying dimensions) so populating the Staging fact table with the most recently available surrogate keys.

    Step 3) Using the staging fact table in Step 2) above to merge into the target live data warehouse fact table and do a blanket update of all the surrogate keys (so in my initial example, the surrogate key of 1234 for the 1,000 patients would get updated to 1235)

    Update: After thinking this through some more I'm leaning towards only updating the underlying facts only if the business key changes in the fact table (e.g. if the actual patient was switched or if the outcome changed to something different etc.). This way the historical surrogate keys would remain as is for those facts whose business keys didn't change but all surrogate keys would get updated for those facts that did change (a blanket update of the row). Is the proper method to only update the underlying surrogate key for the business keys that changed rather than updating all surrogate keys for the row?

    Apologies if my thought process is non-sensical 🙂 Any feedback very much appreciated

    Thanks in advance,

    N

     

    • This topic was modified 2 years ago by  Niggles.
    • This topic was modified 2 years ago by  Niggles.
    • This topic was modified 2 years ago by  Niggles.
  • Thanks for posting your issue and hopefully someone will answer soon.

    This is an automated bump to increase visibility of your question.

  • Niggles wrote:

    Q

    Well now... that's pretty rude. 😉  The question you deleted may have helped others.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff & Admin, apologies I didn't mean to be rude, I am just re-stating the question as I wasn't really clear so temporarily took it down. As always I appreciate your feedback, you were very helpful to me in the past

  • Thank you very much and apologies I didn't see this before

  • There's a lot to unpack here, so let me start with a few thoughts and clarifying questions:

    To answer your primary question of "Is it ok to update Fact table records?": Absolutely! In a perfect world transactions never change, but unless you're dealing with financial data it is very likely that source "transactions" could change. The business case and expected outcome should drive your decision here.

    For an attribute like ethnicity, why would you make it a type-2 change? A change to this type of attribute is most likely a correction to something that was incorrect in the first place, so why not make it a type-1 change and update all historic records with the new information? Is it required that you report on the changes of these types of attributes? If not, I would recommend not tracking the changes.

    My next question is about the choice of a temporal fact table: What is the reasoning behind tracking changes to appointments? If it's to see the change in status, would it not be better suited as a type-2 dimension (for appointments) with a fact-less fact to represent the association between all the different dimension members? I don't know enough about your use-case to make that call, but you should have a very compelling reason before going down the path of a temporal fact as it will introduce a bunch of other complexities in the model...hence the reason for asking.

  • Hi Martin,

    Thanks for responding much appreciated. Let me just provide a little more info;

    In relation to tracking changes to the Appointment fact table, there is a requirement to make multiple submissions at various milestones during the month. There is a crucial requirement to be able to reproduce reporting data sets that were used for submission purposes at various dates in the past. There is a concept of a "Flex" and "Freeze" submission where the "Freeze" indicates a final version where no further changes can be made whereas the "Flex" is open to further changes (these data sets are produced on different dates). Currently the team are literally dumping these data sets into physical tables on the day of submission and naming them using the date they were submitted. I was thinking that making the Fact_Appointment table temporal would negate the need to make these snapshots as we'd essentially be able to wind back the fact table to any prior date of submission to reproduce the state of affairs as of that day (winding back and updates to facts in the interim).

    In relation to the Ethnicity attribute being a type1 I agree (maybe Gender would have been a better example). After discussing at length with the team earlier it seems that having the most recent surrogate key (IsActive=1) would be preferred against all facts (past and present). I just feel that this is sacrilege in the dimensional modelling world as it essentially altering history daily. Point in time reporting seems to be the key requirement here (i.e. being able to reproduce historic data sets for auditing purposes).

    Thanks again for your input,

    N

     

  • Ok... based on your last post above, especially the part where you say there's a "requirement to make multiple submissions at various milestones during the month", here's what I think.  And, to be sure, I don't know how others do it but I can tell you how I maintain PiT History (Point-in-Time History) tables to do most of what you've identified.

    First of all, if you're not plagued with having to report on WHO or WHAT made a modification and the table is narrow enough to do "whole row auditing", then "system-versioned temporal tables" ("Temporal Tables" from here on) are the bomb and super easy to implement and do things like your milestone reporting, etc.

    If you DO need to know who or what caused a change, then you still do NOT need the poor-man's auditing of having a CREATED_BY, CREATED_DT, and MODIFIED_DT columns.  You only need a Modified_By column and, of course, you're going to need to maintain that and also understand that the initial insertion of a row was a modification.  MS didn't make something nice when it came to Temporal Tables that need to identify the person or thing that made the modification.

    Second of all, if you have or think you need an "IsActive" column, then there's probably something wrong with your design.  Go back to how they designed Temporal Tables and learn the right way to identify the active rows.

    If you do "roll your own" to include a Modified_By column, do yourself a huge favor and stick with the same design as Temporal Tables whether you want the "history table" to be separate or not (I prefer them to be separate even though it's a fair bit of extra work with a view and possibly and Instead Of trigger).  The history table could actually be page compressed for the CI, the CI should never need to be defragmented, and can frequently even be in a different database to speed up DR restores if it comes to that.  Partitioning by month in the short term and by year in the long term and setting the related file groups (1 file per filegroup) to Read_Only can save a huge amount when it comes to backups.

    As a bit of a sidebar, a lot of people poo-poo Partitioned VIEWs in favor of Partitioned TABLEs.  I'm not one of those people.  I find that there are actually a huge number of advantages to Partitioned VIEWS that others don't consider as well as having most of the advantages of Partitioned Tables but under different "names".  I'll sometimes even make separate databases for individual table partitions because it's easy to include only certain months or years for restores to development environments and make it super easy to "get back in business" and then do restores for the legacy data without having to give up on backups until all the data has been restored like what a lot of folks have to do when they use Partitioned TABLEs.

    Also, to help prevent fragmentation caused by the Modified_By column being updated from a NULL, make sure that you always populate it during the initial insert and not after.

    And, whatever you do, don't use a "before" and "after" column(s) even if you go the route of "column changes only" type of history table.  It's just a doubling of space used.  If that's not possible, then default it to SPACE(N) where N is the number of characters that you would normally expect the column to be updated to.

    Heh... and, for goodness sake, don't make a copy of rows in the history table during inserts.  You're just instantly doubling your space requirements.  The current active rows should always be in the main table and that's it.  If you insert a row, it's the currently active row and it should go into the main table without being copied to the history table.  That original row should only be copied to the history table IF it suffers an update or delete.

    There's more but that basically describes what I consider when having to do audits and be able to reproduce the state of related tables for a given point in time.  It' mostly like a TYPE 6 SCD but without the horrors some folks build in.

    And, p.s. ... if you do "roll your own", make the end date for the currently active rows "9998" (equates to 9998-01-01) to give you some headroom for temporal calculations, which real MS Temporal Tables don't do.  And NEVER use NULL for the end date of anything.  It'll make you pay for it in performance almost immediately not to mention making your code more complicated.

    And, almost forgot... if you DO roll your own, you also have the advantage of being able to ignore updates that didn't actually update anything especially when it comes to "column-only" (frequently used on "insanely wide tables") instead of "full row" auditing.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff,

    This is very helpful thanks a million. First of all, just to be clear, I have the IsActive flag on the underlying dimensions and not on the Fact table itself. For my use case I don't need to be concerned with the "who" or "what" made the underlying change so the stock Temporal table would suit me fine. The only gripe I have with temporal tables for my use case is they don't automatically tag a row in the history table with an action (e.g. "Delete" or "Update") but it's not a biggie.

    <b>Question please 🙂 </b>Do you think it would be a valid approach to make all changes in all dimensions a type1 change and then make the underlying dimensions temporal also? I'm thinking a view like the below could be created to query the model.

    Your thoughts are very much appreciated!

    N

    Create View vFact

    As

    Select *

    From dbo.Fact As t1

    Inner Join dbo.Dim_Patient As t2

    On t1.PatientSK=t2.PatientSK

    Select * From vFact For System_Time As Of @Date

  • I agree that Temporal tables don't have an explicit marking for the action taken on a row for MS Temporal Tables.  It's really easy to determine, though...  (and you probably already know this but saying it "out loud" for some that may not know)...

    1. INSERT: If the base table is setup correctly with a PK, you cannot have duplicates according to the PK.  That means that the first entry according to the system start DT for the row is the row where the INSERT occurred.
    2. DELETE: If a row is deleted, it will be the last row for the PK.  The PK will also no longer be available in the base table but that's probably only an important consideration if you"roll your own" Temporal Tables.
    3. UPDATE: Quite literally, everything else.

    The general PiT pattern view (again, PiT is my shortcut name for "Point-in-Time") you proposed as vFact above is in the right direction with the understanding that it's a frequent practice to have identical column names in tables especially since a lot of people allow the default names for the "System_Time" columns to occur.  You'd have to expose those as separately named columns for each underlying table to provide the mostly flexibility if they're not named differently (since this is relatively new, design it so that they are named differently... perhaps in the pattern of StartDT_Tablename and EndDT_Tablename).

    As a bit of a sidebar, you'll find people that absolutely rail against the use of "SELECT *" in views because of issues that can occur if you change the underlying tables.  I'm NOT one of those people and will kowtow to the almighty "It Depends".  I find that, especially on tables with a large number of columns that all need to be returned for a given view, the use of "SELECT *" is actually more performant than large list of discreet columns.

    So, you have two choices there... either check all views for dependencies after any and all changes to the underlying tables as a matter of a "Standard Development Practices" (every Dev Group should have a set of such practices) to locate views that will need to be "refreshed" (which actually can be the easiest thing to do) or schema-bind all views (which prevents all possible oversight but is a pain because then you actually have to drop the view(s), make the change(s) to the table, and then rebuild the views).  Avoiding SELECT * and listing precise column names also may not help you if you're changing something about the listed columns in a view but "It Depends" is the correct term to also describe that.

    I DO agree on using a specific list of columns where you have high column count tables and you really only need a few columns for the given view but this pattern for a general PiT view does beg for "SELECT *".

    And, just to warn out loud to everyone, NONE OF THIS WILL STAND UP IN A COURT OF LAW simply because anyone with the right level of privs can change it.  There are other methods that will stand up in Court.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff,

    Points 1,2 & 3 are helpful to me actually as I was trying to achieve the same in a different way but your approach is clear thanks.

    As a general rule I never use Select * in views and usually script them out using the underlying system tables so it takes care of all the heavy lifting in relation to listing out column names. Schema-Binding the views is a good idea thanks for the suggestion.

    What I'm trying to ascertain here is if it's feasible to make the whole model temporal (i.e. all Fact tables & Dimensions) so it can be rolled back in unison within a view. The other option would be a hybrid approach by keeping the dimensions as is (i.e. non-temporal & loaded using a hybrid type1 & type2 approach). With the aforementioned hybrid approach the Fact table could still be wound back (temporal) and referential integrity maintained with the underling dimensions. I'm leaning towards the "All temporal" approach but I feel uneasy about it. Any thoughts?

    Edit: Just to add that all changes in the dimensions would be Type1 in the all temporal approach as when the dimension is wound back the underlying surrogate key will always point to the current active row

    Thanks again Jeff very much appreciated

    N

    • This reply was modified 2 years ago by  Niggles.
  • If you make all of our fact tables temporal, then any type-2 change in a dimension will have to drive a change to the fact table(s) as well. You can imagine how complicated this can get.

    If you want the option to use a dimensional model to either report at a point in time or current dimension values, using durable keys could help. <shameless plug /> I have a blog post on it here: https://martinschoombee.com/2020/12/01/the-value-of-durable-keys-in-type-2-dimensions/

    To be honest though, the more information you provide the more it sounds like a good use-case for a data vault, where all changes are tracked. You should look into that modelling technique as a potential solution given the requirements you listed here. If you'd like to stay with the star-schema model, I'd opt to go with daily snapshots instead of temporal fact tables unless data volumes are an issue. If they are and temporal fact tables are the only/best option, be prepared to drive lots of changes from your type-2 dimensions to your fact tables.

    My two cents anyways, hope some of it is helpful.

  • Hi Martin,

    Thanks for chiming in. Ok let me present this scenario to you just so I'm clear;

    Scenario 1)

    On 01/01/2022 I create and load a dimension and assign all attributes as a Type1 change and make the dimension temporal. On this day Attribute(A) has a value of "AAAA" and a surrogate key of 1234. I then load the Fact table and make it temporal (currently all Fact rows joining to SK=1234 would return a value of "AAAA" for Attribute(A))

    On 02/01/2022 There is a change to the aforementioned dimension which triggers a Type1 overwrite of Attribute(A) to "BBBB". There is no update to the underlying Fact as the surrogate key remains unchanged (SK=1234). Querying the Fact table as of this day will return a value of "BBBB" for Attribute(A). However if I wound back the Fact table & Dimension to the previous day (using temporal table functionality) then the surrogate key of 1234 would again point to "AAAA" which is historically correct.

    Scenario 2)

    On 01/01/2022 I create and load a dimension and assign Attribute(A) as a Type 2 change (I don't make the dimension temporal). On this day Attribute(A) has a value of "AAAA" and a surrogate key of 1234. I then load the Fact table and make it temporal (currently all Fact rows joining to SK=1234 would return a value of "AAAA" for Attribute(A))

    On 02/01/2022 There is a change to the aforementioned dimension which triggers a Type2 change and a new dimension row with a surrogate key 1235 is created. Under my proposed methodology the Fact table would be updated and all keys previously pointing to a surrogate key of 1234 would get updated to a surrogate key of 1235 (i.e. IsActive=1). However if I wound back the Fact table to the previous day (using temporal table functionality) then the Fact table would reflect the older surrogate key of 1234 and return a value of "AAAA" for Attribute(A)

    I don't think either of the above scenarios would be complex to implement & manage to be honest. As our dimensions don't change a whole lot there aren't that many type 2 changes so traffic isn't that big. I'm trying to understand if I'm losing anything by choosing Scenario 1) over Scenario 2). Scenario 2) is a kind of hybrid approach whereas Scenario 1) is all in on the temporal functionality.

    Just reading up on your blog tonight thanks for the link

    Edit: Just thinking that Scenario 1 would probably dramatically reduce the Fact history table (temporal) as updates would be significantly less

    Cheers,

    N

    • This reply was modified 2 years ago by  Niggles.
    • This reply was modified 2 years ago by  Niggles.
  • I think you're getting confused with some of the terminology (or I am). Type-1 dimensions do not track changes and are not temporal. Type-2 dimensions track changes and are temporal by definition.

    Not 100% sure what you mean by "winding back" the fact table, and your scenario 2 is really just describing what a snapshot fact table would do, but if you update the fact table's surrogate as you describe in scenario 2, how would you go back to a previous version? The only thing you can do in that case is adding a new record to the fact table that uses the surrogate key of the new dimension record, i.e. pointing to the new version. Again, this is really just describing what a snapshot fact will do and I would go down that road unless data volumes are an issue.

    If you need to be able to see the version of your fact and dimension tables on any given day, i.e. for auditing purposes, then you're really talking about a data vault and I would recommend that as an option.

     

    Edit: If you're talking about the specific feature of temporal tables in SQL Server, then we're talking about something completely different. Could you in theory use that feature to essentially implement change tracking on all your tables? Yes, if you can live with the restrictions...but my question would be why? If it's because of auditing requirements, why not just use something like CDC in SQL Server?

  • Hi Martin,

    If the dimension is temporal (by temporal I mean system-versioned temporal table) then history would be tracked automatically. Having a type 1 change would mean an overwrite but the table could be "wound back" (i.e. using the System Versioned syntax introduced post SQL Server 2016 "For System Time As Of")) to return the table to the state is was at on any particular day in the past. Hence no need for snapshots as that is all handled under the hood by SQL Server. The reason to use Temporal tables is due to having to meet submission milestones on a continual basis and having to be able to reconcile back to those datasets for auditing purposes. CDC would identify what the changes are but wouldn't allow us to wind back if that makes sense.

    Thanks,

    N

     

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply