Questions about Fact Table structure

  • Hi Everyone,

    I created a data warehouse allowing to analyze data from the commodity trading business.

    I have multiple fact tables which are refreshed every night and for some customers up to 3 times per day.

    In all the fact tables, I have some measures and dimension keys and always a specific column called "ValuationDatecode" which store the date of the previous business day when the data warehouse is refreshed nightly or the current date when it is refreshed within the day.

    All the fact tables are behaving like snapshot fact tables, meaning that every single I store all fact rows...and as you can imagine it grows quickly. Most of the time I don't need to analyze data from the previous run and there is already a clean process in place which is deleting all records from previous valuation date and only keeping the ones from each end of month + the last 2 working days.

    Here was for the presentation of the context... So my question is: is this design correct?

    I was thinking to have fact tables behaving more like a slow changing dimension, like this I will store only records for which a measure has changed or dimension key as changed. It would a kind of incremental process...

    I would add a "IsCurrent" column and "StartDatetime" and "EndDatetime" columns to the fact tables...

    What do you think? Do you have any suggestions?

    Thanks

     

  • Thanks for posting your issue and hopefully someone will answer soon.

    This is an automated bump to increase visibility of your question.

  • Why would people working with commodities only be interested in only the "end of each month + 2 days"?  And if you're doing it to condense the data for yearly views, why would you have the +2 days?

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • The purpose of a data warehouse is to be able to query the data quickly. If you only insert values that have changed with start and end dates, the table will by much slower to query.

  • You are asking if this approach is a good design, and I would have to say, based on what you present here (which isn't much) it is not a good design approach.

    The three columns you suggest at the end are used for 2SCD type dimensions.  These kind of dimensions track the history of a dimension.  As an example, if you have a stores dimension and your stores are group in regions.  If a store is moved to a different region and you have a 1SCD dimension, the value is simply updated.  But all the fact measure group (FMG) rows associated with that store are now associated only with the new region.  A 2SCD dimension would close down the previous row by giving a value to the EndDateTime and IsCurrent is now zero.  A new row is created with all the same values except for the region.  The StartDateTime value matches the EndDateTime value of the previous row for that store and IsCurrent is one for that row.

    You sound like your ETL is taking a "flush and fill" approach.  This is the easiest and fastest ETL to design.  But as your data grows, it takes more and more time.  If you take the approach that I think you're taking, namely to load only the most recent data, then you are missing out on the best thing about a data warehouse: the ability to determine useful patterns over time.  Year over year and so one.  Incremental ETLs are much harder to design.  The are harder than designing the cubes.  But they are essential to a functioning long term data warehouse design.

    If you haven't already I would suggest you get and read one of Ralph Kimbell's books on the subject.  He's retired now and not writing anymore.  But his books have held up well.

  • Thank you very much to all of you for your first replies.

    I understand that loading the fact table with all the data every day is the easiest approach, however it takes a lot of disk space.

    Most of the time our customers are looking at data with current status, except from the accounting department which is interested by looking data at the end of month whereas we are already the 10th of the next month. They are interested to see data including corrections/changes which occurred after the cut-off date.

    My idea is if the data did not change why to load them again in the daily data warehouse refresh? I'm not very experienced in data warehouse design and I'm surely misunderstanding one point...

    Would you have any suggestion of design?

  • That is not enough information to suggest a design, and coming up with a design takes many hours of analysis and usually interviews to some degree.  I would suggest you start reading with the Kimball books  "The Data Warehouse Toolkit" is a good starting book.  We can help with some questions but asking for a design suggestion seems too much.  If you propose a design and ask some narrow questions we may be able to be more help.

  • Thanks RonKyle! I started to read the book few days ago already (Chapter 5 already 🙂 )

    In fact I read that there are 3 types of fact tables Transactional, Snapshots and accumulating snapshots. Currently in my data warehouse I do snapshots (every day) but it takes too much space and it is not used indeed.

    I did it because for me it was simplier, but now I'm looking to find another design which allow me to save disk space, and I am considering the Transactional type...

    From my first search, it looks like it behaves like a SCD type 2

  • You seem to be confusing fact and dimension tables.  A fact table cannot be 2SCD.  The "D" is 2SCD stands for dimension.

    That you do this as a snapshot does NOT make it a snapshot fact measure group table.  That is generally for inventories, balances and so on where the data is the status of the items on a given day.  A transactional FMG table is for things that are one and done.  Purchases, phone calls, etc.  An accumulating snapshot is for things that have a fixed life span and go through changes during their lives from creation to completion, such as work orders.  The name has NOTHING to do with the ETL method.

    You should re-read that section again more carefully.  Also re read the difference between the fact measure group tables, which measure a business process at a particular level and almost always contain quantifiable metrics, and dimensions, such as date, product, salesperson, things that are connected to the business processes measured by the fact measure groups.

  • Thanks RonKyle, actually I am not confusing fact and dimension and the explanations you provided me are exactly what I understood.

    However I am more confused about the ETL part and how the Fact tables should be loaded. What I am thinking to do:

    Since a dimension can evolve or a fact can change, the ETL should run every day, then if a fact value change or if a dimension key change within the fact table I should insert a new record.

    Do you agree with that?

  • No, I do not agree.

    Regarding the fact, if it can change, it is probably an accumulating snapshot.  You should update the changes.

    Regarding the dimension, you need to determine if tracking the historical change has business value.  If it does, then you close the old record and create a new one.  If it does not, and most of the time it does not, you simply overwrite the old data with the new.

    [edited to correct the misspelling of data]

    • This reply was modified 1 year, 5 months ago by  RonKyle.
  • I agree that tracking changes on every attribute of a dimension is not always relevant that's why sometimes I am using SCD7 (a mix of type 1 and 2)

    However I don't understand when you say "If it does not, and most of the time it does not, you simply overwrite the old date with the new"... Which date are you talking about?

    Would you have an example of any simple fact table?

  • If any part of the dimension is a type 2, it's probably a 2, and only a two.  An SCD7 would be difficult to implement (it didn't even exist until his 3rd edition book) and as you definitely seem like you are starting out, I wouldn't worry about it.  As you intend to flush and fill and only load two days of day, I can't even imagine why you would need a two as you are not preserving the old rows in the fact table.

    The sentence that confused you should have read "overwrite the old DATA with the new" not date.  I will edit the comment.

    The Kimball books have more than sufficient examples and it doesn't seem likely that providing you yet one more example is going to make a difference.  In the absence of context it may make it more confusing.  If you post one fact table design, I will take a look at it, but make no promises.  I'm glad to answer narrow questions on design.  But I'm not going to design it for you.

  • I was not asking to do the design for me but an example. Anyway that's not a problem,  and I will find the right solution/design and how to implement it.

    That's my first data warehouse so it's quite normal having some misconceptions and errors 🙂

    I will go-ahead with the design and probably I will post an example of one of the fact tables.

    Thanks again for your help.

  • I don't doubt that.  I remember my first data warehouse.  But I got my grounding by reading the Kimball books.  While I'm happy to teach my colleagues, I'm able to give them face to face instruction.  It's simply not possible to pass that kind of knowledge in this kind of format.

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply