Fact table with hundred of columns - Any alternative?

  • Hi everyone,

    I am designing a accumulating snapshot fact table to evaluate the performance of the product forecasts since they are approved until they are purchase.

    The process starts when the line managers estimate how many units they expected to be produced and then there are several process, in which some other calculations are done, like the sourcing costs, margins, expected demand, etc.

    There are also some milestones date, when the figures need to be frozen and reported.

    Business wants to have all of the parameter for the calculations, which are really a lot, around 120 (sign off price , Source Price, royalty fix, royalty variable, sales markup, target price, true target price, quoted price, retail markup, duty base, etc, etc, etc. )

    I have to big dilemmas:

    1- Should I really have to add all of theses parameters in the fact table and create a monster in our data mart. Business wants to be able to find this info but to be honest, the reports shown only the final calculations based on the margin and volumes and I can't imagine an user, playing with all these parameters in a cube.

    2- For the frozen values: I need to historized or version the rows in the fact table using valid from and to columns, because business wants to know the value of a forecasted volume at the miles stone date, but before and after this date this value can be changed and they also want to know the current value. How you would approach this?

    In my opinion we are always doing the things too complicated and in the end business are not gaining real value from our applications.

    Any comment would be appreciated.

    Kind regards,

    Paul Hernández
  • Ask yourself whether some of these cost and price related columns should really be attributes in a dimension, not measures on a fact table.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Paul Hernández (10/12/2016)


    ...I am designing a accumulating snapshot fact table...

    the line managers estimate how many units they expected to be produced and then there are several process, in which some other calculations are done, like the sourcing costs, margins, expected demand, etc.

    There are also some milestones date, when the figures need to be frozen and reported.

    Business wants to have all of the parameter for the calculations, which are really a lot, around 120 (sign off price , Source Price, royalty fix, royalty variable, sales markup, target price, true target price, quoted price, retail markup, duty base, etc, etc, etc. )

    I have to big dilemmas:

    1- Should I really have to add all of theses parameters in the fact table and create a monster in our data mart. Business wants to be able to find this info but to be honest, the reports shown only the final calculations based on the margin and volumes and I can't imagine an user, playing with all these parameters in a cube.

    2- For the frozen values: I need to historized or version the rows in the fact table using valid from and to columns, because business wants to know the value of a forecasted volume at the miles stone date, but before and after this date this value can be changed and they also want to know the current value. How you would approach this?

    Fact tables are usually narrow (not many columns) and long (have many many rows). Dimensions are usually wide (many columns) and short (fewer rows). For a fact table, you typically only want it to have the columns of the actual numbers that are going to be aggregated (called measures) and foreign key values pointing to your dimension tables. Dimension tables "describe" the data in the fact table, and a lot of the attributes you mention seem like they'd be in a slowly changing product dimension.

    There's a lot of good information about designing your fact and dimension tables at:

    http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/

  • Chris makes a good point. It's unclear from your initial post what a single row in the fact table represents. A transaction? A product price snapshot?

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • For your second question, it seems like the scenario is very similar to a temporal fact table. As the other gentlemen mentioned, it seems like most the components for the calculations should already be stored in other fact or type-2 dimensions.

    My recommendation would be to ensure that each attribute is where it belongs...after that, the accumulating fact for the forecast really just contains what it needs to and in the format of a temporal fact table.

  • Hi everyone,

    thank you very much for your contributions.

    At the end the key point was to figure out as you mentioned, what is intended to be aggregated, the other let's say static or almost static parameters are stored as dimension attributes.

    The values in the fact table are also historized in the DW because the source system is not able to do it.

    We can transfer to our data mart different snapshots of this history and the meet all of the reporting requirements.

    Kind regards,

    Paul

    Paul Hernández
  • If you havn't already, then read Kimball's book "Data Warehouse Toolkit", particularly chapter 2 which covers various standard approaches to fact and dimensional table design.

    https://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Definitely you need to know which attributes relate to Facts and which relate to Dimensions.

    As a starting point, if the attribute is numeric and the aggregated value of the attribute makes sense then it is likely to be part of a Fact. If the aggregated value does not make sense then it is likely to be part of a Dimension. If you SUM(OrderValue) you get a meaningful number and OrderValue is likely to be part of a Fact. If you SUM(DiscountPercentage) you do not get a meaningful number and DiscountPercentage is likely to be part of a Dimension. However AVERAGE(DiscountPercentage) can be meaningful, so you need to be careful about your logical design.

    Translating logical design to physical design should give a different result depending on if you are using row-format storage or column-format storage. As ever with database design, a guiding principle is to get a design that gives best performance.

    Row-format is what most people are used to, and is the traditional way of storing data in SQL Server. An entire row is stored on one page. For physical design you need to balance the cost of joins against the cost of processing the same data from disk multiple times. The end result is often a optimised Star Schema, where Dimensions and Facts have separate tables. The Fact table holds the code of the Dimension, and the Dimension holds the textual description. You may read 1m Fact pages but only 50 Dimension pages. The cost of the join is normally (far) less than the cost of including the textual description in the Fact, storing that 1m times, and reading the same set of 50 descriptions 1m times.

    The situation is very different in column-format. The database will only store unique column values, and maintains a set of vectors to say which value belongs to each row. In this situation the balance between cost of joins and cost of processing the same data from disk multiple times is much different compared to row-format. In column-format storage you are likely to still end up with a Star schema, but you are also likely to have merged many dimensions into the Fact tables. If you have 1m Fact rows and 50 Dimension rows, there is no storage cost penalty or IO penalty in merging the entire contents of those 50 Dimension rows into the Fact table. You may have 1m Fact rows but each of the 50 unique values from the Dimension is only stored once. When you run your query, the cost of retrieval from a single table will typically be lower than the cost of doing a join.

    Remember, there are no absolute rules with identification of Fact and Dimension attributes in logical design, and no absolute rules on how you should move from logical to physical design. You need to make informed decisions at each point, and be prepared to review your design when you hit Einstein's saying "In theory, theory and practice are identical. In practice, they are not". It depends.

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

  • That you know it's an accumulated snapshot table shows you have some grasp of this. I'm assuming that this is correct, though from your description, it sounds correct. If so, you need to realize that contrary to one of the comments your fact table can be quite wide, even wider than the dimensions, and will likely have more dimensions associated with it than a transaction or balance fact table. Just the number of dates typically associated with this type can be numerous. You've already identified several of these. Only a data analysis can determine what items belong in the fact table and which are related in a way that they can be put in a dimension. If you have a lot, you may want to do some reading on junk dimensions. For dimensions for which there aren't many choices (think true/false or yes/no flags), these can be combined together even if they aren't strictly speaking related. The dimension table has every possible combination of the data, but in the dimension each is a separate attribute.

    In my opinion we are always doing the things too complicated and in the end business are not gaining real value from our applications.

    This is always a risk, and I've seen this before. You deliver what was asked in the simplest manner possible, but it is considered too difficult to use. You may have to help build workbooks for the users. Really a case of if you build it, they will use it.

  • For the frozen values: I need to historized or version the rows in the fact table using valid from and to columns, because business wants to know the value of a forecasted volume at the miles stone date, but before and after this date this value can be changed and they also want to know the current value. How you would approach this?

    Sorry I missed this. I assume that you're concerned about changes after each respective milestone date, as changes before a given date would not matter. At first blush I would have a column with the milestone date and a second for the data point at that moment. You would need the same column for that data point that would be current and could be changed of course. I don't know how many data points you need to preserve, and if it's a lot, that could be a problematic solution. But the grain of an accumulated snapshot generally represents something undergoing a process. This seems to me what you are describing.

  • As I see things like target price, target something ; plus sales markup and the like I think about the deal dimension and the promotion dimension that Kimball outlines in his book (The Data Warehouse Toolkit). I think those similar attributes (I am guessing attributes, refer to the prior post on if "can you sum it?") can be sourced out to their own dimension. Something like a Sale dimension or a target dimension that can host the group of related attributes.

    ----------------------------------------------------

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply