Fact table design issue- Please advise!

  • Hi,

    I am trying to design the star schema (and OLAP cubes) using SQL server relational tables.

    I am trying to determine what would ideally be the 'grain of the transaction' based on the kimball design methodology.

    From the OLTP side there are currently two tables that appear as candidates. LEts call them Table #A and Table# B. The problem is that the user inputs data into table #A. Table #B uses that input data , calculates it and stores it at a more granular level. Which one should i build my fact table out of table #A or #B...or is there something else design wise i am missing?Therefore process is ------------> User Inserts into #A-----> Using triggers/lookuptables values are inserted into table #B for each week specified in the date range( date range is between start and enddate)-----------> table #B is then used for reporting

    The structure of Table #ACreate table #Intake(

    [IntakeKey] [int] IDENTITY(1,1) NOT NULL,

    [CourseID] [int] NULL,

    [PlanID] [int] NULL,

    [Startdate] [datetime] NULL,

    [Enddate] [datetime] NULL,

    [CoursePeriod] [int] NULL,

    [Learners] [int] NULL,

    CONSTRAINT [PK_CourseIntake] PRIMARY KEY CLUSTERED

    (

    [IntakeKey] ASC

    ) )

    Insert into #A(

    [CourseID] [int] ,[PlanID] [int] ,[Startdate] ,[Enddate] ,[CoursePeriod] ,[Learners])

    select '1','1','2010-06-01 00:00:00.000','2010-07-03 00:00:00.000','8','1'

    union all

    select '2','1','2010-06-01 00:00:00.000','2010-07-04 00:00:00.000,'8','2'

    The structure of table #B[ There are calculations that use the date period between the start and end date from table #A and learners and then break up the data for every week in between the start and end date]

    Create table #B(

    [WeeklyCountID] [int] IDENTITY(1,1) NOT NULL,

    [CourseIntakeKey] [int] NULL,

    [LocationID] [int] NULL,

    [LearnersPerweek] [int] NULL,

    [LearningSpecialistsPerweek] [decimal](5, 2) NULL,

    [CoachesPerWeek] [int] NULL,

    [LearningSpecialistsPerMonth] [int] NULL,

    [CoachesPerMonth] [int] NULL,

    [LearnersPerWeekCV] [decimal](5, 2) NULL,

    [LearningSpecialistsPerWeekCV] [decimal](5, 2) NULL,

    [CoachesPerweekCV] [int] NULL,

    [Startdate] datetime,

    [Enddate] datetime

    CONSTRAINT [PK_WeeklyCountID] PRIMARY KEY CLUSTERED

    (

    [WeeklyCountID] ASC

    ) )

    Insert into #A[CourseIntakeKey] [int] ,

    [LocationID] [int] ,

    [LearnersPerweek] [int] ,

    [LearningSpecialistsPerweek] [decimal],

    [CoachesPerWeek] [int] ,

    [LearningSpecialistsPerMonth] [int] ,

    [CoachesPerMonth] [int] ,

    [LearnersPerWeekCV] [decimal],

    [LearningSpecialistsPerWeekCV] [decimal]L,

    [CoachesPerweekCV] [int],

    [Startdate] datetime,

    [Enddate] datetime

    select '1','1','1','1','0',10','10','0','0','0','2010-06-01 00:00:00.000','2010-07-03 00:00:00.000'

    union all

    '1','1','1','1','0',10','10','0','0','0','2010-06-08 00:00:00.000','2010-07-10 00:00:00.000'

    union all

    '1','1','1','1','0',10','10','0','0','0','2010-06-15 00:00:00.000','2010-07-17 00:00:00.000'

    union all

    '1','1','1','1','0',10','10','0','0','0','2010-06-22 00:00:00.000','2010-07-24 00:00:00.000'

    union all

    '2','1','2','1','0',10','10','0','0','0','2010-06-2 00:00:00.000','2010-07-4 00:00:00.000'

    union all

    '2','1','2','1','0',10','10','0','0','0','2010-06-09 00:00:00.000','2010-07-11 00:00:00.000'

    union all

    '2','1','2','1','0',10','10','0','0','0','2010-06-16 00:00:00.000','2010-07-18 00:00:00.000'

    union all

    '2','1','2','1','0',10','10','0','0','0','2010-06-23 00:00:00.000','2010-07-25 00:00:00.000'

    union all

    '2','1','2','1','0',10','10','0','0','0','2010-06-30 00:00:00.000','2010-09-01 00:00:00.000'

    Hope that is enough information...let me know.

  • Hi,

    A good principle to follow when designing fact tables, is to always store the data at the more granular level. You can always aggregate up, but never down, and create aggregated/summarized fact tables if it makes sense.

    Another point to support the recommendation would be that you are already doing all the work in the OLTP database to perform some of those lookups and calculations. It therefore doesn't make sense to do all of that again in the DW or reporting layer.

    Hope this helps.

    Martin.

  • Martin Schoombee (11/5/2010)


    Hi,

    A good principle to follow when designing fact tables, is to always store the data at the more granular level. You can always aggregate up, but never down, and create aggregated/summarized fact tables if it makes sense.

    Another point to support the recommendation would be that you are already doing all the work in the OLTP database to perform some of those lookups and calculations. It therefore doesn't make sense to do all of that again in the DW or reporting layer.

    Hope this helps.

    Martin.

    From what I gathered in his post, the calculations only occur at the higher grain level, not the lowest. So he loses those calculations at the lower granularity. Maybe I misunderstood?

    But I completely agree that using the lower granularity is always the best option. It allows for the most flexibility. If management wants to start seeing daily reports you would have to make significant modifications and/or redesign your solution and rebuild your fact table. Or if you want to use that data to build additional fact tables you could run into problems if you are trying to use the summarized data.

    You definitely could leverage the code that creates the calculations and modify them for use on a "non-summarized" fact table.

    If you do want weekly or monthly reports then I would suggest building a DimWeek and DimPeriod(Month) dimension that contain the week and month date values and then you could build additional fact tables that store your summarized values for your weekly and monthly periods. This is especially helpful if you are dealing with large amounts of data - having the summarized fact tables can help with performance.

  • tmitchelar (11/5/2010)

    From what I gathered in his post, the calculations only occur at the higher grain level, not the lowest. So he loses those calculations at the lower granularity. Maybe I misunderstood?

    At first I thought so too...but it looks like table "B" breaks the period into separate weeks, and the counts for a specific week.

  • Martin Schoombee (11/5/2010)


    tmitchelar (11/5/2010)

    From what I gathered in his post, the calculations only occur at the higher grain level, not the lowest. So he loses those calculations at the lower granularity. Maybe I misunderstood?

    At first I thought so too...but it looks like table "B" breaks the period into separate weeks, and the counts for a specific week.

    Thats right..Table B is at a more granular level than table #A. [one record for every week in the date range between start and enddates].

    I guess the only question in my mind now is- should i stop using a table #B at all when i create the OLAP framework..since the primary purpose of table #B ( at a more granular level) if for OLTP reports and For OLAP based reports should i stick to table #A?

    My reasoning is that a table #A records is the actual transaction stored in the table and that supposedly should define the fact table record. .However, I'm not even sure how i'd build the star schema to accommodate the granularity yet or even if its possible.

    As i think through this i am heavily leaning on using the table #B as it is at the most granular level..

    Thanks for your inputs

  • I guess the only question in my mind now is- should i stop using a table #B at all when i create the OLAP framework..since the primary purpose of table #B ( at a more granular level) if for OLTP reports and For OLAP based reports should i stick to table #A?

    Good point. If the plan is really to get rid of Table B then it would make more sense to try and use Table A as the source for your OLAP query. Sometimes though...we end up creating a table similar to Table B through some ETL process for better performance or better visibility.

    I would say give it a shot from Table A first, and if you get to the point where you realise it is not feasible you should use Table B. I am assuming of course that you have some time to play around with this 🙂

  • Pac123 (11/5/2010)


    Thats right..Table B is at a more granular level than table #A. [one record for every week in the date range between start and enddates].

    Here is the root cause of all the confusion. It is the other way around, if table "B" is an aggregated version of the data in table "A" then table "A" is more granular than table "B".

    Use table "A" as the source of your ETL to populate your FACT_tableA factual table.

    Always remember, FACT comes from "factual", facts are not aggregations but granular events.

    Hope this helps.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • PaulB-TheOneAndOnly (11/5/2010)


    Pac123 (11/5/2010)


    Thats right..Table B is at a more granular level than table #A. [one record for every week in the date range between start and enddates].

    Here is the root cause of all the confusion. It is the other way around, if table "B" is an aggregated version of the data in table "A" then table "A" is more granular than table "B".

    Use table "A" as the source of your ETL to populate your FACT_tableA factual table.

    Always remember, FACT comes from "factual", facts are not aggregations but granular events.

    Hope this helps.

    Table B is more granular, meaning that it has records for every week within a date range. While table A will have one record that that indicates 10 learners for the date range between the 1st of Nov to 30th Nov-------Table B goes one step further and would have 4 records with 2.5 learners in each week.[ Table A and B would be related by the intake key in table # A]

    The users want reporting at that granularity(weekly) whereas the transaction itself can have a date range of a month(or 1.5 months, 2months etc)...hence the need for a table #B.

    But from what you say, it sounds like if table B is the most granular, the fact table should be built based on that.

  • Pac123 (11/8/2010)


    PaulB-TheOneAndOnly (11/5/2010)


    Pac123 (11/5/2010)


    Thats right..Table B is at a more granular level than table #A. [one record for every week in the date range between start and enddates].

    Here is the root cause of all the confusion. It is the other way around, if table "B" is an aggregated version of the data in table "A" then table "A" is more granular than table "B".

    Use table "A" as the source of your ETL to populate your FACT_tableA factual table.

    Always remember, FACT comes from "factual", facts are not aggregations but granular events.

    Hope this helps.

    Table B is more granular, meaning that it has records for every week within a date range. While table A will have one record that that indicates 10 learners for the date range between the 1st of Nov to 30th Nov-------Table B goes one step further and would have 4 records with 2.5 learners in each week.[ Table A and B would be related by the intake key in table # A]

    The users want reporting at that granularity(weekly) whereas the transaction itself can have a date range of a month(or 1.5 months, 2months etc)...hence the need for a table #B.

    But from what you say, it sounds like if table B is the most granular, the fact table should be built based on that.

    Sorry - but Paul B is right. How is table B more granular? The column names (would appear) to give the clue - if not the smoking gun. "Weekly". Table B is UPDATED by triggers on Table A and is AGGREGATING the data. QED.

    Therefore, a Fact based on table A is the appropriate base point, as it should allow drilldown to the individual record ~(where/if required)~. One MAY have a summary table similar to table B in a constellation schema, for example, for certain reports - but with the MS BI stack I'd aviod doing this in general as it does not provide aggregate awareness in the same way a Business Objects Universe does

    I'm a DBA.
    I'm not paid to solve problems. I'm paid to prevent them.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply