Modelling grain and fact table key

  • Hi,

    I'm new to datawarehousing and am really working through a lot of concepts with no guidance or mentor. I have a question about modelling the fact table grain. I seem to have reached a contradiction however I may well be misunderstanding this concept completely:-

    As a simple example, a customer buys products on a certain date, so my dimensions are

    - Date

    - Customer

    - Product

    If my source is sales orders (headers and lines), the customer is attached to the header, the product is attached to the lines. I think that my grain is therefore the product on the sales line.

    A customer may place multiple orders for a product within a given day e.g

    Date, Customer, Product, Qty

    D1, C1, P1, 100

    D1, C1, P1, 500

    D1, C1, P1, 100

    In my fact table, should this be 3 separate fact rows as above, or one row with an aggregated quantity as below

    D1, C1, P1, 700

    The contradiction is as follows :-

    - The fact table should not store aggregated data. Then the first example with 3 rows honours this.

    - The primary key of a fact table should be a combination of all dimensions. The one row with a Qty of 700 honours this - but this would be aggregated data. It also limits my options for gracefully adding other dimensions to the fact table in future, e.g if 2 sales managers sell to the same customer.

    Is this a question that has one answer?

  • If you want to capture data at order level then you should capture enough information to identify individual orders: an order number for example. If there are no attributes of an individual order other than the order number then put the order number into the fact table (this is sometimes called a "degenerate dimension" meaning a dimension without non-key attributes).

  • If you're not interested in the order line level - and only then - you can aggregate the data to one row in your example.

    This is similar to aggregating measurements at the hour level up to the day level, because you're not interested in anything below one single day.

    However, typically I would create a fact table at the lowest level and a 2nd level fact table with the aggregated data. Just in case.

    If you are interested in the order line level, you need to add a degenerate dimension like sqlvogel suggested.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Just one small remark:

    henryKrinkle (10/2/2016)


    - The primary key of a fact table should be a combination of all dimensions.

    This should be:

    The primary key of a fact table should be a combination of all required dimensions (+ perhaps degenerate dimensions if needed).

    Not all dimensions are needed in the primary key. You can have optional dimensions as well, such as a promotion dimension (or maybe a weather dimension).

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Hi,

    Thank you both for your replies, I really appreciate your input. I see now there is no set rule and this is more of a requirement question. So even if the business requirement is at product level, aggregating this would limit the usefulness of the facts assuming the granularity may change in future?. I do have a further query that may be specific to my unfortunate circumstances, would you mind commenting if I am on the right train of thought? :-

    I got all excited as I thought something had really sunk in regarding this query but......

    I have profiled the source system and we have a bug where the line number per order can be duplicated for imported orders e.g a single order my have line 2 x 10 occurrences, even if the product choice is different :-(. I want to try and avoid bringing in the unique id from the sales line table (they are bigints and would be useless in the model) , so my proposed solution is to also bring in the timestamp the line was created down to second level (as a date/time key split YYYYMMDD MMSS) which would make each row unique and allow me to generate the PK across all dimensions, while maintaining the level of granularity to order line level and not having to aggregate any data.

    So as an order line cannot be created for the same order at the same time, the time dimension is what would bring uniqueness to this odd scenario my final design would become :-

    Date, Time, Customer, Product, SalesOrder (Degenerate), SalesLine (Degenerate), Qty

    D1, T1, C1, P1, ORDER1, LINE1, 100

    D1, T2, C1, P1, ORDER1, LINE1, 500

    D1, T3, C1, P1, ORDER1, LINE1, 100

    I guess you could also argue that the line number Degenerate may not be necessary now as well.

    (I am of course going to submit a bug fix to the application team!!)

    Thanks in advance.

  • Hi Henry,

    you mentioned that orders are imported? Maybe you can add the source as well as a degenerate dimension?

    For example, Source = S (system) for normal orders and Source = I (imported) for imported orders.

    If that's not possible, a timestamp might work indeed, although there might still be a very small chance that the timestamps are the same as well.

    Or are you 100% the imports are always done at a later point in time?

    I would not drop the degenerate dimension order line, even though it's not always unique.

    Such a column is very useful for debugging purposes.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Hi SSCoach,

    The orders are imported into the same transactional system and the system gives each imported order a unique internal sales order number.

    I see your point about the timestamp, this however would be very unlikely as it only allows 1 user to edit 1 sales order in 1 session . Profiling the system reveals no duplication down to timestamp minutes, which is a good sign, although I would go to seconds as they are available to me.

    I will keep the degenerate line number as at some point I am hopeful it will be correct and I agree will be very useful.

    Thanks again for all the input everybody, I feel like I have bridged the gap between declaring the grain, the physical design and the realities of what may be available to achieve both correctly.

    Henry

  • I want to try and avoid bringing in the unique id from the sales line table (they are bigints and would be useless in the model

    I would almost certainly keep this, and have unless it's an aggregated table. It's good that you seem conscious of the need to keep the row as small as possible and it's too bad the key is a bigints if that's really not necessary. You could downgrade it to an integer in the fact table though. I'd also include the order number. As someone has already said, this would be useful for debugging.

  • I forgot to mention that I have GUIDs in the source database that I bring into but store in a separate table so that they are not part of the warehousing solution. But in Kimball's 3rd edition book, he suggests using views for the fact and dimension tables. I've been experimenting with these and like the result. It's occurred to me that I could put the GUIDs in the fact table but leave them out of the view. This seems like it would be the best of both worlds. You could put the bigints in the table, but then base the fact table on a view that excludes it.

    I haven't yet done this in wide use, but if you go this way, please let me know how it works.

  • Thanks for your input.

    I think I'm at the stage where I'm realising the models and advice in my books are based on best case scenarios. Like you said, I could bring the id over - it would certainly help with traceability, but don't necessarily have to expose it to any models etc.

    Your comment on casting to int is interesting, would this not cause errors if the integer range is exceeded?

    Thanks

    Henry

  • would this not cause errors if the integer range is exceeded

    It would, which is why I said if the bigint wasn't really needed. I've seen too many developers "max out" the data types, just in case. Even the GUIDs that I have to deal with in some of my models was an unnecessary choice. An integer would have done just as well, and the GUIDs have caused all kinds of headaches.

  • I myself advocate for the lowest possible granularity unless you know for sure you wont need the details. For example, can one of those transactions be reversed at a later date if a customer returns a order?

    ----------------------------------------------------

  • D1, T1, C1, P1, ORDER1, LINE1, 100

    D1, T2, C1, P1, ORDER1, LINE1, 500

    D1, T3, C1, P1, ORDER1, LINE1, 100

    Sorry if I missed a detail here, would the line numbers not be increasing here? Line1, line2, ...

    If not you could add your own line item number within the group -->(date, customer, product, orderNumber, lineItem) as your key. Maybe you can even populate a calculated field to note the total number of line items within the transaction, placed on every line using a windowed function.

    ----------------------------------------------------

  • If not you could add your own line item number within the group -->(date, customer, product, orderNumber, lineItem) as your key. Maybe you can even populate a calculated field to note the total number of line items within the transaction, placed on every line using a windowed function.

    I would assume the lines should have incremented. But if not the use of Rowover with a break by order number would probably be the better way to make the line numbers unique to each order.

  • RonKyle (10/3/2016)


    If not you could add your own line item number within the group -->(date, customer, product, orderNumber, lineItem) as your key. Maybe you can even populate a calculated field to note the total number of line items within the transaction, placed on every line using a windowed function.

    I would assume the lines should have incremented. But if not the use of Rowover with a break by order number would probably be the better way to make the line numbers unique to each order.

    I think we are saying the same thing. Here is what I described >

    ... OrderID;lineNumber;numLinesInGroup

    --------------------------------------

    ... orderA,1,3

    ... orderA,2,3

    ... orderA,3,3

    The last two columns would be populated with a windowed function. The last one being optional.

    ----------------------------------------------------

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply