Need help resisting falling back to dimensional/snowflake schemas

  • In a structure for an International business with a SalesOrder and SalesOrderLine, a data model needs created that answers these questions:

    1. Show me which SalesOrders are in the sale region of South-East Asia

    2. Show me the status of each sale order item.

    3. Show me the total sales against each sales region compared to the budget for that region

    To do this in a star schema, the first question suggests that SalesOrder should be the fact table, and Customer and SalesRegion be dimensions.

    The second question suggests that SalesOrderLine should be the fact table, and SalesOrder, Customer, and SalesRegion should be dimension.

    The third question prevents SalesRegion from being included as attributes in the Customer dimension.

    To combine these three into a single working schema requires a relationship of [SalesOrderLine] <- SalesOrder_ID - [SalesOrder] <- SalesRegion_ID - [SalesRegion].

    We now have a snowflake schema, which is bad practice. this is how I've handled this situation in the past, and it's led to a regression into a partly normalized schema.

    Not all SalesOrders will have SalesOrderLine entries due to an eccentricity in the data.

    A possible solution to this is to add a dummy SalesOrderLine for each SalesOrder that doesn't have any SalesOrderLines, allowing SalesOrder to be merged into SalesOrderLine attributes as a degenerate dimension.

    Another possible solution, which I'm worried is over-the-top and could cause confusion for our end users, is to hold that question #1 and question #2 are measuring two different things, so those should be two separate star schemas, with a conformed SalesOrderRegion shared between them.

    I could use some counsel - am I understanding the tensions at play and a possible solution correctly, or there something off in my understanding?

    Best,

    Andrew

    • This topic was modified 1 year, 3 months ago by  Andrew P. Reason: Added clarification
  • Thanks for posting your issue and hopefully someone will answer soon.

    This is an automated bump to increase visibility of your question.

  • Hi,

    You have a problem because you are not choosing the correct method. You are still normalizing in Kimball, which is a fully denormalized cache.

    Just add all the data in SalesOrder to SalesOrderLine and all problems go away. You may need more storage though.

    The way that issue is solved in more modern DWH's is by storing historized data in Data Vault format (normalized), then put views on top of that for the Kimball layer, and only materialize them where required for performance.

     

  • I'm not an expert at dimensional modeling by any means, but it seems to me you have a fundamental misunderstanding about how to model this data using the Kimball methodology (and please, someone, correct me if I'm misunderstanding, I probably am).

    I think the third question is the important one and is where you ought to start because it seems to establish what your fact table needs to be...I would call it "Sales" and I think it corresponds to your "SalesOrderLine", though I'm not certain. The granularity of the fact table would appear to be a "sale", or a "sale order item", but I would still call it "sale", because that's what you're asking for...total sales. The fact table would not contain a "sale order item", the fact table contains the measures that you want to quantify, not the sale order item itself (usually), so that's why I would call the fact table "Sales", not SaleOrderItem. If there are details to track based on a "sales order item", whatever that is, it would (usually) go in a dimension of its own, not in the fact table.

    Now for dimensions...clearly you need a "Region" (or "SalesRegion" dimension), because you're asking for total sales by region and what sales (by "sale order) that happened in a region. This, with your fact table, is needed to answer part of question #1 ("SaleOrder region South-East Asia"), and part of question #3 ("total sales" in question 3, but not the "compared to the budget" part...that could be an attribute in your Sales Region dimension, or a dimension on its own, I suppose (a "SalesRegionBudget" dimension, but that seems to be overkill if budget is one-to-one with "sales region").

    To completely answer questions 1 an 3, you also need a "SalesOrder" dimension.

    Question #2 would be answered by a "status" dimension. If you want to track it over time then it'd be a slowly changing dimension of the appropriate type for your situation.

    So, one fact table,  dimensions of Region , SalesOrder, and Status (and maybe "budget", though I don't think I'd make a dimension out of that unless it makes sense to do so. "budget" seems to be an attribute of a "region"), seems to be able to answer all three questions. It sounds like you also want a "Customer" dimension in there too since you mention that, but I don't know what you want to track by customer.

  • @dmbaker: Thanks for your reply - I'm sorry I hadn't replied earlier. I think you've directly hit on some of those misunderstandings, and I really appreciate you taking the time on this one.

    You've confirmed to me that I need to work out a model with just a single fact and dimensions. To do that, and get rid of the need to query against the "SalesOrder" as a fact table, I'm experimenting with creating a dummy "SalesOrderLine" row for each of the Sales Orders that don't have any, which means that now all Sales Orders are represented in the SalesOrderLine table.

    Thanks also heaps for explaining that the attributes of my "SalesOrderLine" shouldn't be in the fact table itself (so it shouldn't be called SalesOrderLine) - they should be in a separate dimension, even if that's 1:1 with the fact table. That answers a problem I'm having with my fact table getting wide when I know it should be tall and narrow.

    Best,

    Andrew

    • This reply was modified 1 year, 2 months ago by  Andrew P.
  • @ronaldkunenborg apologies for the delay in responding. Thanks for the suggestion to push SalesOrder down into SalesOrderLine. I'm planning on moving the SalesOrder details to a dimension from the fact table for the measures currently in SalesOrderLine, and ensuring that there is at least one row in that fact table for every SalesOrder. That will mean adding a dummy row to the fact table for each of the few SalesOrders that don't have any rows in that but still need included in the dataset.

    Thanks also for the push towards Data Vaults, I've been hearing a small few people say they're moving away from star schemas. I hope to get my design working properly in a star schema design before moving away from that. I'm definitely noticing the trend, the future is now!

Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply