Design / performance question

  • I have a large fact table (several millions of rows) in my data warehouse. I want to add a dimension that stores if a customer is billable or not for a specific month. Should I store all the customers for a specific month in the table and indicate if the customer is billable or not for that month, or should I just store the non-billable customers (significantly less rows).

    The impact is that when I store all the customers, I can do an INNER JOIN between the tables (and easily filter on billable info), while if I store only the non-billable customers I have to do a LEFT OUTER JOIN and replace NULLs.

    I'd rather want to store only the non-billable customers, but I'm concerned the LEFT OUTER JOIN might be slower (for example to retrieve all billable customers). Is using IS NULL slower than using " = 'Billable'"?

    If the question is not clear or if you need more info, let me know!

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • If the only attribute you're interested in is whether the customer is billable or not, then a degenerate dimension straight off the fact is probably all that's required and you wouldn't need another table at all. E.g. just a column with a true/false value. Or, if you're not a fan of degenerate dimensions, a dimension table with two rows (Billable/Non-Billable) and a Key from the Fact table to the relevant value.

    Putting aside query performance for a second (it's going to depend on the specifics of the individual query), you generally don't want to have a NULLABLE key in a fact table, you'd always map unknown or N/A values to a dummy key value.

    If there are other attributes to a customer than whether they're billable or not (I'd assume there are), then you should definitely have a customer dimension that's mapped for all fact rows.

    However, if the purpose is to create a "Billable Customers" dimension and all the attributes only apply to billable customers (and not non-billable), then you could just populate the billable customers and have a single dummy row for "Non-Billable" that everything else maps to.

  • The problem is that billable information can change during the course of a month.

    If I use a degenerate key, I have to update all the records for that customer for the whole month (which might be a lot).

    Another problem is that the dimension is new. I can use existing columns of the fact table (date and customer SK) to create a compound key to the new dimension if I store customers in the dimension with their billing info (instead of just billable or not billable). If I create a "normal" dimension with two attributes, I have to add the new surrogate key column to the fact table and update it for all existing rows.

    However, since this is all done in the ETL, I might just live with all those updates. I'll have a longer ETL window, but the design will be cleaner.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • I suspect that "ColX IS NULL" is a faster text than "ColX = 'Billable'" but that's not the point, if you do this with a left outer join you have to create the NULL before you test for it so the whole thing may be slower.

    However, if you have an efficient select statement that gives you customers which are not billable in the period concerned, you can use it as the right hand argument of an EXCEPT operator to get billable customers and that might be quite efficient. But for all I know SSAS could make that too difficult.

    Is the number of billable customers so large as to make it useful to save the space though? If you use two single bit columns, billable and non-billable, plus whatever the key needs to be, the design might be cleaner and not take up too much space. Of course if the point is not to save space but to apply the filter as early as possible to eliminate rows early in a complex query, then recording just the billable customers or just the non-billable ones may well be better.

    Tom

  • Always current month?

    Or rolling into a historical table?

    Dimension implies SSAS, which also can be an influence.

    I would tend to think in terms of NOT EXISTS and check performance.

    I'm sure you are aware of it's a boolean check and very fast most of the time.

  • L' Eomot Inversé (12/10/2013)


    I suspect that "ColX IS NULL" is a faster text than "ColX = 'Billable'" but that's not the point, if you do this with a left outer join you have to create the NULL before you test for it so the whole thing may be slower.

    However, if you have an efficient select statement that gives you customers which are not billable in the period concerned, you can use it as the right hand argument of an EXCEPT operator to get billable customers and that might be quite efficient. But for all I know SSAS could make that too difficult.

    Is the number of billable customers so large as to make it useful to save the space though? If you use two single bit columns, billable and non-billable, plus whatever the key needs to be, the design might be cleaner and not take up too much space. Of course if the point is not to save space but to apply the filter as early as possible to eliminate rows early in a complex query, then recording just the billable customers or just the non-billable ones may well be better.

    The number of billable customers is much higher than non-billable customers (otherwise the company would lose money :-D).

    The problem with storing only non-billable customers in the dimension is SSAS. I can solve this by using an unknown member, but this feels like a dirty hack to me.

    Performance wise a single bit column in the (large) fact table would be the fastest I guess, but since customers can change status during the month this leads to massive updates in the fact table (for that specific month).

    I think I'm going to store non-billable customers in a seperate table. I avoid all the updates, I don't have to change the structure of the fact tables and I can use EXCEPT or not exists to filter efficiently for billable customers if I'm querying directly on the DB. We have a view layer here, so I will adapt the view on top of the fact table to include the billable information as a degenerate dimension, so that SSAS doesn't throw a fit while processing.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Koen Verbeeck (12/10/2013)


    L' Eomot Inversé (12/10/2013)


    I suspect that "ColX IS NULL" is a faster text than "ColX = 'Billable'" but that's not the point, if you do this with a left outer join you have to create the NULL before you test for it so the whole thing may be slower.

    However, if you have an efficient select statement that gives you customers which are not billable in the period concerned, you can use it as the right hand argument of an EXCEPT operator to get billable customers and that might be quite efficient. But for all I know SSAS could make that too difficult.

    Is the number of billable customers so large as to make it useful to save the space though? If you use two single bit columns, billable and non-billable, plus whatever the key needs to be, the design might be cleaner and not take up too much space. Of course if the point is not to save space but to apply the filter as early as possible to eliminate rows early in a complex query, then recording just the billable customers or just the non-billable ones may well be better.

    The number of billable customers is much higher than non-billable customers (otherwise the company would lose money :-D).

    The problem with storing only non-billable customers in the dimension is SSAS. I can solve this by using an unknown member, but this feels like a dirty hack to me.

    Performance wise a single bit column in the (large) fact table would be the fastest I guess, but since customers can change status during the month this leads to massive updates in the fact table (for that specific month).

    I think I'm going to store non-billable customers in a seperate table. I avoid all the updates, I don't have to change the structure of the fact tables and I can use EXCEPT or not exists to filter efficiently for billable customers if I'm querying directly on the DB. We have a view layer here, so I will adapt the view on top of the fact table to include the billable information as a degenerate dimension, so that SSAS doesn't throw a fit while processing.

    At the end of the day, you still have to answer is this an always current dimension / attribute for any given customer, or tied to each month historically. I suspected, although you did not say it directly, that SSAS would be part of the overall picture.

    Capturing change with triggers, then applying only those changed records, I don't see the massive updates.

    A table with customer, month, billable column, not the fact table, is where I would go.

    Kind of a slowly changing customer dimension....

    Depending on the defiintion of Billable, NON EMPTY comes to mind.

    Maybe I came from a completely different environment.

    We tended to look at customers based on volume of sales, and profit margin on those items sold.

    Year over Year ( or Month over Month) trending told the story.

  • Greg Edwards-268690 (12/10/2013)


    At the end of the day, you still have to answer is this an always current dimension / attribute for any given customer, or tied to each month historically. I suspected, although you did not say it directly, that SSAS would be part of the overall picture.

    Capturing change with triggers, then applying only those changed records, I don't see the massive updates.

    A table with customer, month, billable column, not the fact table, is where I would go.

    Kind of a slowly changing customer dimension....

    The billable/no-billable information is on a monthly basis. So I need to know for a specific month if a customer was billable or not.

    SSAS is indeed part of the overall picture.

    A slowly changing dimension would fit, but it would take some work to redesign the dimension and it would possibly make the dimension into a monster dimension, which is why an outrigger would better server our needs. Furthermore, since a customer can (unfortunately) change between status a few times a month (only last status is valid), it would make handling type 2 changes a nightmare.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • What does your current data model look like (as in a visual, if available).

    There are many solutions that come to mind, but I like to compare "what are the business questions you are trying to answer" versus what data model is currently in place vs how much time you have to implement the solution....

  • sneumersky (12/10/2013)


    What does your current data model look like (as in a visual, if available).

    There are many solutions that come to mind, but I like to compare "what are the business questions you are trying to answer" versus what data model is currently in place vs how much time you have to implement the solution....

    There's no visual I can share here.

    It's a pretty straightforward star schema with a few fact tables (some daily data, some are monthly snapshots). I need to add a new dimension, but I cannot put the attributes directly in the customer dimension.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Some "outer part of the box ideas" considering some of your design constraints.

    Option 1: IsBillable bit column in Fact table (good solution with column store index).

    I used to be completely against this design, but it can save you in a pinch (depending on situation)

    Option 2: Use a junk dimension (especially if you have other low cardinality flags hanging around).

    Option 3: Sometimes a table can be both a fact and a dimension (member eligibility comes to mind in healthcare). A periodic snapshot fact table for your customers.

    http://www.kimballgroup.com/2007/12/11/design-tip-97-modeling-data-as-both-a-fact-and-dimension-attribute/

    With this all said, a dimension outrigger, as your gut has led you, could be the way for you to go. Did I go off topic?

  • If a customer can change back and forth several times during the month, I'm missing some concept.

    Why would this freeze at the end of the month?

    It almost sounds like for any given day, what is the open amount per customer is the question a person is ultimately looking for. Or Paid YTD, Open YTD, etc. And Aging is another factor.

    Sorry to be such a pain, but have been down some similar paths before.

    So I am a bit curious how you are using it.

    Usually we would ask the harder questions, and make sure we could address those needs in the end design.

    If you want a real mind bender, I could PM an outline of Price Realization as we implemented.

  • Koen, have you also considered:

    1. Creating a mini-dimension for low cardinality customer attributes? Kimball calls it a "minidimension" (with customers sometimes referred to as a behavioral minidimension), and that minidimension acts as a separate entry point from the Fact table to the minidimension table. Common minidimensions are things like:

    A. Customer Demographics Dimension with its own Customer Demographics Key (FK) in the fact table (in addition to the MAIN customer dimension whose fact table foreign key is stored separately).

    B. Variable Width Attribute Set where:

    i. A global customer dimension has its own FK to a fact table.

    ii. An extended customer dimension that has a SUBSET of all customers (in your case BILLABLE CUSTOMERS).

    There is a FK from the Extended Customer Dimension to the Fact Table AND ALSO a FK from the global

    Customer Dimension to the Extended Customer Dimension. In this scenario you get the best of both worlds:

    A Dimension Outrigger AND A Minidimension.

    These are basically riffs on what most of us have already alluded to. For more info, see chapter 6 of the FIRST EDITION OF THE DWH Toolkit (the one that focuses on modeling).

  • Greg Edwards-268690 (12/10/2013)


    If a customer can change back and forth several times during the month, I'm missing some concept.

    Why would this freeze at the end of the month?

    Billing is at the month level, so the last status of billable/non-billable is what we need.

    Why does it change? I don't have that info yet, but I guess people change their mind a lot 🙂

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply