Something just is not clicking with me

  • As I am primarily a developer, but as a JOAT (jack of all trades) I have been tasked with developing a demographics data cube, which I have and it works except now when data is being analyzed I am being hit with SCD issues. First I need to say that I already have read up on changing the fact table to include current date and a IsCurrent column. Except what can I do at the Cube level as I do not live in the DB world here to be changing the fact tables?

    Here is a sample and example and you can see with the counts the issue I have.

    Client Id Distinct Count

    All 7208 <--GOOD NUMBER

    Adult DD 767

    Adult MI 4137

    Adult SA 309

    Child DD 150

    Child MI 1267

    Child SA 1

    Unknown 778

    The total of 7208 for the fiscal year ending is correct and should be 7208, however if you add the values of the categories it comes up to 7409 and when I analyzed the data from the lowest client level, 201 clients did change a category sometime during the year. We do not want to count the "older" category, but only the newest based on another dimension table which is Fiscal Dates (Year, Quarter, Month attributes).

    This is the query I use to get the above data

    select {[Measures].[Client Id Distinct Count]} on 0

    , NON EMPTY {[Population].[Population Name].MEMBERS} on 1

    from DemographicsCube

    Where [Dates].[Fiscal Dates].[Fiscal Year].&[2015]

    Is there a way just to count the current data and ignore the historical data for a client who has "changed" values in a dimension and do it at the cube level. Even if it is with a calculated member I am good with that. I just need some direction.

    Thanks!

  • Just to clarify, I'm hoping that you've included the effective dates and current indicator in your dimension table...not your fact table. These attributes should then be used to determine the "active" dimension record at the time of the fact record (or transaction), to correctly populate the fact table with the relevant dimension key.

    In terms of only counting customers once (or the current version as you have stated), there are a few ways of doing that:

    I typically prefer to have a type 1 dimension (for the same entity) in the database as well. This means that there would be a dimension for customer history (type 2 and tracking changes), and a customer dimension with only the most recent information for each customer. Each fact record will then have 2 foreign keys for customers, 1 that point to the type 2 dimension and the other to the type 1. In your cube you could then use the type 1 dimension with the most recent version of each customers attributes, to perform the kind of count you require here.

    It would also be possible to "fake" this in the DSV of the cube, i.e. create a dimension table in the DSV that only contains current customer records. You would have to jump through a few hoops to get the fact references to work, as your surrogate keys would no longer be enough.

    Another option would be a calculated measure, counting only current records for customers.

    Hope this helps.

  • Just to clarify, I'm hoping that you've included the effective dates and current indicator in your dimension table...not your fact table. These attributes should then be used to determine the "active" dimension record at the time of the fact record (or transaction), to correctly populate the fact table with the relevant dimension key.

    I will need to check that tomorrow. That would be done at the DSV level, correct? I have a Date table that is being pulled in for our date dimension and that is what the Calendar and Fiscal date dimensions are using.

    In terms of only counting customers once (or the current version as you have stated), there are a few ways of doing that:

    Actually that total that I listed is an accurate count of UNIQUE client IDs so that is not at issue. It is the SCD (for this case when a child becomes an adult during the year, but there are dozens of other dimensions where it could change for the year like zip code)

    I typically prefer to have a type 1 dimension (for the same entity) in the database as well. This means that there would be a dimension for customer history (type 2 and tracking changes), and a customer dimension with only the most recent information for each customer. Each fact record will then have 2 foreign keys for customers, 1 that point to the type 2 dimension and the other to the type 1. In your cube you could then use the type 1 dimension with the most recent version of each customers attributes, to perform the kind of count you require here.

    It would also be possible to "fake" this in the DSV of the cube, i.e. create a dimension table in the DSV that only contains current customer records. You would have to jump through a few hoops to get the fact references to work, as your surrogate keys would no longer be enough.

    Another option would be a calculated measure, counting only current records for customers.

    Either way you have given me different options to try. Like I said too, I am not the DBA and cannot change the fact table that the cube's data comes from so I need to do it either at the DSV or a calculated measure. Something like that. But whatever it is has to apply for not only the Population dimension but others as well. Thanks again for your response.

  • Brad Allison (11/10/2015)


    I will need to check that tomorrow. That would be done at the DSV level, correct? I have a Date table that is being pulled in for our date dimension and that is what the Calendar and Fiscal date dimensions are using.

    No, effective dates and current record indicators are usually added into the back-end dimension tables. These attributes are then used when populating the fact table and assigning foreign keys to the type-2 dimension.

  • Just to add a little here, recall that a FACT is just that. It captures something that happened at a point in time. There is really no updating here (unless you have something like an accumulating fact table or an error). To keep it simple, an event is captured at instance in time. The dimension relates descriptors to that event. In your case you are dealing with a changing dimension (not fact)

    Example//

    I sell widgets made of material A ... I sell 500 in the first six months of the year

    I update the widget to make it out of materialB ...I sell 1000 in the next six months , a definite plus to change the material of the widget as it helped sales.

    In the operational system the widget will still have an ID like 'mainWidget123'

    In the warehouse, you know these are not exactly the same ( though they are to the marketing department) , you will have Key values 100, 101 (lets say) associated with the 'mainWidget123' ID.

    So here a cube would show two versions of this widget. however sales according to the marketing dept would be 1500 for the year and not 1000 ( for the latest version of the product).

    Hope this helps you sort things out a bit.

    ----------------------------------------------------

  • Thanks and yes, it does help some. But here is what I need to figure out now. Using your example, the reporting department here needs to see some MainWidget123 demographic data in the form of a count (not an aggregated financial sum). So what my pivot tables are showing to the right now is MainWidget123 (being counted correctly once), but it shows one count for Material A and one count for Material B. My reporting department just wants to ignore that count for Material A and use the count for Material B. However, because Material A is in the fact table, it is being reported.

    I think I mentioned that I do NOT have the capability to change the underlying fact warehouse tables and need to make do with what they give me. So I do not have an "active record" flag to filter data from.

  • Typically a SCD type 2 has those validFrom and validTo fields in the dimension table. You can create your DSV by filtering on only those that have the validTo be NULL. These are necessary for this type of reporting. Of course changing the table is not enough even if you could. You would likely need to alter the actual process that populates the database to use those new fields as well.

    In you situation I am assuming your distinct count is 1000 but the individual sums total 1500 (using my example)? Are you using your own keys or the operation systems ID? Is there some inconsistency/error with the key values for the lines in the dimension table?

    ----------------------------------------------------

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply