Snowflaking Dimensions

  • I've got two related questions.

    Question 1: Building an insurance data warehouse, you have a DimPolicy table (for the insurance policy) and a FactPolicyTransaction table (for the transactions on that policy). There is an insurance agent that is responsible for the policy as a whole, and an insurance agent (which may be a different person) that is responsible for the transaction on that policy. There are a large number of attributes about the agent, which leads me to believe that putting all of the attributes in DimPolicy for the agent (where the agent is responsible for thousands of policies) doesn't make sense - therefore, you snowflake the agent, and create a DimAgent table which has it's PK as a FK within DimPolicy. Now you can tell which agent is responsible for the policy. Now, you also need a relationship between DimAgent and FactPolicyTransaction to denote which agent actually made the policy transaction. This creates a "triangle", since DimAgency has its key in both DimPolicy and FactPolicyTransaction, and DimPolicy also has its key in FactPolicyTransction. The question - is this "triangle" ok? It seems awkward to me, and against the philosophies of dimensional modeling.

    Question 2: You have a company (brokerage), and an agent works for a company. There are 30+ attributes for the brokerage, and 20+ attributes for an agent. It seems funny to create a DimAgent dimension which includes all 30 of the brokerage's attributes repeated over and over again. By doing that, you will constantly need to run a "SELECT DISTINCT" on the DimAgent table to get a valid list of brokerages (since they are repeated over and over again). Is this a prime time to snowflake and create a DimBrokerage dimension as well?

    Thanks in advance for your help.

  • Sideout1972 (2/9/2011)


    Question 1: Building an insurance data warehouse, you have a DimPolicy table (for the insurance policy) and a FactPolicyTransaction table (for the transactions on that policy). There is an insurance agent that is responsible for the policy as a whole, and an insurance agent (which may be a different person) that is responsible for the transaction on that policy. There are a large number of attributes about the agent, which leads me to believe that putting all of the attributes in DimPolicy for the agent (where the agent is responsible for thousands of policies) doesn't make sense - therefore, you snowflake the agent, and create a DimAgent table which has it's PK as a FK within DimPolicy. Now you can tell which agent is responsible for the policy. Now, you also need a relationship between DimAgent and FactPolicyTransaction to denote which agent actually made the policy transaction. This creates a "triangle", since DimAgency has its key in both DimPolicy and FactPolicyTransaction, and DimPolicy also has its key in FactPolicyTransction. The question - is this "triangle" ok? It seems awkward to me, and against the philosophies of dimensional modeling.

    Question 2: You have a company (brokerage), and an agent works for a company. There are 30+ attributes for the brokerage, and 20+ attributes for an agent. It seems funny to create a DimAgent dimension which includes all 30 of the brokerage's attributes repeated over and over again. By doing that, you will constantly need to run a "SELECT DISTINCT" on the DimAgent table to get a valid list of brokerages (since they are repeated over and over again). Is this a prime time to snowflake and create a DimBrokerage dimension as well?

    1.a - I do agree a DimAgent dimension table is needed.

    1.b - How about having two FactPolicyTransaction columns named AgentPolicy and AgentTransaction.

    I assume FactPolicyTransaction table will be sliced and diced by a handful of dimensions including AgentPolicy and AgentTransaction so it seams like a reasonable solution.

    2 - You need two tables, one describing Brokerage companies and one describing Agents. I'm pretty sure your ER model shows a "belong" relationship in between BrokerageCompanies and Agents.

    Hope this helps.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • PaulB-TheOneAndOnly (2/9/2011)


    1.a - I do agree a DimAgent dimension table is needed.

    1.b - How about having two FactPolicyTransaction columns named AgentPolicy and AgentTransaction.

    I assume FactPolicyTransaction table will be sliced and diced by a handful of dimensions including AgentPolicy and AgentTransaction so it seams like a reasonable solution.

    2 - You need two tables, one describing Brokers and one describing Agents.

    1a: I think you're saying DimPolicy should be snowflaked into DimAgent and DimPolicy, where DimPolicy contains the FK of DimAgent. I'm thinking the same thing. However, on to 1b.

    1b: Makes sense, as long as all queries would go through the Fact table. In many cases, users will need a list of DimAgent(s) and the DimPolicy(ies) for those agents. Using your proposed solution, you're basically have DimAgency have a FK in both DimPolicy as well as FactPolicyTransaction. As long as I'm not breaking any bad rules by having this triangle between DimAgency, DimPolicy, and FactPolicyTransaction, that works for me.

    2: Based on your suggestion, that would mean the new DimBroker table wouldn't have its key stored anywhere except for DimAgent, and DimAgent would have its key used in any appropriate Fact tables. Is that correct?

    Thanks!

  • Sideout1972 (2/9/2011)


    PaulB-TheOneAndOnly (2/9/2011)


    1.a - I do agree a DimAgent dimension table is needed.

    1.b - How about having two FactPolicyTransaction columns named AgentPolicy and AgentTransaction.

    I assume FactPolicyTransaction table will be sliced and diced by a handful of dimensions including AgentPolicy and AgentTransaction so it seams like a reasonable solution.

    2 - You need two tables, one describing Brokers and one describing Agents.

    1a: I think you're saying DimPolicy should be snowflaked into DimAgent and DimPolicy, where DimPolicy contains the FK of DimAgent. I'm thinking the same thing. However, on to 1b.

    1b: Makes sense, as long as all queries would go through the Fact table. In many cases, users will need a list of DimAgent(s) and the DimPolicy(ies) for those agents. Using your proposed solution, you're basically have DimAgency have a FK in both DimPolicy as well as FactPolicyTransaction. As long as I'm not breaking any bad rules by having this triangle between DimAgency, DimPolicy, and FactPolicyTransaction, that works for me.

    2: Based on your suggestion, that would mean the new DimBroker table wouldn't have its key stored anywhere except for DimAgent, and DimAgent would have its key used in any appropriate Fact tables. Is that correct?

    I'm not convinced a snowflake is needed. Seeing ER model wouldn't hurt.

    In a Data Warehouse environment the data users want to see is the data stored in FACT tables, that data will be sliced and diced using the DIMension tables of the proper datamart.

    FACT table in this case will contain both an AgentPolicy and an AgentTransaction columns, each one pointing to the rigth DIMension table for further description.

    The use of Data Warehouse to get a "list of brokers" is not what Data Warehouse is meant for, that information should be provided by the OLTP system, nobody analyzes a list of brokers. Having said that, proposed structure even provides that.

    Last but not least, DimBroker and DimAgent look to me like slowly changing dimensions - how are you planning to address them? would they be treated as Type 2 SCD?

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • The use of Data Warehouse to get a "list of brokers" is not what Data Warehouse is meant for, that information should be provided by the OLTP system, nobody analyzes a list of brokers. Having said that, proposed structure even provides that.

    Last but not least, DimBroker and DimAgent look to me like slowly changing dimensions - how are you planning to address them? would they be treated as Type 2 SCD?

    Interesting. Why shouldn't the DW be used to get the list of brokers? Why should the purpose of the data warehouse is to bring data together from all of the source systems into a conformed data model. Asking the OLTP system to do that, when there are multiple OLTP systems coming together to create the discrete "list of brokers", doesn't make a lot of sense to me. Especially if you are trying to use "Kimball Methodology". I'm looking to avoid the need for a relational model of all of the source systems with conformed data in it.

    As far as the SCDs - only a few fields in DimBroker will be slowly changing - the association of a broker into a broker group, which can change over time (kind of like a broker can change regions), so it will be type 2. DimAgent I believe will all by type 1.

  • Looks to me as if there are two fact tables: policies and transactions, both have agent relationships. The former is SCD. The latter probably isn't. Whether broker may or may not be useful probably depends on whether that dimension has information unique from the agent dimension, or if there is a direct relationship between a fact and the broker in the source system. Broker, if it is tied to an agent, could also be SCD (hopefully, very slowly changing).

    An ERD would help answer some of these questions.

  • ssills (2/9/2011)


    Looks to me as if there are two fact tables: policies and transactions, both have agent relationships. The former is SCD. The latter probably isn't. Whether broker may or may not be useful probably depends on whether that dimension has information unique from the agent dimension, or if there is a direct relationship between a fact and the broker in the source system. Broker, if it is tied to an agent, could also be SCD (hopefully, very slowly changing).

    An ERD would help answer some of these questions.

    Policy wouldn't be a fact table - it would be a dimension. With that said, an option COULD be to have a factless fact table, which will join all of the dimensions together so there wouldn't be any triangular relationships between the dimensions, and that could solve the "list" type report queries.

    As far as the ERD goes - are you talking about seeing the star schema or the OLTP data model? There are multiple source systems that are being pulled into staging tables, so there isn't an ERD to go by. It doesn't really matter how the source system databases are designed, since they do not reflect how we want the data to be reported.

  • Sideout1972 (2/9/2011)


    ssills (2/9/2011)


    Looks to me as if there are two fact tables: policies and transactions, both have agent relationships. The former is SCD. The latter probably isn't. Whether broker may or may not be useful probably depends on whether that dimension has information unique from the agent dimension, or if there is a direct relationship between a fact and the broker in the source system. Broker, if it is tied to an agent, could also be SCD (hopefully, very slowly changing).

    An ERD would help answer some of these questions.

    Policy wouldn't be a fact table - it would be a dimension. With that said, an option COULD be to have a factless fact table, which will join all of the dimensions together so there wouldn't be any triangular relationships between the dimensions, and that could solve the "list" type report queries.

    As far as the ERD goes - are you talking about seeing the star schema or the OLTP data model? There are multiple source systems that are being pulled into staging tables, so there isn't an ERD to go by. It doesn't really matter how the source system databases are designed, since they do not reflect how we want the data to be reported.

    ERD of your Data Warehouse - Star schema has to be based in something.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • ERD of your Data Warehouse - Star schema has to be based in something.

    I'm definitely missing something (which, of course, is off topic). The data warehouse IS the star schema(s). Other than staging/landing and interim tables, I don't plan to have any relationally-modeled tables that store source system data - that seems like a waste to me. The star schema should be modeled how the business itself understands the data.

    Back to being on-topic - I posted the same question on the Kimball Forum, if anyone is interested that group's responses.

    http://forum.kimballgroup.com/t957-snowflaking-dimensions

  • This was removed by the editor as SPAM

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply