Request for review of Star schema design

  • Hi all,

    I'm new to Data Warehousing, and could do with some comments on a star schema design I've created. This schema's purpose is to hold discrete contact or resource use events relating to a student at an educational institute. The grain is a discrete student-interaction related event, and the fact tables are factless fact tables. An example use-case is a timeline of events that indicate interactions with a student.

    I'm working through Kimball's The Data Warehouse Toolkit 3rd Edition and am familiar with SCD's, star and snowflake schemas.

    Could I please get some comments on any flaws or schema design principles I've violated in the attached design, or improvements I could make? As this is my first attempt at designing a schema myself I'm a little unconfident.

    Thanks,

    Andrew

  • There's certainly a lot to digest in that ER diagram. I have a couple of questions offhand, but they may just be my lack of understanding of the requirements you're working with.

    - Related to FACT_PRINT_EVENT, there's a dimension DIM_PRINT_DETAILS that has PAGES in it, I'd think that would be in the fact table so you could easily calculate usage.

    - Related to FACT_CMS_EVENT, there are 3 dimensions I'm not sure I understand why they are separate? DIM_EMAIL_ADDRESS, DIM_PHONE_NUMBER, DIM_ADDRESS. I'd think they might be attributes in DIM_PERSON or DIM_STAFF, maybe both? It's not clear what the business cases are for these.

    - Is DIM_NETWORK_LOGIN_EVENT a fact table? I didn't see it related to any fact table, just other dimensions.

  • Hi Chris, thanks for the feedback.

    Chris Harshman (12/20/2016)


    There's certainly a lot to digest in that ER diagram. I have a couple of questions offhand, but they may just be my lack of understanding of the requirements you're working with.

    You're right, it would have been better if I had only attached one of those schemas, they're all very similar. I had just split them from one giant "STUDENT_EVENT" fact table schema and wasn't thinking of them as separate designs.

    - Related to FACT_PRINT_EVENT, there's a dimension DIM_PRINT_DETAILS that has PAGES in it, I'd think that would be in the fact table so you could easily calculate usage.

    I didn't see any business requirements we'd have regarding doing calculations on that column, only knowing that as a context of a PRINT_EVENT. I suppose by putting that into a dimension I'm limiting it from being calculated against. Would you then leave the junk dimension "DIM_PRINT_DETAILS" containing the IS_COLOR boolean and PAGE_SIZE text, or move IS_COLOR back into the fact table as well, and change PAGE_SIZE into a dimension with a reference from FACT_PRINT_EVENT?

    - Related to FACT_CMS_EVENT, there are 3 dimensions I'm not sure I understand why they are separate? DIM_EMAIL_ADDRESS, DIM_PHONE_NUMBER, DIM_ADDRESS. I'd think they might be attributes in DIM_PERSON or DIM_STAFF, maybe both? It's not clear what the business cases are for these.

    Sometimes students may not call us from their current registered address, email or phone number - they could call us from a phone number they never use again (e.g. a relatives house). As it's advised text content like that not be in the fact table, I had moved them to their own dimension. Would you take an alternative approach?

    - Is DIM_NETWORK_LOGIN_EVENT a fact table? I didn't see it related to any fact table, just other dimensions.

    That's sloppy naming, my apologies.

    I also see those "Additional_Text" columns shouldn't be in the fact tables, so have split them out into their own dimensions.

    Can you recommend any improvements to the below, or issues in my thinking above?

    Thanks for the help,

    Andrew

  • To what entity to the e-mail, address, and phone number belong? Presumably a person. A student? They certainly don't look like they belong in the Fact table at all, but in some dimension table that describes a person of some kind.

    If they describe a Fact, what does a record in your fact table describe? I think that's the part you need to be really clear on. After you understand that ( and can describe it succintly ), the modeling questions pretty much answer themselves.

  • pietlinden (12/20/2016)


    To what entity to the e-mail, address, and phone number belong? Presumably a person. A student? They certainly don't look like they belong in the Fact table at all, but in some dimension table that describes a person of some kind.

    If they describe a Fact, what does a record in your fact table describe? I think that's the part you need to be really clear on. After you understand that ( and can describe it succintly ), the modeling questions pretty much answer themselves.

    FACT_CMS_EVENT describes an event where the business had email, phone, or postal contact with a student or prospective student. An example is "John Adams called from 534-3432-543 on 23 April 2016 at 09:00 and spoke to Carol Green, where he asked when his certificate would arrive. Carol let him know it would arrive by Friday". I've just identified I'm using an incorrect abbreviation and should have been using "CRM" not "CMS" - perhaps that would be better named "FACT_STUDENT_CONTACT_EVENT".

    Students may call or email us from an address that isn't their primary address, and they may not contact us from that phone number or email address again. We may post mail to an address not recorded in the student's record. Because those details can be isolated to a specific contact event, I think they are best referenced from the fact table.

  • Andrew P (12/20/2016)


    You're right, it would have been better if I had only attached one of those schemas, they're all very similar. I had just split them from one giant "STUDENT_EVENT" fact table schema and wasn't thinking of them as separate designs.

    Seems right to split them out, each of your fact tables models a specific business process, and has a different grain. If on occasion you need information from more than one of these fact tables, you could do what's called "drilling across" using your common dimensions.

    Would you then leave the junk dimension "DIM_PRINT_DETAILS" containing the IS_COLOR boolean and PAGE_SIZE text, or move IS_COLOR back into the fact table as well, and change PAGE_SIZE into a dimension with a reference from FACT_PRINT_EVENT?

    I'd probably just keep the dimension with IS_COLOR and PAGE_SIZE since IS_COLOR has such limited values.

    Sometimes students may not call us from their current registered address, email or phone number - they could call us from a phone number they never use again (e.g. a relatives house). As it's advised text content like that not be in the fact table, I had moved them to their own dimension.

    OK, that makes sense that you don't want to track all these possibly extraneous values in a slowly changing Person dimension. You may want to give the columns in FACT_CMS_EVENT a slightly more descriptive name so it's clear who's phone, e-mail, or address it is.

    I also see those "Additional_Text" columns shouldn't be in the fact tables, so have split them out into their own dimensions.

    Can you recommend any improvements to the below, or issues in my thinking above?

    Additional Text situations can be awkward to deal with, especially if it's free form text, so I understand splitting it out. That way it's only referenced when needed.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply