Loading Fact Tables - Step by Step Instructions Challenge

  • Dear All,

    This is my first Data warehouse project and I am having serious problems loading my fact table.

    I have already created the dimension tables (over 20)with unique IDs each one (using IDENTITY int).

    Apart from that each record has a FK relationship with one particular table

    e.g.

    Delegate Table

    DelegateID PK

    DelegateName

    StaffGroup

    StaffGroupID PK

    DelegateID FK

    StaffGroupName

    The problem is that I am unable to load my fact table using the LookUp function in SSIS 2008 R2. I am using the DelegateID as the LookUp key.

    As the tables are different lengths ect. Date and the measures tables, I am unable to successfully complete the LookUps. I can use the NoMatch output but with that many tables it will become impossible to folly.

    There are null values in my Date and Measures tables. I thought that they would be acceptable there.

    My request is can someone give step by step instructions on how to load a fact table?

    I have scoured the internet for this, and there is nothing apart from a blog here. While it is informative it is pretty hard to understand for a newbie.

    I would be a great boon for me and others if someone in the community would take up this challenge. If not at least point me the right direction.

    MSDN does not have any realistic explanations and neither does BOL.

  • From your sample tables you don't really have a star schema. If possibly you should denormalise your dimension as much as posssible, not split them off into snowflakes

    You need to explain your problem - "I am unable to load my fact table using the LookUp function" - why?

    Normally you get the source system data with source system keys, load it into a staging table, then personally I don't use SSIS lookups, I use a bulk update to set the surrogate keys then load it into the fact.

    Have you read the Kimball Books or the microsoft data warehousing books? Thes concepts come from books, not from technical articles (MSDN, BOL)

  • I agree.

    The Microsoft Datawawrehouse Toolkit, chapter 7, has great tips for loading fact tables along with sample scripts and SSIS packages to go with it. The 2005 version gives you more of a step-by-step tutorial; whereas, the newest version, SQL Server 2008 R2 Edition, maps the the SSIS product to the 34 subsystems of ETL along with giving you advice on establishing standards.

    My biggest tip: Once you have mastered the art of the lookup transform, remove all foreign keys in your fact table. Like the author states, they are "technically right, but practically wrong" as FKs add unnecessary overhead for a check you just performed in the lookup transform.

    Book and code samples can be found here:

    http://www.kimballgroup.com/data-warehouse-and-business-intelligence-resources/data-warehouse-books/booksmdwt/

    The 34 subsystems of ETL can be found in the Kimball Group Reader or on their website: http://www.kimballgroup.com/2007/10/21/subsystems-of-etl-revisited/

  • One more thing......in the Microsoft Data Warehouse Toolkit, the actual use of the Lookup Transform is covered more in depth in

    ETL Subsystem 13: Fact Table Builders

    ETL Subsystem 14: Surrogate Key Pipeline

    Technique #1: Using Cascading Lookups (via the lookup transform)

    Technique #2: Using database joins

  • My addition to this thread is to inquire why you used full integers for your surrogate keys. Do you REALLY need 4.2BILLION possible values? I bet not. tinyint is one byte, smallint is 2 bytes. Review how many values you can have for each (256 and 65K+ respectively) and use the right one for every dimension. Believe it or not those extra bytes (recall that an integer is FOUR BYTES) on a fact table REALLY DO ADD UP!!

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Completely agree: http://www.kimballgroup.com/2001/11/03/design-tip-30-put-your-fact-tables-on-a-diet/

    I also forgot. NEVER, EVER, under ANY circumstance, leave null values in your dimension keys. It puts too much at risk for an end user. Look up "surrogate nulls" for more information 🙂

  • Dear All,

    Thank you for your comments. I will be getting rid of the NULL values and reducing the size of my int(s) pronto!

    I have Amazoned the MS Data Warehouse Tookit and should get it by the end of the week.

    In the meantime I have attached data schematic that I would like to convert to dimensions. I am trying to monitor compliance rates based on Topic and Job. A specific job will need to do specific topics. I will then need to slice by Organisation hierarchy.

    As I need to delve in to each individually I can't put the topics (e.g. Consent or Dementia) all under one roof so to speak.

    The relationship is centered on the Delegate / staff member, hence the link. Based on the business requirement matrix I have worked out who does what.

    I have tied this to each topic using the Delegate table PK as FKs in almost all the other tables.

    This is where I am encountering problems.

    When I try to upload the individual PK into the Fact Table in SSIS I am getting a lot of No Match Outputs. Perhaps (actually I am pretty sure) I am doing the LookUp incorrectly.

    I am using the DelgateID as the LookUpSource. Should I be using something else? Is it because of the NULLS in the Dimension tables or a combination of both?

    Am I completely on the wrong track here?! Thoughts very welcomed.

    I am quite eager to learn and willl readily start over again if general concensus is to do so.

    Be brutal if you must in a kind way!!

    Regards,

    Marvin.

  • I don't understand enough about your technical issue with lookups, but for starters be aware that under default settings it is case sensitive.

    You generally would not load a source system key into a fact table. Source system keys usually go into the dimensions (or a prior merge/mapping table before the dimension).

    What you should be doing is loading dimensions (which often will not necessarily map 1-1 with lookup tables in the source system), and saving the source system key in here, and generating a new 'surrogate key' in the process.

    That's your first step.

    Next step is to load transactions from the source, then work out what dimensions these transactions relate to and then load that into the fact table.

    The result is a fact table that only has surrogate keys in it. All of the complications around changing attributes, multiple source systems, data cleansing, multi column keys, strange data models etc. are handled by the ETL and the dimension table. All of these complications are boiled down to the single surrogate key that joins neatly to your fact.

    Hopefully the schema you posted is the source system, not the data warehouse? While you are waiting for your data warehouse book, there are plenty of resources online about star schemas and Kimball.

    When building your star schema you need to think about the business process, i.e. what events happen over time and in what way are they measured? - that's your fact(s).

    If your source system records these events in multiple tables in a complicated data model, your challenge is to boil this down to a nice fact/dimension representation.

  • Dear Nick,

    I have already done the first step. so, what's in my dimension tabls have both the source keys and new surrogate keys.

    the schema posted is not my Datawarehouse but it may well be something similiar.

    I have thought about my measures and it is at the lowest level of granularit as well.

    It is just the uploading that is posing a problem at present.

  • Is this an OLTP-like schema for the Quality Management function of a hospital?

  • For the most part, yes. It is geared to L&D rather than clinical systems.

  • Can you attach your target schema? That is, the dimensions surrounding the fact table you are loading along with the fact table itself?

    Also, think about what actual "events" you are measuring and the core business process(es) surrounding the event. More often than not, the "verbs" of your business represent the facts, and the dimensions represent the "nouns" or the "group bys" of your business.

    More often than not, if you have to "group by" it, it is going into a dimension--a dimension that is usually denormalized. If you have to aggregate it, chances are it is a fact. While there are exceptions and advanced concepts where these rules do not apply, try to uses these rules of thumb when mapping your schema back to your business requirements.

  • I am struggling to understand how your diagram shows any relationships.

    Is it simply a statement of facts about people?

    Does the presence of a row in the HandHygiene table mean that the person either did or did not comply with hand hygiene rules?

    Would your fact table look something like this?

    DelegateID INT,

    ReportDate DATE,

    HasConflictResolutionEvent TINYINT,-- (1 - YES, 0 = NO)

    CompliesWithHandWashing TINYINT,

    CompliesWithInfectionControl TINYINT,

    CompliesWithViolenceAggression TINYINT,

    HasMentalCapacityTraining TINYINT,

    etc.

    With Dimensions:

    Date

    Division/Department/Role/Delegate

  • the diagram is not of the dimensions and fact table but of the OLTP at the second stage.

    Stage 1. RawData into generic table

    Stage 2. Data cleaned and loaded which should transform to dimensions.

    The fact table data is has also been loaded into a table with a uniquie key as well.

    It is a statement of facts about people.

    The Fact Table will be exactly that.

    However, I will also need to aggregate on Department, StaffGroup etc.

    There will be KPIs and Trends for each of the topics. But that is the easy/ fun part (I hope :w00t:)

    Hope this helps.

  • What are the top 10 - 20 business questions you are attempting to answer?

Viewing 15 posts - 1 through 15 (of 25 total)

You must be logged in to reply to this topic. Login to reply