Populating Fact Tables

  • Comments posted here are about the content posted at http://www.sqlservercentral.com/columnists/vRainardi/2769.asp

  • Thank you for an interesting article on loading of fact tables. This is something I might be doing shortly.

  • Thank you, this is very helpful.  I am taking on a new assignment, and will be building a data warehouse.  This is a great example of how to monitor for pitfalls.

    I understand this is 3rd in a series.  How do i find the preceding two articles?

    Thanks, Mike

     

     

  • One of the problems I've run into is the size of the fact tables.  This is with a with a web site doing 70+ million hits per year.  I've found that the majority of reporting works on month/year as the major time dimension, so I created a function that changes the date to the first of the month.  I don't like the idea of chosing a particular day, but OLAP tools are looking for a date and not just a month/year combo.  This cuts the size of the data into 1/30th of what it would be otherwise. 

    CREATE FUNCTION [dbo].[GetRptDate] (@Date datetime) 

    RETURNS datetime AS 

    BEGIN

    DECLARE @StrYear nvarchar(4)

    SET @StrYear = Year(@Date)

    DECLARE @StrMonth nvarchar(2)

    SET @StrMonth = Month(@Date)

    RETURN  convert(datetime,(convert(nvarchar,@StrYear+'-01-'+@StrMonth, 103)), 103)

    END

    I use this function as part of an ETL job that runs daily.  There is a lot more to it than this, but I wanted to see how other people are dealing with VLFT (Very Large Fact Tables - a new acronym?). Thanks.

  • Mike - If you click on the author's name at the top of the page, you can see all the articles he has written. It appears that the previous two articles are parts 1 and 2. It also appears that he has written several DW articles, so I will be digging into all of them.

    Thanks Vincent!



    Michelle

  • Thank you Vincent for taking the time to write this article. It is coherent, concise, and useful. Well done!

    In the section titled "Fact Table Partitioning" (near the bottom) I think there are some missing hyperlinks. I would like to read the one about partitioning in SQL 2005. The sentence reads as "This article provides a good overview on the partitioning on SQL Server 2005", but there is no hyperlink on "This". Or at least it's not showing up in my browser (IE7). Can you please reply with the link?

    Thank you!


    James Stover, McDBA

  • Hi James, the link to the excellent SQL 2005 partitioning by Kimberly Tripp is http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql90/html/sql2k5partition.asp, where as the one on SQL 2000 by Joy Mundy is http://msdn2.microsoft.com/en-us/library/aa902650(SQL.80).aspx

    Thank you Sam for the useful date converter function. On Oracle and SQL 2005 we deal with VLFT (good acronym, thx) using table partitioning. On SQL 2000 as there is no real partitioning, summary tables, indexing and purging are our weapons. We had a snapshot fact table with 90m rows and treated this with a summary table (which is only 300k rows), purging (leaving daily snapshot data for the last 4 weeks and weekly for last 2 years). Luckily the ETL is not too much of a problem as we maily do insert and no updates.

    Thank you Michelle pointing the two articles for Mike. The URL for my articles is http://www.sqlservercentral.com/columnists/vRainardi. Mike, Data Warehouse Loading is the first in the series, where I describe the general things, then I wrote 1 article focusing on dimension tables and another one focusing on the fact tables. I haven't covered audit, control, restore, MDB and EAI (topic number 7, 8, 9, 10, 11 - see planned content on the DW Loading article) as I've been busy writing a book and another article on using DW for CRM but if a lot of people want it please tell me and I'll write it for you guys.

  • Hi,

    the article is good but I missed the usage of the new possibilities of SQL Server 2005 SSIS. In SQL 2000 it was a common way to use a "staging area" in form of a temporary table for the dimension mapping, calculations and whatever. However, performance problems arised when dealing with tons of data. Because every time you write data into your database you have disk IO which slows down performance dramatically.

    With SQL 2005 you can avoid this behaviour, because the SSIS architecture enables you to perform nearly all transformations in memory and then writing the data into the fact table. This is much faster than using a staging table and common best practice.

    Another thing I do not agree with is using the combination of all surrogate dimension keys in the fact table as clustered index. How does this work in a table with xx million rows and 15 dimensions? This is much afford for the dbms to maintain such an index and I do not believe that it works performantly.

    I made good experiences by using one of the date surrogate keys as clustered index for due date reporting fact tables. Other experiences?

     

    Best regards,

    Stefan

     

     


    Kindest Regards,

    SK

  • Hi Stefan, you are right that in SSIS we can do all transformations and lookups in memory and write directly into the fact table. The reason we dumped the source systems data into a stage database in this particular project was because our access window to the operational systems are limited. In one system (the main one) we were allowed access only from 4 am to 5 am and the whole ETL package took about 2 hours. The operation guys didn't want us to do any data extract during working hours (which was 6 am to 8 pm) because of the impact to the online performance of the operation system. From 8 pm to 4 am they have various house keeping processes such as overnight batch process and backup. So we just used the limited time window only to get the data out from the source system into either a flat file or a collection of unindex, unconstrained tables. We managed to get the extract portion of the ETL to 15-30 minutes. But as you said, the best practice is to perform in-memory transformations, because this produces overall end-to-end shortest ETL execution time. In other projects where I used Informatica for ETL I also performed in-memory lookup and transformations, especially when we have a dedicated box for ETL processes.

    You second point is very true, we should not use the combination of all the surrogate dimensional keys as the clustered index, but should use only several keys. I normally use either 3 surrogate dimensional keys which make the fact table record unique, or a fact key column if there is one. In a particular project we had a customer profitability incremental snapshot fact table, which had 19 surrogate keys. The grain of this fact table is 1 row for each customer requirement per day. The clustered index was made of only 3 columns: date_key, customer_key and a degenerate dimension column called requirement number. We had a transactional fact table which record the visit to each customer with 12 surrogate dimensional keys. The grain  of this fact table is 1 row for each visit. The clustered index is comprised of date_key, customer_key and requirement number DD. I normally try to create unique clustered index and in many occasions it is the primary key columns. In certain cases I had non-unique clustered index, for example when the number of surrogate key columns required to uniquely identify the fact table record is more than 4 and there is no fact table key. In these cases my preference is to create a fact table key and make it the clustered index. If it not possible to do that as an alternative I would consider taking 2 or 3 PK columns as non-unique clustered index. The first column is typically the date key, with the main reason being the tendency of extremely high usage of this column in the application queries. The second column is typically the customer key, account key, product key or service key depending on which mart I am working on, also because of the same reason e.g. frequency of usage in the queries. For performance and storage reasons I tend to avoid having more than 3 columns for clustered index and for performance reasons I also tend to avoid having a fact table without a clustered index.

    Thank you for sharing your experience about the date surrogate keys, Stefan. If anybody have other experiences with fact table indexes or in-memory transformations, be it contradictory or supportive to the above, I would appreciate if you could share them so we can all learn from them.

    Kind regards,

    Vincent

  • Hi Vincent,

    Thanks a lot for this article (and its companions "Data Warehouse Loading" and "Upsert Dimension Tables") : a real gold mine 🙂 completing on the practical side some points developed by Ralph Kimball in his excellent book "The Microsoft Data Warehouse Toolkit".

    I haven't covered audit, control, restore, MDB and EAI (topic number 7, 8, 9, 10, 11 - see planned content on the DW Loading article) as I've been busy writing a book and another article on using DW for CRM but if a lot of people want it please tell me and I'll write it for you guys.

    Please do ! I'm looking forward to read it !

    Also an other request: some tips or best practices concerning the use of indexes for both dimension and fact tables. I have read lots of articles about the subject just to get lots of contradictory information. I would really appreciate the advise of a long time practioner ;).

    Luc

    MCTS SQL Server 2005 BI

    Luc Magnée
    MCTS BI
    Antwerp - Belgium

  • Hi Vincent

    Nice overall concise article, however I don't agree with your Unknown record solution. The product dimension is not unknown, but rather only the atributes are unknown. But creating a force record of the product_id you don't lose the product_id information. Business can decide what they want as defaults for the product attributes So your data will now look like this:

    Here is an example of a row in the source table with a product_id that does not exist in the dimension table:

    order_id order_date product_id quantity price last_update

    358 19/08/2006 BGCKZ 3 2.99 31/10/2006

    "Force" the unknown product_id as an actual dimension record.

    product_key product_id product_name description min_level valid_until load_time

    231 BGCKZ Unknown Late Arriving 0 1/1/1900 2/10/2004

    Now your lookups will still be accurate.

    This is how fact_sales looks after that record is loaded:

    fact_key date_key product_key order_id quantity price load_time

    830937 2424 231 358 3 2.99 31/10/2006

    This solves a problem for late arriving dimension records. Due to bad/unusual business process you may receive the product details for BGCKZ a few days after the fact record. It is just a matter of updating the existing dimension record with the correct details. This way business as well a foreign key integrity is preserved.

    Regards,

    Dudley

  • Yes you are right Dudley, we can create the product dimension row based on the product ID to handle late arriving dimension data.

  • Hi Vincent,

    Your article is great for me, it’s the first time I’m working with data marts. I just created my first environment (still in the development/test mode). I read your book as well. I will work on SSIS, for now I’m creating my processes on SQL to understand the behavior of the records.

    My Fact table has some aggregation columns (more than Key columns). This fact table will be used for reporting. All the reports will use the information from this Fact Table and they will be run with a parameter of date range and client id. (date_Key and Client_Key)

    How exactly I will call my report and my aggregates. I mean, this is a Data Mart “natural” behavior? Some aggregate columns are percentages or probability calculations. I assume I will aggregate the aggregates in order to have a report for one month.

    Do you have an example of how to get the information (Report) from the fact table with SQL?

    Thank you

  • Hi MTY,

    "aggregate column which is more than key columns"

    I think what you mean is a measure column which is duplicated across rows. The Kimball term for this situation is: the measure doesn't match the grain of the fact table. Generally, this means that that measure does not belong to this fact table and you would have to build another fact table for that measure, with the right grain.

    "how exactly I will call my report and my aggregates"

    A typical star join query used to retrieve measures from the fact table is

    select d1.attribute1, d2.attribute1, sum(f.measure1)

    from fact1 f

    inner join dimension1 d1 on d1.d1_key = f1.d1_key

    inner join dimension2 d2 on d2.d2_key = f1.d2_key

    group by d1.attribute1, d2.attribute1

    If your measure doesn't satisfy the grain of the fact table you will encounter double counting. Hence the recommendation is that you have to move that measure column to other fact table so that it can be summed up correctly.

    If however, you insist of putting that measure in this fact table, then you would have to take a max, min or average (or other aggregate SQL function like rank).

    "some aggregate columns are percentages or probability calculations"

    With percentage measures generally we will have to persist (store) it as 2 columns in the fact table: the nominator (A) and the denominator (B). The report or cube will then do the calculation A/B to get the percentage. Same with probability.

    Hope this helps,

    Vincent

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply