Thank you, this is very helpful. I am taking on a new assignment, and will be building a data warehouse. This is a great example of how to monitor for pitfalls.
I understand this is 3rd in a series. How do i find the preceding two articles?
One of the problems I've run into is the size of the fact tables. This is with a with a web site doing 70+ million hits per year. I've found that the majority of reporting works on month/year as the major time dimension, so I created a function that changes the date to the first of the month. I don't like the idea of chosing a particular day, but OLAP tools are looking for a date and not just a month/year combo. This cuts the size of the data into 1/30th of what it would be otherwise.
CREATE FUNCTION [dbo].[GetRptDate] (@Date datetime) RETURNS datetime AS BEGIN DECLARE @StrYear nvarchar(4)SET @StrYear = Year(@Date)DECLARE @StrMonth nvarchar(2)SET @StrMonth = Month(@Date) RETURN convert(datetime,(convert(nvarchar,@StrYear+'-01-'+@StrMonth, 103)), 103) END
I use this function as part of an ETL job that runs daily. There is a lot more to it than this, but I wanted to see how other people are dealing with VLFT (Very Large Fact Tables - a new acronym?). Thanks.
Thank you Vincent for taking the time to write this article. It is coherent, concise, and useful. Well done!
In the section titled "Fact Table Partitioning" (near the bottom) I think there are some missing hyperlinks. I would like to read the one about partitioning in SQL 2005. The sentence reads as "This article provides a good overview on the partitioning on SQL Server 2005", but there is no hyperlink on "This". Or at least it's not showing up in my browser (IE7). Can you please reply with the link?
Hi James, the link to the excellent SQL 2005 partitioning by Kimberly Tripp is http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql90/html/sql2k5partition.asp, where as the one on SQL 2000 by Joy Mundy is http://msdn2.microsoft.com/en-us/library/aa902650(SQL.80).aspx
Thank you Sam for the useful date converter function. On Oracle and SQL 2005 we deal with VLFT (good acronym, thx) using table partitioning. On SQL 2000 as there is no real partitioning, summary tables, indexing and purging are our weapons. We had a snapshot fact table with 90m rows and treated this with a summary table (which is only 300k rows), purging (leaving daily snapshot data for the last 4 weeks and weekly for last 2 years). Luckily the ETL is not too much of a problem as we maily do insert and no updates.
Thank you Michelle pointing the two articles for Mike. The URL for my articles is http://www.sqlservercentral.com/columnists/vRainardi. Mike, Data Warehouse Loading is the first in the series, where I describe the general things, then I wrote 1 article focusing on dimension tables and another one focusing on the fact tables. I haven't covered audit, control, restore, MDB and EAI (topic number 7, 8, 9, 10, 11 - see planned content on the DW Loading article) as I've been busy writing a book and another article on using DW for CRM but if a lot of people want it please tell me and I'll write it for you guys.
the article is good but I missed the usage of the new possibilities of SQL Server 2005 SSIS. In SQL 2000 it was a common way to use a "staging area" in form of a temporary table for the dimension mapping, calculations and whatever. However, performance problems arised when dealing with tons of data. Because every time you write data into your database you have disk IO which slows down performance dramatically.
With SQL 2005 you can avoid this behaviour, because the SSIS architecture enables you to perform nearly all transformations in memory and then writing the data into the fact table. This is much faster than using a staging table and common best practice.
Another thing I do not agree with is using the combination of all surrogate dimension keys in the fact table as clustered index. How does this work in a table with xx million rows and 15 dimensions? This is much afford for the dbms to maintain such an index and I do not believe that it works performantly.
I made good experiences by using one of the date surrogate keys as clustered index for due date reporting fact tables. Other experiences?
Hi Stefan, you are right that in SSIS we can do all transformations and lookups in memory and write directly into the fact table. The reason we dumped the source systems data into a stage database in this particular project was because our access window to the operational systems are limited. In one system (the main one) we were allowed access only from 4 am to 5 am and the whole ETL package took about 2 hours. The operation guys didn't want us to do any data extract during working hours (which was 6 am to 8 pm) because of the impact to the online performance of the operation system. From 8 pm to 4 am they have various house keeping processes such as overnight batch process and backup. So we just used the limited time window only to get the data out from the source system into either a flat file or a collection of unindex, unconstrained tables. We managed to get the extract portion of the ETL to 15-30 minutes. But as you said, the best practice is to perform in-memory transformations, because this produces overall end-to-end shortest ETL execution time. In other projects where I used Informatica for ETL I also performed in-memory lookup and transformations, especially when we have a dedicated box for ETL processes.
You second point is very true, we should not use the combination of all the surrogate dimensional keys as the clustered index, but should use only several keys. I normally use either 3 surrogate dimensional keys which make the fact table record unique, or a fact key column if there is one. In a particular project we had a customer profitability incremental snapshot fact table, which had 19 surrogate keys. The grain of this fact table is 1 row for each customer requirement per day. The clustered index was made of only 3 columns: date_key, customer_key and a degenerate dimension column called requirement number. We had a transactional fact table which record the visit to each customer with 12 surrogate dimensional keys. The grain of this fact table is 1 row for each visit. The clustered index is comprised of date_key, customer_key and requirement number DD. I normally try to create unique clustered index and in many occasions it is the primary key columns. In certain cases I had non-unique clustered index, for example when the number of surrogate key columns required to uniquely identify the fact table record is more than 4 and there is no fact table key. In these cases my preference is to create a fact table key and make it the clustered index. If it not possible to do that as an alternative I would consider taking 2 or 3 PK columns as non-unique clustered index. The first column is typically the date key, with the main reason being the tendency of extremely high usage of this column in the application queries. The second column is typically the customer key, account key, product key or service key depending on which mart I am working on, also because of the same reason e.g. frequency of usage in the queries. For performance and storage reasons I tend to avoid having more than 3 columns for clustered index and for performance reasons I also tend to avoid having a fact table without a clustered index.
Thank you for sharing your experience about the date surrogate keys, Stefan. If anybody have other experiences with fact table indexes or in-memory transformations, be it contradictory or supportive to the above, I would appreciate if you could share them so we can all learn from them.