RE: Populating Fact Tables – SQLServerCentral

SSCrazy

Points: 2905

January 6, 2007 at 9:02 pm

Hi Stefan, you are right that in SSIS we can do all transformations and lookups in memory and write directly into the fact table. The reason we dumped the source systems data into a stage database in this particular project was because our access window to the operational systems are limited. In one system (the main one) we were allowed access only from 4 am to 5 am and the whole ETL package took about 2 hours. The operation guys didn't want us to do any data extract during working hours (which was 6 am to 8 pm) because of the impact to the online performance of the operation system. From 8 pm to 4 am they have various house keeping processes such as overnight batch process and backup. So we just used the limited time window only to get the data out from the source system into either a flat file or a collection of unindex, unconstrained tables. We managed to get the extract portion of the ETL to 15-30 minutes. But as you said, the best practice is to perform in-memory transformations, because this produces overall end-to-end shortest ETL execution time. In other projects where I used Informatica for ETL I also performed in-memory lookup and transformations, especially when we have a dedicated box for ETL processes.

You second point is very true, we should not use the combination of all the surrogate dimensional keys as the clustered index, but should use only several keys. I normally use either 3 surrogate dimensional keys which make the fact table record unique, or a fact key column if there is one. In a particular project we had a customer profitability incremental snapshot fact table, which had 19 surrogate keys. The grain of this fact table is 1 row for each customer requirement per day. The clustered index was made of only 3 columns: date_key, customer_key and a degenerate dimension column called requirement number. We had a transactional fact table which record the visit to each customer with 12 surrogate dimensional keys. The grain of this fact table is 1 row for each visit. The clustered index is comprised of date_key, customer_key and requirement number DD. I normally try to create unique clustered index and in many occasions it is the primary key columns. In certain cases I had non-unique clustered index, for example when the number of surrogate key columns required to uniquely identify the fact table record is more than 4 and there is no fact table key. In these cases my preference is to create a fact table key and make it the clustered index. If it not possible to do that as an alternative I would consider taking 2 or 3 PK columns as non-unique clustered index. The first column is typically the date key, with the main reason being the tendency of extremely high usage of this column in the application queries. The second column is typically the customer key, account key, product key or service key depending on which mart I am working on, also because of the same reason e.g. frequency of usage in the queries. For performance and storage reasons I tend to avoid having more than 3 columns for clustered index and for performance reasons I also tend to avoid having a fact table without a clustered index.

Thank you for sharing your experience about the date surrogate keys, Stefan. If anybody have other experiences with fact table indexes or in-memory transformations, be it contradictory or supportive to the above, I would appreciate if you could share them so we can all learn from them.

Kind regards,

Vincent