Foreign key constraints in data warehouses

  • I am just learning about data warehousing and business intelligence, and reading Ralph Kimball’s excellent book “The Data Warehouse Toolkit, Second Edition”.

    In dimensional modeling, many of the fact table fields are foreign keys that link to the primary keys of dimension tables. My question is, when the data warehouse database is implemented in SQL Server 2005, is there any point to adding SQL Server foreign key constraints between the fact and the dimension tables? The data should have already been cleansed in the loading process, and adding foreign key constraints doesn’t automatically index the foreign key fields – the fields have to be indexed separately. It seems that foreign key constraints might add complexity, slow processing, and provide no benefit.

    Any opinions?

  • The cleansing routine is done during the ETL process. After the data loaded into the dimension table, it is not guaranteed that no one would delete it.

    To put a foreign key in the fact table is used for data integrity.

  • I understand your point about the data cleansing. Why should it be necessary? One, it will help analysis services when you import the tables because it will see the relationships. Two, just in case something is wrong somewhere, the FKs might show it. I recently had a case where having FKs showed an issue that would not have shown up otherwise.

  • I agree that you should always add a FK in your data warehouse.

    My other consideration is that you can always use the benefits of index between FK column in your fact table and PK in your dimension table if you want to drill through the data

  • During ETL, loading Facts we always do a lookup with the dimensions to ensure that data gets filtered. In other words fact tables will contain only that data whose reference is present in dimension tables. Adding primary key foreign key relationship will decrease the response time

  • Chuck Bevitt (2/7/2008)


    My question is, when the data warehouse database is implemented in SQL Server 2005, is there any point to adding SQL Server foreign key constraints between the fact and the dimension tables?

    I wouldn't do it. Foreign Keys are really antithetical to the purpose and philosophy of OLAP and data warehouses.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • During ETL, loading Facts we always do a lookup with the dimensions to ensure that data gets filtered. In other words fact tables will contain only that data whose reference is present in dimension tables. Adding primary key foreign key relationship will decrease the response time

    I suppose that for some downloads there would be no change to the dimension tables, but I haven't had that experience. There are always new dimension facts to download too, such as new technicians or new zip codes where the work is being done. There is the chance that something will go wrong with those downloads, in which case, you wouldn't want the fact table to also succeed. FKs prevent that second download from succeeded.

    There could be a case where the download time performance is absolutely crucial, but most downloads occur in the middle of the night, and 10 or 20 extra seconds isn't going to matter.

  • RonKyle (3/10/2008)


    I suppose that for some downloads there would be no change to the dimension tables, but I haven't had that experience. There are always new dimension facts to download too, such as new technicians or new zip codes where the work is being done. There is the chance that something will go wrong with those downloads, in which case, you wouldn't want the fact table to also succeed. FKs prevent that second download from succeeded.

    There could be a case where the download time performance is absolutely crucial, but most downloads occur in the middle of the night, and 10 or 20 extra seconds isn't going to matter.

    In general, I think that it is better to implement ETL features in the ETL itself and keep them out of the OLAP. And that is a large part of the whole philosophy of having an OLAP database that is separate from your OLTP database.

    Also, I have been in many environments where my customer was pushing the limits of their 4 to 6 hour ETL window and there have not been many cases where one FK would have added only 10 to 20 seconds, let alone one or more for every dimension.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Loner (2/7/2008)


    The cleansing routine is done during the ETL process. After the data loaded into the dimension table, it is not guaranteed that no one would delete it.

    To put a foreign key in the fact table is used for data integrity.

    Users and Apps should not be deleting records or making any other permanent modifications to an OLAP database. Only the ETL should be doing that.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Since data warehouse is not a mission critical, I completely agree with Ron. Added a FK would not decrease a process at all, because we do it in a batch. No users involved in this process

    Also imagine when you had to drillthrough the data. FK will speed up the process.

  • I'd love to be in Hendra's position...... a datawarehouse that isnt mission critical. What a joy that would be. We have several real time cubes that make our datawarehouse mission critical, not just for the users but they feed information back to the OLTP (OK, not the best scenario I know but the decision was taken out of my hands). With that in mind FK's have proved a their worth on a couple of ocassions. I'm not overly keen on them as they do prove a bit of a pain when developing and other things but they have proved useful in our situation. On several other projects in the same datawarehouse I have elected to remove FK's because I deemed them unnecessary, so I suppose my approach is horses for courses, what works in one case might not in another.

  • Let's stick with Kimball mathod for a minute.

    These PK/Fk pairs are surrogate keys not the natural keys from OLTP. They should be built during initial OLAP load and maintained in the subsequent ETL process.

    Which means in dimensional table, if it is type2, it gets a new set of PK with every update. Thus, comes with the need of remapping the fact table's FK to that new FK.

    Now, to implement this theory in SSIS successfully is something I would like to see.

    Anybody has the experiece implementing, say 3 to 5 fact tables with a dozen dimention tables and the total size of the db is in 100 GB range?

    How long will a historical load take?

  • These PK/Fk pairs are surrogate keys not the natural keys from OLTP. They should be built during initial OLAP load and maintained in the subsequent ETL process.

    Maybe I should have been clearer from my earlier entries that these are the FKs I'm using. The former PKs may be unique indexed, but all the dimension data is turned into a surrogate key integer, even if the original PK was an integer. I would not advocate trying to maintain the OLTP PK/FK relationship.

  • hongchen91 (3/11/2008)


    Anybody has the experiece implementing, say 3 to 5 fact tables with a dozen dimention tables and the total size of the db is in 100 GB range?

    How long will a historical load take?

    This is highly dependent on the environment & configuration, however in my experience, 3 hours would be the absolute minimum, and I have certainly had to wiat over 2 weeks for some to complete.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • I always declare fact to dimension foreign key constrains in SQL Server 2005.

    However I do not enforce the foreign key constraint, nor define any cascading event actions.

    Example :

    ALTER TABLE

    [dm].[Fact_Episode_Summary] WITH NOCHECK ADD CONSTRAINT

    [FK_Fact_Episode_Summary_Dim_Hospital] FOREIGN KEY([Hospital_DID])

    REFERENCES [dm].[Dim_Hospital] ([Hospital_DID]) NOT FOR REPLICATION

    GO

    ALTER TABLE [dm].[Fact_Episode_Summary] NOCHECK CONSTRAINT

    [FK_Fact_Episode_Summary_Dim_Hospital]

    GO

    I, like other noted here , do not enforce the constraint as the ETL process swill always gurantee that the relationship is valid. To do so would be an unnecessary overhead.

    However the relationship is very important to the query optimiser. It tells the optimiser that it can rely on the 1:n relationship. For very large databases this is critical as it allows the qury optimser to design a suitable parrallel query plan.

    I have been building data warehouses for about 10 years now.

Viewing 15 posts - 1 through 15 (of 35 total)

You must be logged in to reply to this topic. Login to reply