Foreign key constraints in data warehouses

  • That's an interesting idea with the NOCHECK constraint. Doing that will allow you to take advantage of 2005s ability to file non conforming ones to a type of UNKNOWN bucket.

    I still would be inclined to leave it on in the beginning and after major changes to catch errors in the transformation and load portion (as I said in a previous post, except for that I would not have caught a load error) although I supposed once committed to the UNKNOW bucket there'd be nothing more to do about it.

  • Kerry Bennett (3/11/2008)


    I always declare fact to dimension foreign key constrains in SQL Server 2005.

    However I do not enforce the foreign key constraint, nor define any cascading event actions.

    I have not tried this, but I think that I would be OK with it.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Which means in dimensional table, if it is type2, it gets a new set of PK with every update. Thus, comes with the need of remapping the fact table's FK to that new FK.

    Okay, I'm lost here. Why would you have to remap the PK/FK relationship if you entered a new record in the dimension table to the fact table? The value of the PK has changed, not the PK/FK relationship.

    😎

  • Pk on the dim table has changed to a new value. Its counter part in the fact FK needs to reset to that value. That is what I meant "remap".

  • But that doesn't change the PK/FK relationship.

  • After reading through some of the "suggestions" on this thread, and going through years of getting my butt handed to me by my own code, I'm curios; if you don't declare and enforce RI, how would you know you're ETLs are scrubbing the data properly? I can see unit testing, but when edge cases are taken into account and there's no RI all hell generally breaks loose.

    After cleaning up messes made by some of the published authors in this field, I can assert from personal experience that RI is not optional. First, it'll save you time to release as your ETL bugs will be immediately visible (no testing required, the code just fails) AND it will save your administrator hours of sleep as an RI violation in production should trigger a batch rollback/reload so no human intervention is required to clean the errant data.

    Last, and probably most importantly, it'll save your users from making huge business mistakes. If a part is purchased from supplier A and the reporting system is asserting that it is purchased from supplier B, a bad decision is going to be made. Depending on the decision, it could be terribly detrimental to the company as a whole. If the company doesn't make money, neither do you 🙂

    There have been occasions where I've dropped RI due to physical limitations of a machine, but I'm also aware I do so at my peril. I'd suggest this is always the case.

    ROb

  • 100% agree with ironpants. You should start with RI enforced always, and only ever consider removing it in special cases (usually for performance issues that can't be solved any other way) and when doing so make the business fully aware of the risks. How many of us are absolutely certain that all processes that touch the DB are perfectly written and respect our non-RI enforced relationships?

    Regards,

    Jacob

  • Had a case just yesterday where having the FKs in place led us to find that a customer hadn't provided information on 5 zip codes for a process that had been added after the initial fielding. Without restraints, we would not have discovered it was an issue. I'm on the side that you take a huge risk without them. You just never know what can happen, especially if you make changes.

  • Shouldn't those 5 missing zip codes be shown up by the RI in the OLTP database?

    If the OLTP database lacks RI it won't exist in the DW either!

    I think that the valid logical 'foreign key relation' is important in the DW though it need not be enforced by an 'actual' foreign key.

    I use a lookup function on loading at the ETL level to assign fact table surrogate key values from dimension tables and if a value is not found (ie. where RI is broken at a logical level I assign a default value so as to avoid a null: in practice the default value in the fact table might flag a foreign key relation problem presumably in the ETL region). If you need enforced foreign keys in the DW then the data loading is defective.

    It could be possible to apply RI during development and testing to ensure that data is coming in to the DW without RI problems and to remove it when the DW loading is significant. I can imagine that a fact table of perhaps 30 foreign keys to large dimension tables and millions of rows is going to be very slow on inserts and heavy on resources. To speed up the querying on the DW it is possible to use suitable indexing which can be switched off during loading.

    Having to enforce foreign key relations on all keys of all fact tables must surely overload performance.

    If the input data from the OLTP data is not 'clean' there is a problem 'outside' the DW itself.

  • We've completely separated the dimension tables from the fact tables. They could be on separate servers if I wanted.

    This was a bit nerve racking at first, I will admit. Adding triggers to the dim tables to prevent deletes helped.

    Also access to the dim data is only through the SSIS package or the SSAS account.

  • Do you mean separate databases for the fact and dimension tables?

    I am curious to know if this slows down joins between fact and dimension tables due to their being in separate DBs.

    It would also be possible to use separate disks (RAM disks?) for different parts of the DW (staging tables or lookup tables as well as dimensions or facts or even indices) but remaining within the same database.

    Is your advantage in having separate servers for each part of the DW?

  • You know how I know you have no idea of relational integrity? 😉

    Do this at your own risk. After about 20 years I've come to the conclusion that my employment is boud to the extent of my constraints. You just removed a bunch of yours. After it all falls apart, what do you think your job security will look like?

  • I agree with extensive use of foreign keys in an OLTP database (I worked on one recently which had NONE and it was a very scary experience). Personally I would not use FKs at all in an OLAP/DW.

    If you use FKs rigorously in a DW (that is in virtually ALL fields of a fact table) these tables are likely to be so slow on block inserts or updates that the DW is likely to unusable as the fact tables grow with time.

    I think it is bad practice to use RI in the DW to detect problems in the ETL layer. It is true that the FK will show up an RI problem in the ETL, but I would suggest that this be handled another way (lookups).

    If a lookup surrogate key value is not found, this implies broken RI, and it can be logged as an error if this is critical, the lookups can be added (as with the 5 missing zipcodes) either manually or by using code to upate lookup tables.

    This is however an ETL/staging problem. It should not be allowed to dictate constraints on the DW.

    The whole point of the DW is its speed of response both to data loading and to data querying. Orthodox use of FKs (as for OLTPs) works directly against this. Partial use of FKs seems pointless!

    To use the fact tables as a means of detecting ETL errors is getting the argument backwards.

    RI errors in the ETL should be managed BEFORE the data gets into the fact tables or dimensions. I would suggest writing good ETL procedures that handle missing lookup surrogate key values or which automatically update new values (eg zipcodes as above) in the ETL feed data into lookup tables. This way a value CANNOT be inserted into a fact table or dimension without the lookup value since the lookup value itself has been checked before hand and MUST be there.

    I am concerned that good practice OLTP techniques when applied to OLAP become a recipe for an overloaded and finally dysfunctional DW. (I would also avoid the use of triggers and cascades in a DW star scheme)

  • Yes separate databases. All the fact data lives in separate databases as well. All these fact DBs share the common dimension db. I don't worry about joins since join speed is not related to where the data is located, it's indexes, besides almost all data access is through cubes. Integrity is an issue that is why you have proper backups. The ability to recover is what saves your job.

  • I suppose separate DBs means easier maintenance: if you were to use disk and table partitioning and separate tablespaces wouldn't the result be similar? I can see that the modular approach that you suggest might be simpler from an admin point of view. I suppose it would also be possible to get perfomance gains during ETL if the dimensions are on one machine and the facts on another, though there could be a problem synchronising the two.

    What were the factors that pushed you to try this solution?

Viewing 15 posts - 16 through 30 (of 35 total)

You must be logged in to reply to this topic. Login to reply