Foreign key constraints in data warehouses

  • we use standard edition so partioning is not possible. And yes maint is easier. also we use schemas to separate all the dim data for security purposes and organization.

  • The primary factor to this approach was performance in loading data. We did have fk's on everything since my background and schooling was oltp. Loading a fact table with 25 fk's was not working. Also we have been trying to use an agile approach with data warehousing as well as app dev and the thought of having duplicate dim table all over the place was not appealing nor was having to recover a ginormous dw. The modular approach made more sense but of course there was added risk with integrity, but worth it. Development is easier and faster having all the dims in place and being able to just use them as needed. One issue although is the possibility of fact showing up without dim data since loading occurs at different times. This is where the inferred member is used. Basically just create the dim if it's missing then on the next dim load it is updated with all the attributes. Check out project real for some great examples of this.

    Hth Tom

  • My biggest tip: Once you have mastered the art of the lookup transform, remove all foreign keys in your fact table. Like the author states, they are "technically right, but practically wrong" as FKs add unnecessary overhead for a check you just performed in the lookup transform.

    This is from another forum, but seems germaine to this one. I would really like to understand this. I absolutely advocate the use of the standard RI in the database. In my view, it's not an ETL function directly. In theory, there should be no issues, but in theory that's also true of OLTP applications. In practice, there are always records that slip through.

    If you choose not to use them, I wonder if you do, as Kimball suggests, something that tests them once per week just to ensure there are no issues. Bottom line for me is: with RI constraints I'm sure there's not a data issue. Without them, I'm not sure. Yes, there is a cost, but there's a possible cost of data issues without them. The standard for data issues due to design defect should be zero.

    I don't see this any differently from those who advocate not using transaction control where necessary because it affects performance (yes, that happens). Most of the time, there is no issue. But sometimes there are issues. The performance hit to ensure these issues don't exist is worth it, and if it's not, then assuming the queries have been efficiently written, the hardware needs to be improved.

  • Couple points to add here. First the ETL is the only process that can change this data. Remember this is a data warehouse, one load process and the rest are just reads. As long as the dim data cant be deleted RI remains intact. Second is the fact that we build SSAS cubes on top of all this data and its not happy when a key doesnt have a match in the dim database, so that acts as a line of defense protecting RI. Also the dim keys that live in the fact data are NOT NULL so the ETL process will fail if no key is found, but since we are generating missing keys if the lookup returns null no issues there.

    My biggest worry was some converting we did from some old dim data and the possibility of having a matching key but for the wrong record. A way to prevent this is to use Guids but Guids are big and slow and use a bunch of space so all keys are INT besides FK's would not have helped with this anyway.

    However I do agree, design points could be missed the cube could be setup to ignore errors, and the dim keys could be set to allow NULL, in which case the idea of running a weekly dim check sounds like a great idea.

    We do have databases for some custom apps and trust me RI is strictly enforced... anytime we let a developer near the data RI is a must!

  • Thanks, that helps. Perhaps I should have made clear is that I'm trying to be open to the idea that in an OLAP environment typical RI constraints might not only not be necessary, but may even be costly. I don't think I'm stuck in old ways. I had no trouble grasping that the tables would not be normalized in the OLTP sense and made that change very readily. But this seems different. A factual argument will always persuade me, however, so if someone wants to run one by me, let me know.

    The comment that the look up jobs can get around this is potentially persuasive to me, but I also find the comment about turning the checks off during a compile to also be interesting, as I've seen that done.

  • If we could have RI with a single dimension set and no performance hit then by all means we would. So I don’t look at from the standpoint of being unnecessary or not needed more from the standpoint of being able to achieve a certain design pattern and have the performance we are unable to have RI setup for data warehousing.

    But as I mentioned we have taken steps and/or do things that will identify RI issues and keep us safe.

Viewing 6 posts - 31 through 35 (of 35 total)

You must be logged in to reply to this topic. Login to reply