Tables, Primary Keys and Clustered Indexes

  • I do not create a primary key on fact tables unless I need to relate those facts to another fact table (in which case I add a surrogate key, identity, to the fact table).

    I almost always have added the entity business key as the key provided it's been an integer. Kimball doesn't recommend it, but I've personally not found any issues and it's a times helpful as in our system duplicates can appear.

  • To Gail Shaw:

    It is not an easy task to maintain referential integrity in a data warehouse. We decided to consolidate the data from different sources into one data store with FULL referential integrity in place. After cleansing but before consolidation, all data is checked for integrity flaws (both primary and foreign keys). Invalid data is marked as outflow; any changes in this outflow appear in a daily report. This approach saved us some headaches after version changes in our sources, or errors in third party application causing data corruption and creating 'litter' as we call it: Dangling foreign keys, NULL values in both primary and foreign keys, zeros in places where NULLs shoud be used; some aplication developers still use a databse as 'just a bunch of tables'.

    But it IS hard. You need to solve all kinds of issues BEFORE data moves into the data store. Checking integrity takes additional time, especially when the 'dimension' is large (like all customers), tree-shaped (like organizational units) creating a data-dependency-chain, or the data contains a lot of cross references. However, in the end we (both me AND the users) are still happy with it, also because it signals problems in the source databases that would otherwise linger on unnoticed. Gail, I might be one of the few, but I am at your side on this.

  • But it IS hard.

    It sounds hard only because someone else didn't follow the correct methods at the source data. If that's been done correctly, it's not that hard. All too often, however, it is not done correctly.

  • hi Andy,

    We have several tables above the 10 billion rows mark. Luckily we have the Enterprise Edition, so we have been able to partition the tables.

    That helps maintainability, it does not automatically improve the speed of your code.

    It has been difficult to figure out the best Partition Function for us and it also took us a while to realise that the Clustered Key and the Primary Key were not necessarily the same.

    Kimberly Tripp has a few blog post about choosing the best Clustered Key. See http://www.sqlskills.com/blogs/kimberly/category/clustering-key/

    Best regards,

    Henrik

  • Lynn Pettis (6/21/2016)


    Good database design is needed in data warehouse and data mart systems just as it is needed OLTP systems.

    Data Warehouses pulling data from multiple sources should cleanse and standardize (not the actual term I am looking for, but close) the data for reporting.

    SSIS itself isn't what Gail was referring to regarding your ETL process leaving the data in an inconsistent state, it is the code you and your developers develop in the packages.

    I would love to hear other scenarios that is not just protecting the database from the code others created because in most cases, ETL process behind dimensionalizing data is really simplistic and likely not going to change to need hard RI. At least, not in my experience because it's such a straight forward process in the DW world it seems. OLTP on the other hand, seems a bit different.

  • likely not going to change to need hard RI

    Will have to disagree. I've seen DW systems that without the hard RI end up with issues. With all of Ralph Kimball's experience, that was his conclusion.

  • henrik staun poulsen (6/24/2016)


    We have several tables above the 10 billion rows mark. Luckily we have the Enterprise Edition, so we have been able to partition the tables.

    That helps maintainability, it does not automatically improve the speed of your code.

    First of all... THANK YOU for pointing out that partitioning isn't a performance panacea. Way too many people have been mislead to thinking that it is and don't understand that it can actually slow things down a bit because each partition essentially has it's own B-Tree to traverse. Hat's off to you, good Sir!

    On the subject of Partitioned Tables, I've learned to loath them (seriously... what the hell was Microsoft thinking when they wrote those?). As you've correctly pointed out, they require the "Expensive Edition" 😛 and that can get quite expensive, indeed. The reason that I prefer the Enterprise Edition is only because of the "online" capabilities for certain index maintenance such a rebuilds (although I've intentionally NOT done any index maintenance on my big databases since 18 Jan 2016 and have suffered no performance problems, but that's a whole 'nuther discussion).

    The reason why I loath Partitioned Tables is because of some major usage faults that a lot of people haven't had to consider or just put up with. For example, if you have some very large temporally-based (by month) audit tables (they necessarily take up 75% of a 600GB database we have), it would be nice to restore, say, only the last 3 months of those tables when doing a "refresh" of Development databases from Prod. That can't actually be done in a straight forward manner using restores even if, like our tables, each month is in a separate file in a separate file group. Oh sure, it'll let you restore the required file groups but you can't actually "RECOVER" the database until ALL of the file groups have been restored. That also means you can't do backups, etc, etc, and, for me, backups in Dev are just as important and backups in Prod. And, no, you can't change the Partition Schema or Partition Function to "ignore" the non-loaded filegroups until you've loaded the very file groups you want to ignore and not load.

    Then there's also the subject of data use. The newer partitions would perform better (especially for inserts) if they were indexed one way and while the older partitions will never be inserted into ever again and would perform better if they were indexed a different way. The only way to do that with Partitioned Tables is to have non-aligned indexes and that defeats nifty functionality such as SWITCH and other things.

    Then there's the idea of doing a real restore for DR. Yeah... it's handy that Partitioned Tables will allow the database to be used while you're loading older, non-critical filegroups to quickly "get back in business" BUT... there's that nasty little problem of not being able to do backups until ALL the filegroups have been restored. For our larger systems, that could actually take an entire day, which is way too long for us to go without Point-in-Time backups.

    It would take too long to explain here but I'm prepping our systems to convert our Partitioned Tables to Partitioned Views. There are methods unique to Partitioned Views that will suffer none of the previously mentioned disadvantages of Partitioned Tables while bringing advantages not possible in Partitioned Tables to bear.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 7 posts - 16 through 21 (of 21 total)

You must be logged in to reply to this topic. Login to reply