No PKs on fact tables

  • I just inherited a data warehouse where none of the fact tables have a primary key (all fact tables are heaps).  The creator of the database is long gone and there is not documentation to justify this decision.  I'm just doing forensics looking at the schema.

    Has anyone ever seen this before?  I can't think of a good reason to do this?  Can you?

    By default I've always created a clustered primary key when I create a table.  Yes, I realize there are performance scenarios where the clustered index might not be the primary key.  I also know that many folks use heaps for their staging tables (though I typically add an identity clustered PK to these types of tables).  I'm puzzled.

    Thanks,
    Rob

  • You can have a HEAP with a PK because a PK does not have to be a clustered index.

    Shifting gears, yes... I've seen the problem that you describe many times.  It's normally because the designer had little clue of things having to do with a database.  I will state, however, that HEAPs can be faster than Clustered Tables if everything works out perfectly.  They normally don't work out perfectly for very long if the table rows are ever expanded by an UPDATE.

    I've also had some folks say that they didn't add a clustered index just in case they ever need to port the database to a different database engine.  Total rubbish, IMHO.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I have also seen this before, unfortunately.  At a previous place I had to deal with an architect who said keys on tables weren't necessary.  No, there is not a good reason for that.  That said, don't make the PK the clustered index as a knee jerk reaction.  Especially on fact tables, I may use create date as the clustered index.  You may also have a fact table where you need a distinct count.  That's not common, but when it happens, you will want to cluster on the item that needs the distinct count or there will be a performance hit.

  • Jeff and Ron, thank you for your feedback.  This is something I'll have to correct over time.  As I get to know the usage better, I'll be able to determine what good clustered indexes are for each fact table.

    Thanks,
    Rob

  • robert.gerald.taylor - Monday, April 3, 2017 7:26 AM

    As I get to know the usage better, I'll be able to determine what good clustered indexes are for each fact table.

    No pun intended but that's definitely the key.

    Shifting gears a bit and quite contrary to what most people will advise, having the clustered index on the right meaningful data column(s) is not a panacea of performance especially if the table has any substantial width to it. 

    For example, we have some crazy wide tables and very long tables at work and someone decided they where going to move the clustered index away from the IDENTITY column to a column that most of the criteria in queries made a reference to.  First, it destroyed INPUT performance because the key to the CI was no longer in ever-increasing order and page splits on the very wide table went nuts..  Second, performance on those most common queries actually go worse when the new CI came into play because the optimizer had to have the entire width of the table loaded rather than just what was needed.  In other cases, the optimizer recognized that problem and decided to scan the second column of many of the non clustered indexes because it was cheaper and faster than doing the seek and range scan that the CI offered.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • First, don't accept at face value anyone's claim that "duplicates are handled by the ETL process". I've found that data warehouses in the wild are more prone to inadvertent duplicates than transactional line of business databases. If you don't have unique keys on a DW table, then there WILL be duplicates; it's inevitable. If nothing else, a primary key is a way for the fact table to explicitly declare a single version of truth regarding it's internal logical uniqueness, which is useful because you'll find folks within IT and the business who are in disagreement over basic stuff like "How is this table logically unique?" even after the project has been deployed to production. As a Data Architect or DBA you'll need to play the role of supreme court judge and reconcile all the divergent opinions. 

    Typically, the primary key on a fact table is a subset (not necessarily a complete set) of the dimensional foreign keys. You'll want at least a unique index on that. Whether you want that index clustered depends. If it is clustered you'll want to carefully ponder the logical order of the keys. From a physical modeling / performance optimization perspective, I'd suggest by default keeping the keys a clustered index ordered in such a way that it maintains the same order in which they are originally inserted, because it helps minimize page splits and fragmentation. But keeping the table as a heap with a non-clustered PK is also a valid approach, especially if the fact table has like 10+ foreign keys required to make each row unique.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • robert.gerald.taylor - Friday, March 31, 2017 6:52 AM

    I just inherited a data warehouse where none of the fact tables have a primary key (all fact tables are heaps).  The creator of the database is long gone and there is not documentation to justify this decision.  I'm just doing forensics looking at the schema.

    Has anyone ever seen this before?  I can't think of a good reason to do this?  Can you?

    By default I've always created a clustered primary key when I create a table.  Yes, I realize there are performance scenarios where the clustered index might not be the primary key.  I also know that many folks use heaps for their staging tables (though I typically add an identity clustered PK to these types of tables).  I'm puzzled.

    Thanks,
    Rob

    Welcome to the real world where people did not learn RDBMS! And, and there is no good reason for this kind of crap, other than ignorance. By definition, a key is a subset of attributes of an entity which is unique. This is straight textbook RDBMS. It has nothing to do with clustering indexes, hashing, or any other yet undiscovered access method.. Start with the logical model and make it logically valid. After that, read about implementation. Back in the old days was structured programming, we used to say get it correct then make it fast.

    Our rule in the old days was that you clustered on an attribute where the ordering mattered for the performance. For example, if most of my reports were done on the basis of states, then I would cluster on the state code column in the table. This would let the group by, order by operators perform better. But this is because most of us worked in products that did indexing. If you worked with a product (Teradata) that used hashing, this heuristic did not apply. In fact, in that environment a bigger key could give better performance.

    Let me make some really gross generalizations. You have a primary key that is a generated sequence (check numbers, invoice numbers, etc.), then a clustered key might improve performance because presumably the generation sequence is also temporal and things will be resolved in the order received in the system. However, if you have something like automobiles, where you have no idea how the VIN numbers will arrive in the system, then clustering them gained you know advantage and frankly, you're probably going to look at the most recent entries (what car did we just buy?) Far more often than older stuff (what automobiles we were reusing in 1945?).

    The have it described of always creating a clustered primary key index is really bad. It says your confusing your logical model your physical implementation. Think about how the key and each table is actually used, and what it means in your data model. Otherwise you're going to fall into the trap your predecessors did – – of seeing the world as a uniform creature with only one data model, based on a presumptive physical model of what the data is and what it means. Essentially you're falling into the punch card trap.

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • jcelko212 32090 - Monday, April 3, 2017 12:55 PM

    robert.gerald.taylor - Friday, March 31, 2017 6:52 AM

    I just inherited a data warehouse where none of the fact tables have a primary key (all fact tables are heaps).  The creator of the database is long gone and there is not documentation to justify this decision.  I'm just doing forensics looking at the schema.

    Has anyone ever seen this before?  I can't think of a good reason to do this?  Can you?

    By default I've always created a clustered primary key when I create a table.  Yes, I realize there are performance scenarios where the clustered index might not be the primary key.  I also know that many folks use heaps for their staging tables (though I typically add an identity clustered PK to these types of tables).  I'm puzzled.

    Thanks,
    Rob

    Welcome to the real world where people did not learn RDBMS! And, and there is no good reason for this kind of crap, other than ignorance. By definition, a key is a subset of attributes of an entity which is unique. This is straight textbook RDBMS. It has nothing to do with clustering indexes, hashing, or any other yet undiscovered access method.. Start with the logical model and make it logically valid. After that, read about implementation. Back in the old days was structured programming, we used to say get it correct then make it fast.

    Our rule in the old days was that you clustered on an attribute where the ordering mattered for the performance. For example, if most of my reports were done on the basis of states, then I would cluster on the state code column in the table. This would let the group by, order by operators perform better. But this is because most of us worked in products that did indexing. If you worked with a product (Teradata) that used hashing, this heuristic did not apply. In fact, in that environment a bigger key could give better performance.

    Let me make some really gross generalizations. You have a primary key that is a generated sequence (check numbers, invoice numbers, etc.), then a clustered key might improve performance because presumably the generation sequence is also temporal and things will be resolved in the order received in the system. However, if you have something like automobiles, where you have no idea how the VIN numbers will arrive in the system, then clustering them gained you know advantage and frankly, you're probably going to look at the most recent entries (what car did we just buy?) Far more often than older stuff (what automobiles we were reusing in 1945?).

    The have it described of always creating a clustered primary key index is really bad. It says your confusing your logical model your physical implementation. Think about how the key and each table is actually used, and what it means in your data model. Otherwise you're going to fall into the trap your predecessors did – – of seeing the world as a uniform creature with only one data model, based on a presumptive physical model of what the data is and what it means. Essentially you're falling into the punch card trap.

    Actually, this question didn't need any DDL or follow any ANSI/ISO standards.

    By the way, when did ANSI/ISO issue any standards to use when asking for help?

  • FYI, we recently discovered a heap table, with no indexes at all, consuming 56GB having only +/-40000 rows of 64bytes.
    (SQL2014 SP2CU3)
    After 'alter table x REBUILD', consumption got back to 6MB
    Of course we then shrunk the file holding that table to a more reasonable size 🙂

    Load patterns also have an impact to performance
    Maintenance can prevent a ton of sh*t

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • ALZDBA - Tuesday, April 4, 2017 5:40 AM

    FYI, we recently discovered a heap table, with no indexes at all, consuming 56GB having only +/-40000 rows of 64bytes.
    (SQL2014 SP2CU3)
    After 'alter table x REBUILD', consumption got back to 6MB
    Of course we then shrunk the file holding that table to a more reasonable size 🙂

    Load patterns also have an impact to performance
    Maintenance can prevent a ton of sh*t

    Yes, that happens because heaped tables don't recycle space previously used by deleted or updated rows; new or updated data is only appended.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • So how would you reconcile the idea that "unique index on fact table is not negotiable" with a product like Azure Sql Data Warehouse where there are no constraints?
    Columnar structures don't compress well with unique values for each row.
    I try to cluster records on the Date data is extracted from the source system and avoid updates on fact tables as much as possible. Better to model them as transaction events where if fact A changes, then fact AA inserts the required delta.

    [edited for typo]

  • daniel.bartley 78268 - Tuesday, April 4, 2017 11:06 PM

    So how would you reconcile the idea that "unique index on fact table is not negotiable" with a product like Azure Sql Data Warehouse where they are no constraints?
    Columnar structures don't compress well with unique values for each row.
    I try to cluster records on the Date data is extracted from the source system and avoid updates on fact tables as much as possible. Better to model them as transaction events where if fact A changes, then fact AA inserts the required delta.

    SQL Server has been incrementally evolving the feature completeness of ColumnStore since it was originally introduced with v2012. Starting with v2016 (and maybe v2014 too), you can now define unique non-clustered indexes and foreign keys on a Clustered ColumnStore table. 

    https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-data-warehouse

    Each column is compressed separately, so so the compression ratio will vary depending on it's carnality, but even a surrogate key having unique values in each row should compress significantly if it's sequentially incrementing.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Late to the thread here. I can give you a scnerio where I have no PK. I like being the black sheep.

    I have had large tables where 15+ uniques are needed to ensure row-level uniqueness and updates rarely happen. ETL fully controls the row-level uniqueness of the table and has yet to fail. Could primary keys still help, sure, but that would also be saying the ETL could fail too. I have yet to see that happen in my case, especially when the only way to get data into the table is through ETL.

  • jcelko212 32090 - Monday, April 3, 2017 12:55 PM

    Welcome to the real world where people did not learn RDBMS! And, and there is no good reason for this kind of crap, other than ignorance. By definition, a key is a subset of attributes of an entity which is unique. This is straight textbook RDBMS. It has nothing to do with clustering indexes, hashing, or any other yet undiscovered access method.. Start with the logical model and make it logically valid. After that, read about implementation. Back in the old days was structured programming, we used to say get it correct then make it fast.

    Our rule in the old days was that you clustered on an attribute where the ordering mattered for the performance. For example, if most of my reports were done on the basis of states, then I would cluster on the state code column in the table. This would let the group by, order by operators perform better. But this is because most of us worked in products that did indexing. If you worked with a product (Teradata) that used hashing, this heuristic did not apply. In fact, in that environment a bigger key could give better performance.

    Let me make some really gross generalizations. You have a primary key that is a generated sequence (check numbers, invoice numbers, etc.), then a clustered key might improve performance because presumably the generation sequence is also temporal and things will be resolved in the order received in the system. However, if you have something like automobiles, where you have no idea how the VIN numbers will arrive in the system, then clustering them gained you know advantage and frankly, you're probably going to look at the most recent entries (what car did we just buy?) Far more often than older stuff (what automobiles we were reusing in 1945?).

    The have it described of always creating a clustered primary key index is really bad. It says your confusing your logical model your physical implementation. Think about how the key and each table is actually used, and what it means in your data model. Otherwise you're going to fall into the trap your predecessors did – – of seeing the world as a uniform creature with only one data model, based on a presumptive physical model of what the data is and what it means. Essentially you're falling into the punch card trap.

    I absolutely agree with all of that. Considering the partial statement that I've set in Bold above, what would you recommend the key for a "Personnel" table be for something like a National Bank?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Thursday, April 13, 2017 6:47 AM

    jcelko212 32090 - Monday, April 3, 2017 12:55 PM

     what would you recommend the key for a "Personnel" table be for something like a National Bank?

    having done something like this in the financial industry years ago, I found that going into the accounting department and getting the license numbers for the personnel was the best approach.

    Please post DDL and follow ANSI/ISO standards when asking for help. 

Viewing 15 posts - 1 through 15 (of 41 total)

You must be logged in to reply to this topic. Login to reply