Worst Practices - Not Using Primary Keys and Clustered Indexes

  • Interesting that people don't like identity columns. It takes me back to a case some years ago where I added a seemingly useless identity column to a super-enormous table and made it the clustered index/PK. I took so much krp from people who said that I made the super-enormous table that much larger. They all were insisting that I use the existing "natural" key in the db, a compound of a varchar and a date. To shut up the screamers, I had to make a copy of the entire table for them, doing it their way and demonstrating the CLEAR VALUE OF A THIN (in this case, INT) PK. Most people (including dba's) don't get how db indexes work. I've seen no mention in this thread about the value of a thin PK.

    --Wayne Clifford - Seattle

  • Andy,

    I am curious as to what you would say to this:

    Our ERP vendor doesn't have any primary keys on trans_tables (journal load transaction transfer tables) thus eliminating the possibility of replication but that is another story.

    The trans table is simply a middle man table. It verifies when a flat file comes into the ERP that the flat file columns are the right size and data type via bcp process.

    Then the actual ERP program performs the load from the trans into a table with primary keys.

    The data remains in these trans tables no longer than 10 seconds and no data is ever stored into these tables.

    Would you say the vendor is lazy in db design on these tables or a true example why primary key and a clustered index is not needed?

    JTS

  • There's one thing to be aware of - the database ALWAYS access the row by primary key, even if it doesn't have one. What do I mean? I mean that each row has an implicit identify, either being the offset within the table or the row number. So in cases where no application distinction is required, I say, why?

    What is the benefit?

    Data integrity? This would not come into question for a standalone or temporary table.

    But that said, it's not so bad to always use one either!

  • I don't disagree with you.

    Every primary key can be changed even if you thought It will never change at first.

    The case I remember is user id, social security number, book's bar code.

    Some user wants to change there user id for some reason - hiding from paparachi, not compatible because it's unicode character.

    We korean have social security number that is unique when we born. but some of people share same number.

    Book's bar code can be same even if the book is renewed and republished.

    When you work long time, you can find everything visible, displayable, meaning can be changed.

    I think meaninglesss auto incremental number is only unchangeable, and usable for unique primary key.

  • Why not create a policy for this (must have clustered index and must have primary key) in SQL Server 2008, and really annoy your developer community and the less enlightened.

  • I agree with having a primary key on every table, that is like DB 101 ( or even DB 001).

    However, we use a lot of Uniqueidentifier (MS SQL server) as primary keys, but because of the nature of the Clustered index we do NOT make the primary a clustered index. So we end with no clustered index.

    What are you suggestion for that.

    Ints are easy.. keep the index in the order of the number added... but for guids you don't want to do that... it will kill the index rebuild every time a new record is added.

    Your ideas?

  • Andy -

    First - very good article.

    I pretty much agree on all points, with the exception of relatively small temp tables, or those generated by Inline/Multi-statement table functions - that, to me, seems like crossing the line into overkill.

    My only gripe with your piece, is that I do not think you really hammered home the great performance advantages (to other indexes) to be gained by ALWAYS including primary keys/clustered indexes on each and every persistent table.

    All non-clustered keys use the clustered index to point to the data rows: it really can make those additional indexes more efficient!

    Having clustered indexes built at the get-go, can also speed up rebuilds of/for other indexes; in addition to, saving DBA's on-going maintenance time, effort, and headaches.

    So, clustered indexes are a win-win situation: better query/app performance, better maintenance, better recovery, better service to the organization.

    Thanks, again.

    ~REpicurus

  • MattieNH (5/9/2008)


    Steve,

    Thanks for responding, I probably should have opened a new topic. I'm not arguing about the need for a primary key, and I realize that it has to be unique. (I've read enough posts titled 'how do I remove duplicates?' to convince me of that:)). So the only time I don't have one is in temp tables, and that's usually because I can't figure out what makes an entry unique.

    I was wondering about the operational difference between a unique constraint, and a unique index, and why I might choose to have one rather than the other.

    Mattie

    Mattie,

    I will try to explain.

    A unique constraint is a LOGICAL specification. It tells the database to assert a business rule: that the designated combination of columns is unique. Conceivably, there are many ways a database could implement a unique constraint.

    A unique index is a physical structure. Since it satisfies the requirements of a unique constraint, that's what the database uses to implement the constraint.

    In a practical sense, there are two differences:

    1. A unique index has performance implications that a unique constraint does not. As long as the unique constraint is implemented with a unique index, it's an academic issue, but we have no assurance that a future DB will implement the constraint the same way.

    2. A unique constraint can be referenced by a foreign key. A unique index cannot.

  • The article is really good for newbies. They should remember that primary key is MUST. For them this mistake specially the clustered one is very common.

    🙂

  • larry (5/9/2008)


    There's one thing to be aware of - the database ALWAYS access the row by primary key, even if it doesn't have one.

    In Sql server a table with a primary clustered index is sorts its pages by the clustering index. If I define the clustering index on a field other than my primary key then this field determines the page order not the primary key.

    If I have no clustering index the data is not organised in any order onto data pages and any non clustered index refers directly to data pages rather than index values in the clustering index had I created one. This is not very usefully for most on line transaction processing (OLTP) modes of access because we mostly retrieve data stored in some loose grouping associated to the order it was entered. Or rather our indexes grow at the end rather than the middle and over time and data remains grouped by the index fields it was when it was first inserted. (no need to page split old low level index pages on insert)

    When the majority of rows being inserted would need to be inserted in the middle of an index that is needed to optimise retrieval, (For example a list of the highest grossing movie releases) then putting a identity and clustering index on this list becomes an overhead because the order you enter the movies has nothing to do with the clustering desired for retrieving the top 10. As each movie inserted would end up on a different data page due to its order of insert (identity) anyway why maintain a clustered index ? the non clustered index used to retrieve the data is as likely to point to non-sequential data pages anyway.

  • Dennis,

    Thank you so much for the explanation. As Professor Higgins said about Lize Doolittle, 'by george, I think she's got it'.

    Mattie.

  • Every relation has a PRIMARY KEY and therefore, so must a table if we expect it to represent a relation. The purpose is to assure that each row is distinct and identifiable to the business. Considerations for a good primary key have absolutely nothing to do with performance; not having one is certainly a worst practice.

    Having said that, primary key is a logical constraint that can be implemented different ways in the physical deployment. Among the ways (and most commonly), we use a unique index, usually clustered and often on an identity column.

    The primary purpose of an index is query optimization. Clustering has additional implications with respect to fragmentation and insert optimization. The type of index can make a significant difference in performance, even to the point where an index can have an adverse performance benefit in some cases.

    Worst practices are:

    1. to get in the habit of letting the indexing strategy dictate the primary key

    2. to accept without question the indexing strategy that results by default from declaration of a primary key

    Said another way, declaring a surrogate key, identity column, primary key is a reasonable solution 90% of the time. But, it's hardly a worst practice to do things a differently if the circumstances call for it. Quite the contrary--I think it is a worst practice to do it for every table (something I run into often in the SQL Server community) without question.

  • I totally agree with the fact primary keys and clustered index should be created on a table. Not only do they ensure data integrity (which saves you a lot of trouble in the long run) but they also make the tables easy to maintain.

  • Andy,

    I would appreciate some consideration about this subject when we talk about indexing on Datawarehouse structured databases, where there are large search queries, very few updates, some deletes and a lot of inserts and truncates.

    In this kind of structure it's recomended the deletion and re-creation of indexes. In these cases what would be the advantages and disadvantages of using clustered indexes.

    Thanks in advance

    Martin.

  • Fact tables should have a clustered index on an ever-increasing identity column. That facilitates LOAD performance by eliminating page splits. It also reduces the need for free space, so you can use 100% fill factor and forget pad index. The end-result is a compact table that conserves disk space and minimizes I/O. The identity column needn't be designated as the primary key (surrogate key), but I see no reason not to make it so. FK's to dimension tables should also be indexed to optimize query performance. Since ad-hoc queries are the nature of the beast, generally most, if not all, dimension FK's should be indexed.

    Dimension tables should definitely have a surrogate key and an identity column is a good fit for that. In addition, the natural key should most certainly be indexed. I don't have a one-size-fits all argument about which of those indexes to cluster; it depends on a case-by-case analysis. In cases where the dimension is fairly static, clustering the the natural key index makes the most sense. The more inserts a dimension experiences, however, the more sense it makes to cluster the surrogate key.

Viewing 15 posts - 166 through 180 (of 184 total)

You must be logged in to reply to this topic. Login to reply