The Clustered Index is not the Primary Key

  • Comments posted to this topic are about the item The Clustered Index is not the Primary Key

  • A Primary Key should not necessarily need to be clustered.

    Few developers, and not all the dbas I have worked with actually know what a clustered index is. So for some this article would be a Post-Doctorate dissertation.

    The issue I always have is most developers place an identity column on the table and then make it the primary key and by default clustered. Whenever asked the response is:

    - it to make the row unique,

    - it the way we always do it,

    - it is for performance reasons,

    - its how we were taught at university,

    - but Stackoverflow says this is the best way.

    :crazy:

    I see the road is very,very long.

    I suppose on the bright side its not a heap.

  • Well, heaps do have a slight advantage in certain cases. Because SQL Server (unlike other SQL systems) had a provision for identity columns from day one, it has been a long standing practice to use this provision for a primary key. There are some benefits of a single integer as clustered primary key: little fragementation on insertion, ease of use when keeping a reference to a row, small indexes (both clustered and unclustered), and clear predictable behaviour especially when it is called [ID]. Only after thorough analysis of the actual every day workload one can make better choices for the clustered index, mainly to support common range queries on that table. Far too often other primary key candidates turn out to be not as unique as expected on design; reality seems to have its own twist on the actual data stashed away in this table. Business rules change every month or so, and developers somehow cannot understand that constraints do protect them from putting unwieldy combinations of values into the database. I did learn how to design a database properly both logically and physically based on a full specification of the data storage requirements of an application. But nowadays these requirements start changing right from the start of the first development cycle. Is a clustered index on an identiy column called [ID] always the best choice? Maybe not, but it is nearly always a save choice in the harsh environment of today software development.

  • Yet Another DBA (5/24/2016)


    A Primary Key should not necessarily need to be clustered.

    Few developers, and not all the dbas I have worked with actually know what a clustered index is. So for some this article would be a Post-Doctorate dissertation.

    The issue I always have is most developers place an identity column on the table and then make it the primary key and by default clustered. Whenever asked the response is:

    - it to make the row unique,

    - it the way we always do it,

    - it is for performance reasons,

    - its how we were taught at university,

    - but Stackoverflow says this is the best way.

    :crazy:

    I see the road is very,very long.

    I suppose on the bright side its not a heap.

    I guess I do that in general because I have read Kimberley Tripp. She says to make the clustered index unique, narrow, immutable and preferably ever increasing. I also do it because it is easy when you are a surrogate key fan (I am a surrogate key type because this makes life easy with ORMs). Obviously for smaller secondary entity tables it doesn't matter so I give it no thought.

    The only time I don't tend to do it is when a date field or similar (perhaps a period identifier), perhaps in combination with some other field, is likely to be heavily used. This seems to improve performance on larger primary tables.

    Perhaps a simple philosophy, but sometimes you just have to press on and get the job done!

  • Hi,

    there is one phrase I don't understand in this article:

    http://www.sqlskills.com/blogs/kimberly/ever-increasing-clustering-key-the-clustered-index-debate-again/

    What does this mean?

    If the clustering key is used as the lookup key from all non-clustered indexes

    Lookup key for non-clustered index?

    Thanks,

    Ryszard

  • Hi Ryszard,

    this is where a Select is partly satisfied because the predicates (FROM) are columns within a non clustered index, however some of or all of the columns defined within the SELECT portion of the Query do not exist within the non clustered index.

    A non clustered index, at the leaf level, contains the clustering key (hidden column/s at the end of the defined index).

    Say you have a table tstTable (Col_a int identity(1,1), Col_b char(5), Col_c char(5), Col_d Date)

    Col_a is defined as the clustered Primary Key for the table

    Cold_d is a non clustered key

    If you do a query as follows

    Select Col_b, Col_c from tstTable

    where Col_d = <some date>

    The non clustered index on Col_d satisfies the predicate so all rows equal to a date will be selected from it - but hang on - this index doesn't have the columns defined in the SELECT of the query - but the Col_d index does have the clustering key - so then the required data is extracted from the table by looking up on the clustered index column using ID's retrieved from Col_d's index - and the clustered index is the table so the information can be returned

    I hope that helps and makes sense to you

    Cheers

    Steve

  • richlion2 (5/24/2016)


    Hi,

    there is one phrase I don't understand in this article:

    http://www.sqlskills.com/blogs/kimberly/ever-increasing-clustering-key-the-clustered-index-debate-again/

    What does this mean?

    If the clustering key is used as the lookup key from all non-clustered indexes

    Lookup key for non-clustered index?

    Thanks,

    Ryszard

    The clustered index determines the physical arrangement of rows on disk. Columns of the clustered index are always part of a non-clustered index even if you don't specify them. In queries that have columns that aren't covered by the columns in a non-clustered index (edit: see stephen's example above), you'll see Key Lookups in the plan. Not sure if that's what Kimberley was referring to, but probably not far off.

    Really good discussion of it (with examples) here: https://www.brentozar.com/archive/2015/08/clustered-index-key-columns-in-nonclustered-indexes/

    ____________
    Just my $0.02 from over here in the cheap seats of the peanut gallery - please adjust for inflation and/or your local currency.

  • Also sometimes what appears to be the natural primary key, from the point of view of the data model and retrieval use cases, turns out to be the wrong primary key from the point of view of the storage model and how the data is generated.

    A real world example from when I worked for a large bank (who shall remain anonymous) who were creating an online banking platform for their commercial banking customers. Of course this contained a "Transactions" table (unsurprisingly the largest table in the db).

    In terms of the data model, the natural key for the transactions was, account_id, date, transaction_sequence_in_day (henceforth sequence for concision). In terms of the use case, it was pretty clear that the customers using the online banking app would want to retrieve their transactions in relation to a particular account, sorted by date and sequence in the day. So they created the cluster key based on this natural key, account_id, date, sequence, in that order. In testing the retrieval performance was great.

    So, what's the problem?

    The transactions were uploaded from the mainframe every night. So every 24 hours you added a days worth of transactions, for all accounts. Result? Massive fragmentation. To the extent that in production they found they had to rebuild the clustered index 2-3 times a week (!) to maintain acceptable performance. A total nightmare.

    Given the storage model and how the data was uploaded, they really needed to change the order of the natural key to - date, account_id, sequence. Then the fragmentation created by the nightly upload would have disappeared (relatively). They could have added a non-clustered index on account_id if necessary.

    Although, given the wideness of the key, there's an argument to be made that it might have been a good candidate for a surrogate key (internal) based, essentially on upload sequence. Even if you still kept the date, account_id, sequence composite key as the cluster key (with the increased size penalties for non-clustered indexes) such a surrogate key could be handy for partitioning and archiving purposes for future data management.

  • I'm a real fan of surrogate keys (using identities) and use them almost exclusively as both primary key and clustered index. My application's domain allows me to use 4 byte integers for the key, so I get narrow, ever increasing key that keeps fragmentation to a bare minimum. Of course that means (almost) every table has some kind of unique alternate key that's a combination of columns, but the narrowness of the primary key is a real boon when it comes to foreign keys.

    So I don't mind an extra index (or two) per table as an alternate key. 😀

  • vliet (5/24/2016)


    Well, heaps do have a slight advantage in certain cases. Because SQL Server (unlike other SQL systems) had a provision for identity columns from day one, it has been a long standing practice to use this provision for a primary key. There are some benefits of a single integer as clustered primary key: little fragementation on insertion, ease of use when keeping a reference to a row, small indexes (both clustered and unclustered), and clear predictable behaviour especially when it is called [ID]. Only after thorough analysis of the actual every day workload one can make better choices for the clustered index, mainly to support common range queries on that table. Far too often other primary key candidates turn out to be not as unique as expected on design; reality seems to have its own twist on the actual data stashed away in this table. Business rules change every month or so, and developers somehow cannot understand that constraints do protect them from putting unwieldy combinations of values into the database. I did learn how to design a database properly both logically and physically based on a full specification of the data storage requirements of an application. But nowadays these requirements start changing right from the start of the first development cycle. Is a clustered index on an identiy column called [ID] always the best choice? Maybe not, but it is nearly always a save choice in the harsh environment of today software development.

    Very good thoughtful response.

  • stephen.lawrenson (5/24/2016)


    Hi Ryszard,

    ...

    A non clustered index, at the leaf level, contains the clustering key (hidden column/s at the end of the defined index).

    ...

    Additionally, the clustering key is included at the non-leaf levels if the nonclustered index is not defined as UNIQUE.

    See the discussion at http://www.sqlservercentral.com/Forums/Topic1750713-2799-1.aspx and Kimberly Tripp's explanation at http://www.sqlskills.com/blogs/kimberly/nonclustered-indexes-lookup-key-btree/

    Cheers!

  • paul.bowman (5/24/2016)


    Also sometimes what appears to be the natural primary key, from the point of view of the data model and retrieval use cases, turns out to be the wrong primary key from the point of view of the storage model and how the data is generated.

    It seems you are confusing PK and clustered key, which was my point in the piece. The storage, usage/retrieval, have nothing to do with the PK. The PK is about identifying a field(s) that generate and enforce uniqueness.

    The storage, the impact on insert/retrieval, these are cluster/noncluster issues.

  • lshanahan (5/24/2016)


    richlion2 (5/24/2016)


    Hi,

    there is one phrase I don't understand in this article:

    http://www.sqlskills.com/blogs/kimberly/ever-increasing-clustering-key-the-clustered-index-debate-again/

    What does this mean?

    If the clustering key is used as the lookup key from all non-clustered indexes

    Lookup key for non-clustered index?

    Thanks,

    Ryszard

    The clustered index determines the physical arrangement of rows on disk. Columns of the clustered index are always part of a non-clustered index even if you don't specify them. In queries that have columns that aren't covered by the columns in a non-clustered index (edit: see stephen's example above), you'll see Key Lookups in the plan. Not sure if that's what Kimberley was referring to, but probably not far off.

    Really good discussion of it (with examples) here: https://www.brentozar.com/archive/2015/08/clustered-index-key-columns-in-nonclustered-indexes/

    Actually, the clustered index does NOT determine the physical arrangement of rows on disk, it determines the logical arrangement of data on disk. Page splits during insert/update operations will affect what pages contain what data. The logical order will be maintained, the physical order may not.

  • vliet (5/24/2016)


    Well, heaps do have a slight advantage in certain cases. Because SQL Server ...

    I really dislike opening statements like this. You are almost implying that someone reading this should really consider heaps every time, or at least, often.

    Every table should have a clustered index. Start with that. If you find you have a valid, rational, backed up by data reason, then you can use a heap. However, don't consider heaps from the start.

  • Steve Jones - SSC Editor (5/24/2016)


    paul.bowman (5/24/2016)


    Also sometimes what appears to be the natural primary key, from the point of view of the data model and retrieval use cases, turns out to be the wrong primary key from the point of view of the storage model and how the data is generated.

    It seems you are confusing PK and clustered key, which was my point in the piece. The storage, usage/retrieval, have nothing to do with the PK. The PK is about identifying a field(s) that generate and enforce uniqueness.

    The storage, the impact on insert/retrieval, these are cluster/noncluster issues.

    The point is a natural primary key that is a combined key actually doesn't have an order - so, in the abstract, the ordering is a clustering (or indexing) issue, and therefore separate.

    However, given you can't actually instantiate a primary key without a unique index (clustered or non-clustered) it is still possible to instantiate the primary key in the "wrong" order.

    Admittedly if the primary key is non-clustered, you only have to rebuild a non-clustered index 2-3 times a week, instead of the whole cluster.

    My bad for confusing the distinction being made in the OP.

Viewing 15 posts - 1 through 15 (of 36 total)

You must be logged in to reply to this topic. Login to reply