Clustered Indexes? Sedimentary, my dear Watson

Surely one would base the choice of the clustering key for a clustered index on the needs of the most critical queries; the goal being to cluster on the key(s) that will gather together data that is most likely to be queried together, and to define the natural order of the data. However, in SQL Server, this advice seems to be heavily tempered by the need to consider index implementation issues, such as minimizing page splits, fragmentation and so on.

There is much sound advice on MSDN and elsewhere suggesting that every table should have a clustered index, and that narrow, integer, ever-increasing columns, such as afforded by an IDENTITY column or some other naturally increasing integer, make the best candidates for the clustering key.

If you use an ever-increasing value in clustered key column, each INSERT will be stored in the logical order dictated by the clustering key. Rows will therefore always be added to the end, making the scanning of the index faster and more efficient. Otherwise, data will be inserted randomly over the pages, leading to increased page splitting and fragmentation; and consequentially to higher IO. You will also help to minimize the cost of using the clustered key as the row identifier in any secondary, non-clustered indexes, by adding a clustered index on an IDENTITY column.

This advice comes from the viewpoint that the clustered index is there primarily to 'organize the table' from the point of view of the index itself, and its health, rather than the queries that run against it; the latter being best served by the non-clustered, covering, indexes. It's a viewpoint I understand, but find a little troubling.

If you create a clustering key on an IDENTITY column, all you're doing is ensuring that the next row to arrive is logically stored next to the previous one; in other words you're organising your data by time of arrival. This is nicknamed the 'sedimentary' approach. In the absence of deletes and updates, this is the same behaviour you'll get from a heap. If organizing your data by time-of-arrival is optimal for your critical queries, then all is well. However, if it is not, then one could argue that it hasn't really organized the data at all.

If you do have a lot of deletions, it's true that clustering on an IDENTITY-style key will minimize fragmentation as compared to using a more random key, but the fact is that you will still get it, because data goes in a specific order and so gaps don't get reused, so you will still need to rebuild the index on a regular basis.

Ultimately, the primary objective of a clustered index is, or should be, to organize the data in such a way that the values that are likely to be queried together are stored together. It imposes an order on the data that corresponds to the natural way that we sequence and arrange it. Some of the advice that I read leads me towards the conclusion that we are introducing a best practice for clustered indexes in SQL Server that merely compensates for an implementation problem, rather than reflecting the requirements of the data within the application.

I'd love to hear from the Database Weekly community on this issue. How do you create your clustered indexes? How often do you rely on an IDENTITY column for the clustering key?

Cheers,

Tony.