SQL Clone
SQLServerCentral is supported by Redgate
Log in  ::  Register  ::  Not logged in

Cluster That Index! Part Two

By Christoffer Hedgate,


A while back I wrote an article about clustered indexes (Cluster that index!), where I stated that in my opinion all tables should have a clustered index defined for them. Since I wanted to keep the focus specifically on the advantages and disadvantages of clustered indexes I intentionally left some information and discussion regarding indexes out of the article. In this article I would like to elaborate on these matters as well as responding to some comments by you readers to the previous article. If you haven't read that article I recommend you to do that prior to reading this one as much of this information relates to that of the other article.

Non-clustered indexes are not bad in any way!

First of all I want to clear any confusion regarding non-clustered indexes. I am not in any way saying they are bad or should not be used. What I am saying is that I think every table should have a clustered index, so if you're only going to have one index on a table it should be a clustered one. The focus on the prior article was on showing the differences between tables that have a clustered index and those that don't (heap tables, or simply heaps). In this article I will describe some more details regarding indexes and also try to give some advice on where to use what index type.

Indexes and modifications

One question I was asked was how indexes and their effect on modifications (INSERT, UPDATE and DELETE) fits in with the clustered/non-clustered pros and cons. First of all, I don't really see this as something that should define whether a table should have a clustered index or not. Consider table T1 below, where we have a column id that we'll probably want to define as primary key, and a second column name which is probably the argument used when searching this table.  
  id int identity(1,1) NOT NULL   , name varchar(25) NOT NULL)  
Even if id is sometimes specified as search argument (it might be an employee id for instance), it will probably not be used for range searches, i.e. WHERE id BETWEEN 1 AND 10. The second column, name, will be much more likely to do range searches with. Remember that a right-truncated search (using % as a suffix in a LIKE clause) is a range search, i.e. WHERE name LIKE 'a%'. In the prior article I showed that clustered indexes really excel in range queries, while they won't have any different effect (neither positive nor negative) than a non-clustered index for 'normal' queries. So, we can easily see that name is a good candidate for a clustered index, and id would not be a good choice. Remember however that id is the primary key of the table, and when you define a primary key for a table SQL Server will automatically create an index for that key. By default SQL Server will create a clustered index for a primary key, but you can specify explicitly that you want the index to be a non-clustered index. But as I showed in the prior article all tables should have a clustered index defined for them as that effectively changes the way the table is physically stored. As I described, since heap tables can suffer from some very performance heavy disadvantages, the biggest advantage of having a clustered index was not that it was great for range searches but rather the disadvantages avoided by having it.
So, to get back to the question, how does clustered/non-clustered indexes fit in with indexes and their disadvantages on modifications? As I said earlier, I don't think it is relevant at all. If you have a table that is heavily modified more or less all the time and you therefore don't want to have too many indexes on it, then let your the id column (which is the table's primary key) have a clustered index and be done with that. You probably won't be able to use the nice features of range searches with it, but at least you'll avoid the troubles of heap tables. However, since I work not only as a DBA but also as a search engine developer I tend to favor quick responses when searching, and therefore I like to index not only keys but also columns used in search arguments. And that's my recommendation, if you are going to have just a single index, make it a clustered index, but if you're going to have several indexes you should think hard about which one to choose as a clustered index. But always have a clustered index on every table.


Another question I got was how much disk space a clustered index takes up compared to the table it is indexing, and if this extra disk space is a disadvantage of clustered indexes. Before I answer this, let me just say that if I had to choose between the extra disk space a clustered index takes or leaving a table as a heap table I would probably choose a clustered index (with the extra disk space it takes) in 99% of all situations. Of course, every choice you make regarding performance and configuration should always be thought through and well tested, but disk space is so cheap today that I think it would be very unusual if you had to make that choice.
It is still interesting to calculate the size of an index, and as I said in some situations you might have to take this into consideration. Actually, a clustered index does not take that much extra disk space. First of all, remember that the leaves of the clustered index is the actual data, sorted in index order, so even if you drop the clustered index these pages will be (more or less) the same. The extra space usage instead comes from the levels above the leaf level in the index B-tree. Exactly how much space these extra levels use depends on the amount of data in the table and the size of the indexing key (the column(s) that the index is defined on). Each data page, and remember that in a clustered index this is the same as the leaf level pages of the index, requires one index row in the index page on the level directly above it. To calculate the number of index pages needed for the level directly above the leaf level we use the following formula:
  P / (8096 / K)
where P is the number of leaf level pages and K is the size (in bytes) of the index key. 8096 is the maximum amount of bytes that can be stored per index page. If this result is more than 1 we take this number and use it as P in the same formula to find out the number of index pages in the next level. When we end up with just one page we've found the root level. In Inside SQL Server 2000 Kalen Delaney shows an example of this. For a table with 10.000 data pages (each page containing 8 KB data) there is a clustered index defined on a fixed-length character column of bytes (char(5)). Each index key row in this index use 12 bytes (key size + overhead). This means that 15 index pages (10000 / (8096 / 12) = 15) is needed at the level directly above the leaf level. The level above this will consist of a single page, the root level. The extra size used by adding this clustered index is therefore 16/10000, i.e. less than 1%. As you can see, a clustered index doesn't really use very much extra space. 1% is normally a good estimate that you can use if you don't want to calculate it exactly, it will even be a bit high (as in the example). As always it is of course necessary to define the index key as small as possible.


One thing that I intentionally left out of the prior article was a description of fill factor and pad index, in order to keep focus on the problems of not having a clustered index on a table. One thing I did mention was that one slight disadvantage you might run into by having a clustered index on a table are the page splits that can occur when adding data (or sometimes when changing existing data) to the table. This is a part from that article:
"Because the data is stored in the order of the index, to insert a new row SQL Server must find the page with the two rows between which the new row shall be placed. Then, if there is not room to fit the row on that page, a split occurs and some of the rows get moved from this page to a newly created one."
As some of you readers commented, one way to counter this (at least to a degree) is to specify a value for the fill factor and possibly pad index for the clustered index. Fill factor specifies the percentage that SQL Server should fill every data page to when creating an index. Specifying a fill factor value of 50 therefore means that each data page will only be half full, in effect meaning that the storage space needed for the table is doubled (since there are twice as many data pages). The good thing is that page splits are not as necessary as with a higher fill factor, since there will usually be room on a page to insert a new row without splitting the page into two and moving rows between them. If fill factor is specified you can also specify the option to use pad index (it uses the same value as fill factor), which works in the same way as fill factor only it is used for index pages in the intermediate levels (between the root level and leaf level).
Note that both options are only used when creating or rebuilding an index. They are not maintained after initial 'filling' by SQL Server, as that would defeat the original purpose to use them, i.e. avoiding page splits (as SQL Server would have to use page splits to maintain them). Also note that these options apply for non-clustered indexes as well, the index pages of non-clustered indexes use page splits in the same way as clustered indexes of course. As always, think carefully and test the effects of changing the fill factor and pad index values.
Total article views: 9765 | Views in the last 30 days: 16
Related Articles

Clustered Indexes: Stairway to SQL Server Indexes Level 3

Now that we've seen the basics of indexing, and taken a deeper dive into Nonclustered Indexes, this ...


clustered index

clustered index


Stairway to Columnstore Indexes Level 12: Clustered or Nonclustered?

The previous levels of this stairway describe details, features, and limitations of columnstore inde...


Cluster Re-Index

Re-Index Cluster


Indexes for Booking Search

Indexes for Booking Search

sql server 7