Cluster That Index! Part Two



A while back I wrote an article about clustered indexes (Cluster

that index!), where I stated that in my opinion all tables should have a

clustered index defined for them. Since I wanted to keep the focus specifically

on the advantages and disadvantages of clustered indexes I intentionally left

some information and discussion regarding indexes out of the article. In this

article I would like to elaborate on these matters as well as responding to some

comments by you readers to the previous article. If you haven't read that

article I recommend you to do that prior to reading this one as much of this

information relates to that of the other article.

Non-clustered indexes are not bad in any way!

First of all I want to clear any confusion regarding non-clustered indexes. I am

not in any way saying they are bad or should not be used. What I am saying is

that I think every table should have a clustered index, so if you're only going

to have one index on a table it should be a clustered one. The focus on the

prior article was on showing the differences between tables that have a

clustered index and those that don't (heap tables, or simply heaps). In this

article I will describe some more details regarding indexes and also try to give

some advice on where to use what index type.

Indexes and modifications

One question I was asked was how indexes and their effect on modifications

(INSERT, UPDATE and DELETE) fits in with the clustered/non-clustered pros and

cons. First of all, I don't really see this as something that should define

whether a table should have a clustered index or not. Consider table T1 below,

where we have a column id that we'll probably want to define as primary

key, and a second column name which is probably the argument used when

searching this table.


  id int identity(1,1) NOT NULL   , name varchar(25) NOT NULL)  

Even if id is sometimes specified as search argument (it might be an employee id

for instance), it will probably not be used for range searches, i.e. WHERE id

BETWEEN 1 AND 10. The second column, name, will be much more likely to do range

searches with. Remember that a right-truncated search (using % as a suffix in a

LIKE clause) is a range search, i.e. WHERE name LIKE 'a%'. In the prior article

I showed that clustered indexes really excel in range queries, while they won't

have any different effect (neither positive nor negative) than a non-clustered

index for 'normal' queries. So, we can easily see that name is a good candidate

for a clustered index, and id would not be a good choice. Remember however that

id is the primary key of the table, and when you define a primary key for a

table SQL Server will automatically create an index for that key. By default SQL

Server will create a clustered index for a primary key, but you can specify

explicitly that you want the index to be a non-clustered index. But as I showed

in the prior article all tables should have a clustered index defined for them

as that effectively changes the way the table is physically stored. As I

described, since heap tables can suffer from some very performance heavy

disadvantages, the biggest advantage of having a clustered index was not that it

was great for range searches but rather the disadvantages avoided by having it.


So, to get back to the question, how does clustered/non-clustered indexes fit in

with indexes and their disadvantages on modifications? As I said earlier, I

don't think it is relevant at all. If you have a table that is heavily modified

more or less all the time and you therefore don't want to have too many indexes

on it, then let your the id column (which is the table's primary key) have a

clustered index and be done with that. You probably won't be able to use the

nice features of range searches with it, but at least you'll avoid the troubles

of heap tables. However, since I work not only as a DBA but also as a search

engine developer I tend to favor quick responses when searching, and therefore I

like to index not only keys but also columns used in search arguments. And

that's my recommendation, if you are going to have just a single index, make it

a clustered index, but if you're going to have several indexes you should think

hard about which one to choose as a clustered index. But always have a clustered

index on every table.


Another question I got was how much disk space a clustered index takes up

compared to the table it is indexing, and if this extra disk space is a

disadvantage of clustered indexes. Before I answer this, let me just say that if

I had to choose between the extra disk space a clustered index takes or leaving

a table as a heap table I would probably choose a clustered index (with the

extra disk space it takes) in 99% of all situations. Of course, every choice you

make regarding performance and configuration should always be thought through

and well tested, but disk space is so cheap today that I think it would be very

unusual if you had to make that choice.


It is still interesting to calculate the size of an index, and as I said in some

situations you might have to take this into consideration. Actually, a clustered

index does not take that much extra disk space. First of all, remember that the

leaves of the clustered index is the actual data, sorted in index order, so even

if you drop the clustered index these pages will be (more or less) the same. The

extra space usage instead comes from the levels above the leaf level in the

index B-tree. Exactly how much space these extra levels use depends on the

amount of data in the table and the size of the indexing key (the column(s) that

the index is defined on). Each data page, and remember that in a clustered index

this is the same as the leaf level pages of the index, requires one index row in

the index page on the level directly above it. To calculate the number of index

pages needed for the level directly above the leaf level we use the following



  P / (8096 / K)


where P is the number of leaf level pages and K is the size (in bytes) of the

index key. 8096 is the maximum amount of bytes that can be stored per index

page. If this result is more than 1 we take this number and use it as P in the

same formula to find out the number of index pages in the next level. When we

end up with just one page we've found the root level. In Inside SQL Server

2000 Kalen Delaney shows an example of this. For a table with 10.000 data

pages (each page containing 8 KB data) there is a clustered index defined on a

fixed-length character column of bytes (char(5)). Each index key row in this

index use 12 bytes (key size + overhead). This means that 15 index pages (10000

/ (8096 / 12) = 15) is needed at the level directly above the leaf level. The

level above this will consist of a single page, the root level. The extra size

used by adding this clustered index is therefore 16/10000, i.e. less than 1%. As

you can see, a clustered index doesn't really use very much extra space. 1% is

normally a good estimate that you can use if you don't want to calculate it

exactly, it will even be a bit high (as in the example). As always it is of

course necessary to define the index key as small as possible.


One thing that I intentionally left out of the prior article was a description

of fill factor and pad index, in order to keep focus on the problems of not

having a clustered index on a table. One thing I did mention was that one slight

disadvantage you might run into by having a clustered index on a table are the

page splits that can occur when adding data (or sometimes when changing existing

data) to the table. This is a part from that article:


"Because the data is stored in the order of the index, to insert a new row

SQL Server must find the page with the two rows between which the new row shall

be placed. Then, if there is not room to fit the row on that page, a split

occurs and some of the rows get moved from this page to a newly created one."


As some of you readers commented, one way to counter this (at least to a degree)

is to specify a value for the fill factor and possibly pad index for the

clustered index. Fill factor specifies the percentage that SQL Server should

fill every data page to when creating an index. Specifying a fill factor value

of 50 therefore means that each data page will only be half full, in effect

meaning that the storage space needed for the table is doubled (since there are

twice as many data pages). The good thing is that page splits are not as

necessary as with a higher fill factor, since there will usually be room on a

page to insert a new row without splitting the page into two and moving rows

between them. If fill factor is specified you can also specify the option to use

pad index (it uses the same value as fill factor), which works in the same way

as fill factor only it is used for index pages in the intermediate levels

(between the root level and leaf level).


Note that both options are only used when creating or rebuilding an index. They

are not maintained after initial 'filling' by SQL Server, as that would defeat

the original purpose to use them, i.e. avoiding page splits (as SQL Server would

have to use page splits to maintain them). Also note that these options apply

for non-clustered indexes as well, the index pages of non-clustered indexes use

page splits in the same way as clustered indexes of course. As always, think

carefully and test the effects of changing the fill factor and pad index values.


5 (3)




5 (3)