Indexing Strategies

,

Introduction

When we run a query, the choice of whether to use an index or

not is decided by the query optimizer. Using appropriate indexes on tables so

that the query optimizer chooses the most efficient strategy can yield

significant performance gains . We will try to examine a few scenarios to

familiarize ourselves with the way indexes are used to retrieve data and also

try to figure out which index clustered or non clustered is good on each

occasion.

Let us create a sample table so that we can use it for our

purpose.

CREATE TABLE MEMBERS

( member_number   INT NOT NULL,

  account_number   INT NOT NULL,  

  member_SSN      INT  NOT NULL,

  account_balance  MONEY NOT NULL, 

  member_name     VARCHAR(30) NOT NULL,

  member_address  VARCHAR(60) NOT NULL )

We will use a number of scenarios here.

Retrieving a single row from the members table

Select account_balance from members where member_number =

9000

Let us say there is a clustered index on the member_number

column.

SQL Server will first obtain the page number of the root page of

the index from the sysindexes table. Then it will traverse through the index

pages looking for a key value that is not greater than the current key value. In

a clustered index the index consists of index key and a pointer which is the

Page ID. So just one level above the leaf level it gets the page pointer to the

data page holding the particular row and it scans the data page for the row

containing the key.

Let us say there is a non clustered index on the

member_number column.

In case of a non clustered index too the traversal works in a

similar manner, the pointer to the key value we want to retrieve is found in the

leaf level which is the Row ID of the data row. Using the Row ID sql server

pulls the data row.

The difference here is, the non clustered index has one more

logical read. The performance difference is very less between the two indexes

here.

Retrieving a range of rows

Select account_balance from members where member_number

between 9000 and 10000

In case of a clustered index on member_number column sql

server will look again for the highest key value not greater than the lowest key

we wish to retrieve. It will traverse through the index and will find the data

page that contains the lowest key value which is 9000 in our case. It scans the

data page for 9000 and when it finds it, it then retrieves the rows sequentially

until it reaches 10000, as we know in a clustered index the data rows are in

arranged in the same order as the clustered index key.

In case of a non clustered index on member_number

column the traversal of the index happens in a similar way. But once the

leaf level is reached and the key value is found, the index entry contains the

Row ID of the data row to be found which in this case will be the Row ID for

member_number 9000. Using this Row ID, sql server will pull the data page

holding that row. In a similar way it has to pull all the other rows for

member_numbers between 9001 and 10000. We have to note a difference here. The

key values are in sequence in a non clustered index, but the data pages are not.

Let us say we have 50 rows satisfying the query and approximately 10 rows are

present in each data page. So sql server has to retrieve the 50 rows in 50/10 =

5 logical reads to the data pages using a clustered index and sql server will

need 50 logical reads to the data pages using non clustered index. This is a

significant performance difference.

So it is always better to create a clustered index when we are

accessing a range of rows.

Covered queries

Select account_balance from members where member_number

between 9000 and 10000

Same query as before , but this time let us say that the members

table has a composite non clustered index on member_number and

account_balance columns. The query is called a covered query since the items

in the where clause and select clause belong to the index. The query optimizer

will realize that this is a covered query. In this case the sql server does not

have to go to data level to work on the query. It can find the results in the

leaf level of the non clustered index. It traverses the index based on the

member_number and when it reaches the leaf level , it finds the account_balance

information next to it in the key itself with out having to access the data page

through the pointer.

Let us say there are 500 rows that satisfy this query and there

are 100 index entries per data page. So to get the results sql server has to do

5 logical reads of data pages. This is more efficient than having a clustered

index on the member_number column since we do not have to access data pages at

all.

So sometimes a covered non clustered index is more efficient

than an equivalent clustered index.

Retrieving rows with clustered index and non clustered index on

the table

Select account_balance from members where member_number

between 9000 and 10000

Here we will examine a different scenario where there is a

non clustered index on the member_number column and clustered index on

the account_number column. The main thing to note here is that the members

table will now has leaf level index entries containing the clustered index key

as a pointer instead of Row ID when there is no clustered index on the table. So

in order to access the data rows, sql server has to traverse through the

clustered index. In our example in order access row with member_number 9000, sql

server will traverse through the non clustered index and in the leaf level it

will find the index entry for 9000 having a clustered index key which will be

the account_number. Based on this account_number it traverses through the

clustered index and finds the data row. Suppose we have 100 rows to fetch, we

have to access the clustered index 100 times to fetch the data pages containing

the rows.

This is very inefficient and probably the query optimizer will

choose a table scan instead of an index scan.

In the same scenario let us change the query and execute the

following query.

Select account_number from members where member_number

between 9000 and 10000

So sql server will traverse through the non clustered index as

before until the leaf level where it finds the clustered index key which is the

account_number in this case. So sql server need not go fetch the data pages

traversing through the clustered index again, since it found the member_number

and account_number side by side in its index entries.

So the query optimizer will choose the index scan since it is

very efficient in this case.

Retrieving rows with multiple non clustered indexes on the

table

Select * from members where account_balance between 500 and

5000 and member_number between 9000 and 9500

Let us say there is a non clustered index on account_balance

column and another non clustered index on the member_number column with out

any clustered index on the table. In this case the query optimizer can use both

the indexes. Since there is no clustered index,  the leaf level of the indexes

has Row IDs as pointers to the rows. So the query optimizer examines both

indexes and gets the sets of Row IDs that match the given selection.Then it gets

the intersection of both the sets of Row IDs. So instead of pulling the

individual data pages, the query optimizer first gets a result set of all the

RowIDs that match both the criteria. Based on how many rows satisfy the

condition it estimates whether to make use of the indexes or to do a table scan

to retrieve the rows. Let us say there are 100 rows that satisfy this query and

there are 10,000 rows spread over 500 data pages (20 rows per page) in the

table, it may be efficient to use the index scan and do 100 logical reads to get

the data pages than scanning all the 500 data pages.

Conclusion

The decision of the query optimizer depends on the cost it would

incur or in other words how many logical or physical reads it has to do to

retrieve the result of a query. Hopefully these different scenarios will help a

little bit in better understanding the usage of indexes by query optimizer.

Any comments are welcome.

References

Microsoft SQL Server 2000 Performance Optimization and

Tuning Hand Book by Ken England.

Rate

5 (4)

Share

Share

Rate

5 (4)