SQLServerCentral Article

Understanding Indexes

,

Introduction

This paper provides an introduction to database table indexes using Microsoft SQL Server. The intended audience is the junior database administrator (DBA) or others with little or no experience in the design, administration, or optimization of database systems.

A Simple Index

Webster defines an index as an alphabetical or classified list, as one placed at the end of a book, for facilitating reference to material within the body of the text (The Webster Reference Dictionary of the English Language, 1981). In other words, the index of a book is a sorted listing of topics contained within the book that is placed on additional pages in the back of the book. The index provides a quick and easy reference mechanism for retrieving sought after information contained in the book. The index has two limitations; it requires additional pages in the book, and it requires maintenance. The amount of information indexed in a book is a function of the granularity desired for the index. However, the index must be maintained when there is any addition of new text such as during an edition update; the index requires new entries to reference the additional information. The new index entries must be stored in sorted order within the existing index . the index of the new edition of the book must be regenerated.

There is good reason why the use of this example is pervasive in the discussion of database indexing; it is remarkably analogous to indexing data in a database system. The indexing system in a database requires additional storage and maintenance when new information is added or removed from the database. The index of a database is, in simple terms, a reference which is easier and faster to search than the information itself. In the following sections the paper will discuss the use of indexes in a database as well as index support in SQL Server; index types, their performance advantages and disadvantages, and index selection criteria.

Index Utilization in Database Systems

In a world where data collection is growing at rates that were unfathomable a decade ago, the ability to store and retrieve data efficiently becomes paramount. It is projected that just four years from now, the world's information base will be doubling in size every 11 hours. So rapid is the growth in the global stock of digital data that the very vocabulary used to indicate quantities has had to expand to keep pace. A decade or two ago, professional computer users and managers worked in kilobytes and megabytes. Now schoolchildren have access to laptops with tens of gigabytes of storage, and network managers have to think in terms of the terabyte (1,000 gigabytes) and the petabyte (1,000 terabytes). Beyond those lie the exabyte, zettabyte and yottabyte, each a thousand times bigger than the last.h (IBM Global Technology Services, 2006).

The constraints on performance of a database system require a mechanism to allow search functions to find the sought after data without scanning the entire data table. This is the purpose of the index structure in a database system; to provide an efficient mechanism for finding data. As we will see in subsequent sections, the use of indexes in a database system can, if carefully applied, increase the performance of data retrieval functions. However, there is 'a rub', other data functions such as updating, deleting and inserting data may be negatively affected by the use of indexes. In short, an index on a table uses the data in the row's columns to assist the database engine when executing data manipulations on the table.

SQL Server Index Types

An index is a data structure providing fast access to data stored in the database. Formally, "An index of a relation r over relation schema R is a data structure which helps in locating rows of r faster under some conditions." (Mannila & Raiha, 1992) An index is a data structure containing a collection of attributes of the data and a reference to the location in memory where the data is stored. "An index I is based on some index attributes {A1,c,An} º R. Conceptually, one can consider an index to be a data structure containing pairs ((a1,...,an), b), where (a1,c,an) is the tuple of values for the attributes A1,c,An and b is the address of the tuples in r with these values, that is, a pointer to the set ,c, " (Mannila & Raiha, 1992)

If the index values of the data are unique, then the storage location returned by the index will contain at most one reference to a location in memory where the data is stored; otherwise the index will return a set of locations. SQL server supports two index types: 1) clustered, and 2) nonclustered (technically, this is not accurate; however, discussion of other index sub-types will be deferred). The implementation details of SQL Server indexes is, again, beyond the scope and intent of this paper; however, it should be noted that both clustered and nonclustered index data structures are implemented as B-trees in SQL Server. When visualizing the B-tree with a clustered index the data and the index will be found sorted by the index key at the leaf level; whereas, with a nonclustered index the leaf level contains the address to the data pages contained in the search. Simply put, a clustered index is stored and sorted with the data in the clustered index key order. While the nonclustered index may be thought of as a pointer to the location of the data to which it refers.

The following sections discuss the various criteria for selecting and defining indexes to maximize performance of a database.

Indexing a Table

In general, database systems contain a wide variety of data necessary for operating a business; however, on a day-to-day, hour-by-hour, or even a minute-by-minute basis what is necessary to service the needs of the business is limited to a significantly smaller portion of the data in the database. With this understanding it is safe to surmise that most "queries on tables use only a subset of the data contained in the table; therefore, creating an index on the frequently searched table data increases the performance of the query." (Silberschatz, 2006) The question then is how many indexes and on which columns should they be built.

As a precursor to the definition and creation of indexes the performance effect of the new index should be carefully considered. Although an index may, generally, be expected to increase performance of the queries against a table it is necessary to ensure that the patterns of data access on the soon-to-be indexed table are well understood. An important metric to collect associated with an indexing strategy include the ratio of queries that read data as opposed to queries that update the data. In addition, the performance of the database gas-ish prior to the creation of any new objects must be captured as a baseline performance measurement. Because adding an index to a table is not a guarantee of increased performance there are important issues to examine in order to ensure that new indexes will lead to expected increased performance results, thus any changes to database objects, indexes in this case, may be measured against known performance characteristics of the database.

The goal of creating indexes on a table is to increase performance by decreasing input/output (I/O) requests to SQL Server while fulfilling the query issued against the table. Discussion of the various tools available for the analysis of database I/O specifically and performance in general are outside the scope of this paper and will be deferred. Whichever tool is used for performance measurement the general metrics to be applied to measure the effect of an index is I/O and index maintenance load on the database. The performance of the database will be enhanced by an index when the number of I/O requests to the database are reduced; however, the same index may reduce performance by requiring maintenance when data is updated in the table. It is this contention between high-performance data retrieval and data insertion index maintenance that requires careful analysis when planning an indexing technique.

Below are a few general indexing patterns:

  • When a table is smallh the cost of using an index may actually decrease performance by
  • increasing the number of I/O requests when compared to a simple table scan (the database
  • simple searches all rows of the table for the data in the query).
  • Avoid creating an index in response to the poor performance of a single query.
  • When queries against a table do not include a WHERE clause there is no benefit to using a
  • nonclustered index.
  • Avoid the creation of an index on a column in a table when the data stored in that column has a
  • large number of repeated data values are stored in the column.
  • If the database supports a transaction processing system, then avoid using indexes because the
  • performance of the insertion of new transactions will be degraded.
  • Avoid the creation of too many indexes on a table on which INSERT, UPDATE, DELETE, and
  • MERGE statements are applied because the need to maintain the indexes will significantly
  • impact performance. (Microsoft)
  • Do create many indexes on a table that is not frequently updated in order to provide the Query
  • Optimizer with more choices to determine the best execution path. (Microsoft)

Considerations for Selecting Indexes

The responsibilities of a DBA include database integrity, security, and performance; therefore, an effective DBA will have the ability to create appropriate performance enhancing indexes for a database. SQL Server allows each table to have a single clustered index defined upon it while SQL Server 2000 and SQL Server 2005 allow 249 nonclustered indexes and SQL Server 2008 R2 allows 999 nonclustered indexes; however, it is important to keep in mind that the size of any one index is limited to 900 bytes (technically, this is not accurate; however, discussion of the exceptions to the key record size limitations will be deferred). The considerations for defining a clustered or nonclustered index on a table are dependent on the queries executed against the table as well as other application and data factors; however, several generic patterns may be defined.

Below are a few general index selection patterns:

  • Clustered Index
  • Define a clustered index on every table (technically, not accurate; however, discussion of the exceptions
  • to this rule will be deferred).
  • Multi-column clustered indexes should be defined with as few columns as possible.
  • Define a clustered index on the column that appears most frequently in the WHERE clause of SELECT
  • statements.
  • Select columns that are unique such as an IDENTITY column or contain many distinct values.
  • Select columns whose data are accessed sequentially.
  • Select columns whose data are frequently used to sort data during the evaluation of a query.
  • Avoid columns that change frequently.

Nonclustered Index

  • Define a nonclustered index on columns used in the JOIN, WHERE, ORDER BY, or GROUP BY clause of a
  • SELECT statement.
  • Define a nonclustered index on the columns that coverh a query. That is, all columns in the SELECT
  • statement are defined in the nonclustered index. There is an added benefit to this approach because all
  • of the data necessary to satisfy the query is located in the index; therefore, no data page redirection is
  • necessary.
  • Define a nonclustered index on columns with many distinct values.

Defining indexes is as much art as science requiring careful consideration of the impact of the indexes on performance recognizing that an index may or may not increase performance of the query; and in the worst case will degrade the performance of the system.

Summary

Database management systems use indexes to provide a faster path to the data they contain. Unfortunately,the definition of indexes on data tables is not a purely objective exercise, but must consider many subjective metrics. After careful consideration of the data, the queries executed against the data, system performance,storage requirements, etc. a set of indexes are defined by the DBA only to find that over time as the data within the system changes the indexes must be reexamined, possibly redefined. Regardless, indexes are not static objects within the system; they require regular maintenance in order to provide performance improvements.

There is but one Golden Rule of indexes: test; establish a baseline performance measure against which any new index or changes to existing indexes may be compared. A set of properly selected, configured, and maintained indexes will provide significant query execution performance improvements; on the other hand, indexes may degrade performance due to their maintenance requirements and may, in fact, be ignored by the Query Optimizer.

As an introductory paper many topics were deferred including SQL Server's indexed views which provide a powerful tool for enhancing query performance. Subsequent papers will discuss the tools available in SQL Server for identifying missing indexes and measuring the performance of indexes, indexed views, and other performance optimization topics in SQL Server.

References

IBM Global Technology Services. (2006). The toxic terabyte. London: IBM Corporation.Kadlec, J. (2006). SQL Server Indexes Made Handy.Mannila, H., & Raiha, K.-J. (1992). The Design of Relational Databases. Great Britain: Addison-Wesley.Microsoft. (n.d.). General Index Design Guidelines. Retrieved 8 2010, from Sql Server Books Online:http://msdn.microsoft.com/en-us/library/ms191195.aspxSchwarz, L. (2007). Clustered and Non-Clustered Indexes in SQL Server.Silberschatz, A. (2006). Database System Concepts. New York: McGraw-Hill.The Webster Reference Dictionary of the English Language. (1981). Baltimore: Publishers United Guild.

Rate

2.56 (77)

You rated this post out of 5. Change rating

Share

Share

Rate

2.56 (77)

You rated this post out of 5. Change rating