There are two kinds of database indexes: unique and non-unique. Unique indexes and non-unique indexes are different. They are as different as night and day. They are as different as men and women. They are as different as code and data. In fact, unique indexes are code and non-unique indexes are data, and that's how they should be treated.
Primay keys, unique constraints, and unique indexes are part of the database code. They affect the database's behavior: the same INSERT statement that succeeds without a unique index fails with one. They enforce business rules and data integrity. They allow the developers to make assumptions about the data when writing the rest of the code. My code relies on the existence of my unique indexes, and it won't work correctly without them.
Database developers can be trusted to write good unique indexes. They're the ones writing the rest of the code, after all. They know what the business rules are; they know what the data should look like, and they know what data should not be allowed. It is rightfully the developers' responsibility to create and maintain the proper unique indexes for the database. The unique indexes, just like all the rest of the code, should be captured in source control, tested by a quality assurance team, and deployed to the production environment, where they can be expected to produce the exact same behavior that they produced in the development and test environments.
Non-unique indexes, on the other hand, are completely different. They are not code; they do not perform calculations or do any data manipulation; they do nothing more than provide the server with more options for generating query execution plans. To select data from a table with no indexes at all, the server must perform a table scan, but if the table has indexes, the server can do index scans and index seeks and such to get the same data. The more indexes there are, the more ways there are for the server to retrieve the data.
There are several characteristics about non-unique indexes that make them significantly different from unique indexes:
- They do not affect the database's behavior at all; they only affect its performance. The same query will give you the same results no matter what indexes exist, as long as it doesn't time out; the only difference will be how long it takes to get the results. Missing indexes don't make the code break, so most developers won't notice when they're missing, and they won't remember to go back later and add some.
- Unless query execution is noticeably slow, the only way to see the impact of a non-unique index is to run a query in SQL Server Management Studio before adding an index, run the same query again after adding the index, generate execution plans both times, and visually compare them. Let's face it; those execution plans are not that easy to read, so developers don't have much incentive to look at them.
- The usage of a non-unique index is optional. Just because an index exists doesn't mean that the server will use it.
- A non-unique index will not be used if the data is not of sufficient quantity. If there are only a dozen rows in a table, then the server will ignore indexes because it's faster to scan the table than it is to use an index. There must be just enough overhead involved in an index seek or an index scan to make it not worthwhile for small tables. I experimented recently and found that the server only started using indexes when there were tens of thousands of rows in the table.
- A non-unique index will not be used if the data is not of sufficient quality. The server keeps statistics on the data; it can guess by what percentage an index seek will reduce the number of results. If half of the last names in your Person table are "Smith" and the other half are "Gupta", and you try to select all the Smiths, the server will know that using an index will only reduce the results by 50%, and it may decide that it's not worth the bother and just do a table scan. An index must have the potential to significantly filter the data before the server will use it.
- A non-unique index will not be used if you never execute a query that can benefit from it. If you never search for people by middle initial, then the server won't use an index on the Middle_Initial column. You might think this doesn't need to be said, but it is significant.
- Indexes have a cost. Not only do they take up disk space, they can have a performance cost, too; they make SELECTs faster, but they make INSERTs, UPDATEs, and DELETEs slower because the server has to change the data in the table and in the indexes. Unique indexes cost disk space and CPU time, too, but the code won't work without them; non-unique indexes are optional, so there are times when they're not worth the resources they use.
- There are more ways to index a table than you can count. A Person table can have one index on First_Name and another on Last_Name, or it can have one index that includes both columns. There can be two indexes on First_Name and Last_Name, one with First_Name first and the other with Last_Name first. You can add Middle_Initial to an index as a key column or as an included column. You can put a filter on the index, and on and on it goes. There are guidelines for guessing what kinds of queries will benefit from what kinds of indexes, but they are complex enough to make guessing accurately really hard.
All of these facts about non-unique indexes lead to some important consequences:
A non-unique index will have a different impact on the production server than it has on the development servers. A typical developer has a nearly empty copy of the database on his machine for development. He will only generate the bare minimum amount of test data required to make sure that the program works, and the test data that he generates will likely look like "User 1", "User 2", "User 3" for a Person table. Development data will certainly have neither the quantity nor the quality sufficient to make the server use indexes.
A non-unique index will have a different impact on the production server than it has on the test server. The Quality Assurance team, if there is one, may make the same mistake and use a nearly empty database for testing, but even if they do the right thing and routinely restore a backup of the production database onto the test server for testing, they won't be executing the same queries against that data that happen in production. Trust me; nobody can predict the crazy ways that the end user will try to query the database. The testers can diligently verify that the server can search for people quickly by either first or last name, and a user will complain that the web site times out when he searches by middle initial. As soon as the test team starts testing the performance of queries on the middle initial, someone will start trying to query by the tens digit in the year of the person's birthday.
A non-unique index will have a different impact on one production database than it has on another production database. I work for a company that creates lots of similar web sites; we keep the data for most of our sites in one database, but we have a handful of sites that are busy enough for each one to have a database to itself. The tables, views, procedures, functions, and other database objects are identical in all of those databases, but the amount of data in each one is vastly different, as are the queries that get executed against each one daily. Each database needs a different set of indexes to perform well; the same indexes that enhance the performance of one database can degrade the performance of another.
These consequences lead to inescapable conclusions:
- It doesn't make sense to maintain one set of indexes for all copies of the database. Each database needs its own.
- It doesn't do any good for the development team or the test team to test the indexes. You can test the unique indexes and all the rest of the code in a test environment, because they will behave the same in test and production, but it's a waste of time to test the non-unique indexes in a test environment. They will not perform identically in production. Even if they do today, they won't tomorrow.
- It also doesn't do any good to just index everything. You can't anyway, because there are more possible indexes than you can count, and indexes have a cost, so creating too many indexes can do more harm than good.
It all adds up to this: you can not expect the database developers to maintain the non-unique indexes. They just can't do it. There is no way to create them in the development environment and know that they'll work in production. The QA team wouldn't be able to test them in the test environment and come up with any meaningful results. The only way to know what indexes are needed is to observe the activity on the production database. In most organizations, the developers and testers don't have access to the production database, so the best any developer can do is guess what indexes might be needed.
So if they developers can't do it, who can? The only person who can is the database administrator. It is the DBA's job to keep the production database up and running smoothly at all times, and he's the only one with access to it, so the maintenance of non-unique indexes is the DBA's responsibility.
How can the DBA do it? I've seen documentation that says something about running a trace in the SQL Server Profiler to capture a day's worth of activity and analyzing the results with the Database Engine Tuning Advisor. Don't do that; it severly bogs down the server (guess how I know), and there's an easier way. SQL Server provides some nifty little dynamic management views that really help.
Every time SQL Server executes a query, it generates an execution plan (or retrieves one from a cache). Part of that generation process involves analyzing the query to see what indexes would produce the fastest results, looking to see what indexes are available, and matching them up. If it finds an index it needs, it uses it; if it doesn't, then it has a missing index -- a table name and a list of column names that don't have an index where the server could have used one. That is valuable information, and the server saves it where we can view it later. Search the SQL Server Books Online for "Index Related Dynamic Management Views and Functions." I got excited when I found out that these views exist:
The names can be misleading; I thought at first that a missing index group contained information about a group of missing indexes, but it turns out that there is only one missing index in each group. Microsoft could just as easily have put all the columns into one view instead of three. I made my own view, based on these, to tell me which missing indexes could generate the most potential benefit:
CREATE SCHEMA Maintenance; GO /* This view shows the server's suggestions for missing indexes that should be created to improve performance based on actual queries executed. The formula for potential_user_benefit came straight from the SQL Server Books Online documentation for the index-related dynamic management view sys.dm_db_missing_index_group_stats: http://msdn.microsoft.com/en-us/library/ms345421.aspx I don't completely understand it, but it seems to work. */CREATE VIEW Maintenance.missing_indexes AS SELECT mid.index_handle, mid.database_id, mid.statement, mid.equality_columns, mid.inequality_columns, mid.included_columns, migs.user_seeks, migs.user_scans, migs.avg_total_user_cost, migs.avg_user_impact, migs.avg_total_user_cost * migs.avg_user_impact * (migs.user_seeks + migs.user_scans) AS potential_user_benefit FROM sys.dm_db_missing_index_details AS mid INNER JOIN sys.dm_db_missing_index_groups AS mig ON mid.index_handle = mig.index_handle INNER JOIN sys.dm_db_missing_index_group_stats AS migs ON mig.index_group_handle = migs.group_handle WHERE (mid.database_id = DB_ID()) GO SELECT TOP 10 * FROM Maintenance.missing_indexes ORDER BY potential_user_benefit DESC;
With this information, I can create the indexes that I know the server most needs.
The other side of the story is, how can I tell which existing indexes the server is not using, so I can drop them? SQL Server can tell me that, too. Every time it uses an index to perform a seek, a scan, or a lookup, it increments a counter associated with that index; every time it updates data in an index, it increments another counter. All four counters are visible in the dynamic management view
sys.dm_db_index_usage_stats, along with the date and time of the last seek, scan, lookup, and update. By looking at the date stamps, I can tell if an index has been used recently, and by comparing the number of reads to the number of writes, I can tell how much good the index is doing.
-- return the maximum of 3 dates CREATE FUNCTION Maintenance.max3datetimes ( @value1 DATETIME, @value2 DATETIME, @value3 DATETIME ) RETURNS DATETIME AS BEGIN DECLARE @value DATETIME; SELECT @value = MAX(value) FROM (SELECT @value1 value UNION SELECT @value2 value UNION SELECT @value3 value) AS allvalues; RETURN @value; END; GO /* This view shows all non-unique indexes in the current database and how much they have been used. Indexes are selected for dropping by considering how long it has been since they were last read and by comparing the number of reads to the number of writes. */CREATE VIEW Maintenance.droppable_indexes AS SELECT schemas.name AS schema_name, tables.name AS object_name, indexes.name AS index_name, ISNULL(ius.user_seeks + ius.user_scans + ius.user_lookups - ius.user_updates, 0) AS usefulness, Maintenance.max3datetimes(ius.last_user_seek, ius.last_user_scan, ius.last_user_lookup) AS last_user_read, ius.last_user_update FROM sys.schemas AS schemas INNER JOIN sys.tables AS tables ON schemas.schema_id = tables.schema_id INNER JOIN sys.indexes AS indexes ON tables.object_id = indexes.object_id LEFT OUTER JOIN sys.dm_db_index_usage_stats AS ius ON ius.database_id = DB_ID() AND indexes.object_id = ius.object_id AND indexes.index_id = ius.index_id WHERE (indexes.is_unique = 0) GO -- find the most-written-to and least-read-from indexes SELECT TOP 10 * FROM Maintenance.droppable_indexes ORDER BY usefulness; -- find the least-recently-used indexes SELECT TOP 10 * FROM Maintenance.droppable_indexes ORDER BY last_user_read;
After using these views for a while and spending some time creating and dropping indexes, it occurred to me that I could write procedures to do the work for me. I've had lots of practice writing SQL code that writes SQL code:
CREATE FUNCTION Maintenance.missing_index_name (@index_handle INT) RETURNS SYSNAME AS BEGIN; DECLARE @index_name SYSNAME; SELECT @index_name = 'IX_' + obj.name FROM sys.objects AS obj INNER JOIN sys.dm_db_missing_index_details AS mid ON obj.object_id = mid.object_id WHERE (mid.index_handle = @index_handle) AND (mid.database_id = DB_ID()); SELECT @index_name = @index_name + CASE WHEN column_usage = 'INCLUDE' THEN '#' ELSE '_' END + column_name FROM sys.dm_db_missing_index_columns(@index_handle); RETURN @index_name; END; GO /* This procedure creates the missing index that should have the most performance impact on the server. */CREATE PROCEDURE Maintenance.create_index AS BEGIN; DECLARE -- @object_name SYSNAME, @index_name SYSNAME, @equality_columns NVARCHAR(4000), @inequality_columns NVARCHAR(4000), @included_columns NVARCHAR(4000), @SQL NVARCHAR(4000); SELECT TOP (1) -- @object_name = statement, @index_name = Maintenance.missing_index_name(index_handle), @equality_columns = equality_columns, @inequality_columns = inequality_columns, @included_columns = included_columns FROM Maintenance.missing_indexes ORDER BY potential_user_benefit DESC; IF @@ROWCOUNT > 0 BEGIN; SELECT @SQL = 'CREATE INDEX ' + QUOTENAME(@index_name) + ' ON ' + @object_name + ' (' + COALESCE(@equality_columns + ', ' + @inequality_columns, @equality_columns, @inequality_columns) + ')' + ISNULL(' INCLUDE (' + @included_columns + ')', ''); PRINT @SQL; EXEC (@SQL); END; END; GO /* This procedure finds the least used non-unique index in the current database with the most overhead and drops it. */CREATE PROCEDURE Maintenance.drop_index AS BEGIN; DECLARE -- @schema_name SYSNAME, @object_name SYSNAME, @index_name SYSNAME, @SQL NVARCHAR(4000); SELECT TOP (1) -- @schema_name = schema_name, @object_name = object_name, @index_name = index_name FROM Maintenance.droppable_indexes ORDER BY last_user_read, usefulness; IF @@ROWCOUNT > 0 BEGIN; SELECT @SQL = 'DROP INDEX ' + QUOTENAME(@index_name) + ' ON ' + QUOTENAME(@schema_name) + '.' + QUOTENAME(@object_name) + ';'; PRINT @SQL; EXEC (@SQL); END; END; GO
It wasn't very long before I was calling these procedures on a schedule from a SQL Server Agent job. Now I have a process on each database in the development, test, and production servers that creates custom indexes just for that database and gets rid of indexes that aren't being used.
This is a simplified version of my actual code. There are lots of ways to customize it: I can create tables to keep track of the indexes that the process created and dropped recently. I can check to make sure that I didn't just create an index before dropping it. I can protect certain indexes from being dropped based on whatever criteria is appropriate. I can add error handling and reporting. I can create a similar process to defragment indexes when they need it. I can change the criteria for determining which index to drop. I can compare an index's disk usage to the table's data's disk usage when determining which index to drop. I can include views when creating and dropping indexes. I can exclude system tables or change-data-capture tables. I can leave clustered indexes alone if they happen to be non-unique. I can change the formula for making up index names. I can create the indexes online if I'm on SQL Server Enterprise. I can create the indexes in their own file group. I've watched the process for a long time now and made several changes along the way to fine-tune it, and I will probably continue to tweak it in the future.
The point is, my indexes weren't created by a developer; they are the output of an automated process. That makes them data, not code. They never see the inside of our source control repository; they are managed directly on the production server.
Why is this a big deal? Because we use Red Gate SQL Compare to automate our database deployments, and SQL Compare does not distinguish between unique and non-unique indexes. It treats them all the same. It gives you only one option: you can include all indexes in a comparison, or you can ignore them all. Ignoring them all isn't an option -- can you imagine what would happen if you didn't deploy your primary keys? Including them all, however, means that the deploy process will always make the production database have only the indexes that developers have checked into source control, which means there's a good chance there won't be any at all. I don't have to imagine what would happen then; I know: I would spend all of my time looking at profiler traces and execution plans to see why the database keeps timing out. So our database deployment process has to generate a change script with SQL Compare and then edit it before running it to strip out all of the CREATE INDEX and DROP INDEX statements. That's not very efficient.
I've brought this issue up with Red Gate support, but I'm having trouble getting them to understand why I want to treat unique and non-unique indexes differently. I'm writing this to see if this viewpoint makes any sense to other database developers and administrators.