SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 

Index on Key vs Included

Indexing can be quite confusing at times if you not 100% sure on what to do. Do you just index on key or do you add included columns? Well this depends on what you are doing and what type of performance improvement you are looking for.

Lets start with some history, SQL Server 2005 added the ability to include non-key columns in a non-clustered index. In SQL Server 2000 and earlier, for a non-clustered index, all columns defined for an index were key columns, which meant they were part of every level of the index, from the root down to the leaf level. When a column is defined as an included column, it is part of the leaf level only.

The benefits of included columns:

  • They can be data types not allowed as index key columns.
  • They are not considered by the Database Engine when calculating the number of index key columns or index key size.

For example, a varchar(max) column cannot be part of an index key, but it can be an included column. Further, that varchar(max) column doesn’t count against the 900-byte (or 16-column) limit imposed for the index key.

An index with non-key columns can significantly improve query performance when all columns in the query are included in the index either as key or non-key columns. Performance gains are achieved because the query optimizer can locate all the column values within the index; table or clustered index data is not accessed resulting in fewer disk I/O operations.

We can infer that whether the index columns are key columns or non-key columns, we get an improvement in performance compared to when all columns are not part of the index. But, is there a performance difference between the two variations?

I restored a copy of AdventureWorks2016CTP3 to my SQL Server instance and ran a check on Sales.SalesOrderHeader to verify the indexes.

EXEC sp_helpindex N'Sales.SalesOrderHeader'


We’ll start with a straight-forward query for testing that retrieves data from multiple columns:

SELECT	 [CustomerID]
		,[SalesPersonID]
		,[SalesOrderID]
		,[DaysToShip] = DATEDIFF(DAY, [OrderDate], [ShipDate])
		,[SubTotal]
FROM [Sales].[SalesOrderHeader]
WHERE [CustomerID] BETWEEN 11000 and 11200;


(In Management Studio, you could see the I/O metrics using SET STATISTICS IO ON;.)

The SELECT has a warning icon, because the optimizer recommends an index for this query:

USE [AdventureWorks2016CTP3]
GO
CREATE NONCLUSTERED INDEX []
ON [Sales].[SalesOrderHeader] ([CustomerID])
INCLUDE ([SalesOrderID],[OrderDate],[ShipDate],[SalesPersonID],[SubTotal])

Test 1
We will first create the index the optimizer recommends (named NCI1_included), as well as the variation with all the columns as key columns (named NCI1):

CREATE NONCLUSTERED INDEX [NCI1]
ON [Sales].[SalesOrderHeader]([CustomerID], [SubTotal], [OrderDate], [ShipDate], [SalesPersonID]);
GO

CREATE NONCLUSTERED INDEX [NCI1_included]
ON [Sales].[SalesOrderHeader]([CustomerID])
INCLUDE ([SubTotal], [OrderDate], [ShipDate], [SalesPersonID]);
GO

If we re-run the original query, once hinting it with NCI1, and once hinting it with NCI1_included, we see a plan similar to the original, but this time there’s an index seek of each nonclustered index, with equivalent values for Table I/O, and similar costs (both about 0.006):

(The scan count is still 1 because the index seek is actually a range scan in disguise.)

Now, the AdventureWorks2012 database isn’t representative of a production database in terms of size, and if we look at the number of pages in each index, we see they’re exactly the same:

SELECT [Table]    = N'SalesOrderHeader'
	  ,[Index_ID] = [ps].[index_id]
	  ,[Index]    = [i].[name]
	  ,[ps].[used_page_count]
	  ,[ps].[row_count]
FROM [sys].[dm_db_partition_stats]  [ps]
JOIN [sys].[indexes]				[i] ON [ps].[index_id] = [i].[index_id] 
										AND [ps].[object_id] = [i].[object_id]
WHERE [ps].[object_id] = OBJECT_ID(N'Sales.SalesOrderHeader');


If we’re looking at performance, it’s ideal (and more fun) to test with a larger data set.

Test 2
I have a copy of the AdventureWorks2016CTP3 database that has a SalesOrderHeader table with over 200 million rows (script HERE), so let’s create the same nonclustered indexes in that database and re-run the queries:

CREATE NONCLUSTERED INDEX [Big_NCI1]
ON [Sales].[Big_SalesOrderHeader](CustomerID, SubTotal, OrderDate, ShipDate, SalesPersonID);
GO
 
CREATE NONCLUSTERED INDEX [Big_NCI1_included]
ON [Sales].[Big_SalesOrderHeader](CustomerID)
INCLUDE (SubTotal, OrderDate, ShipDate, SalesPersonID);
GO
 
SELECT	 [CustomerID]
		,[SalesPersonID]
		,[SalesOrderID]
		,[DaysToShip] = DATEDIFF(DAY, [OrderDate], [ShipDate])
		,[SubTotal]
FROM [Sales].[Big_SalesOrderHeader] WITH (INDEX (Big_NCI1))
WHERE [CustomerID] between 11000 and 11200;
 
SELECT	 [CustomerID]
		,[SalesPersonID]
		,[SalesOrderID]
		,[DaysToShip] = DATEDIFF(DAY, [OrderDate], [ShipDate])
		,[SubTotal]
FROM [Sales].[Big_SalesOrderHeader] WITH (INDEX (Big_NCI1_included))
WHERE [CustomerID] between 11000 and 11200;


Now we get some data. The query returns over 6 million rows, and seeking each index requires just over 32,000 reads, and the estimated cost is the same for both queries (31.233). No performance differences yet, and if we check the size of the indexes, we see that the index with the included columns has 5,578 fewer pages:

SELECT [Table]    = N'Big_SalesOrderHeader
	  ,[Index_ID] = [ps].[index_id]
	  ,[Index]    = [i].[name]
	  ,[ps].[used_page_count]
	  ,[ps].[row_count]
FROM [sys].[dm_db_partition_stats]  [ps]
JOIN [sys].[indexes]				[i] ON [ps].[index_id] = [i].[index_id] 
										AND [ps].[object_id] = [i].[object_id]
WHERE [ps].[object_id] = OBJECT_ID(N'Sales.Big_SalesOrderHeader');


If we dig into this a big further and check dm_dm_index_physical_stats, we can see that difference exists in the intermediate levels of the index:

SELECT	 [ps].[index_id]
		,[Index] = [i].[name]
		,[ps].[index_type_desc]
		,[ps].[index_depth]
		,[ps].[index_level]
		,[ps].[page_count]
		,[ps].[record_count]
FROM [sys].[dm_db_index_physical_stats](DB_ID(), OBJECT_ID('Sales.Big_SalesOrderHeader'), 5, NULL, 'DETAILED')	[ps]
JOIN [sys].[indexes]																							[i] ON [ps].[index_id] = [i].[index_id] 
																													AND [ps].[object_id] = [i].[object_id];
 
SELECT	 [ps].[index_id]
		,[Index] = [i].[name]
		,[ps].[index_type_desc]
		,[ps].[index_depth]
		,[ps].[index_level]
		,[ps].[page_count]
		,[ps].[record_count]
FROM [sys].[dm_db_index_physical_stats](DB_ID(), OBJECT_ID('Sales.Big_SalesOrderHeader'), 6, NULL, 'DETAILED') AS [ps]
JOIN [sys].[indexes]																							[i] ON [ps].[index_id] = [i].[index_id] 
																													AND [ps].[object_id] = [i].[object_id];


The difference between the intermediate levels of the two indexes is 43 MB, which may not be significant, but I’d probably still be inclined to create the index with included columns to save space – both on disk and in memory. From a query perspective, we still don’t see a big change in performance between the index with all the columns in the key and the index with the included columns.

Test 3
For this test, let’s change the query and add a filter for [SubTotal] >= 100 to the WHERE clause:

SELECT	 [CustomerID]
		,[SalesPersonID]
		,[SalesOrderID]
		,[DaysToShip] = DATEDIFF(DAY, [OrderDate], [ShipDate])
		,[SubTotal]
FROM [Sales].[Big_SalesOrderHeader] WITH (INDEX (Big_NCI1))
WHERE CustomerID = 11091
AND [SubTotal] >= 100;
 
SELECT	 [CustomerID]
		,[SalesPersonID]
		,[SalesOrderID]
		,[DaysToShip] = DATEDIFF(DAY, [OrderDate], [ShipDate])
		,[SubTotal]
FROM [Sales].[Big_SalesOrderHeader] WITH (INDEX (Big_NCI1_included))
WHERE CustomerID = 11091
AND [SubTotal] >= 100;


Now we see a difference in I/O (95 reads versus 1,560), cost (0.848 vs 1.55), and a subtle but noteworthy difference in the query plan. When using the index with all the columns in the key, the seek predicate is the CustomerID and the SubTotal:

Because SubTotal is the second column in the index key, the data is ordered and the SubTotal exists in the intermediate levels of the index. The engine is able to seek directly to the first record with a CustomerID of 11091 and SubTotal greater than or equal to 100, and then read through the index until no more records for CustomerID 11091 exist.

For the index with the included columns, the SubTotal only exists in the leaf level of the index, so CustomerID is the seek predicate, and SubTotal is a residual predicate (just listed as Predicate in the screen shot):

The engine can seek directly to the first record where CustomerID is 11091, but then it has to look at every record for CustomerID 11091 to see if the SubTotal is 100 or higher, because the data is ordered by CustomerID and SalesOrderID (clustering key).

Test 4
We’ll try one more variation of our query, and this time we’ll add an ORDER BY:

SELECT	 [CustomerID]
		,[SalesPersonID]
		,[SalesOrderID]
		,[DaysToShip] = DATEDIFF(DAY, [OrderDate], [ShipDate])
		,[SubTotal]
FROM [Sales].[Big_SalesOrderHeader] WITH (INDEX (Big_NCI1))
WHERE CustomerID = 11091
ORDER BY [SubTotal];
 
SELECT	 [CustomerID]
		,[SalesPersonID]
		,[SalesOrderID]
		,[DaysToShip] = DATEDIFF(DAY, [OrderDate], [ShipDate])
		,[SubTotal]
FROM [Sales].[Big_SalesOrderHeader] WITH (INDEX (Big_NCI1_included))
WHERE CustomerID = 11091
ORDER BY [SubTotal];


Again we have a change in I/O (though very slight), a change in cost (1.5 vs 9.3), and much larger change in the plan shape; we also see a larger number of scans (1 vs 9). The query requires the data to be sorted by SubTotal; when SubTotal is part of the index key it is sorted, so when the records for CustomerID 11091 are retrieved, they are already in the requested order.

When SubTotal exists as an included column, the records for CustomerID 11091 must be sorted before they can be returned to the user, therefore the optimizer interjects a Sort operator in the query. As a result, the query that uses the index Big_NCI1_included also requests (and is given) a memory grant of 29,312 KB, which is notable (and found in the properties of the plan).

Summary
The original question we wanted to answer was whether we would see a performance difference when a query used the index with all columns in the key, versus the index with most of the columns included in the leaf level. In our first set of tests there was no difference, but in our third and fourth tests there was. It ultimately depends on the query. We only looked at two variations – one had an additional predicate, the other had an ORDER BY – many more exist.

What developers and DBAs need to understand is that there are some great benefits to including columns in an index, but they will not always perform the same as indexes that have all columns in the key. It may be tempting to move columns that are not part of predicates and joins out of the key, and just include them, to reduce the overall size of the index. However, in some cases this requires more resources for query execution and may degrade performance. The degradation may be insignificant; it may not be…you will not know until you test. Therefore, when designing an index, it’s important to think about the columns after the leading one – and understand whether they need to be part of the key (e.g. because keeping the data ordered will provide benefit) or if they can serve their purpose as included columns.

As is typical with indexing in SQL Server, you have to test your queries with your indexes to determine the best strategy. It remains an art and a science – trying to find the minimum number of indexes to satisfy as many queries as possible.

The SQL Girl

I have always had a passion for programming and SQL. I fell in love with SQL back in 2009 and have been working with it ever since. I have obtained my MCSE in BI and MCSE in Data Management and Analytics. I enjoy solving puzzles with SQL code. If you are interested you can find hundreds of puzzles on Project Euler. When I am not coding I spend my time with my animals and doing creative things, such as painting and mosaics.

Comments

Leave a comment on the original post [thesqlgirl.com, opens in a new window]

Loading comments...