Managing SQL Server Statistics

Accurate statistics about the data held in tables are used to provide the best execution strategy for SQL queries. but if the statistics don't accurately reflect the current contents of the table you'll get a poorly-performing query. How do you find out if statistics are correct, and what can you do if the automatic update of statistics isn't right for the way a table is used?

Statistics are critical metadata used by SQL Server’s query optimizer, which influence the selected execution plan for a query. The optimizer obtains its knowledge of the data, its distribution, and the number of rows a given query is likely to return from the available statistics. Based on this knowledge, it decides the optimal access path, making choices such as whether to scan a table or perform an index seek, use a nested loop join or a hash join, and so on.

If statistics are out of date, or do not exist, the optimizer can make poor choices and execution plan quality, and consequently query performance, can suffer. SQL Server can automatically maintain statistics, periodically refreshing them based on its tracking of data modifications. However, for some tables, such as those subject to significant changes in distribution, or those with skewed values, it’s possible that SQL Server’s automatic statistics update will be inadequate to maintain consistently high levels of query performance.

In this article, I’ll describe, briefly, when SQL Server creates statistics and its criteria for performing automatic statistics updates. I’ll then explain, in more detail, why automatic updates to statistics may not be sufficient, and what data you can gather and monitor to help you manage statistics manually, and proactively, rather than in response to query performance issues.

Statistics and Database Maintenance Tasks

Database backups, integrity checks, and performance optimizations form the core of a DBA’s regular maintenance tasks. Backups are usually at the top of the list, as data is one of the most important, if not the most important, facets of a business. If there is any problem with the production database, it must be possible to recover the data. Integrity checks are also essential as database corruption must be found and corrected as quickly as possible, to mitigate downtime and data loss.

Finally, a DBA has to ensure that performance is optimal, and ensuring that statistics remain up-to-date and accurate is an important part of this task. However, perhaps because SQL Server does update statistics automatically, whereas it doesn’t, for example, automatically defragment indexes, it’s easy for a DBA to overlook this task.

The problem is that, in certain cases, SQL Server’s automatic statistics maintenance may update the statistics too infrequently, or not provide the optimizer enough information to define properly the data distribution. This is when the DBA may need to step in and manage statistics manually.

When SQL Server Creates Statistics

First, SQL Server will create an associated Statistics object every time we create an index. We refer to these as index statistics. These statistics will exist as long as the index exists. Note that since an index rebuild operation recreates the index, SQL Server will also update the index Statistics object.

Second, assuming the database option Auto Create Statistics is enabled, which it is by default, SQL Server will create single-column statistics whenever a column, which is not already the leading column in an existing index, is used in a query predicate (e.g. in the search condition of a WHERE clause, or in a JOIN condition). We refer to these as column statistics. We can also use the CREATE STATISTICS command (http://msdn.microsoft.com/en-us/library/ms188038.aspx) to create single- and multi-column statistics manually.

In essence, statistics are a histogram describing the data distribution in a column. The statistics are stored in system tables in the database. In a multi-column statistic, whether index or column-level, the histogram only exists for the first column (they are “left-based”). For example, if the index is on (ID, LastName, FirstName), the histogram for the index statistic is on ID. If we write a query, SELECT * FROM dbo.table WHERE LastName = 'Smith', then SQL Server will create a column statistic on LastName only (assuming LastName is not the leading column in another existing index).

How SQL Server Manages Statistics

After working with clients for several years, I realized there was some confusion surrounding indexes and statistics, particularly among “accidental” DBAs, or those that were new to database administration. While many recognized that indexes and statistics were two distinct entities, they didn’t always understand when each would update.

As we add, delete or modify rows in a table, SQL Server adds, deletes or modifies corresponding rows in the indexes. For example, if we modify the value for ProductID for a record, SQL Server will also modify the corresponding record in any indexes that contain that column. Some DBAs assume SQL Server updates statistics in a similar fashion i.e. every time we add, remove or modify data. In fact, this is not how it works. Assuming the Auto Update Statistics database option is enabled for the SQL Server instance, SQL Server will automatically update the statistics, but only after a certain “volume threshold” of changes to the data.

Every time we modify a record in a table, SQL Server tracks it via the rcmodified column in a hidden system table. SQL Server 2005 tracked this information in the sys.rowsetcolumns table, In SQL Server 2008 (and later) sys.rowsetcolumns merged with sys.syshobtcolumns and became sys.sysrscols (covered in more detail later in the article). In general, when 20% of the rows in a table, plus 500, have been modified, SQL Server considers the associated column and/or index statistics objects to be ‘stale’. SQL Server will update each invalidated object, automatically, the next time the query optimizer needs to use that statistic.

When SQL Server auto-updates statistics

For more details on when, generally, SQL Server will automatically update statistics objects, see my article, Understanding when statistics will automatically update. For details on one or two other cases where SQL Server will update statistics, please see Statistics Used by the Query Optimizer in Microsoft SQL Server 2008.

What this means is that over time, as business processes modify data in a table, the associated statistics can become progressively ‘stale’ and less accurate in their representation of the data. At a certain volume threshold of modifications, as described above, SQL Server will automatically update the statistics. However, in some cases it’s possible that the optimizer will start to make poor choices for an execution plan, before the automatic update occurs.

The second important point to consider, in addition to the volume of changes that will trigger an automatic update, is the degree of accuracy of the auto-updated statistics. When we create or rebuild (not reorganize, just rebuild) an index, SQL Server generates the statistics with a FULLSCAN, i.e. it scans all the rows in the table to create a histogram that represents the distribution of data in the leading column of the index. Likewise, SQL will auto-create column statistics with a full sample.

Index rebuilds on partitioned tables

Note that in SQL Server 2012, index rebuilds for partitioned tables do not perform a full scan. See http://blogs.msdn.com/b/psssql/archive/2013/03/19/sql-server-2012-partitioned-table-statistics-update-behavior-change-when-rebuilding-index.aspx for more details.

However, when SQL Server automatically updates a statistic, if the table is more than 8 MB in size then it uses a default sample, which is less than the full sample that SQL Server uses when creating the statistic (this is non-configurable). The automatic update samples only from selected pages (the sampling is not actually random; see http://blogs.msdn.com/b/conor_cunningham_msft/archive/2008/07/24/statistics-sample-rates.aspx) in the index or table. Those samples may not capture the interesting details about a column or set of columns. As a result, the default sample may not create a histogram that accurately represents the distribution of data in the column.

Are SQL Server’s automatic updates to statistics adequate?

Many systems rely solely on SQL Server to update statistics automatically, and many of those systems consistently perform well. However, there are always exceptions. Large tables, tables with uneven data distributions, tables with ever-increasing keys and tables that have significant changes in distribution often require manual statistics updates.

If a table is very big, perhaps more than 100 million rows, then waiting for 20% of rows to change before SQL Server automatically updates the statistics could mean that millions of rows are modified, added or removed before it happens. Depending on the workload patterns and the data, this could mean the optimizer is choosing a substandard execution plans long before SQL Server reaches the threshold where it invalidates statistics for a table and starts to update them automatically. In such cases, you might consider updating statistics manually for those tables on a defined schedule (while leaving AUTOUPDATESTATISTICS enabled so that SQL Server continues to maintain statistics for other tables).

In cases where you know data distribution in a column is “skewed”, it may be necessary to update statistics manually with a full sample, or create a set of filtered statistics (not covered further in this article), in order to generate query plans of good quality. Remember, however, that sampling with FULLSCAN can be costly for larger tables, and must be done so as not to affect production performance.

It is quite common to see an ascending key, such as an IDENTITY or date/time data types, used as the leading column in an index. In such cases, the statistic for the key rarely matches the actual data, unless we update the Statistic manually after every insert (not likely). For example, consider the Sales.SalesOrderDetail table in the AdventureWorks2012 database, and imagine that we wanted to retrieve the CustomerID, SalesPersonID, and TotalDue and for all orders with an OrderDate in the current week.

We could create a non-clustered index with OrderDate as the leading column, and include as part of the key the CustomerID, SalesPersonID, and TotalDue columns. If we had a scheduled job that updated statistics on Sunday morning at 1:00 AM, the highest value in the histogram would be for the most recent order placed i.e. Sunday at 1:00 AM, at latest.

As customers place orders during the week, new rows are added to the table. By the end of the week, it might contain thousands of new rows but, if the table is big enough, this volume of change still may not trigger SQL Server’s automatic statistics update, and according to the statistic, there would only be one row for any date beyond Sunday at 1:00 AM. It is very common to see large differences between the estimated and actual rows in a statistic when an ascending key exists.

How You Can Manage Statistics

In cases where we feel that the frequency or accuracy SQL Server’s automatic statistics updates is inadequate for a table, we have the option to maintain those statistics manually. For example, we can create a maintenance job that will run the UPDATE STATISTICS (http://msdn.microsoft.com/en-us/library/ms187348.aspx) command, and update statistics on a regular schedule. We can run the command for a table, in which case it will update all index statistics and column statistics on that table, or we can target specific index statistics or column statistics. However, we must consider carefully when to execute this job.

Many teams simply incorporate manual statistics updates into their regular index maintenance schedule. Some DBA teams rebuild all of their indexes on a regular schedule, say once a week, in which case they are also updating index statistics on a weekly basis since, as discussed, rebuilding an index will update the statistics with a FULLSCAN. Note that, for this reason, it is a waste of resources to rebuild indexes and then immediately run a job to update statistics for those indexes. I’ve even seen cases where immediately after index rebuilds, a scheduled job updated the statistics, but using the default sample, so in effect reducing the sampling size, and therefore accuracy of the statistics. A better option, after a rebuild of all indexes, would be to update only column statistics, i.e. those auto-created by SQL Server, as these are unaffected by index rebuilds.

Overall, though, regular rebuilding of all the indexes in a database is a poor way to manage indexes. You’ll inevitably waste resources rebuilding indexes that are not fragmented. In addition, it can affect the availability of a system, particularly if you’re running Standard Edition where index rebuilds are performed offline (although for some systems, this option works because the database does not have to be online 24×7 and it completes in the available maintenance window). Note that there are additional considerations, even in Enterprise Edition, that affect the ability to rebuild indexes online (e.g. LOB and XML data).

A better, and more common, approach is to implement a maintenance task that rebuilds or reorganizes indexes based on fragmentation. Indexes that we rebuild will have their statistics updated as part of that task. For other statistics, such as those for indexes that we reorganize, or column statistics, we can update the statistics as part of a separate task, or leave SQL Server to manage them automatically.

Collecting Data for Proactive Management of Statistics

In general, the manual management of statistics occurs as a reactive response to specific performance problems. In other words, performance of certain queries starts to suffer due to poor execution plans generated by the optimizer. The DBA suspects the cause is inaccurate and outdated statistics and implements a manual statistics update regime for the affected tables.

However, this reactive approach to optimization is often stressful. Ideally, manual statistic updates are targeted and occur before issues arise, but the practice of proactively monitoring statistics is rare and often an after-thought.

Certain indicators can tell you, ahead of time, that SQL Server’s automatic statistics updates are likely to be inadequate for a table. It comes down to knowing your data, for example, knowing the size of the table, column density (to try to predict data skew), the volatility of the data, those subject to large data loads, the presence of ascending keys, and so on.

We can get some of this information by examining the existing statistics for our tables. In addition to this, we can start actively tracking changes in the volume of data in our tables, and the rate of changes of rows within the table, and then target our manual statistics updates at the tables that really need it.

Let’s discuss each of these elements in turn, first how to examine existing statistics, and then how to start tracking data volume changes, and rate of data modifications, on our tables.

Examining Statistics

Let’s look at the Sales.SalesOrderDetail table in the AdventureWorks2012 database. It has a clustered index, PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID, and two non-clustered indexes, AK_SalesOrderDetail_rowguid and IX_SalesOrderDetail_ProductID.

1777-stats_SOD-1732e0f3-5b69-425a-8940-7

Listing 1: Indexes on the SalesOrderDetail table in AdventureWorks2012

There is a Statistics folder for the SalesOrderDetail table in SSMS Object Explorer and in it we can see four statistics objects.

1777-SaesOrderDetails_Stats-dad3f1e3-ba2

The bottom three represent the statistics associated with the table’s three indexes. The first statistic, _WA_Sys_*, is on the SalesOrderDetailID column, and SQL Server created it automatically.

However, a much better way to view statistics is via the sys.stats catalog view. Those of you on older versions of SQL Server, or those of you who have been using SQL Server for a long time, may still be using the system-stored procedure, sp_helpstats. I like sp_helpstats because it lists the columns in the statistic, but it is a deprecated feature and may be removed in a future release of SQL Server. In addition, the sys.stats catalog view provides significantly more information about a statistic, as shown in Listing 2.

1777-Erin_ViewingStats-08b98347-d810-45e

Listing 2: Output of sp_helpstats compared to querying sys.stats

If you’re using SQL Server 2005 or higher, I highly recommend taking advantage of sys.stats, and using sys.stats_columns to list the columns for a statistic.

Once we know what statistics exist for an object, one way to get more details on that object, including when it was last updated, is via DBCC SHOW_STATISTICS.

1777-1-b7696b10-93d6-479c-9255-d72c78a75

Listing 3: Statistics for IX_SalesOrderDetail_ProductID

The Updated column in the statistics header displays the last time statistic was updated, whether it was via an automatic or manual update. We can use the Rows and Rows_Sampled columns to determine whether the sample rate was 100% (FULLSCAN) or something less.

The density vector, which is the second set of information, provides detail regarding the uniqueness of the columns. The optimizer can use the density value for the left-based column in the key to estimate how many rows would return for any given value in the table. Density is calculated as “1/unique number of values in the column”.

The final section of the output, the histogram, provides detailed information, also used by the optimizer. A histogram can have a maximum of 200 steps, and the histogram displays the number of rows equal to each step, as well as number of rows between steps, number of distinct values in the step, and the average number of rows per value in the step.

Data Volume Changes and Rate of Modification

In order to understand when statistics should be updated manually, versus automatically by SQL Server, you have to know your data. For a DBA that manages hundreds of databases, this can be a challenge, but information is available in SQL Server that can help you make informed decisions related to updates. Two critical pieces of data you can capture are net change in volume of data in the table, and rate of change of rows within the table. If we know this data, we can target first large tables that are subject to rapid change.

Volume of data

To monitor the change in the number of rows for a table over time, use any of the following catalog or Dynamic Management Views:

  • sys.partitions – provide the number of rows in the table
  • sys.dm_db_partition_stats – provides row counts plus page counts, per partition
  • sys.dm_db_index_physical_stats – provides the number of rows and pages, plus information about fragmentation, forwarded rows and more

By tracking changes in row counts over time, we measure simply the net change in volume of data. The sys.partitions and sys.dm_db_partition_stats provide the easiest route to trending the change in the number of rows. However, if you’re already using sys.dm_db_index_physical_stats to manage index fragmentation, then it should be simple to modify existing scripts to capture the number of rows as well. Note that you must use the DETAILED option with sys.dm_db_index_physical_stats to return the number of rows in the table, and be aware that running this DMV generates a significant real-time overhead that can affect system performance.

1777-1-399b5638-694f-4736-bb85-20b109f92

Listing 4: Using sys.partitions or sys.dm_db_index_physical_stats to track data volume in a table

Rate of Change

Changes in data volume, while valuable, do not provide a complete picture of database activity. For staging tables, nearly the same number of records can be added to and deleted from the table each day, and a snapshot of the number of rows would suggest the table is static. However, the records added and deleted may have very different values, which can dramatically alter the distribution of data in a database, rendering statistics meaningless, possibly before the optimizer would label them as invalidated and update them automatically.

In such cases, it helps to track the number of modifications for a table and we can do this in several ways, although for reasons we’ll discuss, only the last two are good approaches:

  • rowmodctr in sys.sysindexes
  • rcmodified in sys.sysrscols
  • modified_count in sys.system_internals_partition_columns
  • leaf_*_count columns in sys.dm_db_index_operational_stats
  • modification_counter in sys.dm_db_stats_properties

While the sys.sysindexes system table is still available in SQL Server 2012, Microsoft plans to remove it in a future release, so don’t use it to track modifications unless you have no choice. Anyone still using SQL Server 2000 could utilize this option.

sys.sysrscols is a hidden system table and accessing it requires the Dedicated Admin Connection, which is not conducive to scheduled maintenance. As such, it is not a practical option for tracking modification. For more information on rcmodified in sys.sysrscols, please see the article, How are per-column modification counts tracked?, which explains how it could be used to examine modifications in detail.

sys.system_internals_partition_columns is a system internal view, which means that it is reserved for Microsoft SQL Server. It tracks modifications in a similar fashion to sys.sysrscols, but you don’t need the DAC to view it. While a better option than sys.sysrscols, its information is still not a good basis on which to build maintenance strategy, because Microsoft makes no promises that the behavior of hidden system tables will be consistent from one release to the next.

Finally, we arrive at the two viable options. We can track the number of modifications using sys.dm_db_index_operational_stats by summing the leaf_insert_count, leaf_update_count, leaf_delete_count and leaf_ghost_count columns. However, the count columns do not track modifications in the same manner as sys.sysrscols and sys.system_internals_partition_columns, as we’ll see in the next section.

For systems running SQL Server 2008R2 SP2 or SQL Server 2012 SP1, we can also use the new sys.dm_db_stats_properties DMV to monitor table modifications, using the modification_counter column; we’ll examine this approach next.

How SQL Server tracks data modifications internally (demo)

One of the easiest ways to illustrate how SQL Server tracks modifications is to walk through an example of how the system tables and DMVs change, in response to record modifications, using sys.system_internals_partition_columns, sys.dm_db_index_operational_stats or sys.dm_db_stats_properties. We’ll then discuss how you can use this information to start measuring the rate of change of data in your tables.

First, create a copy of the Sales.SalesOrderDetail table in the AdventureWorks2012 database and insert 100 rows, as shown in Listing 5. The CHECKPOINT statement after the INSERT statement ensures that we flush from cache any modifications to the system tables, so that they appear in the hidden system table output.

Listing 5: Creating and populating Sales.TestSalesOrderDetail

The first two queries in Listing 6 interrogate sys.system_internals_partition_columns and sys.dm_db_index_operational_stats, respectively, to return the number of recorded modifications for the Sales.TestSalesOrderDetail table. The former tracks modifications per column, and each column, as expected is subject to 100 modifications. The latter view simply returns the total number of modifications, whether they are inserts, updates or deletes.

The final query against sys.dm_db_stats_properties returns no rows, confirming that no statistics currently exist. In this example, we use a statistic with stats_id of 1, but the query returns no rows regardless of the stats_id used.

1777-1-ed2849e1-fdda-41f1-85f2-4885b01c0

Listing 6: Querying row changes and statistics for Sales.TestSalesOrderDetail

Now, let’s create a clustered index, insert 100 more rows, then re-run the same three queries.

1777-1-a0173ec2-d63d-4c14-a0f5-4bc193465

Listing 7: Create a clustered index on Sales.TestSalesOrderDetail and insert 100 more rows

First, notice that the Index ID value changed; in the original output from sys.system_internals_partition_columns it was 0, because the table was a heap. Now that it’s a clustered index, the Index ID is 1.

Second, notice that the output from sys.dm_db_index_operational_stats shows a value of 100 for modifications, but sys.system_internals_partition_columns shows a value of 200 for modifications. This is because even though the Index ID changed, the modification tracking was maintained in sys.system_internals_partition_columns. The output from sys.dm_db_index_operational_stats is the result of directly reading the data, and here we went from a heap to a clustered index, which caused the data to physically move.

Let’s see how the counters change when we selectively update columns.

1777-1-b64265b8-270a-4613-a26b-8970a1758

Listing 8: Update OrderQty for one row on Sales.TestSalesOrderDetail

In sys.system_internals_partition_columns the Modifications column (i.e. the modified_count counter) increased by one for Index_Column_ID 4. Likewise, in sys.dm_db_index_operational_stats, the modifications increased by one.

However, the modification_counter in sys.dm_db_stats_properties did not change. Let’s see what happens if we modify a column that’s in the clustering key.

1777-1-b634ee3a-1977-44f4-9d8e-7327ea3db

Listing 9: Update a clustered index key for one row in Sales.TestSalesOrderDetail

Now we see that the counter changed in sys.dm_db_stats_properties, as did the counter in sys.dm_db_index_operational_stats and the counter for Index_Column_ID 1 in sys.system_internals_partition_columns.

The counter in sys.dm_db_stats_properties will only change if the modification occurs for any column in the index key. If you look closely, you might notice that the modification counter in sys.dm_db_index_operational_stats jumped from 101 to 103 after this update, even though we only updated one row. If we split out the columns that we have been aggregating in sys.dm_db_index_operational_stats, as shown in Listing 10, we see that we have an insert and a ghost record.

1777-1-1c543ef6-5d05-419b-948a-0f7a63769

Listing 10: Splitting out leaf level updates, deletes and ghosts

A ghost record is one that is marked as deleted but has yet to be removed. Once the ghost cleanup process removes it, the count will decrease by one, and the leaf level delete column will increase. This occurs because we modified the cluster key, and even though it was an update of the column, because it is part of the index key, it is actually a delete and an insert. Paul Randal demonstrates this in detail in his post, Do changes to index keys really do in-place updates?, and I talk about this further in my post, SQL University – Internals and Updates.

The important point is that sys.system_internals_partition_columns,sys.dm_db_index_operational_stats, and sys.dm_db_stats_properties track modifications differently. If you plan to use any of these to track modifications for a statistic, and then make a decision to update a statistic, or not, based on the number of modifications, it is critical that you understand the differences.

Within sys.dm_db_index_operational_stats, SQL Server tracks all modifications for a table, regardless of the column modified. Updates of columns in the key will show up twice if you are aggregating all the leaf level counters, which will inflate the modification counter.

Within sys.dm_db_stats_properties, SQL Server tracks only changes to columns in the key. It will not track any changes to columns that are not in the key, even for a clustered index. However, for purposes of statistics, this is acceptable, as the query optimizer is using the statistic specific to a column or set of columns. Changes to column 3 do not in any way affect a statistic for columns 1 and 2.

Note that updating a statistic will not reset counters in sys.dm_db_index_operational_stats, but will reset the modification_counter in sys.dm_db_stats_properties, as demonstrated in Listing 11.

1777-1-9c480e11-12b3-43ef-a714-092ff8ef1

Listing 11: Effect on modification counters of updating statistics

Rebuilding a clustered index will reset the counters in both sys.dm_db_index_operational_stats and sys.dm_db_stats_properties.

Listing 12: Rebuilding the clustered index resets all row modification counters

Tracking modifications for a statistic is very easy using sys.dm_db_stats_properties, and is the route I recommend if you’re using SQL Server 2008R2 SP2 or SQL Server 2012 SP1.

How you can start measuring rate of modification

In my previous post about this DMV, New Statistics DMF in SQL Server 2008R2 SP2, I included a query that could be used to list statistics information, including modifications, for all statistics in a database. Listing 13 shows two variations of the query; the first returns information for all statistics which exist for user objects, and output is ordered with the oldest statistics first. The second query lists statistics that have had greater than 10% of the total number of rows modified (based on the modification_counter). Adjust this threshold to find tables more frequently updated (e.g. 5% of rows modified).

Listing 13: Statistics information, including modifications, for all statistics in a database

Whether you’re looking for statistics with the highest number of modifications, the largest percentage of modifications (modification_counter/rows), or simply those that have not been updated in a while, the sys.dm_db_stats_properties can be used to simplify this data collection and better automate your analysis.

Summary

Statistics play a vital role in performance and the introduction of the sys.dm_db_stats_properties DMF gives DBAs an easy way to capture additional information about database statistics. While many systems rely on SQL Server to automatically update statistics, and consistently perform well, exceptions such as large tables that don’t update frequently, tables with uneven data distributions, tables with ever-increasing keys, and tables which have significant changes in distribution often require manual updates. Tracking modifications for those tables and then updating statistics when the modifications reach a specific threshold can prevent performance problems from occurring, and reduce the manual effort required by the Database Administrator to manage statistics.