This is part 2 of an 8 part series of articles about speeding up access to a SQL Server database. This series is based on chapter 8 "Speeding up Database Access" of my book ASP.NET Site Performance Secrets, available at amazon.com and other book sites.
In part 1, we saw how to pinpoint missing indexes and expensive queries. In this part 2, we'll pinpoint a number of other bottlenecks, including locking issues, lack of execution plan reuse, fragmentation and hardware issues. Parts 3 through 8 will show how to fix the bottlenecks we found in parts 1 and 2:
Part 1 Pinpointing missing indexes and expensive queries
Part 2 Pinpointing other bottlenecks
Part 3 Fixing missing indexes
Part 4 Fixing expensive queries
Part 5 Fixing locking issues
Part 6 Fixing execution plan reuse
Part 7 Fixing fragmentation
Part 8 Fixing memory, disk and CPU issues
In a database with lots of queries executing, some queries may try to access the same resource, such as a table or index. You wouldn't want one query to read a resource while another is updating it, otherwise you could get inconsistent results.
To stop a query from accessing a resource, SQL Server locks the resource. This will inevitably lead to some delays as queries wait for a lock to be released. To find out whether these delays are excessive, check the following performance counters on the database server with perfmon:
Total Latch Wait Time (ms) - Total wait time in milliseconds for latches in the last second.
Lock Timeouts/sec - Number of lock requests per second that timed out. This includes requests for NOWAIT locks.
Lock Wait Time (ms) - Total wait time in milliseconds for locks in the last second.
Number of Deadlocks/sec - Number of lock requests per second that resulted in a deadlock.
Perfmon comes with Windows, so you already have it on your computer. Issue the command "perfmon" from the command prompt. After it has loaded, click the Plus button in the toolbar, then select the category - such as SQLServer:Latches - in the Performance object drop down, and then add the counter - such as Wait Time (ms).
A high number for Total Latch Wait Time (ms) indicates that SQL Server is waiting too long for its own synchronization mechanism. Lock Timeouts/sec should be 0 during normal operation and Lock Wait Time (ms) very low. If they are not, queries are waiting too long for locks to be released.
Finally, Number of Deadlocks/sec should be 0. If not, you have queries waiting on each other to release a lock, preventing either to move forward. SQL Server eventually detects this condition and resolves it by rolling back one of the queries, which means wasted time and wasted work.
If you found locking issues, in part 5 you'll see how to determine which queries cause the excessive lock wait times, and how to fix the problem.
Execution Plan Reuse
Before a query is executed, the SQL Server query optimizer compiles a cost effective execution plan. This takes many CPU cycles. Because of this, SQL Server caches the execution plan in memory, in the plan cache. It then tries to match incoming queries with those that have already been cached.
In this section you'll see how to measure how well the plan cache is being used. If there is room for improvement, you'll see in part 6 how to fix this.
Start by checking the following performance counters on the database server with perfmon:
Category: Processor (_Total)
% Processor Time - The percentage of elapsed time that the processor is busy.
Category: SQL Server:SQL Statistics
SQL Compilations/sec - Number of batch compiles and statement compiles per second. Expected to be very high after server start up.
SQL Re-Compilations/sec - Number of recompiles per second.
These counters will show high values at server start up as every incoming query needs to be compiled. The plan cache sits in memory, so doesn't survive a restart. During normal operation in a system where the data doesn't change much, you would expect compilations per second to be less than 100, and re-compilation per second to be close to zero.
However, in a system with very volatile data, it would be normal for these numbers to be much higher. As you'll see in Part 3 Fixing missing indexes, the most optimal execution plan for a query depends on the actual data in the tables the query accesses. So when that data changes often, it makes sense for the execution plans to be recompiled often as well so they remain optimal.
Additionally, when you change the schema, you would expect the execution plans affected by that change to be recompiled as well.
Alternatively, look at the time spent by the server optimizing queries. Because query optimizations are heavily CPU bound, this is almost all time spent by a CPU.
The dynamic management view (DMV) sys.dm_exec_query_optimizer_info gives you the number of query optimizations since the last server restart, and the elapsed time in seconds it took on average to complete them:
occurrence AS [Query optimizations since server restart],
value AS [Avg time per optimization in seconds],
occurrence * value AS [Time spend optimizing since server restart in seconds]
WHERE counter='elapsed time'
Run this query, wait a while and then run it again, to find the time spent optimizing in that period. Be sure to measure the time between runs, so you can work out what proportion of time the server spends optimizing queries.
The DMV sys.dm_exec_cached_plans provides information on all execution plans in the plan cache. You can combine this with the DMV sys.dm_exec_sql_text to find out how often the plan for a given query has been reused. If you get little reuse for an otherwise busy query or stored procedure, you are getting too little benefit out of the plan cache:
SELECT ecp.objtype, ecp.usecounts, ecp.size_in_bytes,
REPLACE(REPLACE(est.text, char(13), ''), char(10), ' ') AS querytext
FROM sys.dm_exec_cached_plans ecp
cross apply sys.dm_exec_sql_text(ecp.plan_handle) est
WHERE cacheobjtype='Compiled Plan'
The column objtype is 'Proc' for stored procedures and 'Adhoc' for ad hoc queries, while column usecounts shows how often a plan has been used.
In part 1 you saw how to identify busy queries and stored procedures.
The data and indexes in a database are organized on disk in 8KB pages. A page is the smallest unit that SQL Server uses to transfer data to or from disk.
When you insert or update data, a page may run out of room. SQL Server then creates another page, and moves half of the contents of the existing page to the new page. That leaves free space in not only the new page but the original page as well. That way, if you keep inserting or updating data in the original page, it doesn't split again and again.
This means that after many updates and inserts, and deletes as well, you'll wind up with lots of pages with empty space. This takes more disk space than needed, but more importantly also slows down reading, because SQL Server now has to read more pages to access data. The pages may also wind up in a different physical order on disk than the logical order in which SQL Server needs to read them. As a result, instead of simply reading each page sequentially right after each other, it needs to wait for the disk head to reach the next page - meaning more delays.
To establish the level of fragmentation for each table and index in the current database, use the following query which uses the DMV dm_db_index_physical_stats:
SELECT o.name AS TableName, i.name AS IndexName, ips.index_type_desc,
DB_ID(), NULL, NULL, NULL, 'Sampled') ips
JOIN sys.objects o ON ips.object_id = o.object_id
JOIN sys.indexes i ON (ips.object_id = i.object_id) AND (ips.index_id = i.index_id)
WHERE (ips.page_count >= 1000) AND (ips.avg_fragmentation_in_percent > 5) AND
(ips.alloc_unit_type_desc <> 'LOB_DATA') AND (ips.alloc_unit_type_desc <> 'ROW_OVERFLOW_DATA')
ORDER BY o.name, i.name
This gives you all tables and indexes that take over 1000 pages and that are more than 5% fragmented. Looking at fragmentation of tables and indexes taking less than a few thousand pages tends to be a waste of time. Fragmentation of 5% may sound small, but the fragmentation might be in a heavily used area of the table or index.
When you see index type CLUSTERED INDEX in an entry, it really refers to the actual table, because the table is part of the clustered index. Index type HEAP refers to a table without a clustered index. This is explained further in Part 3 Fixing missing indexes.
If you found any tables or indexes that are over 5% fragmented and take over a few thousand pages, this may be an issue. Part 7 will show how to defragment tables and indexes. Note that defragmenting is not a silver bullet - it will not necessarily solve any or all performance issues.
To see whether lack of memory is slowing down the database server, check these counters in perfmon:
Pages/sec - When the server runs out of memory, it stores information temporarily on disk, and then later reads it back when needed - which is very expensive. This counter indicates how often this happens.
Category: SQL Server:Buffer Manager
Page Life Expectancy - Number of seconds a page will stay in the buffer pool without being used. The greater the life expectancy, the greater the chance that SQL Server will be able to get a page from memory instead of having to read it from disk.
Buffer Cache Hit Ratio - Percentage of pages that were found in the buffer pool, without having to read from disk.
These counters can tell you whether SQL Server has too little memory. This will cause it to store more data on disk, leading to excessive disk i/o, causing greater stress on the CPU and disk.
If Pages/sec is consistently high, or Buffer Cache Hit Ratio is consistently below say 90%, adding memory may lighten the load on the disk system - see part 8.
If Page Life Expectancy suddenly becomes much lower than what it normally is on your system and stays low, that would be a cause for concern. You should be able to ignore temporary dips in Page Life Expectancy.
A low Page Life Expectancy is often caused by for example queries doing table scans. During a table scan, SQL Server reads the pages making up a table in memory to scan every record, and expels the pages from memory when they are no longer needed - this page churning leads to shorter page life expectancies. Note that the solution here would not be to add memory, but to avoid the table scans by for example adding indexes - see Part 3 Fixing missing indexes.
However, if your system is getting busier over time with Page Life Expectancy decreasing over time, it may be time to add memory - see part 8.
In addition to the performance counters shown above, you could use the DMV dm_os_sys_memory (SQL Server 2008 and higher):
The column system_memory_state_desc shows in human readable form the state of physical memory.
SQL Server is heavily disk bound, so solving disk bottlenecks can make a big difference. If you found memory shortages in the previous section, fix those first, because a memory shortage can lead to excessive disk usage in itself. Otherwise check these counters for each of your disks to see if there is a disk bottleneck for some other reason.
Categories: PhysicalDisk and LogicalDisk
Avg. Disk sec/Read - Average time, in seconds, of a read of data from the disk.
% Disk Time - Percentage of elapsed time that the selected disk was busy reading or writing.
Avg. Disk Queue Length - Average number of read and write requests queued during the sample interval.
Current Disk Queue Length - Current number of requests queued.
If Avg. Disk sec/Read is consistently higher than 0.02 seconds, your disk system would need attention, especially if it is consistently higher than 0.05 seconds. If it goes higher during short periods, that could simply be a case of a lot of work coming in at the same time, so wouldn't be an immediate cause of concern.
If your system has only 1 disk, than if % Disk Time is consistently over 85%, the disk system is severely stressed. However, if you use a RAID array or SAN, % Disk Time could be consistently over 100% without it being a problem.
Avg. Disk Queue Length and Current Disk Queue Length refer to the number of tasks that are queued at the disk controller or are being processed. These counters are less useful for SANs, where queue lengths are difficult to interpret as okay or too high. Otherwise, counter value that are consistently higher than 2 would be an issue. If you use a RAID array where the controller is attached to several disks, multiply this by the number of individual disks in the array - so if there are 2 disks in the array, look for counter values that are consistently higher than 4. Keep in mind that some operations dump a lot of IO transactions in the queue in one go and than do something else while waiting for the disk system.
Part 8 will show how to fix disk issues.
If you found memory or disk issues in the previous sections, fix those first because they will stress the CPU as well. Otherwise check the counters below to see whether the CPU is stressed for another reason.
% Processor Time - Proportion of time that the processor is busy.
Processor Queue Length - Number of threads waiting to be processed.
If % Processor Time is consistently over 75%, or Processor Queue Length consistently greater than 2, the CPU is probably stressed. Part 8 will show how to resolve this.
In this part, we saw how to pinpoint a number of bottlenecks, including locking issues, lack of execution plan reuse, fragmentation issues and memory, disk and CPU issues. In part 3, we'll start to fix bottlenecks by fixing any missing indexes. We'll have an in depth look at how indexes work under the hood and when and when not to use them.
From the Book
This article is taken from Chapter 8 of ASP.NET Site Performance Secrets. Click the image or link to get your copy at Amazon.
Matt Perdeck is the author of the book and this article, and is a senior ASP.NET developer based in Sydney, Australia. He regular participates in building software at rentacoder.com.