Signal Waits vs. Resource Waits
During my presentation at SQLSaturday#59, I spoke about the categories of wait types, such as Resource Waits, Signal Waits, External Waits, and Queue Waits.
So just to define them here, as I did in my session:
Resource Waits occur when a worker requests access to a resource that is not available because it is being used by another thread, or not yet available. These are the most common types of waits, and surface as locks, latches, network and I/O.
Signal Waits are the time a session id spends waiting for the CPU to become available to run it. The time spent here is known as the runnable queue. Since there is no specific counter to measure CPU Pressure, one must measure the signal wait time. signal_wait_time is the difference between time the waiting thread was signaled and when it started running
Queue waits occur when a worker is idle, waiting for work to be assigned. This wait type is most typically seen with system background tasks such as the deadlock monitor and deleted record cleanup tasks.
External Waits occur when a SQL Server worker is waiting for an external event, such as an extended stored procedure call or a linked server query, to finish.
We typical measure Resource Waits, all the threads that are put into a suspended queue and waits there until the resource becomes available, and Signal Waits, again the time waiting for the cpu.
Because the column values in the sys.dm_os_wait_stats dmv are present for total wait time and signal wait time, we can get the total resource wait time by subtracting the signal wait from wait time.
sys.dm_os_wait_stats DMV: http://technet.microsoft.com/en-us/library/ms179984.aspx
During my recent presentation on Waits & Queues, we were fortunate to have Thomas LaRock, aka SQLRockStar in the room. When I began to talk about signal vs. resource waits, I demoed a simple script which derived the overall percentage of signal waits vs. resource waits aggregated since the last SQL Server restart. The result of the script was 15% signal waits vs. 85% resource waits. To the audience, this possibly misrepresented that the fact the overwhelming percentage of resource waits indicated this was an area of concern. (When in fact this is not necessarily the case.)
Tom asked a question to try to put this in perspective, as to what this actually is showing, and why this is useful data. My intention was to simply show the percentage ratio of signal waits to resource waits on a server. Since there is no counter to specifically measure CPU pressure, the percentage of signal wait time is the proper way to measure this. Although I may have been between LaRock and a hard place (how many times has he heard that one before :-), he made a legitimate observation. We continued to discussion after the presentation, and I definitely could have explained it better.
Although we can specifically say that a percent threshold starting around 35% or higher begins to indicate there is pressure on the CPU, as threads in the runnable queue wait longer periods of time for the CPU to process the queries. Rather than waiting for a resource to become available, the high wait time for the CPU, shows excessive pressure there. It is the time waiting for the processor to actually do the work.
The correct way to explain it would be that, in a healthy OLTP system, there will ALWAYS be a waiting queue for resources, and it is the individual high wait times for each resource wait that indicates a bottleneck, whether it is I/O, Memory, Disk, Network etc.
So, the higher percentage of Resource Waits, as compared to the Signal Waits, is indication of a healthy OLTP SQL Server. However, if the percentage is much higher on the Signal waits, then we know we the bottleneck is the CPU processors themselves. It could indicate that there are too few cores, to keep up with a given workload.
At this point, we must consider more or faster CPU’s to keep up with the requests. We also can take a look at our indexing strategy to avoid unneeded sorts, order and group bys’ in our queries. We also want to reduce the number of excessive joins and compilations that force the query to utilize more cpu.
So, in sum, to understand the output of this simple query (shown below), we want to see more resource waits than signal waits. It is ALL the resource waits vs. signal waits.
To understand how resource waits indicate a contention issue, we must go further to understand where the issue lies, and how long and why it’s waiting on the particular resource.
Here is the script, I used for reference in my demo
--Signal Waits vs. Resource Waits.
,'%signal waits' = cast(100.0 * sum(signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2))
,resourceWaitTimeMs=sum(wait_time_ms - signal_wait_time_ms)
,'%resource waits'= cast(100.0 * sum(wait_time_ms - signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2))
----------------------------------------------------------- End Script
Final note. Tom and I both agree that using the Waits & Queues methodology (also more specifically known as Wait Time Analysis) is the surefire way to go when troubleshooting slow performance issues.
This I made clear in my presentation, and offered many ways to demonstrate this.