Over the last many weeks we have written quite a few blog post about query debugging, using Extended Events, Blocking sampling and other techniques. Today we are taking a few steps back, and demonstrate how you could start looking at performance problems on a system that you don’t know anything about.
One of the first things that I always start looking at is wait stats, by looking at a systems wait statistics I get a quick idea about where to start looking next – is it IO (Slow IO Sub system or missing indexes), is it memory or something complete different. To do this I use sys.dm_os_wait_stats, this gives me instance wide wait stats, accumulated since SQL Server was last restarted or since the stats was cleared. You can read more about the sys.dm_os_wait_stats DMV here.
Looking at the raw output from sys.dm_os_wait_stats doesn’t give you much to work with, you need to apply some kind of black magic. I am not going to take the credits for the following script, it comes from a blog that belongs to guy called Glen Berry. Glen publishes a collection of scripts every months that are very useful. Here is the magic:
WITH Waits AS ( SELECT wait_type, wait_time_ms / 1000. AS wait_time_s, 100. * wait_time_ms / SUM(wait_time_ms) OVER() AS pct, ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS rn FROM sys.dm_os_wait_stats WHERE wait_type NOT IN('KSOURCE_WAKEUP', 'SLEEP_BPOOL_FLUSH', 'BROKER_TASK_STOP', 'XE_TIMER_EVENT', 'XE_DISPATCHER_WAIT', 'FT_IFTS_SCHEDULER_IDLE_WAIT', 'SQLTRACE_BUFFER_FLUSH', 'CLR_AUTO_EVENT', 'BROKER_EVENTHANDLER', 'LAZYWRITER_SLEEP', 'BAD_PAGE_PROCESS', 'BROKER_TRANSMITTER', 'CHECKPOINT_QUEUE', 'DBMIRROR_EVENTS_QUEUE', 'LAZYWRITER_SLEEP', 'ONDEMAND_TASK_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH', 'LOGMGR_QUEUE', 'SLEEP_TASK', 'SQLTRACE_BUFFER_FLUSH', 'CLR_MANUAL_EVENT', 'BROKER_RECEIVE_WAITFOR', 'PREEMPTIVE_OS_GETPROCADDRESS', 'PREEMPTIVE_OS_AUTHENTICATIONOPS', 'BROKER_TO_FLUSH') ) SELECT W1.wait_type, CAST(W1.wait_time_s AS DECIMAL(12, 2)) AS wait_time_s, CAST(W1.pct AS DECIMAL(12, 2)) AS pct, CAST(SUM(W2.pct) AS DECIMAL(12, 2)) AS running_pct FROM Waits AS W1 INNER JOIN Waits AS W2 ON W2.rn <= W1.rn GROUP BY W1.rn, W1.wait_type, W1.wait_time_s, W1.pct HAVING SUM(W2.pct) - W1.pct < 95;
and if we take a look at the output that this script gives on my no-action laptop, it look like this:
the wait types that you see here is more or less useless, there is no SQL Server activity on my laptop, but try running the script on your system, and hopefully you’ll se something way different.
To get more info about Wait Types, and what they mean to your system, here is an excellent link that explains that in detail.
Wait stats over time
The above example show how you can look at wait stats here and now, but how about the user that calls You Monday morning and complains about poor performance during the weekend – the above script is not really going to help you with that. I a situation like that is would be really nice with wait stat information from the weekend, even better if they could be broken down into small timeslots. If you could see the change in the wait stats between 11:00 and 11:15 that might give you an idea where to start to look for the problems.
Lets try to build a solution that makes such information available for you. First off all, I will create a table to store the sample data in.
Create table InstanceWideWaitStats ( id bigint identity(1,1) primary key, wait_type nvarchar(120), waiting_tasks_count bigint, wait_time_ms bigint, max_wait_time_ms bigint, signal_wait_time_ms bigint, SampleTime datetime2 default sysdatetime() )
Now let’s write some TSQL that writes some data into this table:
insert into InstanceWideWaitStats (wait_type, waiting_tasks_count, wait_time_ms, max_wait_time_ms, signal_wait_time_ms) select wait_type, waiting_tasks_count, wait_time_ms, max_wait_time_ms, signal_wait_time_ms from sys.dm_os_wait_stats
This code is super simple, the only thing that I actually do is to sample the DMV that I mentioned earlier in this blog post. What I normally do it the schedule this to run every 15 minutes with the SQL Server Agent. When that is done we are ready to have a look at the data.
The information that I am interested in is the column called wait_time_ms – the same column from the last sample that we did. This gives me the amount of wait time of a given Wait Type between two samples, which is 15 minutes if You run the job every 15 minutes. Let’s have a look at the TSQL:
with cte as ( select id, wait_type, wait_time_ms, SampleTime, ROW_NUMBER() OVER(PArtition by wait_type order by SampleTime) as rn from InstanceWideWaitStats ) select t1.wait_type, t2.wait_time_ms - t1.wait_time_ms as WaitTimeInMS, t2.SampleTime from cte t1 inner join cte t2 on (t1.wait_type = t2.wait_type) and (t1.rn = t2.rn -1) where t2.wait_time_ms - t1.wait_time_ms <> 0
This is SQL Server 2008 syntax, SQL Server 2012 comes with a new feature called LAG / LEAD. If you are running 2012 your code would be much more simple, and could look something like this:
SELECT wait_type, SampleTime, wait_time_ms - LAG(wait_time_ms, 1, null) OVER(PArtition by wait_type order by SampleTime) FROM InstanceWideWaitStats where wait_type = 'BROKER_TO_FLUSH'
In a real world examples you would consider to have a where clauses or something on the SampleTime column, to avoid to get to much data and only to look at the period you need. Let’s have a look at the output from the 2008 syntax query:
wuupa, here goes the magic – with this little simple setup we are now able to look at the wait stats and how they looked earlier. As mentioned earlier, this is not a silver bullet, this is just a very good place to start looking when you have performance issues with your SQL Server.
In the next blog post I’ll demonstrate how you can visualize this data, and make it quite nice and easy to look at with Excel.