Why is that SQL Server Instance under stress?

There are several reliable indications, using SQL Queries, of the what is causing SQL Server performance problems. Some of these are fairly obvious, but others aren't. Grant shows how you can get clues from any SQL Server as to the cause of stress.

When you hit performance problems with a SQL Server Instance, a few measures will tell you enough about their general nature that you can then focus in quickly on the actual cause. There are many different metrics that you can use to understand what’s up with your SQL Server instance. Each of these measures will give you a positive indication. It could, perhaps, be internal or external memory pressure, undue or uneven CPU loading, or IO bottlenecks. If you get a positive indication of where to look for the problem, you can then drill in to the detail. If, for example, you determine that you have excessive CPU stress in SQL Server the next step is to determine which query is causing the most CPU.

So what are these general indications? It is useful to get straight-forward information about the total processor time, a percentage measure of how much of the CPUs you’re using on your machine, or average read time for your disks which tells you how long it takes to read information from them. When you’ve done these easy ones, however, there are other methods of finding out what’s happening on your system that aren’t as easy, but which yield even more useful information. Let’s explore a few of these in more detail.

Is your instance experiencing memory stress?

Sys.dm_os_ring_buffers

The ring buffers within the operating system are simply a collection point for certain types of system messages. Largely these can be obscure system messages, many of them related to the management of the ring buffers themselves, but a few of the buffer messages are extremely interesting. For example, the operating system is aware of when it is running low on memory. When this happens, an entry is made into the ring buffers You can query to find out if you have had a memory alert, using the dynamic management object, sys.dm_os_ring_buffers.

There are actually two kinds of memory alerts. You can get a memory alert for the physical memory of the machine you’re running on. From a SQL Server point of view, this is referred to as external memory, since it’s not the memory managed by the SQL Server service. You also get alerts for the virtual memory; that memory that is being managed by the SQL Server service. We usually refer to this as internal memory. When either is running low, you will see an alert posted into the ring buffers. You’ll also see alerts when you have enough memory, or if you have a large increase in memory.

You can simply query the DMO like this:

That will return the information available, but you will soon find that the majority of the interesting information returned is in the “record” column. This is a text column that stores XML. To really pull together interesting information you’ll want to go for a more thorough query along these lines:

With this query I’ve first created a common table expression (CTE) called RingBuffer. In that I only do two things. First filter for a particular ring buffer type, ‘RING_BUFFER_RESOURCE_MONITOR’. These are where the memory messages are going to occur. Second, I CAST the ‘Record’ column from text to XML. From there I use the CTE within the query and use the XQuery commands to pull out all the interesting information from the XML data.

As an extra complication, the timestamp column in sys.dm_os_ring_buffers is actually a datetime value, but it’s based on cpu ticks so you’ll have to use that formula to convert that data to a human readable date and time.

To use sys.dm_os_buffers as part of a monitoring process, you just have to look for the two events, RESOURCE_MEMPHYSICAL_LOW or RESOURCE_MEMVIRTUAL_LOW. These are available in the ResourceMonitor/Notification property within the XML. They are an absolute indicator of a low memory condition on your machine, so if you get the alert, you were low on either external/OS/physical memory, or internal/SQL Server/virtual memory.

Is the system under load?

One question that constantly comes up, ‘is the system under load or not?’

There are a number of different ways to try to understand this, but only a few that will let you know with certainty whether you’re under load.

Sys.dm_os_workers

One of my favorite ways of determining precisely how much work is going on within the system is to look at sys.dm_os_workers. This measure won’t tell you what it is that is placing load on the system; nor will it allow you to understand what the implications of the load are. However, it’s a perfect measure of load on the system.

There is a lot of information returned from sys.dm_os_workers. The DMO is returning information about worker processes within the OS. You can view information about the process such as the last wait type, whether or not the worker has an exception, how many context switches it’s experienced; all sorts of stuff. The Books Online entry for the DMO even shows you how to determine how long a process has been in a runnable state.

But to use this as a measure for load, your query is extremely easy:

This really is as simple as that. As this number goes up or down, the load on your system is going up or down. Where there are lots of ‘RUNNING’ worker processes today that weren’t there yesterday, you have an increased load on the system. But bear in mind, you need to be able to compare this over time. Simply capturing a number won’t tell you anything. You have to be able to compare between two values.

Sys.dm_os_schedulers

Another way to measure system load is to take a look at the schedulers. These are the processes that manage the worker processes. Again, this is an absolute measure of load on your server. It can tell you how much work is being done within the system.

Querying the schedulers yields a lot of interesting information about how the system is being managed. You can see how many workers are being handled by any particular scheduler. You can see how many times that scheduler has yielded the CPU (gave up access to another process, because each process only gets limited access to CPUs), a count of the workers that are currently active within the scheduler and several other details.

But to see a measure of load you can run a very simple query:

Once more, this number is only meaningful if you can compare it to previous values. Using either the workers or the schedulers as a measure of system load is only as accurate as your baseline values, but if you maintain a collection of these values over time, you’ll be able to determine load on the operating system.

Has SQL Server got enough memory?

DBCC MEMORYSTATUS

This is a simply amazing collection of data. What you get is an output of all the various pieces of memory management within SQL Server. You’ll be able to see where every bit of memory has been allocated and managed within SQL Server. The command is frequently used when you’re dealing with customer support from Microsoft to troubleshoot specific issues. But, it’s another way of absolutely determining how well you’re doing with memory on your system.

If you simply run the command:

You will see all the various memory allocation and management processes within SQL Server. All of them. In fact, there’s so much information there that it quickly becomes pointless to try to make sense of it. The good news is, it’s possible to target a few particular pieces of information. If we specifically go after the Target Committed value and the Current Committed value we can make a determination if SQL Server has enough memory. It’s simple. If the Target value is higher than the Current value, you don’t have the memory you need within SQL Server. But retrieving these values is a little bit of a pain. Here’s one way to do it:

I’m creating a table variable and then loading all the output from MEMORYSTATUS using the TABLERESULTS to ensure that the output is a table. By using the Common Table Expression (CTE) to define the information selected from the table variable I can reference it twice with the SELECT statement and just JOIN the two values based on the ROW_NUMBER. It really works. If you get a negative value, you’re looking at memory issues.

Understand, DBCC MEMORYSTATUS is meant as a support mechanism for Microsoft. It’s not a part of the standard tool set. This means that it absolutely is subject to unpublished changes from one version of SQL Server to the next or even one service pack to the next. Use it to diagnose your memory issues with this understanding firmly in place.

Do I need a better, faster disk system?

Sys.dm_io_virtual_file_stats

This dynamic management object returns statistics about the behaviors of the files on your databases. The most interesting pieces of information here are the stalls, or waits, gathered and available. If you simply run the query:

You must pass it two pieces of information, the database id, which I’m getting by using DB_NAME to identify the database that I’m currently attached to and then passing that to DB_ID, and the file id. You can pass the value NULL, as I did, which will return all the files for that database.

The information returned is excellent, especially four columns; sample_ms, io_stall_read_ms, io_stall_write_ms, io_stall. Let’s take a look at what these represent and you’ll quickly understand how interesting they are to you as a DBA. The sample_ms is very straight forward. It’s the time since SQL Server was restarted. It provides you with the measuring information for understanding all the other values. Next is io_stall_read_ms. This represents the amount of time that processes were forced to wait on reads from this device. If you combined is_stall_read_ms and sample_ms, you get a precise measure of the percentage of time that your apps are waiting to read from a particular file on a particular database. You also get io_stall_write_ms, which represents the amount of time that processes have had to wait for writes. You can gather this metric over time to see how it grows, or use sample_ms in the same way you did with the reads. Finally, io_stall shows the amount of time that watis occurred on that file for any io operation. Again, you can gather it over time to see how it’s growing (because it will always only ever grow) or you can get a percentage of time spent waiting on disks by comparing to sample_ms.

These measures will tell you exactly how severe the issues are regarding io waits on your system. But they won’t pin point specific queries. Instead, this measure is primarily focused on determining if there are issues with your system. Do you need more disks, faster disks, etc.

How well is the CPU performing?

Sys.dm_os_wait_stats

I have this listed as an obscure method for gathering performance metrics, but it really shouldn’t be. By now everyone should have heard that understanding what the server is waiting on is a great way to understand what is causing the server to run slow. But I still see lots of people being shocked that they can find this information.

Sys.dm_os_wait_stats shows you an aggregated view of what the server has been waiting on since the last time it was started (or since the wait stats were cleared). This information is broken down by particular wait types, some of which are obscure indeed. I won’t attempt to document them here, even Microsoft doesn’t supply complete documentation on these. You will need to rely on internet searches for identifying what some wait types represent. Others are documented in the books online, so please use that excellent resource.

To query sys.dm_os_wait_stats you run a query like this:

The output is only five columns; wait_type, waiting_tasks_count, wait_time_ms, max_wait_time_ms and signal_wait_time_ms. The only one of these that is not immediately understandable by it’s name is the signal_wait_time_ms. This column represents the amount of time from when the thread was called and it actually started to execute. This time is included in the total time, wait_time_ms. The signal_wait_time_ms then is actually a measure of the wait time to get at the CPU. This makes it an excellent measure of how much load your CPU is carrying. Because of this, while you should absolutely look at wait statistics in general, you should always focus in on the signal_wait_time_ms separately in order to understand how well your CPU is performing. You can get quite sophisticated with this, or you can simply focus in like the following:

This will represent a cumulative total of all the waits on CPU that have occurred on the system. It’s an excellent indicator of load over time. You will need to compare one measure to another to see how it’s growing.

Conclusion

These are just a few examples of the general areas of ‘stress’ that you can check so that you can then quickly focus in on particular aspects of the system so as to understand what’s going on. With these measures you can quickly confirm or eliminate possible causes of performance problems.

Each of these measures provides a sufficiently positive indication to give you confidence that you have memory pressure or your CPU is under load. Once you understand the general nature of the stress, you’ll need to go to other, more standard metrics, in order to understand specifics such as which query is causing the most CPU.