Toolbox - IO, IO, Why are You So Slow?

This is the next in a series of blogs I am going to create talking about useful tools (mostly scripts) that I use frequently in my day-to-day life as a production DBA. I work as a Work-From-Home DBA for a Remote DBA company, but 90+% of my job functions are the same as any other company DBA.

Many of these scripts come directly from blogs and articles created and shared by other members of the SQL Server community; some of them I have slightly modified and some I have lifted directly from those articles. I will always give attribution back to the original source and note when I have made modifications.

Troubleshooting performance in SQL Server (or almost any other system) is often an iterative process of discovering a bottleneck, fixing it, and then discovering the next bottleneck.

https://memegenerator.net/instance/65795220/grumpy-cat-5-bottleneck-sighting-optimize-the-system

SQL Server bottlenecks frequently fall into one of three resource categories:

CPU
Memory
I/O (Disk)

For example, you may find that "your instance is slow" (Client-speak for "I'm unhappy - FIX IT!") and upon investigation you see in your Perfmon collector (you have a Perfmon collector running, right???) that CPU is consistently at 90%+. You consider the evidence and realize a set of queries is poorly written (darn SharePoint) but you also realize that in this case you can fix them (whew!), and this helps the CPU numbers drop significantly. You're the hero, right?

...until you realize the instance is still "slow."

https://hakanforss.files.wordpress.com/2014/08/lawofbottlenecks.png

You look at the numbers again, and now you find that disk latency, which had previously been fine, is now completely in the tank during the business day, showing that I/O delays are through the roof.

What happened?

This demonstrates the concept of shifting bottleneck - while CPU use was through the roof, the engine so bogged down that it couldn't generate that much I/O, but once the CPU issue was resolved queries started moving through more quickly until the next choke point was met at the I/O limit. Odds are once you resolve the I/O situation, you would find a new bottleneck.

How do you ever defeat a bad guy that constantly moves around and frequently changes form?

https://movietalkexpress.files.wordpress.com/2016/08/3.jpg

The next concept in this series is the concept of "good enough" - all computing systems have theoretical top speeds (speed of electrons through the ether, etc.) but you never get there. At the end of the day the important question is:

How fast does your system need to be to meet the business SLAs and keep your customers happy (aka QUIET)?

Unfortunately in many cases the answer seems to be this:

https://i.kinja-img.com/gawker-media/image/upload/s--Wtfu6LxV--/c_fill,fl_progressive,g_center,h_450,q_80,w_800/1344998898279631762.jpg

Once you can determine this "good enough" speed (the whole discussion of "Do you have SLAs?" could fill dozens of blog posts) you can determine where an acceptable end point is to your iterative process.

I have a small handful of things I check when a system is "slow" and one of them is disk latency. At its core, disk latency is a calculation of how much delay there is between when SQL Server asks for data and when it gets it back.

The data regarding latency comes from the DMV sys.dm_io_virtual_file_stats, and the relevant metrics are io_stall, io_stall_read_ms, and io_stall_write_ms (table from Technet):

Column name	Data type	Description
database_id	smallint	ID of database.
file_id	smallint	ID of file.
sample_ms	int	Number of milliseconds since the computer was started. This column can be used to compare different outputs from this function.
num_of_reads	bigint	Number of reads issued on the file.
num_of_bytes_read	bigint	Total number of bytes read on this file.
io_stall_read_ms	bigint	Total time, in milliseconds, that the users waited for reads issued on the file.
num_of_writes	bigint	Number of writes made on this file.
num_of_bytes_written	bigint	Total number of bytes written to the file.
io_stall_write_ms	bigint	Total time, in milliseconds, that users waited for writes to be completed on the file.
io_stall	bigint	Total time, in milliseconds, that users waited for I/O to be completed on the file.
size_on_disk_bytes	bigint	Number of bytes used on the disk for this file. For sparse files, this number is the actual number of bytes on the disk that are used for database snapshots.
file_handle	varbinary	Windows file handle for this file.

How do you turn these numbers into something meaningful? As is almost always the case - someone has already done that work - you just need to find it!

Paul Randal (blog/@PaulRandal) has a great blog post "How to examine IO subsystem latencies from within SQL Server" and in that post he references a query created by Jimmy May (blog/@aspiringgeek). Jimmy was at Microsoft for many years and is currently at the SanDisk Data Propulsion Lab.

Jimmy's query is so useful that it is also included in Glenn Berry's (blog/@glennalanberry) great Diagnostic Information (DMV) Queries. Here is my *very slightly* modified version of that query:

-- Drive level latency information (Query 19) (Drive Level Latency)
-- Based on code from Jimmy May
SELECT @@ServerName as instanceName, [Drive], volume_mount_point,
CASE
WHEN num_of_reads = 0 THEN 0
ELSE (io_stall_read_ms/num_of_reads)
END AS [Read Latency],
CASE
WHEN io_stall_write_ms = 0 THEN 0
ELSE (io_stall_write_ms/num_of_writes)
END AS [Write Latency],
CASE
WHEN (num_of_reads = 0 AND num_of_writes = 0) THEN 0
ELSE (io_stall/(num_of_reads + num_of_writes))
END AS [Overall Latency],
CASE
WHEN num_of_reads = 0 THEN 0
ELSE (num_of_bytes_read/num_of_reads)
END AS [Avg Bytes/Read],
CASE
WHEN io_stall_write_ms = 0 THEN 0
ELSE (num_of_bytes_written/num_of_writes)
END AS [Avg Bytes/Write],
CASE
WHEN (num_of_reads = 0 AND num_of_writes = 0) THEN 0
ELSE ((num_of_bytes_read + num_of_bytes_written)/(num_of_reads + num_of_writes))
END AS [Avg Bytes/Transfer]
FROM (SELECT LEFT(UPPER(mf.physical_name), 2) AS Drive, UPPER (volume_mount_point) as volume_mount_point
,SUM(num_of_reads) AS num_of_reads,
SUM(io_stall_read_ms) AS io_stall_read_ms, SUM(num_of_writes) AS num_of_writes,
SUM(io_stall_write_ms) AS io_stall_write_ms, SUM(num_of_bytes_read) AS num_of_bytes_read,
SUM(num_of_bytes_written) AS num_of_bytes_written, SUM(io_stall) AS io_stall
FROM sys.dm_io_virtual_file_stats(NULL, NULL) AS vfs
INNER JOIN sys.master_files AS mf WITH (NOLOCK)
ON vfs.database_id = mf.database_id AND vfs.file_id = mf.file_id
CROSS APPLY sys.dm_os_volume_stats(mf.database_id, mf.FILE_ID)
GROUP BY LEFT(UPPER(mf.physical_name), 2),UPPER (volume_mount_point)) AS tab
ORDER BY [Overall Latency] DESC OPTION (RECOMPILE);
-- Shows you the drive-level latency for reads and writes, in milliseconds
-- Latency above 20-25ms is usually a problem

The query performs a calculation comparing the stall in milliseconds to the number of disk operations to create a latency in milliseconds. A rule of thumb (as noted in the query comments) is that anything >=20ms in latency is pretty bad and worthy of investigation.

Here is a sample output of the query:

instanceName	Drive	volume_mount_point	Read Latency	Write Latency	Overall Latency	Avg Bytes/ Read	Avg Bytes/ Write	Avg Bytes/ Transfer
INSTANCE35	F:	F:\SPARE-01\	69	7	64	1002982	236026	940840
INSTANCE35	F:	F:\DATA01\	26	1	25	55005	2629	54384
INSTANCE35	F:	F:\280532-02_SECOND\	8	5	8	105124	8257	105114
INSTANCE35	T:	T:\	2	25	3	65483	64899	65187
INSTANCE35	E:	E:\LOGS01\	7	0	0	20080	51327	51198

On this instance you can see that the SPARE-01 and DATA01 mount points on the F: drive have significant overall latency (64ms and 25ms respectively) - significant enough that the users are almost certainly experiencing I/O-related impact.

If you look at the query you will see that the "overall" latency covers both reads and writes and as such is functionally a weighted average of the read and write latencies. For example you can see that the DATA01 does significantly more reads that writes since the overall latency of 25 is almost equal to the read latency of 26.

One more point from the sample numbers above - look at the T: drive. The write latency is 25 (bad) but the overall latency is only 3. Is this a problem? Usually it is not, because this overall latency shows that this drive does relatively few writes - there are so many reads comparatively that the good read latency heavily outweighs the bad write latency. In most cases the overall latency is the most important statistic of the three.

(Of course as always this is a big #ItDepends - if you have a known issue with write performance - even for the small number of writes being done on the drive - then the write latency is important!)

What causes latency? There are a wide variety of issues - some inside SQL but many outside the engine as well. In my experience I usually follow-up on both sides, tasking infrastructure teams to investigate numbers on the storage back-end *and* the storage network while the DBA investigates SQL Server-side I/O-related issues such as poor query plans and missing or bad indexes that cause excessive unnecessary I/O. Sometimes you will read guidance to investigate SQL first (Paul makes that recommendation in his post referenced above) but I like starting both investigations at once so that you can get both sets of results in a more timely fashion.

Disk Latency is one of the easiest things to find - run the single simple query shown above and look at the numbers. Once you determine that you *do* have a latency issue, you can move forward with internal SQL Server and external infrastructure investigations.

Remember - at the end of the day, you want your customer to think this:

http://imgur.com/waLAsxg

Hope this helps!

Toolbox - IO, IO, Why are You So Slow?

Rate

Share

Share

Rate

Toolbox - IO, IO, Why are You So Slow?

Rate

Share

Share

Rate

Related content

Book Review: Big Red - Voyage of a Trident Submarine

Database Mirroring FAQ: Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup?

Inserting Markup into a String with SQL

Networking - Part 4

Speaking at Community Events - More Thoughts