This is the next in a series of blogs I am going to create talking about useful tools (mostly scripts) that I use frequently in my day-to-day life as a production DBA. I work as a Work-From-Home DBA for a Remote DBA company, but 90+% of my job functions are the same as any other company DBA.
Many of these scripts come directly from blogs and articles created and shared by other members of the SQL Server community; some of them I have slightly modified and some I have lifted directly from those articles. I will always give attribution back to the original source and note when I have made modifications.
--
Troubleshooting performance in SQL Server (or almost any other system) is often an iterative process of discovering a bottleneck, fixing it, and then discovering the next bottleneck.
https://memegenerator.net/instance/65795220/grumpy-cat-5-bottleneck-sighting-optimize-the-system |
SQL Server bottlenecks frequently fall into one of three resource categories:
- CPU
- Memory
- I/O (Disk)
https://hakanforss.files.wordpress.com/2014/08/lawofbottlenecks.png |
You look at the numbers again, and now you find that disk latency, which had previously been fine, is now completely in the tank during the business day, showing that I/O delays are through the roof.
https://movietalkexpress.files.wordpress.com/2016/08/3.jpg |
The next concept in this series is the concept of "good enough" - all computing systems have theoretical top speeds (speed of electrons through the ether, etc.) but you never get there. At the end of the day the important question is:
https://i.kinja-img.com/gawker-media/image/upload/s--Wtfu6LxV--/c_fill,fl_progressive,g_center,h_450,q_80,w_800/1344998898279631762.jpg |
--
Once you can determine this "good enough" speed (the whole discussion of "Do you have SLAs?" could fill dozens of blog posts) you can determine where an acceptable end point is to your iterative process.
Column name | Data type | Description |
database_id | smallint | ID of database. |
file_id | smallint | ID of file. |
sample_ms | int | Number of milliseconds since the computer was started. This column can be used to compare different outputs from this function. |
num_of_reads | bigint | Number of reads issued on the file. |
num_of_bytes_read | bigint | Total number of bytes read on this file. |
io_stall_read_ms | bigint | Total time, in milliseconds, that the users waited for reads issued on the file. |
num_of_writes | bigint | Number of writes made on this file. |
num_of_bytes_written | bigint | Total number of bytes written to the file. |
io_stall_write_ms | bigint | Total time, in milliseconds, that users waited for writes to be completed on the file. |
io_stall | bigint | Total time, in milliseconds, that users waited for I/O to be completed on the file. |
size_on_disk_bytes | bigint | Number of bytes used on the disk for this file. For sparse files, this number is the actual number of bytes on the disk that are used for database snapshots. |
file_handle | varbinary | Windows file handle for this file. |
Paul Randal (blog/@PaulRandal) has a great blog post "How to examine IO subsystem latencies from within SQL Server" and in that post he references a query created by Jimmy May (blog/@aspiringgeek). Jimmy was at Microsoft for many years and is currently at the SanDisk Data Propulsion Lab.
Jimmy's query is so useful that it is also included in Glenn Berry's (blog/@glennalanberry) great Diagnostic Information (DMV) Queries. Here is my *very slightly* modified version of that query:
--
-- Drive level latency information (Query 19) (Drive Level Latency)
-- Based on code from Jimmy May
SELECT @@ServerName as instanceName, [Drive], volume_mount_point,
CASE
WHEN num_of_reads = 0 THEN 0
ELSE (io_stall_read_ms/num_of_reads)
END AS [Read Latency],
CASE
WHEN io_stall_write_ms = 0 THEN 0
ELSE (io_stall_write_ms/num_of_writes)
END AS [Write Latency],
CASE
WHEN (num_of_reads = 0 AND num_of_writes = 0) THEN 0
ELSE (io_stall/(num_of_reads + num_of_writes))
END AS [Overall Latency],
CASE
WHEN num_of_reads = 0 THEN 0
ELSE (num_of_bytes_read/num_of_reads)
END AS [Avg Bytes/Read],
CASE
WHEN io_stall_write_ms = 0 THEN 0
ELSE (num_of_bytes_written/num_of_writes)
END AS [Avg Bytes/Write],
CASE
WHEN (num_of_reads = 0 AND num_of_writes = 0) THEN 0
ELSE ((num_of_bytes_read + num_of_bytes_written)/(num_of_reads + num_of_writes))
END AS [Avg Bytes/Transfer]
FROM (SELECT LEFT(UPPER(mf.physical_name), 2) AS Drive, UPPER (volume_mount_point) as volume_mount_point
,SUM(num_of_reads) AS num_of_reads,
SUM(io_stall_read_ms) AS io_stall_read_ms, SUM(num_of_writes) AS num_of_writes,
SUM(io_stall_write_ms) AS io_stall_write_ms, SUM(num_of_bytes_read) AS num_of_bytes_read,
SUM(num_of_bytes_written) AS num_of_bytes_written, SUM(io_stall) AS io_stall
FROM sys.dm_io_virtual_file_stats(NULL, NULL) AS vfs
INNER JOIN sys.master_files AS mf WITH (NOLOCK)
ON vfs.database_id = mf.database_id AND vfs.file_id = mf.file_id
CROSS APPLY sys.dm_os_volume_stats(mf.database_id, mf.FILE_ID)
GROUP BY LEFT(UPPER(mf.physical_name), 2),UPPER (volume_mount_point)) AS tab
ORDER BY [Overall Latency] DESC OPTION (RECOMPILE);
-- Shows you the drive-level latency for reads and writes, in milliseconds
-- Latency above 20-25ms is usually a problem
--
Here is a sample output of the query:
--
instanceName | Drive | volume_mount_point | Read Latency | Write Latency | Overall Latency | Avg Bytes/ Read | Avg Bytes/ Write | Avg Bytes/ Transfer |
INSTANCE35 | F: | F:\SPARE-01\ | 69 | 7 | 64 | 1002982 | 236026 | 940840 |
INSTANCE35 | F: | F:\DATA01\ | 26 | 1 | 25 | 55005 | 2629 | 54384 |
INSTANCE35 | F: | F:\280532-02_SECOND\ | 8 | 5 | 8 | 105124 | 8257 | 105114 |
INSTANCE35 | T: | T:\ | 2 | 25 | 3 | 65483 | 64899 | 65187 |
INSTANCE35 | E: | E:\LOGS01\ | 7 | 0 | 0 | 20080 | 51327 | 51198 |
On this instance you can see that the SPARE-01 and DATA01 mount points on the F: drive have significant overall latency (64ms and 25ms respectively) - significant enough that the users are almost certainly experiencing I/O-related impact.
If you look at the query you will see that the "overall" latency covers both reads and writes and as such is functionally a weighted average of the read and write latencies. For example you can see that the DATA01 does significantly more reads that writes since the overall latency of 25 is almost equal to the read latency of 26.
One more point from the sample numbers above - look at the T: drive. The write latency is 25 (bad) but the overall latency is only 3. Is this a problem? Usually it is not, because this overall latency shows that this drive does relatively few writes - there are so many reads comparatively that the good read latency heavily outweighs the bad write latency. In most cases the overall latency is the most important statistic of the three.
(Of course as always this is a big #ItDepends - if you have a known issue with write performance - even for the small number of writes being done on the drive - then the write latency is important!)
--
What causes latency? There are a wide variety of issues - some inside SQL but many outside the engine as well. In my experience I usually follow-up on both sides, tasking infrastructure teams to investigate numbers on the storage back-end *and* the storage network while the DBA investigates SQL Server-side I/O-related issues such as poor query plans and missing or bad indexes that cause excessive unnecessary I/O. Sometimes you will read guidance to investigate SQL first (Paul makes that recommendation in his post referenced above) but I like starting both investigations at once so that you can get both sets of results in a more timely fashion.
Disk Latency is one of the easiest things to find - run the single simple query shown above and look at the numbers. Once you determine that you *do* have a latency issue, you can move forward with internal SQL Server and external infrastructure investigations.
--
Remember - at the end of the day, you want your customer to think this:
http://imgur.com/waLAsxg |
--
Hope this helps!