SQL server i/o requests taking longer than 15 secs. & slow checkpoints

  • Last couple of weeks we've been working on getting to the root cause of what could likely be the cause of the occurrence of these i/o issues and slowdown of the checkpoints. At first glance it looks to be clearly an i/o subsystem error and the SAN admin was to be blamed for but recently we changed the SAN to utilize Full Flash but as of today the error still pops up and I have no clue as to why since every metric whether wait stats or any other metric that I run to check if SQL server is a possible culprit seems out to return normal. It doesn't really add up. It could also be very likely that something else is chewing the disk and SQL Server is becoming victimized here...but I am not able to find out what? Dbs are in AlwaysOn AG and as and when these events occur we do see role changes and flip overs occurring along with timeouts. Any help in figuring this out would be highly appreciated. Let me know if any further details is needed.

    Error msgs. below

    SQL Server has encountered 14212 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [E:\MSSQL\DATA\ABC.mdf] in database [ABC] (7). The OS file handle is 0x0000000000000D64. The offset of the latest long I/O is: 0x0000641262c000

    SQL Server has encountered 5347 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [E:\MSSQL\DATA\XYZ.mdf] in database [XYZ] (7). The OS file handle is 0x0000000000000D64. The offset of the latest long I/O is: 0x0000506c060000

    FlushCache: cleaned up 111476 bufs with 62224 writes in 925084 ms (avoided 19 new dirty bufs) for db 7:0

    average throughput: 0.94 MB/sec, I/O saturation: 55144, context switches 98407

    last target outstanding: 10240, avgWriteLatency 14171

    FlushCache: cleaned up 5616 bufs with 3126 writes in 248687 ms (avoided 3626 new dirty bufs) for db 6:0

    average throughput: 0.18 MB/sec, I/O saturation: 10080, context switches 20913

    last target outstanding: 2, avgWriteLatency 3

  • You want to look at the entire I/O subsystem, not necessarily just the disks. Anything in the path from SQL Server to the disks. I've had that type of thing before with multipathing and a bad port on the switch.

    You may want to first start by getting some stats using sys.dm_io_virtual_file_stats in case it's one particular database, file, drive, etc. You can find a good example in this article:

    https://www.sqlskills.com/blogs/paul/capturing-io-latencies-period-time/

    Then you need to work at walking through the entire path in the I/O subsystem. The role switching isn't unusual - when I had the issue on clusters they would flip back and forth when the I/O stalls were logged.

    Sue

  • Thanks! Sue. I will grab that information and loop back on this thread. Also, we have a team meeting tomorrow with the storage admin, networking team etc. DO you have any thoughts as to what to ask them. Following are the details provided by the system guy

    We separate workloads for high I/O intense workloads (such as DB) so that we only have one per host.

    The specs for the current host is Dell R730 with 16 cores of Xeon E5-2620 (2 sockets), 512GB, and 2x10G interconnects for storage.

    No other VM on the cluster nor host are experiencing these issues.

    Storage for VMs and workloads is on Pure FA-x20.

  • Feivel wrote:

    Thanks! Sue. I will grab that information and loop back on this thread. Also, we have a team meeting tomorrow with the storage admin, networking team etc. DO you have any thoughts as to what to ask them. Following are the details provided by the system guy

    We separate workloads for high I/O intense workloads (such as DB) so that we only have one per host.

    The specs for the current host is Dell R730 with 16 cores of Xeon E5-2620 (2 sockets), 512GB, and 2x10G interconnects for storage.

    No other VM on the cluster nor host are experiencing these issues.

    Storage for VMs and workloads is on Pure FA-x20.

  • I've provided the output of the virtual file stats via this url link. It was sample exactly as referenced in the article for a period of 30 minutes.

    https://imgur.com/a/HlW2Ohb

    • This reply was modified 5 years ago by  Feivel.
    • This reply was modified 5 years ago by  Feivel.
  • You'd also want to make sure to check all the other major components for the SQL Server instance. For example, you can end up with more disk activity if you are limited in memory - less data in memory so more trips to disk. That type of thing as well as double checking all of the configuration settings. And review the VM settings - there are some best practices for SQL Server published by VMWare which are a good starting point.

    And you still want to review all the pieces in the I/O subsystem. Basically a systems review since you have some pretty bad numbers with the IO stalls. What's interesting is that no other VM guests or any other hosts seem to be affected which initially seems to make it related to this specific VM guest with the IO stalls.

    Sue

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply