I/O requests taking longer than 15 seconds to complete

  • This is a Win2008R2 SQL2008R2 SP2 active/passive cluster.

    I am getting these messages when I run an UPDATE STATS on a large table consistently. SQL Server has encountered 907 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [V:\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\tempdb2.ndf] in database [tempdb] (2). The OS file handle is 0x00000000000011D0. The offset of the latest long I/O is: 0x00000139500000

    The storage guys have specifically showed me that the storage is responding to these requests in less than 10ms so it isn't the storage. I know it isn't the storage so it has to be something in between but what? Anyone have this issue or have ANY ideas of what I can do to further narrow down what is causing this? I have four other SQL Servers with this issue as well... however, we have some quite sizeable SQL Servers here that do not have the problem so it isn't for every one.

    Any help would be appreciated.

  • I recently got a bunch of those, that were caused by a network issue in the connection to the SAN.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Interesting. However, the SAN team tells me.. if we had problems with the SAN or connectivity many other larger systems not on SQL Server would be complaining up and down and there has been none of that.

  • What monitoring stats have you collected so far to supply to your SAN and infrastructure teams.

    In between the SAN and the SQL Server you have all the other software\hardware components, these all need to be checked. Capture disk latency and throughput as a starter and discuss these with the SAN\Infrastructure teams.

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • When I run Update Stats on large tables and Integrity checks is the only time I see these. In Activity Monitor the DataFile I/O Response time in MS goes way up in spikes. I have shown them this. However, they show me the monitor from the SAN showing disk response of 8MS and under. They say the SAN is performing great and other Oracle and ESSBASE systems that hit the SAN harder are not having any issues. I just don't know what else to show them to convince them there is an issue and what to even look at.

  • As i said, you have all the components in between the SAN and the SQL instance.

    HBAs

    HBA drivers

    multipath drivers

    Fibre channel switches

    Etc, etc

    Initially, check the Queue depth settings on the HBA(s)

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Perry Whittle (11/7/2012)


    As i said, you have all the components in between the SAN and the SQL instance.

    HBAs

    HBA drivers

    multipath drivers

    Fibre channel switches

    Etc, etc

    Initially, check the Queue depth settings on the HBA(s)

    They told me they checked all of that and all was OK....

  • Thanks for the ammunition. I will see what I can find out about this.

  • So I am using PerfMon to monitor

    physical disk performance Obj

    avg dis sec/read counter

    avg disk sec/write counter

    avg dis sec/transfer counter

    based on this article:

    http://blogs.technet.com/b/askcore/archive/2012/02/07/measuring-disk-latency-with-windows-performance-monitor-perfmon.aspx

    but I cannot figure out what the information is telling me.... what is good and what is bad to take to our SAN/Server engineers.....

    Can anyone help with this?

  • Markus (11/7/2012)


    So I am using PerfMon to monitor

    physical disk performance Obj

    avg dis sec/read counter

    avg disk sec/write counter

    avg dis sec/transfer counter

    based on this article:

    http://blogs.technet.com/b/askcore/archive/2012/02/07/measuring-disk-latency-with-windows-performance-monitor-perfmon.aspx

    but I cannot figure out what the information is telling me.... what is good and what is bad to take to our SAN/Server engineers.....

    Can anyone help with this?

    What numbers are you seeing for the above when you run your updatestats query?

  • Very difficult to tell you... if fluctuates.....

    The left side of the graph from 0 to 100 goes up and down but if I highlight one of the lines below the last and maximum do not agree with where the bar is on the 0 to 100 either. The bar graphs for all of the counters is all over the place.

    when I see the slow I/Os the avg. disk per sec.transfer is pegged at 100 for last it reports 12.5 does that mean 12.5 milliseconds?

  • Markus (11/7/2012)


    http://blogs.technet.com/b/askcore/archive/2012/02/07/measuring-disk-latency-with-windows-performance-monitor-perfmon.aspx

    Alongside this also capture throughput too using the following counterset

    PhysicalDisk: Disk Bytes/Read

    PhysicalDisk: Disk Bytes/Write

    This will show throughput of data transferred to and from disk

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Markus (11/7/2012)


    Very difficult to tell you... if fluctuates.....

    The left side of the graph from 0 to 100 goes up and down but if I highlight one of the lines below the last and maximum do not agree with where the bar is on the 0 to 100 either. The bar graphs for all of the counters is all over the place.

    when I see the slow I/Os the avg. disk per sec.transfer is pegged at 100 for last it reports 12.5 does that mean 12.5 milliseconds?

    Ignore the left side of the graph, that's just a scale. If you highlight the counter on the bottom and Last says 12.5, that's 12.5 seconds and is not good.

    Is it possible tempdb is trying to grow but can't grow fast enough? What size is your tempdb, how much is in use? Otherwise, it looks like it could be a connection issue. Do your SAN guys monitor your server HBA too and everything inbetween? I would think they can only see the SAN performance. Is this fiber or ISCSI? Any switches inbetween? Run the counters and try a manual copy of a large file to that drive and see what the numbers are then.

  • scogeb (11/7/2012)


    Markus (11/7/2012)


    Very difficult to tell you... if fluctuates.....

    The left side of the graph from 0 to 100 goes up and down but if I highlight one of the lines below the last and maximum do not agree with where the bar is on the 0 to 100 either. The bar graphs for all of the counters is all over the place.

    when I see the slow I/Os the avg. disk per sec.transfer is pegged at 100 for last it reports 12.5 does that mean 12.5 milliseconds?

    Ignore the left side of the graph, that's just a scale. If you highlight the counter on the bottom and Last says 12.5, that's 12.5 seconds and is not good.

    Is it possible tempdb is trying to grow but can't grow fast enough? What size is your tempdb, how much is in use? Otherwise, it looks like it could be a connection issue. Do your SAN guys monitor your server HBA too and everything inbetween? I would think they can only see the SAN performance. Is this fiber or ISCSI? Any switches inbetween? Run the counters and try a manual copy of a large file to that drive and see what the numbers are then.

    TEMPDB is 32 gig in size and there are two .mdf files. TEMPDB mdf and ldf are on their own drive letter and this is the drive that is getting the slow I/Os. It is not growing.

    I have no idea what they are monitoring but I will find out. Don't know if there are any switches in between either. I will try running the counters and doing a large file copy and see what happens.

    Thanks gang!

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply