SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds

  • Hi there - Last night our application experienced a brief outage due to the following error which I found in the error logs;

    SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [O:\logs\DATABASE.ldf] in database [DATABASE] (12). The OS file handle is 0x0000000000000D58. The offset of the latest long I/O is: 0x00000013f80000

    Does anyone have any idea what this means? And what may have caused it? Thanks

    --------------------------------------------

    Laughing in the face of contention...

  • It means you probably have an issue somewhere in your storage subsystem. Either disk issues or you somehow saturated it, which is a problem as well.

    You can learn more about it here, http://support.microsoft.com/kb/2137408

  • Is there any disk scans, defrags, Anti Virus software running? This could cause it as well.

    The message means that SQL Server sent a request to your storage for data and it took more than 15 seconds to return the results... it should be less than 15 milliseconds.

    I have battled these in the past and it is VERY difficult to determine the cause. I even had Microsoft looking into this and all they came back with is saying it is your storage. For one incident of this we DID find it was the storage. Other times we never determined the cause.

  • Part of the reason why this problem is so difficult to nail down is because, in my humble opinion, people don't look in the right spot. Instead, they go through the throws of proving the disk system doesn't actually have a problem and that there's usually nothing (such as defrags, backup exec, etc, etc) that's the problem. That's because the error message is really just a symptom of a more insidious problem than hardware or the normal things that you would expect SQL Server to do.

    This isn't a rare problem, either. I've found this problem on virtually every system that I've been assigned to work on and few are willing to fix the real cause because it's "too expensive". So they throw more hardware at it and the problem might go away for awhile but it almost always comes back to haunt you because you haven't fixed the real cause of the problem.

    How can you find the real cause?

    Open SSMS and, if it isn't already open, open the "explorer" by pressing the {f8} key. Right click on the instance and select {Reports}, {Standard Reports}, {Performance - Top Queries by Total I/O}. That will show you the top 10 problems that are the usual cause of this problem. You need to bite the bullet and fix those problems. Once you fix those, fix the next 10 and then the 10 after that (although the first 10 will usually give you the biggest bang for the buck).

    After that, you can probably disconnect the garden hose from the CPU cooling and turn off the industrial fan you're using to cool the SAN and the apps will start running just fine with no more such outages.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (11/14/2014)


    Part of the reason why this problem is so difficult to nail down is because, in my humble opinion, people don't look in the right spot. Instead, they go through the throws of proving the disk system doesn't actually have a problem and that there's usually nothing (such as defrags, backup exec, etc, etc) that's the problem. That's because the error message is really just a symptom of a more insidious problem than hardware or the normal things that you would expect SQL Server to do.

    This isn't a rare problem, either. I've found this problem on virtually every system that I've been assigned to work on and few are willing to fix the real cause because it's "too expensive". So they throw more hardware at it and the problem might go away for awhile but it almost always comes back to haunt you because you haven't fixed the real cause of the problem.

    How can you find the real cause?

    Open SSMS and, if it isn't already open, open the "explorer" by pressing the {f8} key. Right click on the instance and select {Reports}, {Standard Reports}, {Performance - Top Queries by Total I/O}. That will show you the top 10 problems that are the usual cause of this problem. You need to bite the bullet and fix those problems. Once you fix those, fix the next 10 and then the 10 after that (although the first 10 will usually give you the biggest bang for the buck).

    After that, you can probably disconnect the garden hose from the CPU cooling and turn off the industrial fan you're using to cool the SAN and the apps will start running just fine with no more such outages.

    The latest one that is causing us to see slow I/Os is only a distribution sql server. I have no control over how Microsoft has setup replication

  • Understood and appreciated. There are sometimes those problems. Just to be sure, though, none of the systems that I've encountered this on had replication or anything similar running.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • SQL server doesn't tell porky pies, when you get this message it's because an I\O request has taken longer than 15ms. Common with SAN attached disks is storage saturation, but you wont generally see this at the array, level that may be showing no issues. Don't forget an I\O on a SAN has to traverse all the FC switches to the storage processor and then come back. When the request leaves the SP it could take forever due to network flooding. A FC network is similar to a TCP\IP network in that a large number of requests active on the network can cause response issues.

    One thing to check is the queue depth length on the HBAs, if its too high it could cause issues.

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Hi,

    What some of the other guys here have mentioned are very valid as it could be so many issues. Here's a few extra things to check.

    Is your server a VM? Then check your underlying disk subsystem to see whether any other servers are sharing with it. It could be these are very busy and overloading the underlying disk.

    It could be a basic configuration issue. On your first post you mentioned Log file. How big is it? Is there many VLFs? Do all your Data and Log files reside on the same disk? What is the RAID level on the disk?

    At the time of the alerts are there any DBCCs or index rebuilds running?

    Regarding the replicated server, if you could describe the setup maybe I can give you some pointers as to what to check there. i.e is it both Publisher and Distributor. How many Subscribers and are they Pull or Push Subscription? What type of Replication is it? Are you getting the Alerts on the Distribution database or the Published database?

    Hope this helps...

  • Hi there - thanks to everyone for their input on this.

    After numerous calls with our hosting partners I can confirm that it was in fact a problem with our network.

    As our database is highly transactional and does not contain any high I/O bound queries therefore I knew it was an issue with the subsystem. We performed several tests on the disk and found that it was performing as expected. We then monitgored the time it took for information to be pass between SQL and the disk, we then found that the network was faulty and not handling the requests as expected, so we changed the network to a higher tunnel link

    Thanks again.

    --------------------------------------------

    Laughing in the face of contention...

  • arrjay (11/20/2014)


    We performed several tests on the disk and found that it was performing as expected. We then monitgored the time it took for information to be pass between SQL and the disk, we then found that the network was faulty and not handling the requests as expected, so we changed the network to a higher tunnel link

    Classic case scenario, thanks for posting back

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • incidentally is this an iSCSI storage system or FC?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • We added additional temp data files to match our 8 cores. We were expecting performance improvement after adding the files but instead the jobs are taking longer than before. We always encountered the I/O request on this drive where the temp files reside.

    I need help with this issue.

  • mchitturi - Wednesday, September 12, 2018 9:46 AM

    We added additional temp data files to match our 8 cores. We were expecting performance improvement after adding the files but instead the jobs are taking longer than before. We always encountered the I/O request on this drive where the temp files reside.

    I need help with this issue.

    You should be creating your own thread rather than posting this on an unrelated four-year old thread.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply