Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

I/O You a followup - More about I/O, Performance Tuning & Troubleshooting....

Yesterday, I blogged about basic I/O usage with some useful scripts, DMVs, and links to more in-depth details on examining these statistics.

Since my blog was blasted to many LinkedIN SQL Server groups, I received many, many of your positive comments and feedback - and for that, I/O you a big thanks! :-)

(Please post all comments directly to the Blog, so all Internet travelers can see your thoughtful comments.)

I wanted to follow-up with some ways to help identify I/O bottlenecks, performance tuning, tips and troubleshooting. 

First using performance monitor, here are the two I/O related counters that can help you understand if there is any I/O bottleneck:

PhysicalDisk(_Total)\Avg. Disk Queue Length
- if this counter is more than Avg. 2 per disk (or spindle if using SAN) then there is an issue

PhysicalDisk(_Total)\Avg. Disk sec/Transfer
- if this is counter value is more than 15 milli seconds this also indicates I/O issues.

One I/O related error message that you may have often seen and wondered about in the SQL Error Log is:

 'SQL Server has encountered n occurrence(s) of I/O requests taking longer than 15 seconds to complete on file <filename> in database <dbname>.'

I believe this was added to SQL Server 2005 SP2, to help identify issues that are affecting SQL Server's performance, as related to a poor disk subsystem, misconfiguration, driver/HW problem connection between the server and the SAN.  It could also indicate fragmentation, excessive I/O requests not being handled properly by the disk, or data files not optimally placed on disk.

The reason why it is so important to understand I/O performance, especially for DBA's, is that it is often misunderstood by those who are NOT DBA's.  System and SAN adminstrators are often quick to blame the database server, when in actuality, SQL Server is simply being helpful in identifying a potential Disk/HW problem. And, managers are more apt to viscerally accept ther analysis over the DBA. So, the more you understand about this, the more knowledge you will have in leading the Systems folks to the real underlying problem.

This in fact, happened recently at a client of mine, and it took some serious caucusing and long meeting hours to prove that it was in fact a disk problem.  I identified the cause right away, but skeptics needed more convincing and disk and SAN diagnostic tests.  This "fiddling while Rome burns" technique caused several incidents of crashing clusters going down at regular intervals.  Turns out, because of a flawed driver and improper configuration, the subsystem could not keep up with the I/O requests generated from the SQL Server.

I have found this excellent article/link, that delves into the origins and causes of the infamous 'SQL Server has encountered n occurrence(s) of I/O requests taking longer than...' and worth adding to your DBA favorites list:

http://www.sql-server-pro.com/i-o-requests-taking-longer-than-15-seconds-to-complete.html

Continue monitoring and watching this blog space for more exciting info and helpful tips!

I will begin running a regular feature called: 'MVP Thoughts of the Day' - great tips, info and insights from known industry sql gurus, experts, and MVPs on all things relating to SQL Server. 

Any of you MVP's out there, feel free to send me your comments and thoughts at rsp05@pearlknows.com, and I'll post them right here.  Thank you all for reading and your contributions!

Try SQLCentric at:
http://www.pearlknows.com

Comments

Posted by scotthervieux on 20 August 2009

Hey Robert.  Just a follow up about the I/O taking longer than 15 seconds issue....  I have had this happen at 2 of my clients,  followed by numerous latch 2 and latch 4 errors.  The problem is, in the event viewer - these appear at informational events (wish they were warning or errors, so that they stood out).  

After experiencing two corrupted databases, due to SQL deciding not to wait any longer, these errors occurring over and over again, then causing issues to show up in the integrity checker (that could not be resolved), I always tell my clients:

1. Watch your event logs (application and system) and report any errors or warnings.

2. If you start seeing a lot of informational events on MSSQL server, that is a warning that you should view those.  these events were occuring every few seconds to every minute.

In both my cases, the I/O latency was actually caused by SQL aware backup solutions.  One was using Lagato and the other I think was using redgate or Bridgehead.  I was not involved in the configuration of the backups, but it appears that the issues started when they implemented the backup procedure, and then stopped when they temporarily discontinued that backup procedure.  It appears they had possibly not configured their backups correctly.  I think one site was on a SAN, and the other was a physical drive in the server.

These are serious errors that can corrupt a database very quickly if left unresolved.  We determined the database to be unrecoverable when it went into recovery mode (after a restart of SQL) and did not progress any further after receiving one of these "I/O taking longer than...." errors.  the database never came out of recovery.  Luckily the client had a good backup to go to (although it was a month old), and we were able to recover the data loss.

Posted by Anonymous on 21 August 2009

Do you consider the I/O whenever a database design project is assigned? It is one of the primary design

Leave a Comment

Please register or log in to leave a comment.