I/O requests taking more than 15 seconds

  • I've read through a lot of posts regarding this error message. Unfortunately, none of them seem to fit my situation.

    Here's the specs of my Virtual Server:

    - SQL Server 2005 Std Ed SP2

    - Microsoft Windows Server 2003 R2 Std Ed SP2 Operating System

    - 2 processors @2.33GHz each

    - 3 VMware Virtual disk SCSI Disk Devices (C, E (data), and F (logs))

    - 3.75GB of RAM

    - Sharing resources with other virtual machines

    - 14 drives on the SAN

    - This virtual machine has it's own Logical Path to the storage

    Here are the database specs on this virtual server

    - DBA, 60 MB, SIMPLE recovery, hardly no I/O, only used during nightly maintenance windows

    - fasuite, 468 MB, SIMPLE recovery, little I/O

    - McGann, 1069.25 MB, FULL recovery. between 2 and 4501 bytes, users connected throughout the day

    - McGann_Removed, 2.94MB, SIMPLE recovery, hardly no I/O,

    - wp1281, 114.50 MB, FULL recovery, minimum I/O, users connected throughout the day

    - wslogdb61, 6000 MB, SIMPLE recovery, highest I/O at around 923897 bytes, this databases logs Internet visits from all networked computers

    *Note: I just recently added the last database - these I/O errors were happening before I added it so I do not attribute it's size or process directly to the problem.

    Ok, so when I monitor my server using Perfmon I see normal ranges on all counters. I confirmed this as well with my SAN admin.

    However, I continue to get the SQL Error Message that an I/O requests is taking longer than 15 seconds! I notice this occurs on the Lazy Writer most of the time, writing to the F (log) drive.

    I also notice that it doesn't exactly coincide with my Transaction Log backups, but it comes close MOST of the time. Sometimes, however, it is random.

    Also, it does not always happen on the same database request.

    I would really appreciate any suggestions for what I need to monitor or look at.

    Thanks in advance for your assistance!

    sandy

  • Sandy,

    Take a look at this blog entry, it might be relevant.

    http://blogs.msdn.com/chrissk/archive/2008/06/19/i-o-requests-taking-longer-than-15-seconds-to-complete-on-file.aspx

  • A couple of queries as follows;

    --Help to isolate physical disk(s) that are involved in the I/O bottlenecks

    select

    db_name(t1.database_id) AS DatabaseName,

    t1.database_id,

    t1.file_id,

    substring(mf.physical_name, 1, 2) as DriveLetter,

    io_stall,

    io_pending_ms_ticks,

    scheduler_address

    from

    (select * from sys.dm_io_virtual_file_stats(NULL, NULL)) t1

    join sys.dm_io_pending_io_requests t2

    on t1.file_handle = t2.io_handle

    join sys.master_files mf

    on mf.database_id = t1.database_id and mf.file_id = 1

    --This is the Data File I/O output as seen in Activity Monitor with 2008 tools (captured with trace)

    --Run the create temp table script first then run the query, first run will yield no results.

    DECLARE @current_collection_time datetime;

    SET @current_collection_time = GETDATE();

    /*

    CREATE TABLE #am_dbfilestats

    (

    collection_time datetime

    , [Database] varchar(150)

    , [File] varchar(500)

    , [Total MB Read] numeric(18,2)

    , [Total MB Written] numeric(18,2)

    , [Total I/O Count] numeric(18,2)

    , [Total I/O Wait Time (ms)] numeric(18,2)

    , [Size (MB)] numeric(18,2)

    )

    */

    ---- Grab a snapshot

    INSERT INTO #am_dbfilestats

    SELECT

    @current_collection_time AS collection_time,

    d.name AS [Database],

    f.physical_name AS [File],

    (fs.num_of_bytes_read / 1024.0 / 1024.0) [Total MB Read],

    (fs.num_of_bytes_written / 1024.0 / 1024.0) AS [Total MB Written],

    (fs.num_of_reads + fs.num_of_writes) AS [Total I/O Count],

    fs.io_stall AS [Total I/O Wait Time (ms)],

    fs.size_on_disk_bytes / 1024 / 1024 AS [Size (MB)]

    FROM sys.dm_io_virtual_file_stats(default, default) AS fs

    INNER JOIN sys.master_files f ON fs.database_id = f.database_id AND fs.file_id = f.file_id

    INNER JOIN sys.databases d ON d.database_id = fs.database_id;

    -- Get the timestamp of the previous collection time

    DECLARE @previous_collection_time datetime;

    SELECT TOP 1 @previous_collection_time = collection_time

    FROM #am_dbfilestats

    WHERE collection_time < @current_collection_time

    ORDER BY collection_time DESC;

    DECLARE @interval_ms int;

    SET @interval_ms = DATEDIFF (millisecond, @previous_collection_time, @current_collection_time);

    -- Return the diff of this snapshot and last

    SELECT

    cur.[Database],

    cur.[File] AS [File Name],

    CONVERT (numeric(28,1), (cur.[Total MB Read] - prev.[Total MB Read]) * 1000 / @interval_ms) AS [MB/sec Read],

    CONVERT (numeric(28,1), (cur.[Total MB Written] - prev.[Total MB Written]) * 1000 / @interval_ms) AS [MB/sec Written],

    -- protect from div-by-zero

    CASE

    WHEN (cur.[Total I/O Count] - prev.[Total I/O Count]) = 0 THEN 0

    ELSE

    ROUND((cur.[Total I/O Wait Time (ms)] - prev.[Total I/O Wait Time (ms)])

    / (cur.[Total I/O Count] - prev.[Total I/O Count]), 1)

    END AS [Response Time (ms)]

    FROM #am_dbfilestats AS cur

    INNER JOIN #am_dbfilestats AS prev ON prev.[Database] = cur.[Database] AND prev.[File] = cur.[File]

    WHERE cur.collection_time = @current_collection_time

    AND prev.collection_time = @previous_collection_time

    ORDER BY

    [Response Time (ms)] DESC;

    -- Delete the older snapshot

    DELETE FROM #am_dbfilestats

    WHERE collection_time != @current_collection_time;

    --DROP TABLE #am_dbfilestats

    Ultimately if you are having disk I/O issues this will bring it to light and show you which database / file is suffering the most.

    David

    @SQLTentmaker

    “He is no fool who gives what he cannot keep to gain that which he cannot lose” - Jim Elliot

  • IOs are one area that cause major problems for virtual servers.

    How many virtuals are hosted on that physical server? How do those virtual server's drives map to the physical server's drives? What RAID level is the server using, and are the disks in the LUNs dedicated or shared?

    The perfmon counters you need to be looking at (and personally I'd look both on the virtual and on the physical machine) are the Avg sec/read and the avg sec/write. Break them down by drive so you can see which are problematic. Anywhere over 50ms is cause for concern.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • What is the frequency of your Transaction log backups ?

    I bet you are experiencing autogrows!


    * Noel

  • Thank you all for your feedback!

    I will try your suggestions and post back the results.

    As for LUNs, we are actually using Blades. I spoke with our SAN admin and he stated that the SQL Server virtual server is on it's own blade so it is not sharing resources.

    I checked the logspace, and all of the logs are under 50% space used so no growth issues it would seem. I wish though, it would be an easy enough solution to fix! LOL

    I have also started a new perfmon trace to capture the specific drive performance.

    One question - is it possible that SQL Server's alerting is a bit wacky when the server is virtual, and on a SAN to boot? Just curious.

    Thanks again!

    Sandy

  • Sandra Skaar (1/13/2009)


    Thank you all for your feedback!

    One question - is it possible that SQL Server's alerting is a bit wacky when the server is virtual, and on a SAN to boot? Just curious.

    Not so much that it's the Virtual Server or the SAN, but the CPU and how SQL is keeping track of time could cause an erroneous error.

  • Ah but you have the classic combination for poor performance, VM + SAN.

    Monitor the io completion latency in perfmon .. Avg Disk sec/write , Avg Disk sec/read for each disk(lun) Ignore the actual average value of this counter watch the peaks as io hits the drives, I bet under load you see the write times rise, especially on the t log drive, but probably the data drive too. These counters will also indicate contention if the graphs peak without increase in load, use the transfers/sec or disk idle time to monitor disk usuage.

    You could also install the performance dashboard which will help you monitor performance.

    Typically for a good performing system I'd expect io reads times to be almost unreadable 20ms is bad ) however if each write is taking 20ms and you are pushing a reasonable volume of data into your db ( oltp ) then I'd say performnce would be pretty dire. I've used this counter with sans etc. for many many years, it's your first base.

    [font="Comic Sans MS"]The GrumpyOldDBA[/font]
    www.grumpyolddba.co.uk
    http://sqlblogcasts.com/blogs/grumpyolddba/

  • Sandra Skaar (1/13/2009)


    One question - is it possible that SQL Server's alerting is a bit wacky when the server is virtual, and on a SAN to boot? Just curious.

    No. If the 15 sec message appears in the logs it does mean that at least 15 sec passed between SQL issuing an IO request to the OS and it getting a reply back from the OS. Unless the clock is running at a different speed on the virtual, it's a legit warning.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • A little more information:

    I have a dedicated pipe to the SAN, however the server is sharing resources with other processes. So at least I am not sharing request threads, however, I am sharing processing at the disk level.

    Perhaps I do not need to be concerned here? If I was having real contention issues and I/O problems, wouldn't the read/write time be high more often, and during other periods of activity? I would also think I might see some Wait latches and or blocking, I have not observed this as of yet.

    I get spikes of high read/write times and usually no more than 4 times a day but they result in this wonderful error message.

    Is it possible that another process is using up resources from the SAN, and causing my I/O requests to take longer during that time?

    During last night's perfmon I monitored my Data and Log drives. Here are the results from a 1 minute spike that resulted in a 15 second error message: (for some reason the reads/sec did not get captured per drive, just for _total drives)

    F Drive:

    % Disk Write Time: 192.484

    % Disk Read Time: 0

    E Drive:

    % Disk Write Time: 68.546

    % Disk Read Time: .225

    Total Drives (C, E, and F):

    Disk Writes/Sec: 10.74

    Disk Reads/Sec: .067

    Thanks again for everyone's assistance!

  • You stated that you have a dedicated pipe to the SAN but is it possible that the volume group (SAN level) in question is shared across more than one drive / server?

    I didn't see a clear answer to that in the previous posts.

    David

    @SQLTentmaker

    “He is no fool who gives what he cannot keep to gain that which he cannot lose” - Jim Elliot

  • Sandra Skaar (1/14/2009)


    Total Drives (C, E, and F):

    Disk Writes/Sec: 10.74

    Disk Reads/Sec: .067

    Not so interested in the reads/sec. What does the avg sec/read and avg sec/write look like?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I monitored again last night, and again, last night I the error.

    One of the error times recorded is 7:04PM

    This is the monitor results from 6:58:50PM to 7:04:47PM (5 minutes, 57 seconds)

    I noticed that the maximum % disk time was quite high. All the other numbers appear within reason, right?

    My trace failed because I forgot to set up the file path, ugh. I will run the trace tonight and post those results.

    F Drive: (Log)

    -----------------

    Avg. Disk sec/Write: Last(.001), Average(.252), Minimum (0.000), Maximum(2.500)

    Avg. Disk sec/Read: Last(.000), Average(.020), Minimum (0.534), Maximum(11.141)

    Disk Writes/sec: Last(6.738), Average(5.919), Minimum (0.000), Maximum(9.740)

    Disk Reads/sec: Last(.000), Average(.003), Minimum (0.000), Maximum(.067)

    % Disk Time: Last(.607), Average(30.627), Minimum (0.227), Maximum(199.471) ******

    E Drive: (Data)

    -----------------

    Avg. Disk sec/Write: Last(.010), Average(.022), Minimum (0.000), Maximum(.412)

    Avg. Disk sec/Read: Last(.000), Average(.005), Minimum (0.000), Maximum(.040)

    Disk Writes/sec: Last(.334), Average(.267), Minimum (0.000), Maximum(2.668)

    Disk Reads/sec: Last(.000), Average(.039), Minimum (0.000), Maximum(.334)

    % Disk Time: Last(.343), Average(.758), Minimum (.000), Maximum(10.988)

  • Just one more thought on this before I post more results tomorrow.

    I notice this error is also occurring on my Log drive in Development, which is also a virtual server.

    Could it be that my virtual server drives would be better off not split, so all my data and all of my logs go to the same drive? Since really, they are stored underneath on multiple SAN drives.

    Just a thought...

  • Sandra Skaar (1/15/2009)


    Avg. Disk sec/Write: Last(.001), Average(.252), Minimum (0.000), Maximum(2.500)

    Avg. Disk sec/Read: Last(.000), Average(.020), Minimum (0.534), Maximum(11.141)

    That is bad!

    The sec/read and sec/write should be below 10ms. You've got an average write time there of a quarter of a second. That means, every time a log record is written to disk, it's taking on average .25 seconds just to get the IO done. That's an extra .25 seconds that every transaction has to wait before it's considered complete.

    The average of the sec/read is better, but the maximum is about the worst I've ever seen. 11 seconds to read 64 kb from the disk?

    I'm guessing those are spikes. Can you check the data from perfmon and get the times that the spikes occurred, and see if you can correlate them with activity on your databases? If you've got good SAN performance monitoring software, see if you can correlate them with any strange activity on the SAN.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass

Viewing 15 posts - 1 through 15 (of 36 total)

You must be logged in to reply to this topic. Login to reply