I/O requests taking more than 15 seconds

Question

Post reply

I/O requests taking more than 15 seconds

Hawkeye_DBA

SSCarpal Tunnel

Points: 4365
More actions
January 13, 2009 at 12:26 pm

#133622

I've read through a lot of posts regarding this error message. Unfortunately, none of them seem to fit my situation.
Here's the specs of my Virtual Server:
- SQL Server 2005 Std Ed SP2
- Microsoft Windows Server 2003 R2 Std Ed SP2 Operating System
- 2 processors @2.33GHz each
- 3 VMware Virtual disk SCSI Disk Devices (C, E (data), and F (logs))
- 3.75GB of RAM
- Sharing resources with other virtual machines
- 14 drives on the SAN
- This virtual machine has it's own Logical Path to the storage
Here are the database specs on this virtual server
- DBA, 60 MB, SIMPLE recovery, hardly no I/O, only used during nightly maintenance windows
- fasuite, 468 MB, SIMPLE recovery, little I/O
- McGann, 1069.25 MB, FULL recovery. between 2 and 4501 bytes, users connected throughout the day
- McGann_Removed, 2.94MB, SIMPLE recovery, hardly no I/O,
- wp1281, 114.50 MB, FULL recovery, minimum I/O, users connected throughout the day
- wslogdb61, 6000 MB, SIMPLE recovery, highest I/O at around 923897 bytes, this databases logs Internet visits from all networked computers
*Note: I just recently added the last database - these I/O errors were happening before I added it so I do not attribute it's size or process directly to the problem.
Ok, so when I monitor my server using Perfmon I see normal ranges on all counters. I confirmed this as well with my SAN admin.
However, I continue to get the SQL Error Message that an I/O requests is taking longer than 15 seconds! I notice this occurs on the Lazy Writer most of the time, writing to the F (log) drive.
I also notice that it doesn't exactly coincide with my Transaction Log backups, but it comes close MOST of the time. Sometimes, however, it is random.
Also, it does not always happen on the same database request.
I would really appreciate any suggestions for what I need to monitor or look at.
Thanks in advance for your assistance!
sandy

Viewing 15 posts - 1 through 15 (of 36 total)

You must be logged in to reply to this topic. Login to reply

Todd Engen SSCrazy Eights Points: 9869 More actions · Answer 1

Sandy,

Take a look at this blog entry, it might be relevant.

http://blogs.msdn.com/chrissk/archive/2008/06/19/i-o-requests-taking-longer-than-15-seconds-to-complete-on-file.aspx

David Benoit SSC-Dedicated Points: 34562 More actions · Answer 2

A couple of queries as follows;

--Help to isolate physical disk(s) that are involved in the I/O bottlenecks

select

db_name(t1.database_id) AS DatabaseName,

t1.database_id,

t1.file_id,

substring(mf.physical_name, 1, 2) as DriveLetter,

io_stall,

io_pending_ms_ticks,

scheduler_address

from

(select * from sys.dm_io_virtual_file_stats(NULL, NULL)) t1

join sys.dm_io_pending_io_requests t2

on t1.file_handle = t2.io_handle

join sys.master_files mf

on mf.database_id = t1.database_id and mf.file_id = 1

--This is the Data File I/O output as seen in Activity Monitor with 2008 tools (captured with trace)

--Run the create temp table script first then run the query, first run will yield no results.

DECLARE @current_collection_time datetime;

SET @current_collection_time = GETDATE();

/*

CREATE TABLE #am_dbfilestats

(

collection_time datetime

, [Database] varchar(150)

, [File] varchar(500)

, [Total MB Read] numeric(18,2)

, [Total MB Written] numeric(18,2)

, [Total I/O Count] numeric(18,2)

, [Total I/O Wait Time (ms)] numeric(18,2)

, [Size (MB)] numeric(18,2)

)

*/

---- Grab a snapshot

INSERT INTO #am_dbfilestats

SELECT

@current_collection_time AS collection_time,

d.name AS [Database],

f.physical_name AS [File],

(fs.num_of_bytes_read / 1024.0 / 1024.0) [Total MB Read],

(fs.num_of_bytes_written / 1024.0 / 1024.0) AS [Total MB Written],

(fs.num_of_reads + fs.num_of_writes) AS [Total I/O Count],

fs.io_stall AS [Total I/O Wait Time (ms)],

fs.size_on_disk_bytes / 1024 / 1024 AS [Size (MB)]

FROM sys.dm_io_virtual_file_stats(default, default) AS fs

INNER JOIN sys.master_files f ON fs.database_id = f.database_id AND fs.file_id = f.file_id

INNER JOIN sys.databases d ON d.database_id = fs.database_id;

-- Get the timestamp of the previous collection time

DECLARE @previous_collection_time datetime;

SELECT TOP 1 @previous_collection_time = collection_time

FROM #am_dbfilestats

WHERE collection_time < @current_collection_time

ORDER BY collection_time DESC;

DECLARE @interval_ms int;

SET @interval_ms = DATEDIFF (millisecond, @previous_collection_time, @current_collection_time);

-- Return the diff of this snapshot and last

SELECT

cur.[Database],

cur.[File] AS [File Name],

CONVERT (numeric(28,1), (cur.[Total MB Read] - prev.[Total MB Read]) * 1000 / @interval_ms) AS [MB/sec Read],

CONVERT (numeric(28,1), (cur.[Total MB Written] - prev.[Total MB Written]) * 1000 / @interval_ms) AS [MB/sec Written],

-- protect from div-by-zero

CASE

WHEN (cur.[Total I/O Count] - prev.[Total I/O Count]) = 0 THEN 0

ELSE

ROUND((cur.[Total I/O Wait Time (ms)] - prev.[Total I/O Wait Time (ms)])

/ (cur.[Total I/O Count] - prev.[Total I/O Count]), 1)

END AS [Response Time (ms)]

FROM #am_dbfilestats AS cur

INNER JOIN #am_dbfilestats AS prev ON prev.[Database] = cur.[Database] AND prev.[File] = cur.[File]

WHERE cur.collection_time = @current_collection_time

AND prev.collection_time = @previous_collection_time

ORDER BY

[Response Time (ms)] DESC;

-- Delete the older snapshot

DELETE FROM #am_dbfilestats

WHERE collection_time != @current_collection_time;

--DROP TABLE #am_dbfilestats

Ultimately if you are having disk I/O issues this will bring it to light and show you which database / file is suffering the most.

David

@SQLTentmaker

“He is no fool who gives what he cannot keep to gain that which he cannot lose” - Jim Elliot

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 3

IOs are one area that cause major problems for virtual servers.

How many virtuals are hosted on that physical server? How do those virtual server's drives map to the physical server's drives? What RAID level is the server using, and are the disks in the LUNs dedicated or shared?

The perfmon counters you need to be looking at (and personally I'd look both on the virtual and on the physical machine) are the Avg sec/read and the avg sec/write. Break them down by drive so you can see which are problematic. Anywhere over 50ms is cause for concern.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

noeld SSC Guru Points: 96590 More actions · Answer 4

What is the frequency of your Transaction log backups ?

I bet you are experiencing autogrows!

* Noel

Hawkeye_DBA SSCarpal Tunnel Points: 4365 More actions · Answer 5

Thank you all for your feedback!

I will try your suggestions and post back the results.

As for LUNs, we are actually using Blades. I spoke with our SAN admin and he stated that the SQL Server virtual server is on it's own blade so it is not sharing resources.

I checked the logspace, and all of the logs are under 50% space used so no growth issues it would seem. I wish though, it would be an easy enough solution to fix! LOL

I have also started a new perfmon trace to capture the specific drive performance.

One question - is it possible that SQL Server's alerting is a bit wacky when the server is virtual, and on a SAN to boot? Just curious.

Thanks again!

Sandy

Todd Engen SSCrazy Eights Points: 9869 More actions · Answer 6

Sandra Skaar (1/13/2009)
Thank you all for your feedback!
One question - is it possible that SQL Server's alerting is a bit wacky when the server is virtual, and on a SAN to boot? Just curious.

Not so much that it's the Virtual Server or the SAN, but the CPU and how SQL is keeping track of time could cause an erroneous error.

colin.Leversuch-Roberts SSC Guru Points: 52643 More actions · Answer 7

Ah but you have the classic combination for poor performance, VM + SAN.

Monitor the io completion latency in perfmon .. Avg Disk sec/write , Avg Disk sec/read for each disk(lun) Ignore the actual average value of this counter watch the peaks as io hits the drives, I bet under load you see the write times rise, especially on the t log drive, but probably the data drive too. These counters will also indicate contention if the graphs peak without increase in load, use the transfers/sec or disk idle time to monitor disk usuage.

You could also install the performance dashboard which will help you monitor performance.

Typically for a good performing system I'd expect io reads times to be almost unreadable 20ms is bad ) however if each write is taking 20ms and you are pushing a reasonable volume of data into your db ( oltp ) then I'd say performnce would be pretty dire. I've used this counter with sans etc. for many many years, it's your first base.

[font="Comic Sans MS"]The GrumpyOldDBA[/font]
www.grumpyolddba.co.uk
http://sqlblogcasts.com/blogs/grumpyolddba/

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 8

Sandra Skaar (1/13/2009)
One question - is it possible that SQL Server's alerting is a bit wacky when the server is virtual, and on a SAN to boot? Just curious.

No. If the 15 sec message appears in the logs it does mean that at least 15 sec passed between SQL issuing an IO request to the OS and it getting a reply back from the OS. Unless the clock is running at a different speed on the virtual, it's a legit warning.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Hawkeye_DBA SSCarpal Tunnel Points: 4365 More actions · Answer 9

A little more information:

I have a dedicated pipe to the SAN, however the server is sharing resources with other processes. So at least I am not sharing request threads, however, I am sharing processing at the disk level.

Perhaps I do not need to be concerned here? If I was having real contention issues and I/O problems, wouldn't the read/write time be high more often, and during other periods of activity? I would also think I might see some Wait latches and or blocking, I have not observed this as of yet.

I get spikes of high read/write times and usually no more than 4 times a day but they result in this wonderful error message.

Is it possible that another process is using up resources from the SAN, and causing my I/O requests to take longer during that time?

During last night's perfmon I monitored my Data and Log drives. Here are the results from a 1 minute spike that resulted in a 15 second error message: (for some reason the reads/sec did not get captured per drive, just for _total drives)

F Drive:

% Disk Write Time: 192.484

% Disk Read Time: 0

E Drive:

% Disk Write Time: 68.546

% Disk Read Time: .225

Total Drives (C, E, and F):

Disk Writes/Sec: 10.74

Disk Reads/Sec: .067

Thanks again for everyone's assistance!

David Benoit SSC-Dedicated Points: 34562 More actions · Answer 10

You stated that you have a dedicated pipe to the SAN but is it possible that the volume group (SAN level) in question is shared across more than one drive / server?

I didn't see a clear answer to that in the previous posts.

David

@SQLTentmaker

“He is no fool who gives what he cannot keep to gain that which he cannot lose” - Jim Elliot

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 11

Sandra Skaar (1/14/2009)
Total Drives (C, E, and F):
Disk Writes/Sec: 10.74
Disk Reads/Sec: .067

Not so interested in the reads/sec. What does the avg sec/read and avg sec/write look like?

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Hawkeye_DBA SSCarpal Tunnel Points: 4365 More actions · Answer 12

I monitored again last night, and again, last night I the error.

One of the error times recorded is 7:04PM

This is the monitor results from 6:58:50PM to 7:04:47PM (5 minutes, 57 seconds)

I noticed that the maximum % disk time was quite high. All the other numbers appear within reason, right?

My trace failed because I forgot to set up the file path, ugh. I will run the trace tonight and post those results.

F Drive: (Log)

-----------------

Avg. Disk sec/Write: Last(.001), Average(.252), Minimum (0.000), Maximum(2.500)

Avg. Disk sec/Read: Last(.000), Average(.020), Minimum (0.534), Maximum(11.141)

Disk Writes/sec: Last(6.738), Average(5.919), Minimum (0.000), Maximum(9.740)

Disk Reads/sec: Last(.000), Average(.003), Minimum (0.000), Maximum(.067)

% Disk Time: Last(.607), Average(30.627), Minimum (0.227), Maximum(199.471) ******

E Drive: (Data)

-----------------

Avg. Disk sec/Write: Last(.010), Average(.022), Minimum (0.000), Maximum(.412)

Avg. Disk sec/Read: Last(.000), Average(.005), Minimum (0.000), Maximum(.040)

Disk Writes/sec: Last(.334), Average(.267), Minimum (0.000), Maximum(2.668)

Disk Reads/sec: Last(.000), Average(.039), Minimum (0.000), Maximum(.334)

% Disk Time: Last(.343), Average(.758), Minimum (.000), Maximum(10.988)

Hawkeye_DBA SSCarpal Tunnel Points: 4365 More actions · Answer 13

Just one more thought on this before I post more results tomorrow.

I notice this error is also occurring on my Log drive in Development, which is also a virtual server.

Could it be that my virtual server drives would be better off not split, so all my data and all of my logs go to the same drive? Since really, they are stored underneath on multiple SAN drives.

Just a thought...

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 14

Sandra Skaar (1/15/2009)
Avg. Disk sec/Write: Last(.001), Average(.252), Minimum (0.000), Maximum(2.500)
Avg. Disk sec/Read: Last(.000), Average(.020), Minimum (0.534), Maximum(11.141)

That is bad!

The sec/read and sec/write should be below 10ms. You've got an average write time there of a quarter of a second. That means, every time a log record is written to disk, it's taking on average .25 seconds just to get the IO done. That's an extra .25 seconds that every transaction has to wait before it's considered complete.

The average of the sec/read is better, but the maximum is about the worst I've ever seen. 11 seconds to read 64 kb from the disk?

I'm guessing those are spikes. Can you check the data from perfmon and get the times that the spikes occurred, and see if you can correlate them with activity on your databases? If you've got good SAN performance monitoring software, see if you can correlate them with any strange activity on the SAN.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass