Large Waits: EventID 833 plus PAGELATCH_SH and LCK_M_U

Question

Large Waits: EventID 833 plus PAGELATCH_SH and LCK_M_U

darkhelmutis

SSCommitted

Points: 1561
More actions
December 22, 2015 at 9:52 am

#399497

I have one SQL Server 2008 R2 server (10.50.4339 - SP2 with 3045312 Security Update). It hosts several databases by a particular 3rd party vendor who database schemas are not the best. They rely heavily on RBAR and Dynamic SQL processes because much of their database schema is created programmatically. For example tables are 'partitioned' by creating new tables appended with timestamps or incremental index values and thus cannot be known before runtime, ergo they have to lookup values and generate Dynamic SQL statements at runtime. Sporadically (read as infrequently) the SQL Server instance will generate an EventID 833 in the Windows Application Log:
SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [xxxxxxxx] in database [xxxxxxxx] (x). The OS file handle is 0x0000000000000824. The offset of the latest long I/O is: 0x00008d5f3b4000
My first thought is of course that this is somehow I/O related, but the evidence in Solarwinds DPA for the most recent example does not support this.
This shows the unexpected spike in a ten minute period (001-Waits.jpg):
This drills into the time period when the spike occurs (002-Waits.jpg):
This shows the top waits for the time period when the spike occurs (003-Waits.jpg):
The top two waits by a significant margin are PAGELATCH_SH and LCK_M_U. I have read that PAGELATCH_SH can be indicative of contention within tempdb, but the available history from DPA also does not support this. I added an alert to monitor sys.dm_os_waiting_tasks where the wait_type is like 'PAGE%LATCH_%' and resource_description like '2:%'.
Any other ideas? I am kind of shooting in the dark here at the moment. Thank you!

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply

SoHelpMeCodd SSCertifiable Points: 5871 More actions · Answer 1

You need to look for 1 occurrence(s) of [an] I/O request. That I/O request lasted at least 15 seconds. Using a 10 second Solarwinds polling interval should be barely adequate to detect part of the wait. In 003-Waits.jpg, PAGEIOLATCH_SH accumulated ~150 seconds of waits from a multitude of I/O requests. It should not be hard to imagine one 15 second wait for one I/O out of that multitude. It may be the only wait, or it may be that hundreds of other spids had waited 14.9 seconds and less. I suspect the former is closer to what happened than the latter.

To catch a 15 second wait, a polling level should ideally be in the 5 second or smaller range, both for perfmon counters and for sys.dm_os_wait_stats (assuming that is what SolarWinds collected). The perfmon counters to watch (with a 5 second or smaller polling interval) are Avg Disk Bytes/Sec, Avg Disk Bytes/Transfer, Avg Disk Sec/Transfer, and Avg Disk Queue Length. Because Avg counter's peak wait counter values (such as 15 seconds) will be averaged with sub-15 second waits, it is unlikely you will ever catch a full 15 second wait, even when counters are collected using a 5 second polling interval. Instead of, or in addition to, collecting /Transfer counters, you might also want to collect the /Read and /Write equivalents.

Singular waits are usually caused by a singular event, such as a database file autogrowth or (likely worse) a log file autogrowth. Be sure to check. And notice WRITELOG and PAGEIOLATCH_EX waits are also present (I suggest initially lumping all IO waits together).

Let's imagine one spid clocking up a 15 second wait for one I/0, while holding incompatible locks. Let's next imagine, multitudes of other spids piling up behind that spid, waiting to update. They would clock up multiple of times more LCK_M_U time than that one spid waiting on its PAGEIO. Seems reasonable.

As far as the PAGELATCH_SH goes, you probably have a hot page that is being frequently read (and rarely fetched from disk). PAGELATCH_SH may be irrelevant to the IO stall. PAGELATCH_SH may just be extreme "noise" for what is a very weak (1 occurrence) "signal". The same could also apply to LCK_M_U - it might be "just a normal noise in here". On the other hand, it could be a heavily read and infrequently written page that finally needed to be flushed, but got stuck for 15 seconds.

If you want to determine whether IO is stalled in Windows (and thus SQL Server) or in the hardware (including its Windows drivers), you need to monitor waits in storport.sys. Follow http://blogs.msdn.com/b/ntdebugging/archive/2010/04/22/etw-storport.aspx. When SQL Server next raises its stalled IO message, if (at the same time) the storport trace catches more-than-generous 15 millisecond waits for 10 byte SCSI commands, the I/O wait is being accumulated in the HW (or its Windows driver). On the other hand, if the storport trace catches no waits at the time when SQL Server raises a stalled IO warning, the I/O wait is being accumulated in Windows (and thus SQL Server).

Before implementing a storport trace, I recommend you ensure database autogrowths are reserved for emergency purposes alone, and are set for a fixed size in MB (not a percent). Because all I/O to a transaction log file must halt (while an autogrowth is happening), it is important to determine the I/O throughput capability of the disk hosting a log file. If the disk can handle 100 MB/sec, a 500 MB log file growth will last 5 seconds. If your clients impose a 5 second wait before they raise their timeout when performing a transaction, they will fail. If your clients instead use a more typical 30 second timeout setting, an additional 5 second wait may (or may not) be noticed by them. If databases or their files are being shrunk only to grow again, I would characterize IO throughput as being played with like a Yo-Yo (and an occasionally stalled I/O could be the result). To avoid autogowths, schedule manual growths, and don't repeatedly shrink.

If SolarWinds catches statement or batch durations, and if the vast majority of statements last less than 15 seconds, any statement or batch that lasted 15 seconds or more should be considered in light of the 3 bar charts' views of sys.dm_os_wait_stats. Such statements or batches could be the culprit, the victim, or both.

darkhelmutis SSCommitted Points: 1561 More actions · Answer 2

Thanks SoHelpMeCodd! I'll review your response once I'm back in the office next week. Till then Happy New Year!!