Waiting tasks quickly shows you where to look for the bottlenecks. This is only one of the many times that I didn’t have to go beyond identifying the waiting tasks to quickly troubleshoot a production performance issue.
I have encountered a case where an application filter got stuck in loading status and not returning any data (well, it’s stuck). This was impacting the overall performance of the application. Normally, I would start with my simple heuristic troubleshooting method to get an idea of what’s going on in the backend. But since the issue is quite specific and the application module is identified in this case, I can fairly conclude that the reason for the stuck procedure is due to requests waiting for resources or whatever they are waiting for.
Thanks to Paul Randal (B | T) for his sys.dm_os_waiting_tasks script. Running that script revealed something like the Waiting Tasks you see below (snapshot only, showed different wait time, status and waits at different points):
Paul was even kind enough to share a quick insight into what could possibly be causing all these locks:
Looks like you’ve got an offline index operation happening maybe. Something holding a SCH_M lock.
— Paul Randal (@PaulRandal) February 2, 2018
So, that’s the first thing that I checked. Using Paul’s comprehensive SQL Server Wait Types Library, I looked for the definition of the waits that popped out:
IO_COMPLETION – “This wait type represents a variety of synchronous read and write operations in data files are not related to tables, plus reads from the transaction log.” From the snapshot returned by the query, I couldn’t really make anything out of this wait. From what I saw, this wait is incurring lots of reads and CPU resources. The SPID was running for more than 4 hours at this point. I looked up for more info on this wait, and it usually occurs in any of the following situations: Long-running I/O-bound operations, including BACKUP, ALTER DATABASE, CREATE DATABASE or database auto-growth. There were no such database maintenance operations within that timeframe though.
What the Waiting Tasks revealed, though, was that the top SPID wasn’t stuck. The task was still running but it’s just taking time and, therefore, taking substantial resources with it. At the rate it was going, it could have run for another 4-6 hours which could bring the SQL Server to its knees and render the application unusable from that point on.
LCK_M_SCH_S – “This wait type is when a thread is waiting to acquire a Schema Stability (also called Schema Share) lock on a resource and there is at least one other lock in an incompatible mode granted on the resource to a different thread.” This wait on the SPID shown in the query snapshot is quite obvious. This is a lock caused by the SPID on top of the chain.
LCK_M_IS – “This wait type is when a thread is waiting to acquire an Intent Shared lock on a resource and there is at least one other lock in an incompatible mode granted on the resource to a different thread.” Again, just like the above lock wait, this is caused by what’s on top of the blocking chain.
So what could have possibly caused the above waits? The primary cause of the blocking chain is the SPID on top that was on that IO_COMPLETION wait at the time the query snapshot was taken (see screen capture above). One of the stored procedures involved in this issue is writing data into a text file (orders). I cannot say that has something to do with the IO_COMPLETION.
I checked the metrics in Solarwinds DPA, and it showed a somewhat high signal wait percentage around the time this issue has occurred. The percentage around the same time was about 42%. Solarwinds has an info about this signal wait percentage that says “Anything over 20% would indicate that there is a possible CPU resource bottleneck.”
The spike in resource demand due to high usage of the application at the same time the database was writing data into the text files can certainly affect the CPU resources. Plus, the top SPID that caused all these blocking did not only incur most of the resources but also blocked a table that was also used by the other stored procedures down the chain, thus that explained the LCK_M_SCH_S & LCK_M_IS waits.
At that point, the action to take to fix the performance issue was becoming obvious. The SPID on top of the blocking chain didn’t seem to indicate that it would complete soon. There was only one command to make the application work again: KILL SPID.
Once the KILL command was issued against the top SPID, all the other dependent locks were released and the application was performing well again almost instantly after that.
But of course, you have to take extra precautions before you pull the plug on any task. If I kill the task, will it spawn another data issue? Will it cause other processes to fail? That’s your call.
KILL has saved the day again.