I recently had an incident where I was looking into the cause of a long running process for a client. It was a batch process that ran overnight, and execution time had been growing until it was now taking over 4 hours.
The database involved is a kind of staging area for MI, so I wanted to look at any heavy queries being executed against the database – I knew that these would all be part of the batch process. I asked them to run the following query for me:
SELECT TOP 50 DB_NAME(qt.dbid) AS DatabaseName, qt.TEXT, qs.execution_count, qs.total_worker_time, qs.total_physical_reads, qs.total_logical_writes, qs.total_logical_reads, qs.total_elapsed_time FROM sys.dm_exec_query_stats qs CROSS apply sys.dm_exec_sql_text(qs.sql_handle) qt WHERE ISNULL(qt.dbid,0) = DB_ID();
I got less than 50 rows back so figured I had everything covered, but the total elapsed time across all the queries was less than ten minutes, I knew the server hadn’t been rebooted for about a month so potentially that could be as little as 20 seconds of query execution a night. Even if the full 10 minutes was from the last 24 hours that still didn’t account for the long run times.
So, I convinced myself it wasn’t the queries and started looking at other things they could check. It was a long list. Lots of theories revolved around the idea that something else was running on the host at the same time.
As part of that I decided to get them to run an extended events trace so I could look at exact timings and waits during the batch process (and anything else that might be going on in SQL). I got a lot of data back which took a fair while to index and load. Figuring I’d already ruled out query execution times for the batch process itself, I expected the analysis was going to be tricky to drill down on the problem, however as soon as I looked for heavy queries running during the execution window I found one specific stored procedure with the following metrics:
Thing is, these figures were for a stored procedure that was part of the batch run. I date for this in the original output but the figures for query execution duration and CPU were tiny in comparison to these. So, were the query execution stats wrong?
No, they weren’t, dm_exec_query_stats only captures figures for cached query plans, so you might have a query that’s not caching a plan for whatever reason – or might have something that’s doing work that’s not actually a query, imagine a while loop that’s trying to find all the prime numbers under a trillion without using sets or queries. That would consume a lot of CPU and duration but wouldn’t show in in the query execution stats – because it’s not a query.
In this case it turned out the stored procedure was:
- Trying to process over 2 million rows using a cursor
- For each row it would attempt to execute a stored procedure that didn’t exist using dynamic SQL
- It would catch the failure and update an error message against the row – using the cursor
- Every day it would attempt to process the failed rows again
- Every day more rows would be inserted to the table that would fail
In particular it was the use of the cursor that was most expensive, and not being captured in the execution stats.
To be kind to the original developer, it was a clever generic solution to a specific problem. It wasn’t optimal from a performance point of view, but they only expected it to have to deal with a few rows daily.
Then the volume of data increased.
Then a subsequent developer made a change but didn’t know they needed to add a new supporting stored procedure. The resulting issue didn’t get picked up in testing.
The number of failed rows had been growing every night for a couple of years, and the execution time creeping up until it went exponential.
This post isn’t about bad code though, it’s about remembering that not all resource consumption on our SQL boxes comes from cached queries, so when trying to troubleshoot remember that query execution stats are one place to look, but don’t cover everything that might be being executed in T-SQL.