Blog Post

Debugging query timeouts by sampling blocking – part 2

,

In part 1 I explained how I was hunting for the root cause for some query timeouts that happened every once in a while. I knew for certain, that the affected queries were waiting for 30 seconds for an intent exclusive lock on something, before a timeout occurs.

Our good old friend sys.sysprocesses actually gives us this kind of information. Let me use the same setup as I did in part 1 to trigger a blocking scenario. From connection 1 I’m execution the following, which leaves an open transaction that holds a lock on the DummyTable:

 

CREATE TABLE DummyTable (col1 INT)
INSERT INTO DummyTable (col1) VALUES (1)
GO
BEGIN TRAN
GO
DELETE FROM DummyTable
WAITFOR DELAY '00:01:00'
GO

 

From connection 2 I’m executing the following, which will keep running until connection 1 either performs a commit or a rollback:

DELETE FROM DummyTable
GO

 

From a third connection I’m running this query agains sys.sysprocesses:

SELECT
    procs.spid,
    procs.blocked as BlockingSPID,
    DB_NAME(procs.dbid) as DatabaseName,
    procs.program_name,
    procs.loginame,
    procs.waitresource,
    procs.lastwaittype
FROM sys.sysprocesses procs
WHERE spid IN (54,55)

 

image

SPID = 54 is my connection 1, and SPID = 55 is my second connection.

The output shows that that my connection 2 (spid 55) is blocked by spid = 54, while waiting to acquire a LCK_M_U on the resource RID:6:1:169:0. This is pretty usefull information, if we could just get the sql text returned as well. And luckily we can Smiley

We can extend the query to OUTER APPLY sys.dm_exec_sql_text. This will give us the text of the queries, so lets see how that goes:

 

SELECT
    procs.spid,
    procs.blocked as BlockingSPID,
    DB_NAME(procs.dbid) as DatabaseName,
    procs.program_name,
    procs.loginame,
    procs.waitresource,
    procs.lastwaittype,
    text
FROM sys.sysprocesses procs
OUTER APPLY sys.dm_exec_sql_text(sql_handle)
WHERE spid IN (54,55)

 

image

Now we are actually able to see the exact query that is causing the blocking.
Because I knew that the blocking was going on for at least 30 seconds, I simply needed to sample this data every 15. seconds, untill I had cought my sinner. I simply create a temporary table to hold the data, and the created a loop that would run until I stopped it manually. By creating a recursive query, I could also find the blocking chain – so the final query looked like this:

CREATE TABLE #Blocking
(
    Timestamp DATETIME,
    SPID SMALLINT,
    BlockingSPID INT,
    DatabaseName VARCHAR(255),
    LoginName VARCHAR(255),
    ProgramName VARCHAR(255),
    BlockingStatement VARCHAR(MAX),
    Waitresource VARCHAR(500),
    lastwaittype VARCHAR(500),
    RowNo BIGINT,
    LevelRow INT
)
--Run until manually stopped
WHILE (1=1)
BEGIN
    ;WITH Processes
    AS
    (
    SELECT
        procs.spid,
        procs.blocked AS BlockingSPID,
        DB_NAME(procs.dbid) AS DatabaseName,
        procs.program_name AS ProgramName,
        procs.loginame AS LoginName,
        procs.waitresource,
        procs.lastwaittype,
        CAST(text AS VARCHAR(MAX)) AS Text
    FROM
        sys.sysprocesses procs
        CROSS APPLY sys.dm_exec_sql_text (sql_handle)
    WHERE
        procs.spid > 50
    ), Blocking
    AS
     (
        --Define base of recursive query
        SELECT
            procs.SPID,
            procs.BlockingSPID,
            procs.DatabaseName,
            procs.LoginName,
            procs.ProgramName,
            procs.Text,
            procs.Waitresource,
            procs.lastwaittype,
            ROW_NUMBER() OVER(ORDER BY procs.SPID) as RowNo,
            0 AS LevelRow
        FROM
            Processes procs
            JOIN Processes procs2 ON procs.SPID = procs2.BlockingSPID
        WHERE
            procs.BlockingSPID = 0
        UNION ALL
        --UNION the recursive step
        SELECT
            ProcsR.SPID,
            ProcsR.BlockingSPID,
            ProcsR.DatabaseName,
            ProcsR.LoginName,
            ProcsR.ProgramName,
            ProcsR.Text,
            ProcsR.Waitresource,
            ProcsR.lastwaittype,
            d.RowNo,
            d.LevelRow + 1
        FROM
            Processes ProcsR
            JOIN Blocking d ON ProcsR.BlockingSPID = d.SPID
         WHERE
            ProcsR.BlockingSPID > 0
    )
    INSERT INTO #Blocking
    SELECT GETDATE() as Timestamp, * FROM Blocking
    --Wait for 15 seconds before starting over
    WAITFOR DELAY '00:00:15'
END

 

This query is using multiple leves of CTE’s, with the first level named Processes beeing pretty much identical to the query showed before. The next level CTE called Blocking is used to create the recursive structure making it possible to identify the blocking chain. My connection 1 and 2 were still running, so running the above query for just a few seconds, samples the data I want:

--After stopping the sampling manually:
SELECT * FROM #Blocking
ORDER BY Timestamp, RowNo, LevelRow

 

image

Each sample will have the same values in the Timestamp column. For each sampling each blocking chain will have the same value in RowNo, and LevelRow orders the statements in the blocking chain.
In my example, I only ran the sampling query long enough to get one sample. And the only blocking happening on the system was my two connections, so only one blocking chain was captured… the one with RowNo = 1.
LevelRow = 0 is the root of the blocking chain, and in my example this is my connection 1.

So by running the above query on the production system, and yet again waiting for the timeouts to occur, I was actually able to identify the root cause for all the timeouts I was debugging. The blocking query was holding a table lock for a long time, which perfectly explained what was happening. Fixing the root cause is another story, which is not in scope of this post.

Wrap up

A system was experiencing intermittent query timeouts. My debugging process was:

  1. Use a classic server side trace to catch the queries that timed out, to figure out if they were using up resources for more than 30 seconds, or if they were waiting for something.
  2. It turned out they were waiting for something, so I ran a XEvent session to track all wait events with a duration > 200 ms (part 1 of this blog post series)
  3. The XEvent session revealed that the queries were waiting to acquire an intent exclusive lock, so the next step was to sample blocking chains to find the root query (This blog post)
  4. The blocking sampling revealed the query that caused heavy table locks, and now it was easy to go ahead and fixing the problem.

But remember..

The debugging process will not work for hunting down the root cause for all scenarios. The outcome of each step determines what the next step should be. So if my initial server side trace had shown heavy cpu usage for the entire 30 seconds, then my next step would be to look into execution plans etc. So in that case it wouldn’t make sense to use the XEvent sessions to capture wait_info and continue down that path.
Nevertheless I hope this little blog post series gave you an idea how you could hunt down your own nasty queries Smiley

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating