Recursive CTE issue when there is a chain of blocking SPIDs

  • Scenario: When one SPID is blocking the second, and the second is blocking the third ... and then the 33rd is blocking the first one, thus forming circular blocking chain.

    QUESTION 1:  is such situation POSSIBLE? Some experts say no, and each circular blocking gets quickly resolved by a Deadlock/victim process. I believe it is possible, that is why my 1st SQL below hangs forever, EATING UP server resources:

    MaxRecursion option - 0 (zero) means INDEFINITE, and thus the execution of the CTE Part travels through very very long (but are they circular???) chain of blocking and blocked SPIDs.

    I am running the below CTE against the WhoisActiveB table (that is populated every 2 minutes via this call of sp_whoIsActive:

    This is how the table WhoisActiveB is populated:

    EXEC sp_WhoIsActive @get_outer_command = 1,  @get_plans = 1,@find_block_leaders = 1,

    @sort_order = '[blocked_session_count] DESC', @destination_table = WhoIsActiveTBL;

    And then the following query  (#1) ris un against the results saved in table WhoIsActiveTBL, to return 2 resultsets: 1 - only blocking CHAINS,  2 = non-blocking SPIDs:

    (the second block of code is our fix to the first one)

    1. (before changing option Max Recursion to 50)
       BEGIN
      ; WITH parents_CTE AS (
      SELECT P.[dd hh:mm:ss.mss]
      ,P.[session_id]
      ,P.[sql_text]
      ,P.[sql_command]
      ,P.[login_name]
      ,P.[wait_info]
      -- ,CAST(LTRIM(RTRIM(P.[CPU])) AS BIGINT) 'CPU'
      ,LTRIM(RTRIM(P.[CPU])) 'CPU'
      ,P.[tempdb_allocations]
      ,P.[tempdb_current]
      ,P.[blocking_session_id]
      ,P.[reads]
      ,P.[writes]
      ,P.[physical_reads]
      ,P.[query_plan]
      ,P.[used_memory]
      ,P.[status]
      ,P.[open_tran_count]
      ,P.[percent_complete]
      ,P.[host_name]
      ,P.[database_name]
      ,P.[program_name]
      ,P.[start_time]
      ,P.[login_time]
      ,P.[request_id]
      ,P.[collection_time]
      FROM WhoIsActiveB P(nolock)
      WHERE ISNULL(blocking_session_id,0)=0
      AND ISNULL(open_tran_count,0)>0
      AND (p.start_time> @SPID_StartTime)
      AND (p.start_time < @SPID_EndStartTime)
      AND (@Collection_timeStart is null OR collection_time > @Collection_timeStart)
      AND (@Collection_timeEnd is null OR collection_time < @Collection_timeEnd)
      -- AND (p.[database_name] IS NULL OR p.[database_name] = ISNULL(@DatabaseName, p.[database_name]))

      UNION ALL
      SELECT C.[dd hh:mm:ss.mss]
      ,C.[session_id]
      ,C.[sql_text]
      ,C.[sql_command]
      ,C.[login_name]
      ,C.[wait_info]
      -- ,CAST(LTRIM(RTRIM(C.[CPU])) AS BIGINT) 'CPU'
      ,LTRIM(RTRIM(C.[CPU])) 'CPU'
      ,C.[tempdb_allocations]
      ,C.[tempdb_current]
      ,C.[blocking_session_id]
      ,C.[reads]
      ,C.[writes]
      ,C.[physical_reads]
      ,C.[query_plan]
      ,C.[used_memory]
      ,C.[status]
      ,C.[open_tran_count]
      ,C.[percent_complete]
      ,C.[host_name]
      ,C.[database_name]
      ,C.[program_name]
      ,C.[start_time]
      ,C.[login_time]
      ,C.[request_id]
      ,C.[collection_time]
      FROM WhoIsActiveB C(nolock), parents_CTE P
      WHERE P.session_id = C.blocking_session_id
      AND (c.start_time > @SPID_StartTime)
      AND (c.start_time < @SPID_EndStartTime)
      AND (@Collection_timeStart is null OR c.collection_time > @Collection_timeStart)
      AND (@Collection_timeEnd is null OR c.collection_time < @Collection_timeEnd)
      --AND (c.[database_name] IS NULL OR c.[database_name] = ISNULL(@DatabaseName, c.[database_name]))
      )
      SELECT * FROM parents_CTE
      WHERE ISNULL(blocking_session_id,0)>0
      ORDER BY collection_time DESC OPTION (MAXRECURSION 0)
      ;
      END ---------------------------------------------------------------------------
      ?

      2. This is what we have After:

      BEGIN
      BEGIN TRY
      ; WITH parents_CTE AS (
      SELECT P.[dd hh:mm:ss.mss]
      ,P.[session_id]
      ,P.[sql_text]
      ,P.[sql_command]
      ,P.[login_name]
      ,P.[wait_info]
      -- ,CAST(LTRIM(RTRIM(P.[CPU])) AS BIGINT) 'CPU'
      ,LTRIM(RTRIM(P.[CPU])) 'CPU'
      ,P.[tempdb_allocations]
      ,P.[tempdb_current]
      ,P.[blocking_session_id]
      ,P.[reads]
      ,P.[writes]
      ,P.[physical_reads]
      ,P.[query_plan]
      ,P.[used_memory]
      ,P.[status]
      ,P.[open_tran_count]
      ,P.[percent_complete]
      ,P.[host_name]
      ,P.[database_name]
      ,P.[program_name]
      ,P.[start_time]
      ,P.[login_time]
      ,P.[request_id]
      ,P.[collection_time]
      FROM WhoIsActiveB P(nolock)
      WHERE ISNULL(blocking_session_id,0)=0
      AND ISNULL(open_tran_count,0)>0
      AND (p.start_time> @SPID_StartTime)
      AND (p.start_time < @SPID_EndStartTime)
      AND (@Collection_timeStart is null OR collection_time > @Collection_timeStart)
      AND (@Collection_timeEnd is null OR collection_time < @Collection_timeEnd)
      -- AND (p.[database_name] IS NULL OR p.[database_name] = ISNULL(@DatabaseName, p.[database_name]))

      UNION ALL
      SELECT C.[dd hh:mm:ss.mss]
      ,C.[session_id]
      ,C.[sql_text]
      ,C.[sql_command]
      ,C.[login_name]
      ,C.[wait_info]
      -- ,CAST(LTRIM(RTRIM(C.[CPU])) AS BIGINT) 'CPU'
      ,LTRIM(RTRIM(C.[CPU])) 'CPU'
      ,C.[tempdb_allocations]
      ,C.[tempdb_current]
      ,C.[blocking_session_id]
      ,C.[reads]
      ,C.[writes]
      ,C.[physical_reads]
      ,C.[query_plan]
      ,C.[used_memory]
      ,C.[status]
      ,C.[open_tran_count]
      ,C.[percent_complete]
      ,C.[host_name]
      ,C.[database_name]
      ,C.[program_name]
      ,C.[start_time]
      ,C.[login_time]
      ,C.[request_id]
      ,C.[collection_time]
      FROM WhoIsActiveB C(nolock), parents_CTE P
      WHERE P.session_id = C.blocking_session_id
      AND (c.start_time > @SPID_StartTime)
      AND (c.start_time < @SPID_EndStartTime)
      AND (@Collection_timeStart is null OR c.collection_time > @Collection_timeStart)
      AND (@Collection_timeEnd is null OR c.collection_time < @Collection_timeEnd)
      --AND (c.[database_name] IS NULL OR c.[database_name] = ISNULL(@DatabaseName, c.[database_name]))
      )
      SELECT * FROM parents_CTE
      WHERE ISNULL(blocking_session_id,0)>0
      ORDER BY collection_time DESC OPTION (MAXRECURSION 50)
      ;
      END TRY
      BEGIN CATCH

      DECLARE @ErrorMessage varchar(MAX) = ERROR_MESSAGE(),
      @ErrorSeverity int = ERROR_SEVERITY(),
      @ErrorState smallint = ERROR_STATE();

      SELECT
      @ErrorMessage = ERROR_MESSAGE(),
      @ErrorSeverity = ERROR_SEVERITY(),
      @ErrorState = ERROR_STATE();

      RAISERROR(@ErrorMessage, @ErrorSeverity, @ErrorState);

      END CATCH

      END ---------------------------------------------------------------------------?

      Table DDL:

    2. CREATE TABLE [dbo].[WhoIsActiveB](
      [dd hh:mm:ss.mss] [varchar](8000) NULL,
      [session_id] [smallint] NOT NULL,
      [sql_text] [xml] NULL,
      [sql_command] [xml] NULL,
      [login_name] [nvarchar](128) NOT NULL,
      [wait_info] [nvarchar](4000) NULL,
      [CPU] [varchar](30) NULL,
      [tempdb_allocations] [varchar](30) NULL,
      [tempdb_current] [varchar](30) NULL,
      [blocking_session_id] [smallint] NULL,
      [blocked_session_count] [varchar](30) NULL,
      [reads] [varchar](30) NULL,
      [writes] [varchar](30) NULL,
      [physical_reads] [varchar](30) NULL,
      [query_plan] [xml] NULL,
      [used_memory] [varchar](30) NULL,
      [status] [varchar](30) NOT NULL,
      [open_tran_count] [varchar](30) NULL,
      [percent_complete] [varchar](30) NULL,
      [host_name] [nvarchar](128) NULL,
      [database_name] [nvarchar](128) NULL,
      [program_name] [nvarchar](128) NULL,
      [start_time] [datetime] NOT NULL,
      [login_time] [datetime] NULL,
      [request_id] [int] NULL,
      [collection_time] [datetime] NOT NULL
      ) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
      GO

     

  • To confirm, you SEE this blocking?  The reason I ask is that a SINGLE SQL query, even a recursive one, will use the SAME SPID for the lifetime of the query.

    So if the query is jumping out to be parallel, then it could be that the query becomes self blocking, but an loop of blocking is highly unlikely to not cause a deadlock and force your query to rollback.

    My guess isn't that you are having a blocking loop, but that your recursive query has no end case and is recursing forever which is a LOT more of a likely scenario in the first case and when you put a limit on the recursion, it is likely hitting the limit.

    My guess is that if you look at the messages window, you will see that when you limit it to 50, you are being told you ran through the 50 recursions and didn't hit the end of things.

    But lets look at your recursive CTE.  The way the recursion works is it is going to be looping through until the recursive bit of the query returns 0 rows.  Your only changing value with the recursive CTE (ie the only part that is not comparing variables) is:

    P.session_id  = C.blocking_session_id

    so lets say the parent table has a session ID of 1 that is blocked by session ID 2.  First time through the loop, you see session ID 1 is blocked by session ID 2.  Next time through the loop you see session ID 1 is blocked by session ID 2... and so on.  so recursion of 0 means you are going to have that looping happening indefinitely.

    TL;DR - you aren't having a "looped blocking" problem, you are having an infinite recursion problem.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • the sttatement itself is not blocking anything. it searches through the results of sp_whoisactive saved to permanent table. so the parent child SPIDs are data in that table.

    and, if what you said is so, how come most times , for most time periods,  i do not get max recursion reached and see 2 normal resultsets. e en tho the amount of data 4 that period is 10 times more in table than in the slice of data that belongs to a difdeent much dmaller set that leads to infinite recursion?

     

     

     

     

    Likes to play Chess

  • That I am not entirely sure.  Best guess - I am wrong on how it is matching things up with the recursive query which I was actually able to prove on my desktop.

    My understanding of your recursive query was incorrect and it is not an infinite loop in most cases.

    I did a quick and dirty version of your recursive CTE that looks like this:

    DECLARE @test TABLE (id INT, blocking_id INT)
    INSERT INTO @test
    (
    [id]
    , [blocking_id]
    )
    VALUES
    (
    1-- id - int
    , 2-- blocking_id - int
    ), (
    2-- id - int
    , 3 -- blocking_id - int
    ), (
    3-- id - int
    , NULL -- blocking_id - int
    ),
    (
    4-- id - int
    , 1 -- blocking_id - int
    ),
    (
    5-- id - int
    , 4 -- blocking_id - int
    );
    WITH cte AS (
    SELECT id, [blocking_id]
    FROM @test
    WHERE ISNULL(blocking_id,0) = 0
    UNION ALL
    SELECT [@test].id, [@test].blocking_id
    FROM @test, [cte]
    WHERE [cte].[id] = [@test].[blocking_id])
    SELECT *
    FROM cte
    WHERE ISNULL([cte].[blocking_id],0) >0
    OPTION (MAXRECURSION 4)

    I say quick and dirty because I removed a lot of the stuff that wasn't being used in the recursion.  Running that with 5 rows of data comes back pretty quickly.  Reducing the maxrecursion below 4 results in an error about it exceeding the recursion limit, which is expected.  Looking at the way the recursion would work, 5->4->1->2->3 which is the initial (5) plus 4 recursion (4, 1, 2, 3).  But, since I am not using any data from the recursion, the recursion is really just spinning up resources and throwing the data away.  I believe the same is true of your query from looking at it, but maybe I am missing something (like how I was missing the infinite recursion).

    The way that you could get infinite recursion would be if SPID A is blocking itself.  This can happen if a query goes parallel and then ends up blocking another thread of itself.  So Thread 1 of SPID A needs to wait for Thread 2 of SPID A to complete before it can continue.  To simulate this with the recursive CTE mentioned above, change the original data set in my example to:

    INSERT INTO @test
    (
    [id]
    , [blocking_id]
    )
    VALUES

    (
    5-- id - int
    , 5 -- blocking_id - int
    ),
    (
    5-- id - int
    ,NULL -- blocking_id - int
    );

    in which case you get infinite recursion.  And this is a perfectly valid case for SQL Server as well.  A query that goes parallel that shouldn't go parallel as it causes self blocking and loses all benefits of parallelism, and the recursive CTE will get stuck in an infinite loop.

    To summarize, if a thread goes parallel and does self blocking, your query may get stuck in an infinite loop.

    But to your original point, if SPID 1 blocks SPID 2 which blocks 3 and so on until SPID N blocks SPID 1, that is a deadlock.  SQL will see that the queries can never end and will kill off the one that is the easiest to roll back or the one with the lowest priority.  There are some pretty nasty deadlock graphs that you can find that have 100's of queries (ie SPIDs) and SQL still rolls it back.

    Now, if I am mistaken on it being an infinite recursion (which is likely since setting max recursion to 50 fixes it for you), my next thought is to set max recursion to 50 and grab an actual execution plan. Then set it to 0 and grab an estimated one (as it never completes, right?).  Now compare those.  It could be that with the max recursion set to 0, SQL is estimating a HUGE amount of memory to be needed and is spilling to tempdb which will be a LOT slower than if you can do it in memory.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply