SPID Status SUSPENDED with MaxDOP = 1? I need to look back...

  • Greetings to the Collective Minds of SQL Server Central! It has been a while since I posted, but not since I have utilized the collective wisdom.

    I recently had a problem with an Application that I support where the Application basically halted because there were 163 (yes, you read that correctly, one-hundred-sixty-three) Unique SPIDS essentially running the same Select statement for the same application user. This is attributed to the user clicking the operation, waiting a bit for a response, and when one was not presented, clicking the button again. I know what you are thinking, "Hello? Business logic?" The problem here is that there was no Blocking in the Database. None!

    Because this is a production system I needed to get the issue resolved so I wrote a Cursor to kill the SPIDs. This resolved the current problem, but what I forgot to do is actually take a picture of the processes to get the wait type. Because of the MaxDOP setting, I am confident that it was not CXPACKET, but I don't know that for sure.

    Short of waiting for this to happen again, which I believe it will, what can I do to look back at those processes to find what caused them to be suspended? It could be disk I/O or something else, but I don't know.

    Any guidance is appreciated.

    Regards, Irish 

  • Waits can be viewed at an aggregate level on the server, and on an individual basis for executing queries. Duration and worker time and IO are tracked for query executions. But I'm not sure you can see the specific "why a spid was suspeded" details like you want at the individual execution level for non-currently-running queries.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • ... This is attributed to the user clicking the operation, waiting a bit for a response, and when one was not presented, clicking the button again. I know what you are thinking, "Hello? Business logic?" The problem here is that there was no Blocking in the Database. None! ...

    Blocking the user from clicking the submit button multiple times before getting a response back should definately be a function of the application, and it simply can't be implemented at the database level. The purpose of blocking in SQL Server isn't to prevent users from doing things but rather to prevent a request from reading various forms of uncommitted data depending on the isolation level.

    ... Short of waiting for this to happen again, which I believe it will, what can I do to look back at those processes to find what caused them to be suspended? It could be disk I/O or something else, but I don't know. ...

    I don't see why it matters what the specific wait state for each of these duplicate requests were. If have a situation where a user has multiple requests executing for the same query, and you're wanting to kill all but one of them, then perhaps use sys.dm_exec_requests to determine which request has the first start time and keep that one. But killing spids is a hack way of dealing with this situation. The best thing to do is have your application disable the submit button after the first request, and then keep it disabled until the first request returns.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Jeffrey Irish - Wednesday, February 22, 2017 10:25 AM

    Greetings to the Collective Minds of SQL Server Central! It has been a while since I posted, but not since I have utilized the collective wisdom.

    I recently had a problem with an Application that I support where the Application basically halted because there were 163 (yes, you read that correctly, one-hundred-sixty-three) Unique SPIDS essentially running the same Select statement for the same application user. This is attributed to the user clicking the operation, waiting a bit for a response, and when one was not presented, clicking the button again. I know what you are thinking, "Hello? Business logic?" The problem here is that there was no Blocking in the Database. None!

    Because this is a production system I needed to get the issue resolved so I wrote a Cursor to kill the SPIDs. This resolved the current problem, but what I forgot to do is actually take a picture of the processes to get the wait type. Because of the MaxDOP setting, I am confident that it was not CXPACKET, but I don't know that for sure.

    Short of waiting for this to happen again, which I believe it will, what can I do to look back at those processes to find what caused them to be suspended? It could be disk I/O or something else, but I don't know.

    Any guidance is appreciated.

    Why not build it into the application?  Our in-house software, when making a database call, will disable the button that would request data until the data is received or sent.  This allows us to just have 1 database connection from the software at a time.  
    Could you do the same steps the end user did on a test environment and throw a waitfor into the SP so it runs slower and thus gives you a more of a chance to do the multiple clicks?
    I am also not sure why you would expect there to be blocking in the database when a user clicks a button.  Unless the button is doing an insert or update or delete, and even then I would expect row or page level blocking not table level.

    I was just checking one of our live systems and the only suspended status I see is something sitting in a waitfor from our SQL monitoring software (3rd party).  If you don't have any 3rd party tools installed for monitoring SQL, it can be very hard to see any historical issues.

    I would try to replicate this in a test system and if possible, try to prevent it from happening at the application level.  There are a LOT of things that could cause it to go into a suspended state (waiting for resources (CPU, memory, disk, etc), waiting for a lock to be released, waiting for some other task to finish (but this should show up in the BlkBy column), etc).

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Thank you all for the quick replies.

    @sqlguru - Okay. I'd like to view that aggregate data if I can. Mostly I'm curious what the wait type was more than anything else. I don't need to drill into why as much as the wait type. It will give me a starting place. I don't know if I can get that from the "top SQL" scripting I have or not, but I have not tried yet either.

    @eric M Russel - Don't get me started on the application pieces of this to prevent the user from submitting multiple requests in succession! The Dev team is aware and they have promised to take care of it...eventually. As for the why it matters of the wait type is simply because I am trying to figure where to start on figuring this out. Honestly I'm kicking myself for not looking at it before killing the processes.

    @bmg002 - Yes, I want it in the application! Sadly I'm just the guy who gets strung up when performance tanks. Sometimes you're the hero and sometimes you're the goat.

    Regards, Irish 

  • Jeffrey Irish - Wednesday, February 22, 2017 11:10 AM

    Thank you all for the quick replies.

    @sqlguru - Okay. I'd like to view that aggregate data if I can. Mostly I'm curious what the wait type was more than anything else. I don't need to drill into why as much as the wait type. It will give me a starting place. I don't know if I can get that from the "top SQL" scripting I have or not, but I have not tried yet either.

    @eric M Russel - Don't get me started on the application pieces of this to prevent the user from submitting multiple requests in succession! The Dev team is aware and they have promised to take care of it...eventually. As for the why it matters of the wait type is simply because I am trying to figure where to start on figuring this out. Honestly I'm kicking myself for not looking at it before killing the processes.

    @bmg002 - Yes, I want it in the application! Sadly I'm just the guy who gets strung up when performance tanks. Sometimes you're the hero and sometimes you're the goat.

    I would wait for it to happen again and then look at the blocking OR replicate this on a test environment.  
    If the process caused some company downtime or potential data loss, I'd put more pressure on your dev team to get the changes in place as they shouldn't be all that complicated.  
    Without seeing the TSQL code, I'd guess that the blocking was probably that the application blocked itself.  What I mean is that they clicked the button 163 times, the application would stall waiting for the 163rd query to complete.  SQL would handle them in the order they came in and they are each blocking each other.  So if the query took 1 second to complete, the end user would need to wait 163 seconds for it to complete.  That is presuming other people didn't click that button in the application too.
    Mind you the above assumes that each query is making an exclusive lock.  If it was a shared lock then they should all complete in nearly the same time unless something was making an exclusive lock.

    Now, if it isn't blocking, it could be disk I/O, resource (cpu/memory) pressure, etc.  Once the queries are killed, it is nearly impossible to figure out what happened unless you have some monitoring tools in place.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • From what you've already described, this sounds like a direct SQL select. 

    However, if this were a stored procedure call, then you would have more options for mitigating this with server side logic. I've encountered scenarios similar to this in the past where users would frequently refresh screens in a dashboard style application where the data was mostly static. I implemented a job that refreshed a persisted summary table every five minutes, and then the stored procedure would query from that, rather than the operational database directly. The data aggregation was granular enough to support whatever filtering was specified by the input parameters of each procedure call. The users never noticed that their data was cached every few minutes, they only knew that screen refreshes were responding a lot faster after Eric worked his magic.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply