A lot of THREADPOOL waits, find worker thread blocking the scheduler

  • Hello, 

    I have a very strange problem. From time to time on a production servers the number of THREADPOOL waits increases drastically (I can see that from sys.dm_os_waiting_tasks). But these waits have no session_id, I can't find any blocker for to be able to terminate that process and release any resource. When looking for blocking transactions in sys.dm_exec_sessions I can't find anything in there too. The number of THREADPOOL waits increases, is there a way I can release the resource on which these waits occur?

    Thanks in advance

  • tttcoding - Tuesday, June 12, 2018 4:26 AM

    Hello, 

    I have a very strange problem. From time to time on a production servers the number of THREADPOOL waits increases drastically (I can see that from sys.dm_os_waiting_tasks). But these waits have no session_id, I can't find any blocker for to be able to terminate that process and release any resource. When looking for blocking transactions in sys.dm_exec_sessions I can't find anything in there too. The number of THREADPOOL waits increases, is there a way I can release the resource on which these waits occur?

    Thanks in advance

    It's not necessarily just blocking that can cause this and querying sys.dm_exec_sessions for blocking doesn't report any blocking so that's not going to help you.
    Checking the number of connections as well as checking for blocking would be good. Connections would just be sys.dm_exec_connections.
    If you can see the waits in sys.dm_os_waiting_tasks, this dmv also has blocking information. Some other scripts to check for blocking: 
    Quickest Way to Identify Blocking Query
    Understanding SQL Server Blocking

    Sue

  • Sue_H - Tuesday, June 12, 2018 10:54 AM

    tttcoding - Tuesday, June 12, 2018 4:26 AM

    Hello, 

    I have a very strange problem. From time to time on a production servers the number of THREADPOOL waits increases drastically (I can see that from sys.dm_os_waiting_tasks). But these waits have no session_id, I can't find any blocker for to be able to terminate that process and release any resource. When looking for blocking transactions in sys.dm_exec_sessions I can't find anything in there too. The number of THREADPOOL waits increases, is there a way I can release the resource on which these waits occur?

    Thanks in advance

    It's not necessarily just blocking that can cause this and querying sys.dm_exec_sessions for blocking doesn't report any blocking so that's not going to help you.
    Checking the number of connections as well as checking for blocking would be good. Connections would just be sys.dm_exec_connections.
    If you can see the waits in sys.dm_os_waiting_tasks, this dmv also has blocking information. Some other scripts to check for blocking: 
    Quickest Way to Identify Blocking Query
    Understanding SQL Server Blocking

    Sue

    Hello Sue, 

    thanks for replying. Yes, I can see the waits in the sys.dm_os_waiting_tasks but there is no blocking information for any of the tasks waiting. The records just contain the task address and information on how much they wait until I extract the info, nothing more. And they are more and more, the number just increases. I had a situation previously where there was blocking information in this DMV and I was able to terminate the session that was causing blocking, but in this situation I don't have info about a blocker.

    Is there any way for me to see which worker thread is stuck and blocks a scheduler? Maybe from the sys.dm_os_workers DMV and then termite whatever it is doing, or whatever task is running?

    Thanks,
    Tomi

  • tttcoding - Tuesday, June 12, 2018 2:55 PM

    Hello Sue, 

    thanks for replying. Yes, I can see the waits in the sys.dm_os_waiting_tasks but there is no blocking information for any of the tasks waiting. The records just contain the task address and information on how much they wait until I extract the info, nothing more. And they are more and more, the number just increases. I had a situation previously where there was blocking information in this DMV and I was able to terminate the session that was causing blocking, but in this situation I don't have info about a blocker.

    Is there any way for me to see which worker thread is stuck and blocks a scheduler? Maybe from the sys.dm_os_workers DMV and then termite whatever it is doing, or whatever task is running?

    Thanks,
    Tomi

    Hi Tomi -

    You can't reliably figure out what worker is stuck - mostly because there is no stuck status. And it may have nothing to do with blocking anyway.
    Your other one in the past was due to blocking but that doesn't mean this one is also due to blocking. You would also want to see what other high wait types are as that can help identify issues.
    In addition to blocking, you can see the same thing when there are a lot of connections, too many processes in parallel, many or long executions of extended stored procedures, etc. That's why checking some of the other waits can help as well as checking the SQL Server error log to see what other errors you may have. It doesn't sound like you have any thread starvation if you can login without the DAC and run the queries. And if nothing is logged, then it really doesn't sound like thread starvation.
    Is the resource_description populated in sys.dm_os_waiting_tasks? That can give you an idea of what objects are involved if associatedObjectId has a value.
    Do you have any baselines for what type of load you get on that server? You get an idea of that by seeing how many workers have the status running:
    SELECT state, COUNT(*)
    FROM sys.dm_os_workers
    GROUP BY state

    Did you check the number of active workers and work queue counts for the schedulers:
    SELECT current_workers_count, active_workers_count, work_queue_count
    FROM sys.dm_os_schedulers
    WHERE status='Visible online'

    There are quite a few things you'd want to check -
    Check the CPU usage as well since usually the combination of low CPU and high threadpool indicates blocking so you'd want to double check for blocking
    Check/Monitor the number of connections and see if you are having any users complaining about problems connecting or are you seeing login timeouts.
    Check the top wait types in addition to threadpool.
    Check if connections being closed, cleaned up by applications

    Sue

  • Sue_H - Wednesday, June 13, 2018 12:00 PM

    tttcoding - Tuesday, June 12, 2018 2:55 PM

    Hello Sue, 

    thanks for replying. Yes, I can see the waits in the sys.dm_os_waiting_tasks but there is no blocking information for any of the tasks waiting. The records just contain the task address and information on how much they wait until I extract the info, nothing more. And they are more and more, the number just increases. I had a situation previously where there was blocking information in this DMV and I was able to terminate the session that was causing blocking, but in this situation I don't have info about a blocker.

    Is there any way for me to see which worker thread is stuck and blocks a scheduler? Maybe from the sys.dm_os_workers DMV and then termite whatever it is doing, or whatever task is running?

    Thanks,
    Tomi

    Hi Tomi -

    You can't reliably figure out what worker is stuck - mostly because there is no stuck status. And it may have nothing to do with blocking anyway.
    Your other one in the past was due to blocking but that doesn't mean this one is also due to blocking. You would also want to see what other high wait types are as that can help identify issues.
    In addition to blocking, you can see the same thing when there are a lot of connections, too many processes in parallel, many or long executions of extended stored procedures, etc. That's why checking some of the other waits can help as well as checking the SQL Server error log to see what other errors you may have. It doesn't sound like you have any thread starvation if you can login without the DAC and run the queries. And if nothing is logged, then it really doesn't sound like thread starvation.
    Is the resource_description populated in sys.dm_os_waiting_tasks? That can give you an idea of what objects are involved if associatedObjectId has a value.
    Do you have any baselines for what type of load you get on that server? You get an idea of that by seeing how many workers have the status running:
    SELECT state, COUNT(*)
    FROM sys.dm_os_workers
    GROUP BY state

    Did you check the number of active workers and work queue counts for the schedulers:
    SELECT current_workers_count, active_workers_count, work_queue_count
    FROM sys.dm_os_schedulers
    WHERE status='Visible online'

    There are quite a few things you'd want to check -
    Check the CPU usage as well since usually the combination of low CPU and high threadpool indicates blocking so you'd want to double check for blocking
    Check/Monitor the number of connections and see if you are having any users complaining about problems connecting or are you seeing login timeouts.
    Check the top wait types in addition to threadpool.
    Check if connections being closed, cleaned up by applications

    Sue

    Hello Sue, 

    thanks for your thorough answer. I sensed that something happens on the server when my connection to the DB host (from my local PC through VPN) started to be slow. I was experiencing slowness running very simple diagnostic queries. Then, I connected with RDP to the machine, tried to connect from there, it was slow too. 

    I first thought that the server may be under high load and I checked the CPU and the memory used. The CPU usage was lowering with every minute, and the memory usage did stick to some average value of 55%, which is usual for our production system. Then my connection via SSMS got disconnected, tried a couple of times, it wouldn't let me in. We were informed about the client sites down. I was able to connect to the server using DAC and that's where I took the statistics from and realized what is happening - Low CPU, good memory usage, high THREADPOOL waits which were more and more. I checked the number of active workers, it was more than 500. The work_queue_count was about 300 and I had about 1100 records in the sys.dm_os_waiting_tasks. I didn't check the number of connections because it was a period of the day when we don't expect a bump on connections. 

    The SQL Server error log contains messages only for login timeouts, nothing more. These are the same timeouts I experienced while trying to log into the DB engine. We don't have anything like a message for Deadlocked Schedulers or similar error message. The combination of Low CPU and high THREADPOOL waits got me thinking that must be some kind of blocking happening, but I wasn't able to find any blocker. The THREADPOOL waits were more and more and the DB engine got stuck.

    The resource_description contained information about the threadpool owner: threadpool id=scheduler<hex-address>. Except that field the only other fields that had values were waiting_task_address, wait_duration_ms, wait_type.

    I confirmed with the developers that the connections are being closed by the applications for sure. 

    Best Regards,
    Tomi

  • tttcoding - Wednesday, June 13, 2018 12:58 PM

    Hello Sue, 

    thanks for your thorough answer. I sensed that something happens on the server when my connection to the DB host (from my local PC through VPN) started to be slow. I was experiencing slowness running very simple diagnostic queries. Then, I connected with RDP to the machine, tried to connect from there, it was slow too. 

    I first thought that the server may be under high load and I checked the CPU and the memory used. The CPU usage was lowering with every minute, and the memory usage did stick to some average value of 55%, which is usual for our production system. Then my connection via SSMS got disconnected, tried a couple of times, it wouldn't let me in. We were informed about the client sites down. I was able to connect to the server using DAC and that's where I took the statistics from and realized what is happening - Low CPU, good memory usage, high THREADPOOL waits which were more and more. I checked the number of active workers, it was more than 500. The work_queue_count was about 300 and I had about 1100 records in the sys.dm_os_waiting_tasks. I didn't check the number of connections because it was a period of the day when we don't expect a bump on connections. 

    The SQL Server error log contains messages only for login timeouts, nothing more. These are the same timeouts I experienced while trying to log into the DB engine. We don't have anything like a message for Deadlocked Schedulers or similar error message. The combination of Low CPU and high THREADPOOL waits got me thinking that must be some kind of blocking happening, but I wasn't able to find any blocker. The THREADPOOL waits were more and more and the DB engine got stuck.

    The resource_description contained information about the threadpool owner: threadpool id=scheduler<hex-address>. Except that field the only other fields that had values were waiting_task_address, wait_duration_ms, wait_type.

    I confirmed with the developers that the connections are being closed by the applications for sure. 

    Best Regards,
    Tomi

    Hi Tomi -
    If you regularly do have a heavy load on that server and you aren't experiencing any other significant waits, you MAY want to look at increasing max worker threads. But it is one of those settings that is best left alone most of the time. With the extra information you provided though, it does sound like this could need that setting changed. If you do try to experiment with that, you'd want to be very conservative and increase in small increments. Increasing worker threads isn't a fix all and comes with more memory being used, usually more context switching, schedulers taking on more load, etc. Here are a few links for your reference that you would want to read up on before any changes:
    Max. Worker Threads and when you should change it
    Max Worker Threads – Lower and Upper Bounds – The Correct Formula

    You'd want to really watch things to make sure it doesn't make other things worse. With the login issues, often the best way to monitor those is querying os ring buffers as that seems to be the more reliable way to catch any of those failures. This link has a lot of the queries with shredding the XML which makes things easier. Especially take a look at the first two examples -
    Inside sys.dm_os_ring_buffers

    It would be great if you could post back any progress, improvements, changes, etc. I'm very interested in seeing how this turns out for you. 

    Sue

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply