Too many workers?

  • Hi,
    I think have a problem with one of my SQL 2012 SP3 CU4 servers. Twice now, our two principal servers have stopped responding due to not enough workers. So I raised the threshold of "Max Worker Threads" to prevent another crash while I investigate the issue and get to understand the root problem. When I started graphing the values of Active Workers, Current Tasks and Current Workers I got a surprise to say the least.
    On one of the servers (DB-01 in the picture) the Active Workers and Current Tasks values are between 140 and 160 with a Current Workers value between 260 and 360 - this seems OK to me. However, on the other server (DB-11 on the picture) the Active Workers and Current Tasks values are between 200 and 270 with a Current Workers value between 750 and 950! Already I am very surprised to have that many workers for the amount of tasks. The thing that really gets my attention is that all of a sudden this servers Current Workers will go up to 1300 - 1400 without there being any more tasks to show and the count will stay there for a few days until it drops to its old value just as suddenly.
    I need help to know where I should start troubleshooting this behaviour.

    System specs:
    DB-01 (Good system)
    - Dual Xeon / 6 cores/CPU / HT = 24 logical processors
    - 96 GB RAM (SQL Server limited to 64GB)
    - Windows 2012 R2
    - SQL Server 2012 Std. Ed. SP3 CU4 x64
    - Max DOP = 8
    - Cost Threshold = 50
    - Max Worker Threads = 1664

    DB-11 (Bad system)
    - Dual Xeon / 12 cores/CPU / HT = 48 logical processors
    - 128 GB RAM (SQL Server limited to 64GB)
    - Windows 2012 R2
    - SQL Server 2012 Std. Ed. SP3 CU4 x64
    - Max DOP = 8
    - Cost Threshold = 50
    - Max Worker Threads = 1500

    The picture below shows the graphs of DB-11 and DB-01 (the two principal servers) since I started graphing these values. For clarity I removed the graphs for the mirror servers.

    Since we're currently in a peak and I don't know how long it will last I hope to get some good queries to run to see what is causing this behaviour.

    Many thanks for your attention

    Yomet10

  • First off: If you want something fixed "now" you really need to get someone on the phone

    to see what is connected/running the best thing I know of is sp_whoisactive

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • yomet10 - Monday, January 30, 2017 2:35 PM

    Hi,
    I think have a problem with one of my SQL 2012 SP3 CU4 servers. Twice now, our two principal servers have stopped responding due to not enough workers. So I raised the threshold of "Max Worker Threads" to prevent another crash while I investigate the issue and get to understand the root problem. When I started graphing the values of Active Workers, Current Tasks and Current Workers I got a surprise to say the least.
    On one of the servers (DB-01 in the picture) the Active Workers and Current Tasks values are between 140 and 160 with a Current Workers value between 260 and 360 - this seems OK to me. However, on the other server (DB-11 on the picture) the Active Workers and Current Tasks values are between 200 and 270 with a Current Workers value between 750 and 950! Already I am very surprised to have that many workers for the amount of tasks. The thing that really gets my attention is that all of a sudden this servers Current Workers will go up to 1300 - 1400 without there being any more tasks to show and the count will stay there for a few days until it drops to its old value just as suddenly.
    I need help to know where I should start troubleshooting this behaviour.

    System specs:
    DB-01 (Good system)
    - Dual Xeon / 6 cores/CPU / HT = 24 logical processors
    - 96 GB RAM (SQL Server limited to 64GB)
    - Windows 2012 R2
    - SQL Server 2012 Std. Ed. SP3 CU4 x64
    - Max DOP = 8
    - Cost Threshold = 50
    - Max Worker Threads = 1664

    DB-11 (Bad system)
    - Dual Xeon / 12 cores/CPU / HT = 48 logical processors
    - 128 GB RAM (SQL Server limited to 64GB)
    - Windows 2012 R2
    - SQL Server 2012 Std. Ed. SP3 CU4 x64
    - Max DOP = 8
    - Cost Threshold = 50
    - Max Worker Threads = 1500

    The picture below shows the graphs of DB-11 and DB-01 (the two principal servers) since I started graphing these values. For clarity I removed the graphs for the mirror servers.

    Since we're currently in a peak and I don't know how long it will last I hope to get some good queries to run to see what is causing this behaviour.

    Many thanks for your attention

    Yomet10

    The queries you should run would be the same ones you would use to see if you should increase the worker threads. It's not recommended to change that value and to let SQL Server configure it dynamically as it does based on cpu and architecture.
    You would generally monitor sys.dm_os_schedulers. Work_queue_count is generally used as an indicator but it's good to monitor other metrics for the schedulers to see if you are hitting issues with CPU or disk such as runnable_task_count, load_factor, pending_disk_io_count. You would also monitor dm_os_wait stats and then dm_os_waiting_tasks when the server is getting hit. You generally would do this before changing max worker threads. Every time I've dealt with scheduler related errors it has never been an issue with needing to increase max worker threads. 

    Sue

  • What about blocking and cost threshold for parallelism on the troubled server?
    😎

  • TheSQLGuru - Monday, January 30, 2017 3:14 PM

    First off: If you want something fixed "now" you really need to get someone on the phone

    to see what is connected/running the best thing I know of is sp_whoisactive

    Hi Kevin,

    Thanks for suggestiong that I download, compile and run sp_whoisactive it seems like a really nice tool. I am not used to it yet but using the suggested values by Brent Ozar
    exec sp_whoisactive @get_plans = 1, @get_locks = 1, @get_task_info = 2
    I get 92 rows as a result, of these only 4 rows are not BROKER_RECEIVE_WAITFOR. These 4 queries that are running are simple SELECT queries. Also none of the sessions listed is blocked, i.e. blocking_session_id is NULL.

    I probably need to get better aquainted with sp_whoisactive but these are the results right now.

    Yomet10

  • Eirikur Eiriksson - Tuesday, January 31, 2017 1:17 AM

    What about blocking and cost threshold for parallelism on the troubled server?
    😎

    Hi Eirikur,

    As you will see int my answer to Kevin there are no blocks reported. Also, as I stated in my original post the cost threshold for parallellism is set to 50.

    Yomet10

  • Sue_H - Monday, January 30, 2017 5:35 PM

    The queries you should run would be the same ones you would use to see if you should increase the worker threads. It's not recommended to change that value and to let SQL Server configure it dynamically as it does based on cpu and architecture.
    You would generally monitor sys.dm_os_schedulers. Work_queue_count is generally used as an indicator but it's good to monitor other metrics for the schedulers to see if you are hitting issues with CPU or disk such as runnable_task_count, load_factor, pending_disk_io_count. You would also monitor dm_os_wait stats and then dm_os_waiting_tasks when the server is getting hit. You generally would do this before changing max worker threads. Every time I've dealt with scheduler related errors it has never been an issue with needing to increase max worker threads. 

    Sue

    Hi Sue,

    Thanks for your input. When I started this ordeal the server was running with the default max_worker_threads of 0 (zero) and the actual running value was 832, as you can see we're running way above that value right now. Since the formula used by SQL server uses the number of cores I thought that maybe I had to double that since we're using hyper-threading, so I got a value of 1216 and added a bit more as a buffer. Hence I am running at 1500.

    Here are the current results from some of the queries that I ran when trying to see if I should increase the worker threads.

    SELECT scheduler_id,current_tasks_count,current_workers_count,active_workers_count,work_queue_count 
    FROM sys.dm_os_schedulers
    WHERE status = 'Visible Online';

    scheduler_id  current_tasks_count current_workers_count active_workers_count  work_queue_count
      0       9       35        9        0
      1       9       39        8        0
      2       8       35        8        0
      3       7       36        7        0
      4       8       38        8        0
      5       8       35        8        0
      6       8       39        8        0
      7       8       35        8        0
      8       7       41        7        0
      9       9       41        9        0
      10      6       32        6        0
      11      6       42        6        0
      12      8       41        8        0
      13      5       39        5        0
      14      4       34        4        0
      15      8       34        8        0
      16      5       39        5        0
      17      7       45        7        0
      18      6       36        6        0
      19      7       36        7        0
      20      8       41        8        0
      21      6       35        6        0
      22      9       34        9        0
      23      10      36        9        0
      24      16      43        16        0
      25      17      43        16        0
      26      15      42        15        0
      27      16      46        16        0
      28      13      47        13        0
      29      14      44        14        0
      30      16      42        16        0
      31      15      40        14        0

    I find it weird that each and every scheduler has 3 to 5 workers per task, but maybe that's the way it's supposed to be.

    Another one:
    SELECT is_preemptive,state, last_wait_type, COUNT(*) as NumWorkers
    FROM sys.dm_os_workers 
    GROUP BY state,last_wait_type,is_preemptive 
    ORDER BY COUNT(*) DESC

    is_preemptive    state    last_wait_type    NumWorkers
    0    SUSPENDED    MISCELLANEOUS    1116
    0    SUSPENDED    BROKER_RECEIVE_WAITFOR    94
    0    SUSPENDED    BROKER_TASK_STOP    38
    0    SUSPENDED    CXPACKET    22
    0    SUSPENDED    DBMIRROR_EVENTS_QUEUE    18
    0    SUSPENDED    DBMIRRORING_CMD    17
    0    SUSPENDED    BROKER_TRANSMISSION_TABLE    6
    0    RUNNING    SOS_SCHEDULER_YIELD    6
    1    RUNNING    MISCELLANEOUS    3
    0    RUNNABLE    CXPACKET    3
    0    RUNNABLE    BROKER_TASK_STOP    2
    1    RUNNING    PREEMPTIVE_OS_DECRYPTMESSAGE    2
    0    SUSPENDED    LAZYWRITER_SLEEP    2
    0    SUSPENDED    LOGMGR_QUEUE    1
    0    SUSPENDED    DBMIRROR_DBM_EVENT    1
    0    SUSPENDED    ONDEMAND_TASK_QUEUE    1
    1    SUSPENDED    REQUEST_FOR_DEADLOCK_SEARCH    1
    1    SUSPENDED    SP_SERVER_DIAGNOSTICS_SLEEP    1
    0    SUSPENDED    SQLTRACE_INCREMENTAL_FLUSH_SLEEP    1
    1    SUSPENDED    XE_TIMER_EVENT    1
    1    RUNNING    PREEMPTIVE_OS_REPORTEVENT    1
    1    RUNNING    XE_DISPATCHER_WAIT    1
    1    INIT    MISCELLANEOUS    1
    0    RUNNABLE    SOS_SCHEDULER_YIELD    1
    0    RUNNING    MISCELLANEOUS    1
    0    SUSPENDED    BROKER_TRANSMITTER    1
    0    SUSPENDED    CHECKPOINT_QUEUE    1
    0    SUSPENDED    CLR_AUTO_EVENT    1
    1    SUSPENDED    CLR_AUTO_EVENT    1
    0    SUSPENDED    DIRTY_PAGE_POLL    1
    0    SUSPENDED    FT_IFTS_SCHEDULER_IDLE_WAIT    1
    0    SUSPENDED    FT_IFTSHC_MUTEX    1
    0    SUSPENDED    HADR_FILESTREAM_IOMGR_IOCOMPLETION    1
    0    SUSPENDED    KSOURCE_WAKEUP    1

    Here I am wondering what the wait_type MISCELLANEOUS represents. It seems unreal that it should have 1116 workers waiting for MISC...

    Best

    Yomet10

  • A) I don't think I ever said anything about blocking being an issue. It wouldn't with this wait type.

    B) Adam Machanic, the mastermind behind the awesome sp_whisactive, did a 30-day blog post series on SQLBlog.com. I recommend you dig into that.

    C) I ALMOST CERTAINLY FOUND YOUR PROBLEM!! And I apologize for missing it initially if it turns out to be it. You are on SQL Server 2012 STANDARD EDITION, which has a limit of 4 CPU sockets or TWENTY CORES!!! So your hyperthreaded 48-logical-core machine is having EVERYTHING shoved onto one CPU, and not even all cores of that.

    https://msdn.microsoft.com/en-us/library/ms143760(v=sql.110).aspx

    Here's the fix:  

    1) Turn off hyperthreading

    2) Use affinity masking to allow SQL Server to run on 10 PHYSICAL cores on each CPU that you have. I would personally start with the third core on each CPU and use the remaining 9. I feel that lots of things seem to tack onto the first core. And logically using them in sequence gets you a better CPU cache-use picture.

    Bet this fixes you right up.

    BTW, your DB-01 system is suboptimal too, with 24 cores in play.

    Good on you for giving the server 128GB RAM too!! That extra can make a difference.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • yomet10 - Tuesday, January 31, 2017 4:06 AM


    I find it weird that each and every scheduler has 3 to 5 workers per task, but maybe that's the way it's supposed to be.


    Here I am wondering what the wait_type MISCELLANEOUS represents. It seems unreal that it should have 1116 workers waiting for MISC...

    You have no work queue count. There is nothing unreal about the miscellaneous work type. Other than you probably have too many worker threads now.
    The old Wait Type Repository that Bob Ward had up on the CSS site is gone but what he had listed for Miscellaneous is the following:

    This really should be called "Not Waiting".
    This may have been used in SQL 2000 but for 2005/2008, it is not used for any valid wait. It is simply the default wait in a list and isn't used to indicate any real waiting. This type shows up twice in sys.dm_os_wait_stats in SQL 2008 but the "other" instance is an older unused wait type in the code. We should be able to remove it.

    And I would guess your runnable task count is 0. It looks like there is absolutely nothing wrong with workers other than having a ton of workers doing nothing. It's highly doubtful that you needed to increase worker threads and more likely that you have other issues.
    You got an error, didn't know what caused it and changed a setting that is recommended NOT to be changed. If you do change those, you really should know them in depth.

    Sue

  • Sue_H - Tuesday, January 31, 2017 7:04 AM

    You have no work queue count. There is nothing unreal about the miscellaneous work type. Other than you probably have too many worker threads now.
    The old Wait Type Repository that Bob Ward had up on the CSS site is gone but what he had listed for Miscellaneous is the following:

    This really should be called "Not Waiting".
    This may have been used in SQL 2000 but for 2005/2008, it is not used for any valid wait. It is simply the default wait in a list and isn't used to indicate any real waiting. This type shows up twice in sys.dm_os_wait_stats in SQL 2008 but the "other" instance is an older unused wait type in the code. We should be able to remove it.

    And I would guess your runnable task count is 0. It looks like there is absolutely nothing wrong with workers other than having a ton of workers doing nothing. It's highly doubtful that you needed to increase worker threads and more likely that you have other issues.
    You got an error, didn't know what caused it and changed a setting that is recommended NOT to be changed. If you do change those, you really should know them in depth.

    Sue

    Hi Sue,

    "You got an error, didn't know what caused it and changed a setting that is recommended NOT to be changed" you are totally correct about this and I humbly realize that I shouldn't have changed a setting I didn't know exactly what it was doing.

    Thanks for the insight into the MISC wait type.

    "you probably have too many worker threads now" that is exactly the title of this thread and the reason why I posted in the first place...   🙂

    I have learned that I shouldn't start fooling around with advanced settings until I know what they do and that next time I should probably post a question first.

    Best regards

    Yomet10

  • TheSQLGuru - Tuesday, January 31, 2017 5:51 AM

    A) I don't think I ever said anything about blocking being an issue. It wouldn't with this wait type.

    B) Adam Machanic, the mastermind behind the awesome sp_whisactive, did a 30-day blog post series on SQLBlog.com. I recommend you dig into that.

    C) I ALMOST CERTAINLY FOUND YOUR PROBLEM!! And I apologize for missing it initially if it turns out to be it. You are on SQL Server 2012 STANDARD EDITION, which has a limit of 4 CPU sockets or TWENTY CORES!!! So your hyperthreaded 48-logical-core machine is having EVERYTHING shoved onto one CPU, and not even all cores of that.

    https://msdn.microsoft.com/en-us/library/ms143760(v=sql.110).aspx

    Here's the fix:  

    1) Turn off hyperthreading

    2) Use affinity masking to allow SQL Server to run on 10 PHYSICAL cores on each CPU that you have. I would personally start with the third core on each CPU and use the remaining 9. I feel that lots of things seem to tack onto the first core. And logically using them in sequence gets you a better CPU cache-use picture.

    Bet this fixes you right up.

    BTW, your DB-01 system is suboptimal too, with 24 cores in play.

    Good on you for giving the server 128GB RAM too!! That extra can make a difference.

    Hi Kevin,

    I will definitely dig into Adam Machanic's blog series about sp_whoisactive - I need more tools in my belt.

    Thanks for all these insights. I thought that SQL Server just wouldn't use those other cores, or logical processors, now I will start by doing exactly what you told me to and post a reply when I see what happens. I will also set the max_worker_threads back to 0 (zero) and let SQL Server handle that part.

    The DB-01 system is soon to be retired so I won't start working on that one, however, the replacement system might need some tuning and I will take care of that as soon as I see how DB-11 reacts.

    Thanks again

    Yomet10

    EDIT:  A little question has come to my attention when re-reading the MS article you linked to. In that article it says that the max for SQL 2012 Std. is the "lesser of 4 Sockets or 16 cores" which makes me think that if I don't turn off HT I can use 32 logical processors (1 core = 2 logical processors), however, when I look in the affinity maks settings in SSMS I see CPU0 - CPU47 listed, i.e. they list the logical processors, not the cores.
    Should I then keep HT on and set the affinity as:
    ALTER SERVER CONFIGURATION SET PROCESS AFFINITY CPU = 8 TO 23, 32 TO 47
    or turn off HT and set the affinity as:
    ALTER SERVER CONFIGURATION SET PROCESS AFFINITY CPU = 4 TO 11, 16 TO 23

  • Thanks for posting back. I know you mentioned it's being retired but I would still go ahead and implement the changes that Kevin suggested. It just puts more learning under your belt. I would highly suspect CPUs being the issue and you won't really know unless those changes are made. The server does have your time and attention these days due to the problems you have seen so it's not really a wasted effort. And the changes with turning off HT and setting affinity mask isn't time consuming.
    Another consideration is that servers being retired can end up in that status for way longer than intended. I've seen plenty of those "soon to be retired" servers run for over a year.
    If you do go ahead with the changes, it would be great if you can post back here what you're seeing after the changes.

    Sue

  • Hi Sue,

    LOL, I know about "soon to be retired" taking much longer than expected - that is the case with our DB-01 server...   However, the new servers are already in place and about to receive SQL server 2016, then the migration will start.
    You are correct in that it will give me more learning experience and I will probably have fun with those settings once the old Prod servers have made their way into Lab but the client using that server is not happy about changing configurations on the server "Too much can go wrong".

    Yomet10

  • Sorry for the confusion, Yomet. I mentioned 20, which is the "issue" and limit that crops up in a similar situation related to an ENTERPRISE license. I had recently fixed that one for a client and simply spaced out when typing the message. 🙂

    The limit for 2012 Standard is SIXTEEN cores, and you should provision 8 on each CPU. I would still DEFINITELY DO PHYSICAL cores since you are doing affinity settings. Preferentially this box would be dedicated to just SQL Server after that too since you are messing with affinities.

    Two more things:

    1) You REALLY REALLY should use Glenn Berry's online advice when provisioning new hardware for SQL Server. You have rather inappropriate and possibly overly costly hardware for your edition of SQL Server.

    2) You mentioned SQL Server 2016. Note that it's license limit was increased to TWENTY FOUR cores (4 CPUs max). Adjust affinities appropriately on that hardware.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Thanks for the quick reply Kevin.

    All our SQL Server are dedicated machines, the only other things that are running are what's needed to send logs and server statistics to our monitoring servers.

    Since you suggest turning off HT, SQL 2016 Std. will maximize the usage of the new servers which are also running on Dual Xeon 12 Core CPUs.

    I am looking into Glenn Barry's site now but will need to continue later since I am also reading "A Month of Activity Monitoring" on Adam Machanic's site  🙂  New HW will not come in soon so the Activity Monitoring is definitely more important.

    I'll get back when I have fixed DB-11.

    Yomet10

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply