January 30, 2017 at 2:35 pm
Hi,
I think have a problem with one of my SQL 2012 SP3 CU4 servers. Twice now, our two principal servers have stopped responding due to not enough workers. So I raised the threshold of "Max Worker Threads" to prevent another crash while I investigate the issue and get to understand the root problem. When I started graphing the values of Active Workers, Current Tasks and Current Workers I got a surprise to say the least.
On one of the servers (DB-01 in the picture) the Active Workers and Current Tasks values are between 140 and 160 with a Current Workers value between 260 and 360 - this seems OK to me. However, on the other server (DB-11 on the picture) the Active Workers and Current Tasks values are between 200 and 270 with a Current Workers value between 750 and 950! Already I am very surprised to have that many workers for the amount of tasks. The thing that really gets my attention is that all of a sudden this servers Current Workers will go up to 1300 - 1400 without there being any more tasks to show and the count will stay there for a few days until it drops to its old value just as suddenly.
I need help to know where I should start troubleshooting this behaviour.
System specs:
DB-01 (Good system)
- Dual Xeon / 6 cores/CPU / HT = 24 logical processors
- 96 GB RAM (SQL Server limited to 64GB)
- Windows 2012 R2
- SQL Server 2012 Std. Ed. SP3 CU4 x64
- Max DOP = 8
- Cost Threshold = 50
- Max Worker Threads = 1664
DB-11 (Bad system)
- Dual Xeon / 12 cores/CPU / HT = 48 logical processors
- 128 GB RAM (SQL Server limited to 64GB)
- Windows 2012 R2
- SQL Server 2012 Std. Ed. SP3 CU4 x64
- Max DOP = 8
- Cost Threshold = 50
- Max Worker Threads = 1500
The picture below shows the graphs of DB-11 and DB-01 (the two principal servers) since I started graphing these values. For clarity I removed the graphs for the mirror servers.
Since we're currently in a peak and I don't know how long it will last I hope to get some good queries to run to see what is causing this behaviour.
Many thanks for your attention
Yomet10
January 30, 2017 at 3:14 pm
First off: If you want something fixed "now" you really need to get someone on the phone
to see what is connected/running the best thing I know of is sp_whoisactive
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
January 30, 2017 at 5:35 pm
yomet10 - Monday, January 30, 2017 2:35 PMHi,
I think have a problem with one of my SQL 2012 SP3 CU4 servers. Twice now, our two principal servers have stopped responding due to not enough workers. So I raised the threshold of "Max Worker Threads" to prevent another crash while I investigate the issue and get to understand the root problem. When I started graphing the values of Active Workers, Current Tasks and Current Workers I got a surprise to say the least.
On one of the servers (DB-01 in the picture) the Active Workers and Current Tasks values are between 140 and 160 with a Current Workers value between 260 and 360 - this seems OK to me. However, on the other server (DB-11 on the picture) the Active Workers and Current Tasks values are between 200 and 270 with a Current Workers value between 750 and 950! Already I am very surprised to have that many workers for the amount of tasks. The thing that really gets my attention is that all of a sudden this servers Current Workers will go up to 1300 - 1400 without there being any more tasks to show and the count will stay there for a few days until it drops to its old value just as suddenly.
I need help to know where I should start troubleshooting this behaviour.System specs:
DB-01 (Good system)
- Dual Xeon / 6 cores/CPU / HT = 24 logical processors
- 96 GB RAM (SQL Server limited to 64GB)
- Windows 2012 R2
- SQL Server 2012 Std. Ed. SP3 CU4 x64
- Max DOP = 8
- Cost Threshold = 50
- Max Worker Threads = 1664DB-11 (Bad system)
- Dual Xeon / 12 cores/CPU / HT = 48 logical processors
- 128 GB RAM (SQL Server limited to 64GB)
- Windows 2012 R2
- SQL Server 2012 Std. Ed. SP3 CU4 x64
- Max DOP = 8
- Cost Threshold = 50
- Max Worker Threads = 1500The picture below shows the graphs of DB-11 and DB-01 (the two principal servers) since I started graphing these values. For clarity I removed the graphs for the mirror servers.
Since we're currently in a peak and I don't know how long it will last I hope to get some good queries to run to see what is causing this behaviour.
Many thanks for your attention
Yomet10
The queries you should run would be the same ones you would use to see if you should increase the worker threads. It's not recommended to change that value and to let SQL Server configure it dynamically as it does based on cpu and architecture.
You would generally monitor sys.dm_os_schedulers. Work_queue_count is generally used as an indicator but it's good to monitor other metrics for the schedulers to see if you are hitting issues with CPU or disk such as runnable_task_count, load_factor, pending_disk_io_count. You would also monitor dm_os_wait stats and then dm_os_waiting_tasks when the server is getting hit. You generally would do this before changing max worker threads. Every time I've dealt with scheduler related errors it has never been an issue with needing to increase max worker threads.
Sue
January 31, 2017 at 1:17 am
What about blocking and cost threshold for parallelism on the troubled server?
😎
January 31, 2017 at 3:27 am
TheSQLGuru - Monday, January 30, 2017 3:14 PMFirst off: If you want something fixed "now" you really need to get someone on the phoneto see what is connected/running the best thing I know of is sp_whoisactive
Hi Kevin,
Thanks for suggestiong that I download, compile and run sp_whoisactive it seems like a really nice tool. I am not used to it yet but using the suggested values by Brent Ozar exec sp_whoisactive @get_plans = 1, @get_locks = 1, @get_task_info = 2
I get 92 rows as a result, of these only 4 rows are not BROKER_RECEIVE_WAITFOR. These 4 queries that are running are simple SELECT queries. Also none of the sessions listed is blocked, i.e. blocking_session_id is NULL.
I probably need to get better aquainted with sp_whoisactive but these are the results right now.
Yomet10
January 31, 2017 at 3:28 am
Eirikur Eiriksson - Tuesday, January 31, 2017 1:17 AMWhat about blocking and cost threshold for parallelism on the troubled server?
😎
Hi Eirikur,
As you will see int my answer to Kevin there are no blocks reported. Also, as I stated in my original post the cost threshold for parallellism is set to 50.
Yomet10
January 31, 2017 at 4:06 am
Sue_H - Monday, January 30, 2017 5:35 PMThe queries you should run would be the same ones you would use to see if you should increase the worker threads. It's not recommended to change that value and to let SQL Server configure it dynamically as it does based on cpu and architecture.
You would generally monitor sys.dm_os_schedulers. Work_queue_count is generally used as an indicator but it's good to monitor other metrics for the schedulers to see if you are hitting issues with CPU or disk such as runnable_task_count, load_factor, pending_disk_io_count. You would also monitor dm_os_wait stats and then dm_os_waiting_tasks when the server is getting hit. You generally would do this before changing max worker threads. Every time I've dealt with scheduler related errors it has never been an issue with needing to increase max worker threads.Sue
Hi Sue,
Thanks for your input. When I started this ordeal the server was running with the default max_worker_threads of 0 (zero) and the actual running value was 832, as you can see we're running way above that value right now. Since the formula used by SQL server uses the number of cores I thought that maybe I had to double that since we're using hyper-threading, so I got a value of 1216 and added a bit more as a buffer. Hence I am running at 1500.
Here are the current results from some of the queries that I ran when trying to see if I should increase the worker threads.
SELECT scheduler_id,current_tasks_count,current_workers_count,active_workers_count,work_queue_count
FROM sys.dm_os_schedulers
WHERE status = 'Visible Online';
scheduler_id current_tasks_count current_workers_count active_workers_count work_queue_count
0 9 35 9 0
1 9 39 8 0
2 8 35 8 0
3 7 36 7 0
4 8 38 8 0
5 8 35 8 0
6 8 39 8 0
7 8 35 8 0
8 7 41 7 0
9 9 41 9 0
10 6 32 6 0
11 6 42 6 0
12 8 41 8 0
13 5 39 5 0
14 4 34 4 0
15 8 34 8 0
16 5 39 5 0
17 7 45 7 0
18 6 36 6 0
19 7 36 7 0
20 8 41 8 0
21 6 35 6 0
22 9 34 9 0
23 10 36 9 0
24 16 43 16 0
25 17 43 16 0
26 15 42 15 0
27 16 46 16 0
28 13 47 13 0
29 14 44 14 0
30 16 42 16 0
31 15 40 14 0
I find it weird that each and every scheduler has 3 to 5 workers per task, but maybe that's the way it's supposed to be.
Another one:SELECT is_preemptive,state, last_wait_type, COUNT(*) as NumWorkers
FROM sys.dm_os_workers
GROUP BY state,last_wait_type,is_preemptive
ORDER BY COUNT(*) DESC
is_preemptive state last_wait_type NumWorkers
0 SUSPENDED MISCELLANEOUS 1116
0 SUSPENDED BROKER_RECEIVE_WAITFOR 94
0 SUSPENDED BROKER_TASK_STOP 38
0 SUSPENDED CXPACKET 22
0 SUSPENDED DBMIRROR_EVENTS_QUEUE 18
0 SUSPENDED DBMIRRORING_CMD 17
0 SUSPENDED BROKER_TRANSMISSION_TABLE 6
0 RUNNING SOS_SCHEDULER_YIELD 6
1 RUNNING MISCELLANEOUS 3
0 RUNNABLE CXPACKET 3
0 RUNNABLE BROKER_TASK_STOP 2
1 RUNNING PREEMPTIVE_OS_DECRYPTMESSAGE 2
0 SUSPENDED LAZYWRITER_SLEEP 2
0 SUSPENDED LOGMGR_QUEUE 1
0 SUSPENDED DBMIRROR_DBM_EVENT 1
0 SUSPENDED ONDEMAND_TASK_QUEUE 1
1 SUSPENDED REQUEST_FOR_DEADLOCK_SEARCH 1
1 SUSPENDED SP_SERVER_DIAGNOSTICS_SLEEP 1
0 SUSPENDED SQLTRACE_INCREMENTAL_FLUSH_SLEEP 1
1 SUSPENDED XE_TIMER_EVENT 1
1 RUNNING PREEMPTIVE_OS_REPORTEVENT 1
1 RUNNING XE_DISPATCHER_WAIT 1
1 INIT MISCELLANEOUS 1
0 RUNNABLE SOS_SCHEDULER_YIELD 1
0 RUNNING MISCELLANEOUS 1
0 SUSPENDED BROKER_TRANSMITTER 1
0 SUSPENDED CHECKPOINT_QUEUE 1
0 SUSPENDED CLR_AUTO_EVENT 1
1 SUSPENDED CLR_AUTO_EVENT 1
0 SUSPENDED DIRTY_PAGE_POLL 1
0 SUSPENDED FT_IFTS_SCHEDULER_IDLE_WAIT 1
0 SUSPENDED FT_IFTSHC_MUTEX 1
0 SUSPENDED HADR_FILESTREAM_IOMGR_IOCOMPLETION 1
0 SUSPENDED KSOURCE_WAKEUP 1
Here I am wondering what the wait_type MISCELLANEOUS represents. It seems unreal that it should have 1116 workers waiting for MISC...
Best
Yomet10
January 31, 2017 at 5:51 am
A) I don't think I ever said anything about blocking being an issue. It wouldn't with this wait type.
B) Adam Machanic, the mastermind behind the awesome sp_whisactive, did a 30-day blog post series on SQLBlog.com. I recommend you dig into that.
C) I ALMOST CERTAINLY FOUND YOUR PROBLEM!! And I apologize for missing it initially if it turns out to be it. You are on SQL Server 2012 STANDARD EDITION, which has a limit of 4 CPU sockets or TWENTY CORES!!! So your hyperthreaded 48-logical-core machine is having EVERYTHING shoved onto one CPU, and not even all cores of that.
https://msdn.microsoft.com/en-us/library/ms143760(v=sql.110).aspx
Here's the fix:
1) Turn off hyperthreading
2) Use affinity masking to allow SQL Server to run on 10 PHYSICAL cores on each CPU that you have. I would personally start with the third core on each CPU and use the remaining 9. I feel that lots of things seem to tack onto the first core. And logically using them in sequence gets you a better CPU cache-use picture.
Bet this fixes you right up.
BTW, your DB-01 system is suboptimal too, with 24 cores in play.
Good on you for giving the server 128GB RAM too!! That extra can make a difference.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
January 31, 2017 at 7:04 am
Here I am wondering what the wait_type MISCELLANEOUS represents. It seems unreal that it should have 1116 workers waiting for MISC...
You have no work queue count. There is nothing unreal about the miscellaneous work type. Other than you probably have too many worker threads now.
The old Wait Type Repository that Bob Ward had up on the CSS site is gone but what he had listed for Miscellaneous is the following:
This really should be called "Not Waiting".
This may have been used in SQL 2000 but for 2005/2008, it is not used for any valid wait. It is simply the default wait in a list and isn't used to indicate any real waiting. This type shows up twice in sys.dm_os_wait_stats in SQL 2008 but the "other" instance is an older unused wait type in the code. We should be able to remove it.
And I would guess your runnable task count is 0. It looks like there is absolutely nothing wrong with workers other than having a ton of workers doing nothing. It's highly doubtful that you needed to increase worker threads and more likely that you have other issues.
You got an error, didn't know what caused it and changed a setting that is recommended NOT to be changed. If you do change those, you really should know them in depth.
Sue
January 31, 2017 at 11:47 am
Sue_H - Tuesday, January 31, 2017 7:04 AMYou have no work queue count. There is nothing unreal about the miscellaneous work type. Other than you probably have too many worker threads now.
The old Wait Type Repository that Bob Ward had up on the CSS site is gone but what he had listed for Miscellaneous is the following:This really should be called "Not Waiting".
This may have been used in SQL 2000 but for 2005/2008, it is not used for any valid wait. It is simply the default wait in a list and isn't used to indicate any real waiting. This type shows up twice in sys.dm_os_wait_stats in SQL 2008 but the "other" instance is an older unused wait type in the code. We should be able to remove it.And I would guess your runnable task count is 0. It looks like there is absolutely nothing wrong with workers other than having a ton of workers doing nothing. It's highly doubtful that you needed to increase worker threads and more likely that you have other issues.
You got an error, didn't know what caused it and changed a setting that is recommended NOT to be changed. If you do change those, you really should know them in depth.Sue
Hi Sue,
"You got an error, didn't know what caused it and changed a setting that is recommended NOT to be changed" you are totally correct about this and I humbly realize that I shouldn't have changed a setting I didn't know exactly what it was doing.
Thanks for the insight into the MISC wait type.
"you probably have too many worker threads now" that is exactly the title of this thread and the reason why I posted in the first place... 🙂
I have learned that I shouldn't start fooling around with advanced settings until I know what they do and that next time I should probably post a question first.
Best regards
Yomet10
January 31, 2017 at 11:53 am
TheSQLGuru - Tuesday, January 31, 2017 5:51 AMA) I don't think I ever said anything about blocking being an issue. It wouldn't with this wait type.B) Adam Machanic, the mastermind behind the awesome sp_whisactive, did a 30-day blog post series on SQLBlog.com. I recommend you dig into that.
C) I ALMOST CERTAINLY FOUND YOUR PROBLEM!! And I apologize for missing it initially if it turns out to be it. You are on SQL Server 2012 STANDARD EDITION, which has a limit of 4 CPU sockets or TWENTY CORES!!! So your hyperthreaded 48-logical-core machine is having EVERYTHING shoved onto one CPU, and not even all cores of that.
https://msdn.microsoft.com/en-us/library/ms143760(v=sql.110).aspx
Here's the fix:
1) Turn off hyperthreading
2) Use affinity masking to allow SQL Server to run on 10 PHYSICAL cores on each CPU that you have. I would personally start with the third core on each CPU and use the remaining 9. I feel that lots of things seem to tack onto the first core. And logically using them in sequence gets you a better CPU cache-use picture.
Bet this fixes you right up.
BTW, your DB-01 system is suboptimal too, with 24 cores in play.
Good on you for giving the server 128GB RAM too!! That extra can make a difference.
Hi Kevin,
I will definitely dig into Adam Machanic's blog series about sp_whoisactive - I need more tools in my belt.
Thanks for all these insights. I thought that SQL Server just wouldn't use those other cores, or logical processors, now I will start by doing exactly what you told me to and post a reply when I see what happens. I will also set the max_worker_threads back to 0 (zero) and let SQL Server handle that part.
The DB-01 system is soon to be retired so I won't start working on that one, however, the replacement system might need some tuning and I will take care of that as soon as I see how DB-11 reacts.
Thanks again
Yomet10
EDIT: A little question has come to my attention when re-reading the MS article you linked to. In that article it says that the max for SQL 2012 Std. is the "lesser of 4 Sockets or 16 cores" which makes me think that if I don't turn off HT I can use 32 logical processors (1 core = 2 logical processors), however, when I look in the affinity maks settings in SSMS I see CPU0 - CPU47 listed, i.e. they list the logical processors, not the cores.
Should I then keep HT on and set the affinity as:ALTER SERVER CONFIGURATION SET PROCESS AFFINITY CPU = 8 TO 23, 32 TO 47
or turn off HT and set the affinity as:ALTER SERVER CONFIGURATION SET PROCESS AFFINITY CPU = 4 TO 11, 16 TO 23
January 31, 2017 at 12:19 pm
Sue
January 31, 2017 at 12:48 pm
LOL, I know about "soon to be retired" taking much longer than expected - that is the case with our DB-01 server... However, the new servers are already in place and about to receive SQL server 2016, then the migration will start.
You are correct in that it will give me more learning experience and I will probably have fun with those settings once the old Prod servers have made their way into Lab but the client using that server is not happy about changing configurations on the server "Too much can go wrong".
Yomet10
January 31, 2017 at 1:58 pm
Sorry for the confusion, Yomet. I mentioned 20, which is the "issue" and limit that crops up in a similar situation related to an ENTERPRISE license. I had recently fixed that one for a client and simply spaced out when typing the message. 🙂
The limit for 2012 Standard is SIXTEEN cores, and you should provision 8 on each CPU. I would still DEFINITELY DO PHYSICAL cores since you are doing affinity settings. Preferentially this box would be dedicated to just SQL Server after that too since you are messing with affinities.
Two more things:
1) You REALLY REALLY should use Glenn Berry's online advice when provisioning new hardware for SQL Server. You have rather inappropriate and possibly overly costly hardware for your edition of SQL Server.
2) You mentioned SQL Server 2016. Note that it's license limit was increased to TWENTY FOUR cores (4 CPUs max). Adjust affinities appropriately on that hardware.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
January 31, 2017 at 2:27 pm
Thanks for the quick reply Kevin.
All our SQL Server are dedicated machines, the only other things that are running are what's needed to send logs and server statistics to our monitoring servers.
Since you suggest turning off HT, SQL 2016 Std. will maximize the usage of the new servers which are also running on Dual Xeon 12 Core CPUs.
I am looking into Glenn Barry's site now but will need to continue later since I am also reading "A Month of Activity Monitoring" on Adam Machanic's site 🙂 New HW will not come in soon so the Activity Monitoring is definitely more important.
I'll get back when I have fixed DB-11.
Yomet10
Viewing 15 posts - 1 through 15 (of 17 total)
You must be logged in to reply to this topic. Login to reply