Hi Experts,
Seeking for inputs on High cpu scenario.
App team has started running workload at that time and they back saying DB is slow. Seeing 100% CPU on one of our production env. Checked if any jobs are running on the server. Found Full backup started at 12:00 AM PST and it is running over 2 hours. Next, tried to query sp_whoisactive to see what active processes running on the servers? they were like 20 spids which started just before 2 mins but all are in RUNNABLE state. Next, tried to query if there are any sleeping spids using sp_whoisactive, and found there are almost 570 spids which are in sleeping state for more than 4 days and open_txn = 1.
What does this indicate? is this is a worker thread starvation scenario? how to prove it or how can I troubleshoot this issue? Do we need to monitor any user connections over here? or any batch requests herer? What are things we need to check when we bump up into such scenario?
Eventually, we had to KILL those sleeping spids with consultation with app team which are running over 2 days. Once it is done, the current processes which are in runnable state started their execution. Also, slowly the cpu utilization has come down to 50%.
Env : SQL 2012 EE
Regards,
Sam
November 28, 2019 at 10:23 am
Query your plan cache to see which queries are using the most CPU. Are you updating statistics regularly? If not, inappropriate, resource-intensive plans could be being compiled. Make sure your cost threshold for parallelism is set appropriately for your workload; the out-of-the-box default is far too low in many cases. How many processors do you have and what is the max degree of parallelism? You probably don't want individual queries to be free to eat all the processors, so consider lowering maxdop if you haven't changed it from its default. Make sure you have sufficient memory - spilling to disk can be expensive in terms of processor utilisation.
John
November 28, 2019 at 1:11 pm
it sounds like either your apps are not closing connections, or that you are not using connection pooling correctly
if you have 100% of your cpu trying to manage context switching and handling the network connections then you will suffer.
MVDBA
November 28, 2019 at 6:36 pm
For me also, it looks like they aren't closing the connections. I say this bcz all spids are in sleeping state for more than 4 days.
As a DBA is that a proper way to prove that a SPID have opened a connection, open txn, finished doing its work but haven't closing the sql connection ?
November 28, 2019 at 6:40 pm
Query your plan cache to see which queries are using the most CPU. Are you updating statistics regularly? If not, inappropriate, resource-intensive plans could be being compiled. Make sure your cost threshold for parallelism is set appropriately for your workload; the out-of-the-box default is far too low in many cases. How many processors do you have and what is the max degree of parallelism? You probably don't want individual queries to be free to eat all the processors, so consider lowering maxdop if you haven't changed it from its default. Make sure you have sufficient memory - spilling to disk can be expensive in terms of processor utilisation.
John
There are 12 logical cpu's. We update stats on weekly basis. MAXDOP is set = 1. I don't why. But this value is there by years. so its not excessive parallelism here.
Is there a way to tell, that spilling is happening during high cpu scenario? how to identify it?
November 28, 2019 at 6:49 pm
John Mitchell-245523 wrote:Query your plan cache to see which queries are using the most CPU. Are you updating statistics regularly? If not, inappropriate, resource-intensive plans could be being compiled. Make sure your cost threshold for parallelism is set appropriately for your workload; the out-of-the-box default is far too low in many cases. How many processors do you have and what is the max degree of parallelism? You probably don't want individual queries to be free to eat all the processors, so consider lowering maxdop if you haven't changed it from its default. Make sure you have sufficient memory - spilling to disk can be expensive in terms of processor utilisation.
John
There are 12 logical cpu's. We update stats on weekly basis. MAXDOP is set = 1. I don't why. But this value is there by years. so its not excessive parallelism here.
Is there a way to tell, that spilling is happening during high cpu scenario? how to identify it?
What is the reason for MAXDOP 1 ?
😎
November 28, 2019 at 7:12 pm
That's how it is for more than 4 years and even my Senior DBA hasn't changed it.
I know its wiered but thatch how it is. I asked him once, he replied saying that initially the maxdop was set to default 0 and observed cases wherein he ran out of threads and saw high CXPACKET waits. since then, he left it as 1.
Is there any query to prove that sql might be running out of worker threads during 100% cpu over sustained period of time?
November 29, 2019 at 9:19 am
There are 12 logical cpu's. We update stats on weekly basis. MAXDOP is set = 1. I don't why. But this value is there by years. so its not excessive parallelism here.
Is there a way to tell, that spilling is happening during high cpu scenario? how to identify it?
You're right that there won't be excessive parallelism with MAXDOP 1 - there won't be any parallelism! I recommend setting that to a sensible number - maybe start at 4 or 6 and see how that goes. And, as I said before, make sure you've changed cost threshold for parallelism from its default.
You can look at the total_grant_kb, last_grant_kb etc columns in sys.dm_exec_guery_stats to find out which queries are getting insufficient memory grants. Those are the ones that will be spilling to tempdb. Conversely, those queries that are getting excessive memory grants will be hogging memory that other processes may need.
Consider updating your statistics more often than once a week, and updating with FULLSCAN at least once a week.
John
Hi Experts,
Seeking for inputs on High cpu scenario.
App team has started running workload at that time and they back saying DB is slow. Seeing 100% CPU on one of our production env. Checked if any jobs are running on the server. Found Full backup started at 12:00 AM PST and it is running over 2 hours. Next, tried to query sp_whoisactive to see what active processes running on the servers? they were like 20 spids which started just before 2 mins but all are in RUNNABLE state. Next, tried to query if there are any sleeping spids using sp_whoisactive, and found there are almost 570 spids which are in sleeping state for more than 4 days and open_txn = 1.
What does this indicate? is this is a worker thread starvation scenario? how to prove it or how can I troubleshoot this issue? Do we need to monitor any user connections over here? or any batch requests herer? What are things we need to check when we bump up into such scenario?
Eventually, we had to KILL those sleeping spids with consultation with app team which are running over 2 days. Once it is done, the current processes which are in runnable state started their execution. Also, slowly the cpu utilization has come down to 50%.
Env : SQL 2012 EE
Regards,
Sam
To me, that's all an indication of a "connection leak" caused by the new process. It's hard to prove those and even harder to find but it sounds like the story you told to support your post has teeth to it. And you've also already taken the first step as to what "to check when we bump up into such a scenario".
I did the following search on Google and it came up with a bunch of decent suggestions for this problem...
https://www.google.com/search?client=firefox-b-1-d&q=how+to+find+a+connection+leak+in+code
Unfortunately, a good number of them require code outside of SQL Server to find the offending code. One of them, however, does seem to be a decent bit of code that you can run through T-SQL. That can be found at the following URL..
https://sqlperformance.com/2017/07/sql-performance/find-database-connection-leaks
Although the code doesn't pinpoint the code that started the connection, it does provide the last bit of code executed by the connection. This is just as important because either that code should be closing the connection OR the code that calls that code should be closing the connection. Either way, providing the code to the front-enders should allow them to easily track down where in their process this code occurs. The first bit of code also acts as proof that the code actually does have a connection issue of one sort or another.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 2, 2019 at 5:03 am
Many thanks Jeff. This has become a repeated problem. App team says there is nothing wrong from the application. Blame games gets started for 1 day and eventually they agree to KILL sessions which are idle doing nothing from last 2-4 days.
December 2, 2019 at 6:10 am
Many thanks Jeff. This has become a repeated problem. App team says there is nothing wrong from the application. Blame games gets started for 1 day and eventually they agree to KILL sessions which are idle doing nothing from last 2-4 days.
Ok... hold on just a minute. You're telling me that Developers have the privs to KILL sessions on your production box??? That's just not right. Developers should, at best, have read and view privs on prod and that's all.
Also, a serious part of what people are calling DevOps is to remember that DBA's and Developers are actually a part of the same team working for the same company. Blame games are for children.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 2, 2019 at 2:18 pm
Jeff, Dev team has no permissions on PROD to KILL or do any admin related tasks.
What I mean to say is, the app team and dba team gets involved in blame game for few days whenever such scenarios arises. Eventually, things will get calm down.
December 2, 2019 at 2:21 pm
Ah. Got it.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 13 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply