SQL CPU maxing out - need some guidance

  • What is the best way in SQL Server 2016 to identify the cause of a sudden and prolonged spike in CPU usage, taking a 24 core hyper-threaded server that normally under full load runs at 30-40% to 100%?

    Most of the monitoring tools and dba scripts only provide request CPU time, which is the elapsed time since the query started. This is unfortunately useless information when the CPU is maxing out and a normal 800ms second query takes 30 or 40 seconds.

    Our system continuously pulls in huge amounts of data from 500 plus sources via several multi-threaded processes on a server farm.
    This data is then processed/matched and merged into user searchable tables.
    Our website users perform searches on that data, around 30,000 per hour plus.

    Here's some more information that might reduce some of the initial typical troubleshooting responses:
    * Indexes and queries are tuned regularly and are all up to date, and getting rebuilt once a week
    * Statistics are updated nightly
    * The problem is not due to locking/blocking/deadlocks
    * We have Allow Snapshot Isolation enabled, and all our searches use Snapshot Isolation
    * The super-fast NVMe PCI disk drive utilization and queue length is absolutely fine
    * Loads of memory and buffer cache/page life expectancy are both good
    * Server MaxDop and cost of parallelism are both set correctly and none of our searches go parallel
    * Server has quite a few Tempdb data files, they have also been increased in blocks of 4 as needed
    * Query plan compilations/recompiles at the time of the issue are the same as before

    From all the monitoring tool data and the performance related historical data we capture, along with the WhoIsActive captures during the issue, we have not been able to identify a single or group of queries specifically to blame.
    The current active processes seem to match our data load like normal.

    When the issue occurs, the searches slow down, and more and more start piling up, making the website unusable and identifying a query in the WhoIsActive logs a bit tedious as we cannot distinguish between the cause and the effect.

    I have analysed our typical cumulative wait statistics for an hour before the issue, and then during the issue. I can see a sharp rise in several wait stats on the specific minute when the issue starts happening.
    There's also a sharp rise in the number of transactions/sec counter, causing me to believe the issue must be related to data changes. I have reviewed all the running queries related to transactions and their execution plans, and they were all the same as normal.

    After a bit more research I found that some of the waits that have increased quite a bit is the PAGELATCH_XX waits, which could indicate tempdb contention. We use loads of temporary tables throughout our system which made this my prime suspect. Unfortunately, the historical performance data we capture does not include the resource_description, so I cannot tell if the PAGELATCH_XX waits are related to 2:1:1 (PFS) or 2:1:3 (SGAM). I will be sure to try and capture this the next time it happens, but in the meantime, I have increased Tempdb files by another 4.

    I still have all the captures, query plans, historical data at my disposal, and can also produce nice reports from our monitoring tool SentryOne should there be specific items to be investigated further.

    I am at the limits of my knowledge on this, so have come to ask for some fresh eyes on the matter. It is unfortunately intermittent and seemingly random. We have days with increased data load and user searches without issue followed by a much quieter day when it suddenly strikes. There have been 3 catastrophic outages in the last 2 weeks.

    Many thanks and looking forward to your insights.

  • By any chance, did you use Entity Framework (EF) to create the GUIs?  This sounds precisely like something we went through recently.  It turned out to the connection strings for the apps along with a touch of bad code that kept the CPUs normally busy at 30-40%.  The nasty and difficult part of the problem to find was that, despite what all of the books seem to say, EF connection string default to have M.A.R.S. (Multiple Active Result Sets) being turned on rather than off (and they SHOULD be off).  Step 1 is to check all connection strings and make sure that M.A.R.S. is turned off.

    The next part is to find and fix the code that's bad to begin with.  Let me know the outcome of the M.A.R.S. thing and we can shoot from there.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff, thanls for the reply. No we didn't use EF and don't have MARS in our connections strings.
    We also had nother occurance of the issue yesterday afternoon which means my prime suspect of tempdb contention was incorrect as the additional files didn't help.

  • FritsD - Tuesday, October 17, 2017 3:16 AM

    Hi Jeff, thanls for the reply. No we didn't use EF and don't have MARS in our connections strings.
    We also had nother occurance of the issue yesterday afternoon which means my prime suspect of tempdb contention was incorrect as the additional files didn't help.

    Ok... the next logical suspect is a "runaway query" that has at least one accidental many-to-many join (also known as a data driven Cross Join).  You need to capture the code that's running when one of these incidents occurs along with some cpu and IO usage stats so that you can figure out which code it is that needs to be fixed.

    If it was bad enough to overwhelm resource usage of "normal" code over time, you might be able to find it in the "standard" instance level reports that are built into the "Reports" pulldown when you right click on the instance in the Object Explorer window.  If not, you need something like Adam Machanic's sp_WhoIsActive to find such a thing while it's running.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Tuesday, October 17, 2017 6:01 AM

    FritsD - Tuesday, October 17, 2017 3:16 AM

    Hi Jeff, thanls for the reply. No we didn't use EF and don't have MARS in our connections strings.
    We also had nother occurance of the issue yesterday afternoon which means my prime suspect of tempdb contention was incorrect as the additional files didn't help.

    Ok... the next logical suspect is a "runaway query" that has at least one accidental many-to-many join (also known as a data driven Cross Join).  You need to capture the code that's running when one of these incidents occurs along with some cpu and IO usage stats so that you can figure out which code it is that needs to be fixed.

    If it was bad enough to overwhelm resource usage of "normal" code over time, you might be able to find it in the "standard" instance level reports that are built into the "Reports" pulldown when you right click on the instance in the Object Explorer window.  If not, you need something like Adam Machanic's sp_WhoIsActive to find such a thing while it's running.

    Hi Jeff, thanks - I use Adam's scripts and have captures from befure during and after the incicent. I have gone through them with a fine toothcombe "RBAR" as you put it and nothing jumps out at me.
    All our code uses stored procedures which are good for analysing the query cache and execution history, and I have compared query plans during the issue with plans from before and they are identical.
    I will take a look at the instance reports also just in case there's something there that is nor reported in SentryOne.
    Coming back to my question, is there no way to see realtime/snapshot CPU usage per query that is not "time elapsed" based?

  • FritsD - Tuesday, October 17, 2017 6:24 AM

    Hi Jeff, thanks - I use Adam's scripts and have captures from befure during and after the incicent. I have gone through them with a fine toothcombe "RBAR" as you put it and nothing jumps out at me.

    All our code uses stored procedures which are good for analysing the query cache and execution history, and I have compared query plans during the issue with plans from before and they are identical.
    I will take a look at the instance reports also just in case there's something there that is nor reported in SentryOne.
    Coming back to my question, is there no way to see realtime/snapshot CPU usage per query that is not "time elapsed" based?

    If you have SentryOne monitoring the server, go see what was running at the time of the incident.  In the SQL Sentry client, there's a tab for Top SQL.  Change your date/time window and see what was running.  If it was captured, you should be able to identify what the runaway was doing.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply