CPU frequently Maxing out 100%

  • Hi over the last 2 months we have frequently see our Production Server's CPU Max out at 100% (occurs twice a week).

    When this happens the only way to get it back down to normal operating conditions is to fail over.

    Killing the session's from the  query's  consuming the most CPU has no effect.

    This has so far only happened between the hours of 7am and 7pm.

    Server spec.

    CPU 48 Cores.

    Ram 760 GIG.

    Server Activity

    -------------------------

    Production server services our website which gets approximately 50,000 hits a day

    Most queries hitting the server are Linq query's.

    We have reporting services on the box all of which use Stored procs (Currently in the process of Moving them over to the data warehouse).

    We have adhoc data loads SSIS Hitting the box through out the day

    We have adhoc Queries (Data Science guys)

    We have a number of Sql Agent Jobs Running throughout the day.

    Maintenance Jobs run overnight Index Rebuilds, full backups etc.

    Under this load the CPU would normally be running at 30 -40 %

     

    Attempts to find the source of the problem

    Prior to CPU spike,

    Using Idera, we see a rise in the Hash Sort Index area of memory.

    I believe this is looking at granted memory in the dmv sys.dm_exec_query_memory_grants.

    Looking at this table hasn't highlighted anything that would raise suspicion.

    We've looked at the top query's from sys.dm_exec_query_stats and again we can't isolate a culprit.

    We've turned off reporting services and also SQL server agent which in turn stopped the replication. Still the CPU remains at

    a constant 100%

    Following a fail over we cant see in the logs any transactions rolling back.

    We are now at a complete loss at what else we can do to try and isolate this.

    Any suggestions as to what we could try next would be very much appreciated.

     

     

     

  • Prior to doing a fail over, see if there are any Killed/Rolled Rollbacks.

    If there are, I'd say there's a 99.5 % chance that you've been bitten by a problem with connections.  Every book and article on the subject states that EF defaults connections to MARS (Multiple Active Result Sets) being deactivated.  That's not what happened for us and I have several friends that, at my request, did a check and found that, indeed, MARS was enabled.

    And, it's MARS that causes this problem.  MARS should only be enabled in connections for a very narrow, specific set of conditions and almost no one actually meets those conditions.

    You need to check ALL of the connection strings for ALL of your applications and ensure that MARS is NOT enabled.

    There is one new possibility for this, as well.  See if any of your software is using sp_GetAppLock.  It doesn't work the way most people think.  It doesn't just block a proc from being run more than once.  It has to start a session first and then it checks.  It's a huge waste of CPU time and, because of the reasons why most people think they want to use that bloody proc, it's normally and arduous and expensive process that can rival rollbacks.

    The other thing that it could be is some serious crap code that is suddenly doing accidental cross joins in the form of many-to-many joints.  You can normally identify these because they use the word DISTINCT in them.

    p.s.  And, YES... I'm actually a victim/survivor of all 3 of these problems.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • EF is the likely the reason, but it's not easy to fix. Your server resource(48 cores, 768GB memory) is more than enough for your workload(50,000 hits a day). Inside EF, the default data type is string or nchar/nvarchar in database terms, if data type in your database is char/varchar, especially for primary key/join columns, table scan/clustered index scan is the most possible way.

    You can ask your developers to explicitly set the data type to varchar instead of default nvarchar, like this

    [MaxLength(30), Required, Column(TypeName = "varchar")]

    public string firstName { get; set; }

  • Hi Jeff,

    Thanks for your quick response.

    A couple of things, firstly I have seen sp_GetAppLock running during 100% CPU utilization,

    I stupidly dismissed this as nothing to worry about.

    Secondly I spoke with our application architect and he has informed me that we

    have(MARS)enabled all over the place.

    Thirdly we do have some very crap legacy code which runs every 25 mins between 7am and 7pm , RBAR inside cursors, utilising temporary tables,

    I think this could be causing recursive hash joins and possibly even bailout. This code is going to become deprecated shortly so there is no motivation from the business to optimize it.

    However I'm still wondering why this occurs Intermittently and killing the process of the above doesn't reduce the CPU utilization

    and why would it only manifest itself around 2 months ago. All of the above have been in place for years?

    We certainly have some things we can look into thanks for your help

     

     

  • It did the same thing for us... went for 7 years with no incident and then it apparently hit a tipping point... and the tipping point wasn't gentle... it was an sudden avalanche, just like yours.

    There are some places where sp_GetAppLock runs fine but it's a pain in the patooti to prove where it is or is not a problem. When it IS a problem, it'll take a while to prove it.  And, no... it doesn't work like trying to call a job more than once.  It's much worse because the attempt is actually cued as if it were going to run.

    you have much larger problems with MARS being enabled.  Even MS states that's a bad idea unless you have a very specific system that can actually benefit from it.  Most do not.  Most people think that it must be enabled to return multiple result sets to the GUI... not true with .NET.

    If things were done correctly, it takes almost no time at all to change the connection strings for applications.  Even if someone could prove that it's not currently a problem, I'd still spend the small bit of time to ensure that MARS were forced to be off so that it doesn't become a problem.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks for your help with this Jeff,

    We now have a sprint to disable all the MARS connections.

  • One other thing it could be is missing indexes, the additional I/O load on the disk can cause high CPU. Have you checked the activity of the disks while you are getting high CPU?

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply