Sudden CPU % Jump to 100% Usage

  • We run a Java application on SQL Server. We are load testing on this machine:

    HP DL380 G8 duel E5-2650 CPU

    16 CPU Hyperthreaded to 32 cores

    32GB Memory

    Dedicated san lun for tempdb, data and log.

    Running on SQL Server 2008 R2, the load test ran fine. We ramped up users and other jobs with no issues over the course of a 4 1/2 hour test, CPU% usage stayed at about 15%. We then upgraded to SQL 2012 (SP1 CU9) on the exact same hardware. Now, when we run the test it runs fine for about 20 minutes (the ramp up) with CPU% staying consistently at 10-15% and then suddenly, over the course of 2-3 seconds, the entire machine -- all 32 cores -- spikes to 100% CPU usage.

    Anybody have any ideas what could be happening? Most postings point to possible heavy queries, but I'd expect to see CPU % climbing not instantly spiking like that. In addition, it's the exact same test, the exact same queries, the exact same database. Yet on SQL 2008 fine, on 2012, we get the spike. We can see that the system is overloaded with SOS_SCHEDULER_YIELD locks, but all I've read just says "That means you're CPU bound." One posting suggested spinlocks, and I found we are way high on SOS_CACHESTORE_CLOCK spinlocks, but we're not really sure if that has any meaning at all.

    Instant spiking like that screams that we've hit some sort of threshold, but we can't figure out what it could possibly be. Any thoughts about what could be happening? Again -- same tests, same machine, same queries, same data on SQL 2008 are great. On SQL 2012 runs great and then spikes from 10% CPU usage to 100% CPU use instantly, over the course of a couple of seconds.

    Thanks in advance!

  • After the upgrade to SQL 2012, did you change the compaitbility level of the database, update the index statistics?

  • ssinger-815504 (5/20/2014)


    We run a Java application on SQL Server. We are load testing on this machine:

    HP DL380 G8 duel E5-2650 CPU

    16 CPU Hyperthreaded to 32 cores

    32GB Memory

    Dedicated san lun for tempdb, data and log.

    Running on SQL Server 2008 R2, the load test ran fine. We ramped up users and other jobs with no issues over the course of a 4 1/2 hour test, CPU% usage stayed at about 15%. We then upgraded to SQL 2012 (SP1 CU9) on the exact same hardware. Now, when we run the test it runs fine for about 20 minutes (the ramp up) with CPU% staying consistently at 10-15% and then suddenly, over the course of 2-3 seconds, the entire machine -- all 32 cores -- spikes to 100% CPU usage.

    Anybody have any ideas what could be happening? Most postings point to possible heavy queries, but I'd expect to see CPU % climbing not instantly spiking like that. In addition, it's the exact same test, the exact same queries, the exact same database. Yet on SQL 2008 fine, on 2012, we get the spike. We can see that the system is overloaded with SOS_SCHEDULER_YIELD locks, but all I've read just says "That means you're CPU bound." One posting suggested spinlocks, and I found we are way high on SOS_CACHESTORE_CLOCK spinlocks, but we're not really sure if that has any meaning at all.

    Instant spiking like that screams that we've hit some sort of threshold, but we can't figure out what it could possibly be. Any thoughts about what could be happening? Again -- same tests, same machine, same queries, same data on SQL 2008 are great. On SQL 2012 runs great and then spikes from 10% CPU usage to 100% CPU use instantly, over the course of a couple of seconds.

    Thanks in advance!

    1) Did the test finish more quickly with CPUs pegged? That could be an indication that things are running 6X faster. Not likely, but not unheard of either.

    2) Were statistics updated with a full scan on all stats after the upgrade? Not likely to be cause of anything, but personally I still like to do it on version upgrade.

    3) Same edition (std-std or ent-ent)?

    4) Is this repeatable?

    5) I would track waits, spinlocks, IO, memory and various perf-mon counters every 5-10 seconds during the test. Probably a >90% chance that your reason(s) will be revealed there somewhere.

    6) Maybe your original system was hit with this bug, http://support.microsoft.com/kb/2526959, and performance sucked because of it. Now things can open up (after some spool-up time) in 2012?? I have seen this happen too (in production systems, not tests).

    7) Can you verify that NO system or database settings changed between the 2008R2 run and the upgraded-to-2012 run??

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • TheSQLGuru (5/20/2014)


    1) Did the test finish more quickly with CPUs pegged? That could be an indication that things are running 6X faster. Not likely, but not unheard of either.

    2) Were statistics updated with a full scan on all stats after the upgrade? Not likely to be cause of anything, but personally I still like to do it on version upgrade.

    3) Same edition (std-std or ent-ent)?

    4) Is this repeatable?

    5) I would track waits, spinlocks, IO, memory and various perf-mon counters every 5-10 seconds during the test. Probably a >90% chance that your reason(s) will be revealed there somewhere.

    6) Maybe your original system was hit with this bug, http://support.microsoft.com/kb/2526959, and performance sucked because of it. Now things can open up (after some spool-up time) in 2012?? I have seen this happen too (in production systems, not tests).

    7) Can you verify that NO system or database settings changed between the 2008R2 run and the upgraded-to-2012 run??

    Thanks for the reply. In order:

    1. No, once the CPUs pegged performance went into the crapper. Was running fine, then terrible.

    2. I believe they were all updated; but we can definitely try it.

    3. Same edition Standard to Standard

    4. Yes, we've repeated it probably a couple of dozen times.

    5. Are all these counters on Windows 2012 Perfmon? Is there a link that can tell us what is outside an acceptable limit for each one?

    6. We looked at the KB, but since SQL 2008 seemed to be good and this didn't apply to SQL 2012, we figured it couldn't be it.

    7. So far as we can tell, every setting we can possibly think of is the same. I agree that it is likely that something changed, but it's no setting we're aware of.

    Thanks for responding!

  • If you can, do a profiler trace of the load test on sql 2008 and compare to a profiler trace on SQL 2012. You will likely find on SQL 2012 certain quries or stored procedures are running with significantly higher reads. Have a look at the execution plans for these queries, and see what could be happening. I have seen queries with subqueries in the select list that were much faster in SQL 2000 than in SQL 2008.

  • ssinger-815504 (5/20/2014)


    TheSQLGuru (5/20/2014)


    1) Did the test finish more quickly with CPUs pegged? That could be an indication that things are running 6X faster. Not likely, but not unheard of either.

    2) Were statistics updated with a full scan on all stats after the upgrade? Not likely to be cause of anything, but personally I still like to do it on version upgrade.

    3) Same edition (std-std or ent-ent)?

    4) Is this repeatable?

    5) I would track waits, spinlocks, IO, memory and various perf-mon counters every 5-10 seconds during the test. Probably a >90% chance that your reason(s) will be revealed there somewhere.

    6) Maybe your original system was hit with this bug, http://support.microsoft.com/kb/2526959, and performance sucked because of it. Now things can open up (after some spool-up time) in 2012?? I have seen this happen too (in production systems, not tests).

    7) Can you verify that NO system or database settings changed between the 2008R2 run and the upgraded-to-2012 run??

    Thanks for the reply. In order:

    1. No, once the CPUs pegged performance went into the crapper. Was running fine, then terrible.

    2. I believe they were all updated; but we can definitely try it.

    3. Same edition Standard to Standard

    4. Yes, we've repeated it probably a couple of dozen times.

    5. Are all these counters on Windows 2012 Perfmon? Is there a link that can tell us what is outside an acceptable limit for each one?

    6. We looked at the KB, but since SQL 2008 seemed to be good and this didn't apply to SQL 2012, we figured it couldn't be it.

    7. So far as we can tell, every setting we can possibly think of is the same. I agree that it is likely that something changed, but it's no setting we're aware of.

    Thanks for responding!

    It is certainly possible you have hit some form of query plan regression by the new optimizer. But I would still like to see the metrics I mentioned trended on a pretty frequent basis during the run. These things are not perf mon counters mostly, but rather information that SQL Server tracks. An experienced tuner would be able to help you here. Watching profiler could be useful too as someone else suggested, but during a performance run you could be affecting the run itself by doing said monitoring. Hey, the Heisenberg Uncertainty Principle in SQL Server!! 🙂 And yes, I do have a degree in Physics so I can use that term without risk of my head exploding! 😀 You could do profiler-to-disk on the server and just turn it off and on regularly until you capture enough of the badness once it starts to do some analytics. Here again you may need some experience to best make use of the trace information.

    Are you running the benchmark code on the SQL box too?? Is it possible that that is somehow involved here? Hmm, and I wonder if auto stats is kicking off and doing something wonky? Reaching here as it is past midnight and my brain is fried! Best hit the sack - long day tomorrow, and the next day, lather/rinse/repeat!

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • It is true, that using Profiler may extend the duration of the test, as it will take up some resources, but in this case, I think that is acceptable. The use of Profiler will not affect the query plans used by SQL Server, or the number of reads reported by any query. we already know the performance is bad, now we need to figure out the why of that.

  • Matt Crowley (5/21/2014)


    It is true, that using Profiler may extend the duration of the test, as it will take up some resources, but in this case, I think that is acceptable. The use of Profiler will not affect the query plans used by SQL Server, or the number of reads reported by any query. we already know the performance is bad, now we need to figure out the why of that.

    1) Depending on the type of load you are running you could get an EXTRAORDINARY amount of data captured by a profiler trace, even for a very restricted set of capture definitions. I have come across several clients that drop over 1MB/second and one that does almost 15MB/second for a VERY tight performance tuning template I regularly use!!

    2) Again we get to the Heisenberg Uncertainty Principle - profiler WILL show different reads for the first and second+ execution of each new query under normal setups. The first will include a variety of work associated with compilation (and not obviously associated with profiler itself). This can significantly skew the work you think is happening for small queries. Over a large run this effect should be minimal though, unless you are using an ORM or code generator that throws a lot of 2-3X-repeated-distinct-strings at the database.

    Best of luck with your forensics!! I love to work on problems like you are having!!

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • On the first point, I certainly agree. I am beginning to hate applications based on Java for this reason (among several others).

    The second point, I have no reason to doubt you. I just don't have any systems that push the limits of the hardware quite so close (not sure if that makes me lucky, or not in this case). I can certainly understand the CPU numbers getting skewed by the generation of extra query plans, but I can't grasp how the reads may get affected. In any case, if I were the OP, I would be looking for differences in orders of magnitude (I still suspect bad query plans are at the root here).

    As an alternative to Profiler, would the data in sys.dm_exec_query_stats be a better way to get at whether certain queries/stores procs are suddenly getting worse plans? I ask, because I have only recently started playing with this view.

  • Matt Crowley (5/21/2014)


    On the first point, I certainly agree. I am beginning to hate applications based on Java for this reason (among several others).

    I've learned that I can save huge amounts of time if I wake up every morning hating all applications (front-end, back-end, whatever) and making exceptions from there. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Matt Crowley (5/21/2014)


    ...

    As an alternative to Profiler, would the data in sys.dm_exec_query_stats be a better way to get at whether certain queries/stores procs are suddenly getting worse plans? I ask, because I have only recently started playing with this view.

    You can certainly roll your own on this front. But I HATE to reinvent the wheel. There are several very good performance monitoring tools out there that will do this automatically for you, and even alert you when key ones start to head south or make a step-change. Your environment sounds like one that should already have such a tool in place.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Matt Crowley (5/21/2014)


    As an alternative to Profiler, would the data in sys.dm_exec_query_stats be a better way to get at whether certain queries/stores procs are suddenly getting worse plans? I ask, because I have only recently started playing with this view.

    Not really. The DMV's data is only for plans in cache. When a plan is removed from cache, all the stats are discarded. So unless you're polling the DMV on a regular basis and have some complex queries to work with the saved results you may well miss stuff.

    It's a very useful DMV, but it's not a replacement for profiler/extended events.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • ssinger-815504 (5/20/2014)


    We run a Java application on SQL Server. We are load testing on this machine:

    HP DL380 G8 duel E5-2650 CPU

    16 CPU Hyperthreaded to 32 cores

    32GB Memory

    Dedicated san lun for tempdb, data and log.

    Running on SQL Server 2008 R2, the load test ran fine. We ramped up users and other jobs with no issues over the course of a 4 1/2 hour test, CPU% usage stayed at about 15%. We then upgraded to SQL 2012 (SP1 CU9) on the exact same hardware. Now, when we run the test it runs fine for about 20 minutes (the ramp up) with CPU% staying consistently at 10-15% and then suddenly, over the course of 2-3 seconds, the entire machine -- all 32 cores -- spikes to 100% CPU usage.

    Anybody have any ideas what could be happening? Most postings point to possible heavy queries, but I'd expect to see CPU % climbing not instantly spiking like that. In addition, it's the exact same test, the exact same queries, the exact same database. Yet on SQL 2008 fine, on 2012, we get the spike. We can see that the system is overloaded with SOS_SCHEDULER_YIELD locks, but all I've read just says "That means you're CPU bound." One posting suggested spinlocks, and I found we are way high on SOS_CACHESTORE_CLOCK spinlocks, but we're not really sure if that has any meaning at all.

    Instant spiking like that screams that we've hit some sort of threshold, but we can't figure out what it could possibly be. Any thoughts about what could be happening? Again -- same tests, same machine, same queries, same data on SQL 2008 are great. On SQL 2012 runs great and then spikes from 10% CPU usage to 100% CPU use instantly, over the course of a couple of seconds.

    Thanks in advance!

    Did you find a solution to this? I'm seeing pretty much this exact same thing.

    Anyone else have any other ideas? Perhaps more people have experienced this.

    Thanks!

  • Please start a new thread and give us as much information as possible.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply