High CPU but fail mirrored database over and/or back resolves things

  • Trying to get to the bottom of an issue with a load test going through one of our environments.

    Getting CPU quickly hitting 100% for SQL Server (no blocking, seems to just be volumes). What seems to fix it is failing the mirrored database over to its failover partner - with the testing still running that SQL box then displays normal CPU. Then failing the database back to the original server with the testing still in-flight gives normal CPU.

    Obviously if it was reindex/statistics update required then the failback to the original server would have seen the CPU spike again - correct? :Whistling:

    So, the question is - what within SQL Server happens when a high availability database is failed over that could be a factor? I am thinking this may be related to other factors so am trying to rule SQL Server out as a cause.

  • I assume that you are experiencing bad parameter sniffing and get inefficient execution plans.

    Generally speaking, updating stats (with 100% sample for skewed data distributions) fixes it (for a while). Permanent fixes are described here: http://www.scarydba.com/wp-content/uploads/2011/01/LuckyConnections.pdf

    Want to rule out SQL Server? Which process is using the CPU?

    -- Gianluca Sartori

  • But I suppose my issue with execution plans, statistics etc is that if they are rubbish, that should persist when the database is failed back to the primary partner (the server showing the issue in the first place) - is that not correct?

    Of the 99% CPU on the database server, when the problem occurs its 98% SQL Server.

  • Execution plans are cleared from cache when you make mirror the principal.

    See for reference: http://www.sqlsoldier.com/wp/sqlserver/doesamirroringfailovercleartheprocedurecache

    -- Gianluca Sartori

  • OK, that makes sense (only problem is that after rebooting both servers a couple of days ago the same issue happened - presumably a reboot also flushes the proc cache in a similar fashion to the failover). I am still confused about the nature of the test load that would use a poorly performing query plan, but then start using a better one after a flush.

  • You can get bad plans for a number of reasons.

    Stale statistics, to name one.

    Parameter sniffing is another common cause.

    -- Gianluca Sartori

  • how to check if parameter sniffing is being used? :unsure:

  • Look in the plan XML for parameter list.

    Here's an article on the subject: http://sqlperformance.com/2013/08/t-sql-queries/parameter-sniffing-embedding-and-the-recompile-options

    -- Gianluca Sartori

  • Think a possible cause has been found. Overnight a MAXDOP from 1-to-4 runs, and a bit later it then sets things back from 4-to-1 - it looks like the cache flush that should happen as part of that is unreliable. Manually running a DBCC FREEPROCCACHE afterwards is needed. I suspect SQL Server is somehow retaining a footprint from the MAXDOP 4 setting while it the compiles the query plans.

    This would explain why a failover fixed the issue, as it forced the cache flush.

    Fingers crossed that has got to the bottom of it.

  • Any idea why the cacheflush from the MAXDOP isn't fully working, while the DBCC FREEPROCCACHE command seems successful? Possibly a bug within SQL 2005?

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply