AlwaysOn High Availability and Statistics

  • Hi All

     I have a 4 node AlwaysOn environment running the latest service pack but not on the latest cumulative update as yet. We've been running this environment a few years now. We are now having an issue when failing over to any of the secondary's were we see a performance dip (high CPU) which we can then only resolve by updating statistics on certain large and heavily used tables.

    As mentioned this environment has been running for a few years and we've never had this issue before. The only noticeable change is the size of the data. Dbs that were initially around 400GB have more than doubled.

    In-terms of statistics maintenance...we have daily and weekly jobs for them (also we only see this performance issue after a failover)

    If I understand MS documentation correctly,this should not be the case because statistics update transactions on the primary should be replicated to the secondary's. After a failover I can see a difference between the tables statistics last updated dates.
    Just curious if there is anyone else experiencing this.

  • Denesh Naidoo - Tuesday, September 25, 2018 7:18 AM

    Hi All

     I have a 4 node AlwaysOn environment running the latest service pack but not on the latest cumulative update as yet. We've been running this environment a few years now. We are now having an issue when failing over to any of the secondary's were we see a performance dip (high CPU) which we can then only resolve by updating statistics on certain large and heavily used tables.

    As mentioned this environment has been running for a few years and we've never had this issue before. The only noticeable change is the size of the data. Dbs that were initially around 400GB have more than doubled.

    In-terms of statistics maintenance...we have daily and weekly jobs for them (also we only see this performance issue after a failover)

    If I understand MS documentation correctly,this should not be the case because statistics update transactions on the primary should be replicated to the secondary's. After a failover I can see a difference between the tables statistics last updated dates.
    Just curious if there is anyone else experiencing this.

    Sounds like you could be running into this issue:
    FIX: Bad query plan created on secondary replicas after statistics updated via FULLSCAN option on primary replica in SQL Server 2012 or 2014

    Sue

  • Thanks Sue_H

    I have come across that link...but I understood it a different way...issues related to queries that are run on secondary server (for eg: if you have redirected your reporting queries off the primary onto the secondary). It doesn't mention anything related to when the secondary becomes the new primary.

    However I see the fix they talk about is in CU5 and I only have CU4 so it's worth a try.

    Will check this out and get back.

    Thanks again

  • I can't explain the differences in the statistics (and I've reached out to some people to see if someone can). However, I can explain performance hits on failover. You're going to have to compile every single query after failover because they won't be in memory, explaining a very high CPU depending on your system because compiling plans is expensive. You're also going to have to warm up the cache because none of the data that's been living there in the primary will be there in the secondary. You may also occasionally see performance differences because the stats in the secondary that are used to compile a plan, while being the same stats as in the primary, are newer than the stats used to compile the plan in the primary (stats changed in the primary but the threshold for a recompile event was not reached). Any or all of these things will explain lots of performance problems on a failover.

    Again though, a literal difference in the statistics, I'm at a loss. We'll see what the people I reached out to say.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • As much as anything this can be a (albeit odd) parameter sniffing issue. Your compiled plan on the old primary is likely happy and full of good things for your workload, even as things evolve. When you failover the stats are out of date (but have never hit the limit to cause them to be rebuilt), so your query comes along, it goes to compile a new plan, the stats don't support it well, and so it goes off the rails in an unexpected way. This is particularly applicable when you are using dates or incremental keys in the predicates.

    To evaluate if this is the case I would grab the plans off of each server for the procs that are going sideways and compare them. See where things are different. It's likely that an alternate index is being used on the new replica prior to your stats rebuild.



    Shamless self promotion - read my blog http://sirsql.net

  • Grant Fritchey - Thursday, September 27, 2018 6:57 AM

    I can't explain the differences in the statistics (and I've reached out to some people to see if someone can). However, I can explain performance hits on failover. You're going to have to compile every single query after failover because they won't be in memory, explaining a very high CPU depending on your system because compiling plans is expensive. You're also going to have to warm up the cache because none of the data that's been living there in the primary will be there in the secondary. You may also occasionally see performance differences because the stats in the secondary that are used to compile a plan, while being the same stats as in the primary, are newer than the stats used to compile the plan in the primary (stats changed in the primary but the threshold for a recompile event was not reached). Any or all of these things will explain lots of performance problems on a failover.

    Again though, a literal difference in the statistics, I'm at a loss. We'll see what the people I reached out to say.

    Thank you Grant for your response.
     If I understand you correctly, after a failover we should see high CPU as all queries start to recompile...but this should start getting better.
    My other concern is why are we only seeing this behavior now. As mentioned we've been running this configuration since 2015 and never had performance issues on a failover. 

    Just to add...we are testing the latest CU in QA. Once approved we will install on production and check again

  • Nicholas Cain - Thursday, September 27, 2018 9:00 AM

    As much as anything this can be a (albeit odd) parameter sniffing issue. Your compiled plan on the old primary is likely happy and full of good things for your workload, even as things evolve. When you failover the stats are out of date (but have never hit the limit to cause them to be rebuilt), so your query comes along, it goes to compile a new plan, the stats don't support it well, and so it goes off the rails in an unexpected way. This is particularly applicable when you are using dates or incremental keys in the predicates.

    To evaluate if this is the case I would grab the plans off of each server for the procs that are going sideways and compare them. See where things are different. It's likely that an alternate index is being used on the new replica prior to your stats rebuild.

    Thanks Nicholas...I will compare the plans

  • Denesh Naidoo - Thursday, October 4, 2018 1:12 AM

    Thank you Grant for your response.
     If I understand you correctly, after a failover we should see high CPU as all queries start to recompile...but this should start getting better.
    My other concern is why are we only seeing this behavior now. As mentioned we've been running this configuration since 2015 and never had performance issues on a failover. 

    Just to add...we are testing the latest CU in QA. Once approved we will install on production and check again

    There could be a number of factors on this. It's hard to know for certain. 3 years worth of growth and changes to your data and the usage patterns of the app could explain it. It might be due to SP/CU changes. Do you have performance tracking from 3 years ago that shows it was better or just more complaints now than before? If you have the data, drill down into the details, wait statistics especially, to see where the differences come from.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply