Poor read-only AlwaysOn replica performance

  • Hi All

    We have a 2-node SQL Server 2012 AlwaysOn setup, in synchronous mode with readable replica databases. We have a particular reporting query that we're running against one of the read-only replica databases, which is taking in excess of over 2 hours to complete on the replica node. However, when we run the same query against the primary database on the primary node, it completes in 5 seconds.

    I've attached part of the 2 of the execution plans shown in Redgate SQL Sentry (Primary.png is the quick slow query, Replica.png is the slow query). These 4 rows are the first stages of the query (clustered index scans), which I believe are the cause of the slowness in the slow-running replica database query.

    You can see that the 'Actual rows', 'Actual Executions' and 'Actual Data Size' are significantly higher in the replica database query. I'm struggling to see how this can be, if the databases are replicas of eachother.

    Both nodes (VMs) are very similar in spec. However, I believe that the underlying storage is slower (SaS as opposed to SSD) on the replica - I'll will confirm this later.

    Any help would be appreciated, as would any pointers as to where I should spend my troubleshooting time.

    Thanks,

    Innerise

  • Innerise (7/7/2015)


    Hi All

    We have a 2-node SQL Server 2012 AlwaysOn setup, in synchronous mode with readable replica databases. We have a particular reporting query that we're running against one of the read-only replica databases, which is taking in excess of over 2 hours to complete on the replica node. However, when we run the same query against the primary database on the primary node, it completes in 5 seconds.

    I've attached part of the 2 of the execution plans shown in Redgate SQL Sentry (Primary.png is the quick query, Replica.png is the slow query). These 4 rows are the first stages of the query (clustered index scans), which I believe are the cause of the slowness in the slow-running replica database query.

    You can see that the 'Actual rows', 'Actual Executions' and 'Actual Data Size' are significantly higher in the replica database query. I'm struggling to see how this can be, if the databases are replicas of eachother.

    Both nodes (VMs) are very similar in spec. However, I believe that the underlying storage is slower (SaS as opposed to SSD) on the replica - I'll will confirm this later.

    Any help would be appreciated, as would any pointers as to where I should spend my troubleshooting time.

    Thanks,

    Innerise

    1) Do an IO stall analysis during report runs. That can be a HUGE factor.

    2) Most people don't know that when you make a secondary readable the PRIMARY DATA STRUCTURES are all modified on write to have a 14-byte pointer added so the secondary can do snapshotting to give a consistent read there. This is a horrible thing because it a) fragments pages because everyone has 0 fill factor on all their indexes, b) creates lots more dirty pages and tlog activity that has to be mirrored over to secondary and replayed, c) hits tempdb for snapshotting stuff, d) additional network activity, etc.

    3) Do some binoogling on statistics on readable secondaries.

    4) I didn't review query plans, but did you check compiled/executed values for parameters as well as actual/estimated row counts between the plans?

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • on top of what Kevin has said, try this link and this link

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Thanks very much, guys. I'll do some reading and testing over the next couple of days and will let you know how I get on.

  • Perry - those links suggest that making the latest statistics available on a readable secondary (via the secondary replica creating and maintaining temporary statistics for secondary databases in tempdb) should be an automatic process.

    However, if I do a select * from sys.stats on the secondary replica, I see no results with either the _readonly_database_statistic suffix or is_temporary = 1. If there is no user intervention necessary, I'm not sure why the temporary stats aren't getting created...but this may be the cause of the slow query.

    In fact, scrub all that - I can see the temp stats getting created now that I've re-run the query 😀

    So, would I be right in thinking that as long as the temp stats are actually getting created, I can rule this out as being the problem?

  • Innerise (7/8/2015)


    Perry - those links suggest that making the latest statistics available on a readable secondary (via the secondary replica creating and maintaining temporary statistics for secondary databases in tempdb) should be an automatic process.

    However, if I do a select * from sys.stats on the secondary replica, I see no results with either the _readonly_database_statistic suffix or is_temporary = 1. If there is no user intervention necessary, I'm not sure why the temporary stats aren't getting created...but this may be the cause of the slow query.

    In fact, scrub all that - I can see the temp stats getting created now that I've re-run the query 😀

    So, would I be right in thinking that as long as the temp stats are actually getting created, I can rule this out as being the problem?

    Note that the CREATION of the stats can be CAUSING performance problems. That is a LOT of IO each time they are created or updated, especially on large tables. That would need to be taken into account when measuring perf. Note that you can see them being created by tracing the event in profiler or extended events.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • On further inspection, it looks like the execution plans for the same query are different when run on the primary and secondary replica databases. So, I guess I can narrow this down to possible differences in statistics or indexes. However, I feel that one of the last comments in Perry's link:

    "...you will be happy to know that you will never run into stale or missing statistics when running queries on readable secondary"

    may be slightly misleading. OK, there may be a severe delay when waiting for the temp statistics to be created, but even if this is the case, I'm struggling to see what is causing the execution plans be different when the configuration is identical.

    **Update** I can also see that this long-running query is using the version store with a generation rate of 3.41 KB/sec (and a cleanup rate of 0.00 KB/sec), which sounds slow to me.

  • I asked this before but didn't see an answer: What are the compiled and actual parameter values on both plans?

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Sorry Kevin - there are no parameters in this query. There are hard-coded values - the query is the result of a particular BI reporting tool.

    I've attached both execution plans if it helps.

    Thanks so much for the assistance.

  • HAH!! Dealing with that mess is many hours of work. Sorry - I don't do that for free. 😎

    I will say that your secondary plan has the dreaded "estimated rowcount 1, actual rowcount LOTS, so lets do a kajillion logical IOs in a nested loop plan" scenario. MANY things cause that estimated rowcount 1, chief among them UDFs, table variables and yet, bad stats.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • in fact without details of the schema it's going to be fairly difficult, there are some hints to possible missing indexes, whilst these should be taken with a pinch of salt initially they may help to pinpoint problem areas

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • 😛 Thanks, Kevin. I'm very grateful for the help you've given me so far, for free.

    Perry - Yes, there's one index in particular that is suggested as 'missing', but the query runs in 1 second (instead of 6+ hours) against the primary replica database without that index.

    I sense a lot of reading and testing on its way... it's fun, though 🙂

  • have you fully digested the links I posted regarding stats on a readable secondary

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply