Temp statistics builds on AlwaysOn Readable Secondary causing query blocking issues

  • At my job we have a very read-heavy workload, not just reporting but high throughput and fast selects, on our synchronous secondary SQL Server. Every once in a while a couple procedures, that are pretty high volume 50/s or so, will all of a sudden start blocking each other and timeout for a short period of time, 2-5 minutes or so, until (it appears) a new plan is generated and locked in. I have tried using OPTION(USE PLAN) on the queries (last resort), and T-SQL optimizations to try to deter the problem as much as I could as I investigate. But I think I have finally caught why and possibly what is causing this.

    We had this incident occur again yesterday and I was able to log into the server in time and use sp_WhoisActive to see the blocking. I tried recompiles, and in the end it ended up clearing after a couple minutes (I don't believe the recompiles worked, because there was no plan because there was no memory!.. more on that in a sec) However I also (this time) was able to catch current memory grants and found a ton of StatsMan queries running building temp statistics across many tables, of which tables the procedures were querying.

    There was a substantial increase in the amount of IOPs across our two Data disks at the start of the incident and continued throughout. The main data drives went from averaging about 1K IOPs to dramatically jumping up to ~ 7K IOPs EACH! The free memory on the SQL instance itself was quickly and severely brought to its knees at this time going from roughly 14 GB available to 0.

    Since there was no memory available, it leads me to believe SQL Server resorted to cache-plan eviction and caused these procedures to be unable to generate and cache a new execution plan which is why they were blocking each other waiting on the plan.

    We currently do have Auto Update Statistics on.

    My question is, what are your thoughts on a solution to this problem? I've thought about updating stats on the primary more often. Currently we update outdated statistics once per day, but that seems like it could cause another ticking time bomb and recompiles (maybe appropriately so and for the better).

    I don't think there is a way to disable temp statistics on the secondary readable server afaik, as this is an internal SQL optimization.. assuming different workloads across primary/secondary replicas. It would be nice if there was a way to throttle temp statistics builds as it looks like it did them in parallel across MANY tables at the same time which caused the IO storm.

    For now I will resort to logging occurrences to correlate any more issues and look into more proactively updating statistics.

    Any thoughts/insight is greatly appreciated! Thanks

  • If you're getting problematic plans, it's probably because of out of date statistics, so getting more up to date statistics is usually the solution. USE PLAN might not work that well here, instead you may need to use RECOMPILE (although that means no plan reuse, and depending on the size and complexity of the plans in question, a lot of overhead). The core issue seems to be one of memory. It's a severe problem if there is insufficient memory to generate a plan. Either the plan is huge, or you're really shy on memory, or both. I think I'd focus first on the memory issue (unless we're talking about gigantic plans here, like, it's not in sys.dm_exec_sql_plan, but only in sys.dm_exec_sql_plan_text level of large, then that's the focus).

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Thanks for the reply Grant,

    I don't think it's that it's getting a bad plan, I think the temp statistics being auto-built on the secondary server cause a huge IO storm scanning those pages and pulling them into memory. I believe it depletes the SQL instance free memory (this SQL Server consistently has 10GB + of free SQL instance memory, with another 15GB free at the OS level) and at the same time, because of the updated statistics, causes recompiles on the procedures which then don't have enough memory to cache a new plan.

    That's my best hunch from investigation.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply