AlwaysOn synchronous - restarting secondary causes long pauses on primary due to HADR_SYNC_COMMIT

  • Hi all,

    I'm having a bit of a weird issue here that I just managed to get more info on. 

    Setup:
    2-Node physical windows cluster, Windows 2016 + SQL 2017 - Stand alone SQL instances on each (local storage) with synchronous AlwaysOn for our primary DB.
    Located in the same DC, racks are next to each other, all fairly new hardware w/ 20+ GB  networking between

    Issue: 
    We have a fairly high throughput application and when passive node restarts it causes the primary node's transactions to have a HADR_SYNC_COMMIT wait type.  This lasts for upwards of a minute in some cases and our application ends up timing out.

    I suspect this is because its trying to harden on the secondary, but because its shutting down its unable to.  At some point the primary becomes aware the secondary is down and it no longer tries to harden to secondary and we stop having the issues - but the damage is already done.

    Soft-Resolution:
    Suspend data movement prior to restart - No issues observed - This is great, so we can do scheduled maintenance on our secondarys without affecting primary. 

    BUT I'm worried that an unscheduled hardware failure/whatever hits our secondary in the middle of the night and our 24/7 app goes down due to timeouts.  

    Anyone run into this before, or have thoughts on how we can minimize the impact of stopping synchronous secondarys on our primary node (outside of scheduled windows)?

    Thanks!

  • It's a known issue/expected behaviour. If you don't need synchronous commit mode*, you can mitigate some of the issues by switching to asynchronous mode.

     suspect this is because its trying to harden on the secondary, but because its shutting down its unable to. At some point the primary becomes aware the secondary is down and it no longer tries to harden to secondary and we stop having the issues

    Yes, but make sure you have good monitoring on the drives hosting your transaction log files, because while the primary is unable to sync to a secondary, the transaction logs cannot be truncated (even if you are taking regular t-log backups), so they may grow rapidly, depending on your system. The problem can get worse very quickly, and then suddenly your databases are read-only, or worse.

    * are you really synchronising to a server in the next rack, by the way?

  • Thx for taking time to reply Beatrix Kiddo!

    I was afraid this was the expected/norm 🙁  My only hope now is there is some sort of timeout value we can tune - I'll be opening a case w/ MS to see if this is possible to tweak.

    I hear you on the monitoring of the tlog - we have some async AlwaysOn on different boxes syncing to a secondary DC and network interruptions sometimes cause us to build up 100's of GB on the primary - to the point where we are hovering over the button to drop the replica.  Re-establishing the replica of a 4 TB DB across different continents isn't much fun though, so in the past we've been expanding the primary log to ensure there is enough space while the network recovers, then just shrink log size back to normal.  Not ideal, but it works in a pinch.

    And you are correct, we can switch to Async, and this may be the way we have to go.  But there are more issues with this, as async does not support auto failover 🙁

    In regards to syncing to the next rack, we are yes... This is a pretty normal HA cluster setup where you do not have a shared SAN - there isn't really much option to keep data in sync otherwise without other 3rd party tools that i'm aware of.

    Cheers,

    Jeff

  • jkelly95z@yahoo.com - Friday, February 15, 2019 5:09 AM

    In regards to syncing to the next rack, we are yes... This is a pretty normal HA cluster setup where you do not have a shared SAN - there isn't really much option to keep data in sync otherwise without other 3rd party tools that i'm aware of.

    Really? That's not the way we have it set up.

    We have 4 node cluster across 2 data centers, 2 nodes in each data center. DC1 has nodes A & B, with A as Primary 1 (OLTP) and B set as secondary to it while B is Primary 2 (reporting) with A set as secondary to B. A & B are set as synchronous with each other. DC2 has nodes C & D, both set as asynchronous secondary to A & B. None of these have a shared SAN for the AG and are in different racks as far as I know.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • Sorry for confusion, i just meant using AG groups and local storage with sync method for 2 nodes in 1 DC - sounds like we are the same there.

    I was thinking 2 main HA types for primary DC: either shared SAN and 1 instance, or 2 instances w/ local storage + AG.

  • jkelly95z@yahoo.com - Monday, February 25, 2019 5:49 AM

    Sorry for confusion, i just meant using AG groups and local storage with sync method for 2 nodes in 1 DC - sounds like we are the same there.

    I was thinking 2 main HA types for primary DC: either shared SAN and 1 instance, or 2 instances w/ local storage + AG.

    ACK! Please do not use a shared SAN for a AG group. First, I don't think HA will let you use the same drive (like a cluster), second, it's not really an AG if you base everything off a shared SAN. If the SAN dies, you have no AG. You have no secondary databases to flip to. That way lies madness. Please please please do not let anyone build a 2-node AG with a shared drive for the AG group.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • hehe - one or the other Brandie, no shared san for AG, no - as you say - that's madness...

    unless each node of the AG is on a separate SAN at which point, maybe

  • Sheesh. You had me scared for a minute! @=)

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • I'm both happy and sad to report we found this issue!

    "The session-timeout period is a replica property that controls how many seconds (in seconds) that an availability replica waits for a ping response from a connected replica before considering the connection to have failed. By default, a replica waits 10 seconds for a ping response. This replica property applies only the connection between a given secondary replica and the primary replica of the availability group. "

    So, it turns out that our session timeout had somehow been increased to 120 seconds, from the default of 10.  We are unsure how this got set and/or how we missed it during our initial investigation.

    But we have it reset to 10 seconds now. 

    I'm still not super happy that this can cause active transactions to wait for upto 10 seconds while internally SQL determines a replica is dead before switching over to async mode.  But I understand the design constraints and this is a trade off we have to work with vs. the old-school shared storage/active-passive HA designs

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply