Rebooting secondary replica of AG causes some databases to stop synchronizing

  • Hi Steal and Rothj

    Right before rebooting secondary server on the primary server's AG group properties set the following:

    Availability Mode: Asynchronous Commit

    Failover Mode: Manual

    Seeding Mode: Manual

    Session Timeout: 900

    after reboot check AG dashboard or error log for progress.

    Once it's all back to normal change the properties back.

    You can also try setting these with https://docs.dbatools.io/Set-DbaAgReplica

     

    Alex S
  • not sure what's going on but i can't see my reply when browsing from my phone so i will post it again:

    Hi Steal and Rothj

    Right before rebooting secondary server on the primary server's AG group properties set the following:

    Availability Mode: Asynchronous Commit

    Failover Mode: Manual

    Seeding Mode: Manual

    Session Timeout: 900

    after reboot check AG dashboard or error log for progress.

    Once it's all back to normal change the properties back.

     

    Alex S
  • Wow - this sounds a bit like the issue we just had. We were on  CU18 and just moved to CU20

    https://www.sqlservercentral.com/forums/topic/alwayson-issue

     

     

    • This reply was modified 1 year ago by  rudy_rosa. Reason: Adding link
  • It does sound similar.  I just moved from CU18 to CU19 and am waiting to see if it helped. I am thinking about switching off auto seed before the next restart.

  • We are on CU19 and will go to CU20 in about 3 weeks.

    Alex S - thanks for the suggestion.  It will be difficult to test as we don't have a test system with enough data/load to reproduce the problem (so we would have to test in prod) but I'll discuss with my DBA team.

    We have a ticket open with Microsoft that hasn't proven helpful thus far- their current theory says the problem is due to log backups running. I disagree as the timing simply doesn't line up, but regardless we have another test scheduled in a couple weeks with all backups disabled.

    I'm continuing to press MS so will share here if I learn anything useful.

  • The log backup doesn't align with my AG either.

  • Due to huge number of database you have lack of worker threads during some kind of operations (failover/secondary node reboot/backup all databases from the secondary node at the same time, etc).

    There are few ways to fix/decrease effect of this issue:

    1) Increase number of CPU cores (that of course is pretty expensive fix)

    2) Remove not usable databases or move some databases out to another server.

    3) Decrease using workers threads on passive node (use separate backup groups with different time frames for backup each group, disable ReadOnly requests to secondary node and so on)

    4) Modify SESSION_TIMEOUT for each AGs from default 10 seconds value to some greater.

    Workaround:

    Create AG state monitoring sql job, that will be checking ‘Recovery_Pending’ db states and bring those databases up on secondary node only. Job should check if AG replica is secondary now and apply following two commands to each database in 'Recovery_Pending' state. You can schedule it to run on SQL restart or add additional mechanism of checking AG failover state.

    ALTER DATABASE DBNAME SET HADR OFF

    ALTER DATABASE DBNAME SET HADR AVAILABILITY GROUP = [AGNAME]

     

    • This reply was modified 11 months, 3 weeks ago by  per4inka.
  • Thanks Per4inka, I'll share this in our DBA team meeting tomorrow.

    BTW we STILL have an open ticket with MS on this so I'll be sure to post back with whatever we do (or don't) learn from that.

  • Just an update that we are still at a dead-end on this with Microsoft.  They claim that it's due to our primary and secondary nodes being on different patch levels, but we absolutely get some databases that fail to synchronize after restarting secondary nodes even when all nodes have the same CU.

  • <sigh>  Thanks for the update.

    I can confirm I also get some out of sync even though both nodes are on the same patch level.

  • I applied CU21 late in July.  I had my first restarts this last weekend and both my nodes came up clean with no secondary out of sync issues.   Have you guys applied this CU and still have issues?   It might be a fluke and I'm not holding my breath yet.  I didn't see anything in the CU 'fixes' related to the thread use improvements.  I did hear 2022 has some improvements and uses less threads at startup, so upgrading might help us.  Thanks

  • We upgraded to CU21 in mid-July and we are still having the issue.  Fingers crossed for CU22 which we will apply in early/mid September.

    MS is recommending we double our CPU (huge cost) and triple our RAM.  We have doubled CPU on one cluster for a test but still saw the problem in our last round of reboots.  We will continue to test.

    We have a large number of databases that is likely contributing to the problem, so we are working on an initiative/re-design to decrease the database count, but that is not a quick fix.

  • Bummer, I figured mine was a fluke.  Thanks for the update.

    I hear you on the licensing...mine is SPLA so it is very expensive to bump up.

  • This was removed by the editor as SPAM

  • This was removed by the editor as SPAM

Viewing 15 posts - 16 through 30 (of 39 total)

You must be logged in to reply to this topic. Login to reply