Rebooting secondary replica of AG causes some databases to stop synchronizing

Question

Post reply

Rebooting secondary replica of AG causes some databases to stop synchronizing

August 22, 2023 at 5:17 pm

This was removed by the editor as SPAM
August 22, 2023 at 5:25 pm

This was removed by the editor as SPAM

Viewing 15 posts - 16 through 30 (of 39 total)

You must be logged in to reply to this topic. Login to reply

AlexSQLForums SSChampion Points: 14277 More actions · Answer 1

Hi Steal and Rothj

Right before rebooting secondary server on the primary server's AG group properties set the following:

Availability Mode: Asynchronous Commit

Failover Mode: Manual

Seeding Mode: Manual

Session Timeout: 900

after reboot check AG dashboard or error log for progress.

Once it's all back to normal change the properties back.

You can also try setting these with https://docs.dbatools.io/Set-DbaAgReplica

This reply was modified 1 year ago by AlexSQLForums.
This reply was modified 1 year ago by AlexSQLForums.

Alex S

AlexSQLForums SSChampion Points: 14277 More actions · Answer 2

not sure what's going on but i can't see my reply when browsing from my phone so i will post it again:

Hi Steal and Rothj

Right before rebooting secondary server on the primary server's AG group properties set the following:

Availability Mode: Asynchronous Commit

Failover Mode: Manual

Seeding Mode: Manual

Session Timeout: 900

after reboot check AG dashboard or error log for progress.

Once it's all back to normal change the properties back.

Alex S

rudy_rosa Old Hand Points: 328 More actions · Answer 3

Wow - this sounds a bit like the issue we just had. We were on CU18 and just moved to CU20

https://www.sqlservercentral.com/forums/topic/alwayson-issue

This reply was modified 1 year ago by rudy_rosa. Reason: Adding link

rothj Hall of Fame Points: 3349 More actions · Answer 4

It does sound similar. I just moved from CU18 to CU19 and am waiting to see if it helped. I am thinking about switching off auto seed before the next restart.

steal SSC Eights! Points: 827 More actions · Answer 5

We are on CU19 and will go to CU20 in about 3 weeks.

Alex S - thanks for the suggestion. It will be difficult to test as we don't have a test system with enough data/load to reproduce the problem (so we would have to test in prod) but I'll discuss with my DBA team.

We have a ticket open with Microsoft that hasn't proven helpful thus far- their current theory says the problem is due to log backups running. I disagree as the timing simply doesn't line up, but regardless we have another test scheduled in a couple weeks with all backups disabled.

I'm continuing to press MS so will share here if I learn anything useful.

rothj Hall of Fame Points: 3349 More actions · Answer 6

rothj

Hall of Fame

Points: 3349

May 4, 2023 at 2:41 pm

#4184218

The log backup doesn't align with my AG either.

per4inka Grasshopper Points: 15 More actions · Answer 7

Due to huge number of database you have lack of worker threads during some kind of operations (failover/secondary node reboot/backup all databases from the secondary node at the same time, etc).

There are few ways to fix/decrease effect of this issue:

1) Increase number of CPU cores (that of course is pretty expensive fix)

2) Remove not usable databases or move some databases out to another server.

3) Decrease using workers threads on passive node (use separate backup groups with different time frames for backup each group, disable ReadOnly requests to secondary node and so on)

4) Modify SESSION_TIMEOUT for each AGs from default 10 seconds value to some greater.

Workaround:

Create AG state monitoring sql job, that will be checking ‘Recovery_Pending’ db states and bring those databases up on secondary node only. Job should check if AG replica is secondary now and apply following two commands to each database in 'Recovery_Pending' state. You can schedule it to run on SQL restart or add additional mechanism of checking AG failover state.

ALTER DATABASE DBNAME SET HADR OFF

ALTER DATABASE DBNAME SET HADR AVAILABILITY GROUP = [AGNAME]

This reply was modified 11 months, 3 weeks ago by per4inka.

steal SSC Eights! Points: 827 More actions · Answer 8

Thanks Per4inka, I'll share this in our DBA team meeting tomorrow.

BTW we STILL have an open ticket with MS on this so I'll be sure to post back with whatever we do (or don't) learn from that.

steal SSC Eights! Points: 827 More actions · Answer 9

Just an update that we are still at a dead-end on this with Microsoft. They claim that it's due to our primary and secondary nodes being on different patch levels, but we absolutely get some databases that fail to synchronize after restarting secondary nodes even when all nodes have the same CU.

rothj Hall of Fame Points: 3349 More actions · Answer 10

<sigh> Thanks for the update.

I can confirm I also get some out of sync even though both nodes are on the same patch level.

rothj Hall of Fame Points: 3349 More actions · Answer 11

I applied CU21 late in July. I had my first restarts this last weekend and both my nodes came up clean with no secondary out of sync issues. Have you guys applied this CU and still have issues? It might be a fluke and I'm not holding my breath yet. I didn't see anything in the CU 'fixes' related to the thread use improvements. I did hear 2022 has some improvements and uses less threads at startup, so upgrading might help us. Thanks

steal SSC Eights! Points: 827 More actions · Answer 12

We upgraded to CU21 in mid-July and we are still having the issue. Fingers crossed for CU22 which we will apply in early/mid September.

MS is recommending we double our CPU (huge cost) and triple our RAM. We have doubled CPU on one cluster for a test but still saw the problem in our last round of reboots. We will continue to test.

We have a large number of databases that is likely contributing to the problem, so we are working on an initiative/re-design to decrease the database count, but that is not a quick fix.

rothj Hall of Fame Points: 3349 More actions · Answer 13

Bummer, I figured mine was a fluke. Thanks for the update.

I hear you on the licensing...mine is SPLA so it is very expensive to bump up.