Re-creating AG after server loss

  • We recently had to rebuild servers that were configured with AG. I built everything from scratch, but I am wondering if there is any way to recover AG configuration from master db? os backups...

    Thank you.

  • Thanks for posting your issue and hopefully someone will answer soon.

    This is an automated bump to increase visibility of your question.

  • This was removed by the editor as SPAM

  • I can only comment theoretically as I don't have access to an AAG currently, but the AAG configuration I believe is all stored in AD.

    I would 100% recommend against trying to restore system databases to anything clustered SQL server that you intend on using. If they have something you need in them, restore to a new box, get it out then put it into new one.

    You would need to recover the failover cluster instance first. I would probably add the new nodes to the existing failover cluster then remove the dead nodes. You potentially could recycle the computer accounts of the original members for the new servers to avoid permissions problems but this may be a confusing procedure unless you know failover clusters very well.

    Then you need to enable AAG in each of the new instances, Add the servers to the AAG, restore databases from backup to one of the nodes leaving them in restoring state and then add a replica with join only.

    If there are more dead nodes than there are new good nodes, you may need to force quorum to get the failover cluster to come online. If you go the route of adding new nodes to the cluster, would not delete the old ones from the cluster until the new ones are added and online.

    If you have good OS backups containing system state, you could restore the OS and then restore the most current SQL backups and it should come back online, but your sql databases will have to be restored to exactly the same file paths as they were before.

    Personally, I only back up operating systems on SQL servers for political reasons, I won't use a Windows backup to do a recovery of a SQL server. I can get a new instance installed, patched and permissions and jobs redeployed more quickly than I can get a bare metal recovery working correctly again to still be stuck with the same problems the old server may have had.

  • Ideally all your configuration like this should be stored in source control so you can easily recreate these things when they go wrong.

    I would be looking at getting yourself familiar with DBATools and using this to script all your configuration/jobs/logins etc etc etc out to .SQL files and storing them in Git or whatever other source control you use, then everything is just a click away from having it all ready to just recreate on the back of a script.

     

    The main issues you are going to get is down to GUID's in AD, for the replicas, cluster name, listener names as any slight change to these and you may as well just delete all the computer objects and start again, build a new WSFC and then use the scripts you have to recreate the AG.

    The AG is so heavily tied to these GUID's that any change to them caused the AG to be none operational.

    Another thing about having everything in source control is that I can use DevOps pipelines to rebuild an entire AG setup in just a few minutes from spinning up new replicas, installing SQL, recreating the AG, jobs, logins etc, Terraform, Ansible, DBATools scripts to recreate this from code is far easier than restoring anything like this from backups.

  • The cluster and listeners are definitely stored in AD, you only need to join a new node for SQL to pull the configuration into itself. If you are having to look at the identifiers used by the guid, something has probably been broken pretty badly by an administrator.

    I would disagree that using devops to build an availability group is a good idea - the privileges required to build an Active Directory availability groups are high that would have to be granted to devops service accounts. There are also tons of directory permissions and object attributes than can drift pretty quickly over time. Rebuilding an AAG means discarding all those permissions and attributes that applications and users rely on when there is nothing wrong with the cluster objects in AD at all.

    Planning on rebuilding an AAG instead of scaling out and to do maintenance or recovery, means planning on giving up on many of the benefits that an AAG has in DR, rolling upgrades, zero downtime server rebuilds, etc. It may take only a few minutes with adequate automation, but it is still a few minutes of downtime and will certainly incur some amount of reliability loss while active directory and DNS replicate to all sites. The benefit of AAGs is for the business and end users, not the administrators. Am also skeptical about being able to rebuild in a few minutes. Perhaps an empty AAG can be built in a few minutes but even small databases only in the 10s of gigabytes and its logs can take minutes to restore from backup on SSD.

  • Since the OP mentioned rebuilding from scratch, if that was me it would all be automated and replayable multiple times so that you know what you get each time you need to build that environment is the same each time and nothing is missed.

    Based also on rebuilding from scratch I interpreted that as indeed the cluster and everything was in a pretty bad state and needed to all be deleted and recreated, so simply adding a node here to an existing WSFC wasn't possible.

    Giving your build service accounts the needed permissions in AD to create computer objects isn't that much of a big deal.  If you follow the principle of least privileged access and only give the needed accounts the needed permissions you can mitigate the need for accounts to have super elevated/domain admin privileges to create the computer objects.

    Additionally unsure what other permissions and attributes you are referring to here, once the cluster and listener object are created there should be no need at all to ever touch these objects in AD ever again, and should always be the default permissions of the nodes manager the cluster object and the cluster managed the listener object.

  • Can I ask why you needed to rebuild the AG's?   We had issues where SQL dropped a Primary replica from the AG, the only solution was to drop and recreate the AG.  I'm just trying to get a feel if we are the only users with this issue or if there is a trend for this kind of problem.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply