Bit if an unusual one here, and you’ll have to forgive me for the lack of examples, it was found while recovering from a big failure in our production environment.
We use SQL Server 2016 with Availability Groups, and we’re configured with a readable secondary which is where we take our backups from. I’m aware of certain nuances around this, but we have ~15TB of data which takes a while to backup and ship into online storage.
Our cluster configuration was altered while introducing a third node to the cluster. Basically, the disks were reconfigured as cluster resources, which took every node in the cluster down.
After a certain amount of poking, we were about to bring both nodes backup, albeit as separate instances. Primary databases came back up in a restoring state, and came online easily. These were added into an availability group. Secondary databases however, came back up with a “Recovery pending” state.
One small and fairly static database rejoined the AG easily, but the other more transactional ones all errored when we attempted the same, giving messages that the LSNs were out of sync. We attempted to restore transaction logs onto the secondary databases, but all were telling us that even our latest transaction log backup was too early to be applied.
In the end we were able to recover the secondary databases by taking a fresh full backup from the primary and restoring it across, but I’m sure you can imagine that added considerable time to our recovery. Given the errors we’d received, I had no faith in the log chain we already had, and didn’t want to spend 5 hours bringing the secondary up if it was just going to give us the same error that we’d already been receiving. But this has given me doubts that our secondary backups are good enough in this case. I’ll need to try to replicate what we saw in a development environment.
But I have a few questions:
1) Does the recovery pending state actually allow transaction log backups to be restored on top? If no, I suppose the remaining questions are somewhat moot.
2) Since SQL Server was telling us our latest log backup was too early to be applied, could we have been victim of the failure happening at the exact time a log backup was being taken? If every disk became unavailable at the point of failure, it seems unlikely SQL Server would be able to rollback the transaction log backup, although I’ll admit to not knowing exactly what happens under the covers in situations like these.
3) Is this just another nuance of taking backups from a readable secondary that is in asynchronous commit?
We need to decide what to do with our backups, but I have no faith that they will help us should we lose only a secondary node in the future. They’re currently being taken from the Primary as a temporary measure.
Many thanks for any responses, apologies for the vague nature of the post.