AlwaysOn HA - secondary node failed and restored from VM image - can't get databases back in sync

Question

AlwaysOn HA - secondary node failed and restored from VM image - can't get databases back in sync

Maxer

SSCrazy Eights

Points: 8964
More actions
February 17, 2016 at 10:35 am

#324918

SQL 2014 AlwaysOn availability group (this is for a DR test, haven't rolled to prod yet)
We have two nodes, Red and Blue.
Red was primary, died (and we then failed over to Blue making it primary).
We restored the VM image of Red from about 12 hours earlier.
Now we tried to resume data sync with Blue and Red.
However, even though they all "see" each other, Red who we just VM restored stays in the not synchronizing state and when we tried a resume we get:
----------
AlwaysOn Availability Groups data movement for database 'MyExampleDatabase' has been suspended for the following reason: "system" (Source ID 7; Source string: 'SUSPEND_FROM_REVALIDATION'). To resume data movement on the database, you will need to resume the database manually. For information about how to resume an availability database, see SQL Server Books Online.
---------
I think that in this case, we have to delete the databases off Red (who was the VM restored one that died) and restore them back to current and then resync and rejoin them to the AG?
Is there anyway to join them back to the AG WITHOUT having to do a full restore plus logs up to current and then re-joining and synching to the AG?
I can't find much info on that error but it seems to be saying that the logs that were on the AG are now so out of date they no longer have all the needed info in the current log chain (backups were taken and log checkpointed and truncated) and we have to restore from backups to bring all the logs back and up to chain?
Trying to fully understand what is going on here.
Thanks!

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

Perry Whittle SSC Guru Points: 233921 More actions · Answer 1

Maxer (2/17/2016)
SQL 2014 AlwaysOn availability group (this is for a DR test, haven't rolled to prod yet)
We have two nodes, Red and Blue.
Red was primary, died (and we then failed over to Blue making it primary).

Ok so far, that's what AGs are designed for

Maxer (2/17/2016)
We restored the VM image of Red from about 12 hours earlier.

Oh dear, this is where the problems start. Do you understand what you have done here?

Maxer (2/17/2016)
Now we tried to resume data sync with Blue and Red.

This is not going to work as the database on Red is at an earlier LSN point than the new Primary. Why did you restore the complete VM image??

I'm seeing flaws in your DR plan after just a few lines, can you explain a bit more about your intended DR plan

Maxer (2/17/2016)
However, even though they all "see" each other, Red who we just VM restored stays in the not synchronizing state and when we tried a resume we get:
----------
AlwaysOn Availability Groups data movement for database 'MyExampleDatabase' has been suspended for the following reason: "system" (Source ID 7; Source string: 'SUSPEND_FROM_REVALIDATION'). To resume data movement on the database, you will need to resume the database manually. For information about how to resume an availability database, see SQL Server Books Online.
---------
I think that in this case, we have to delete the databases off Red (who was the VM restored one that died) and restore them back to current and then resync and rejoin them to the AG?

You didn't need to do this, why restore the VM image, what are you trying to prove?

Maxer (2/17/2016)
Is there anyway to join them back to the AG WITHOUT having to do a full restore plus logs up to current and then re-joining and synching to the AG?

Yes just bring the node backup without restoring from backup\image

Maxer (2/17/2016)
I can't find much info on that error but it seems to be saying that the logs that were on the AG are now so out of date they no longer have all the needed info in the current log chain (backups were taken and log checkpointed and truncated) and we have to restore from backups to bring all the logs back and up to chain?
Trying to fully understand what is going on here.
Thanks!

Yes, the LSN points no longer match

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

Maxer SSCrazy Eights Points: 8964 More actions · Answer 2

Thanks, I agree with all your statements.

Our IT architect wanted to restore from VM image and bring back into sync.

My thinking was that we had to restore from backups and bring each database back up to the most recent log backup with the still functioning node and then we could join the databases back to the avail group.

Only reason to restore the VM image is it is FASTER than doing a from scratch windows install and re-doing SQL server (at least in our case, our VMs restore rapidly).

That said, the ONLY proper course is to restore last full backup and then apply logs back up to most recent log, correct?

THEN re-join these databases back to the avail group by sync'ing them (not doing the full restore where it backs up the active node and restores it to the passive readonly but where it just tries to sync the log files).

The other option is to let the avail group do the full join where it backs up, then copies over, then restores, but that takes longer than just restoring from our existing database backups. (Unless I'm missing something there).

Thanks again.

Perry Whittle SSC Guru Points: 233921 More actions · Answer 3

are you trying to test the process required to recover from a complete server failure, one that's completely lost?

Are you testing other more realistic DR scenarios too, don't just concentrate on the worst case scenario

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

Maxer SSCrazy Eights Points: 8964 More actions · Answer 4

Perry Whittle (2/18/2016)
are you trying to test the process required to recover from a complete server failure, one that's completely lost?
Are you testing other more realistic DR scenarios too, don't just concentrate on the worst case scenario

Yes 🙂

To both.

Long story short the infrastructure group seems to think that VM snapshots are all we need for SQL backups.

This concerns me for countless reasons, the least of which is that IMHO regardless of what setup is being used, SQL should always have direct access to native SQL backup files (.bak, etc... logs).

So that in any given situation, you can take your tape or hard drive or cloud access key, whatever, and just pull down native SQL backup files.

If our entire server farm is nuked from orbit, we should still be able to run out to Best Buy, grab some crappy desktop computers, and log into MSDN/technet and install Windows server, SQL server, and then take our native SQL backups and start restoring and run the entire company off those desktops and a cable modem connection 😉

Obviously...not ideal, but you get the point.

I don't believe that we should have to rely on our entire VM infrastructure being resurrected, then those images restores, then our backup agent being reconfigured, then it mounting its proprietary backup drive and formats and extracting out our SQL backup files and start restoring them or recovering from VM snapshots.

This entire exercise has just been a giant "what if" test.

Back when I was a DBA (this is for a different situation, I'm thankfully not the DBA on this one but have to work with these systems) I always kept a separate backup of our prod systems and stored it at a separate offsite secure cloud location.

Copy only full backup weekly outside of the other backup routine, and then a weeks worth of copies of the existing backup files as native SQL backup files (not odd third party compressed and manipulated agent run backups).

Anyway, that's my soap box. 🙂

For my own purposes, I draw up a matrix of every possible way our systems could fail and then start testing each of those failures.

Node 1, Node 2, witness, network fail, database file deleted, database dropped, table truncated, restore from backup, restore from log, restore point in time, etc, etc, etc... all servers lost and pull tape from offsite backup and rebuild from scratch, etc...

Takes a few weeks to process through all that, but well worth it so you can document and in some cases script out recovery actions.

Then twice a year, test whatever the approved DR plan is and weekly test restoring backups of prod to dev to ensure the backups actually worked.

Record time to restore for all of those.

Year 1 maybe it takes 3 hours to bring all back online, awesome. Then year 3 we had so much data growth it takes 20 hours to bring all back from restore... suddenly things are a lot less awesome and we need to start having discussions about acceptable time to recovery windows, and costs to improve vs cost of downtime, vs acceptable risks, etc...

Again, that's just me.

I know there are other ways and approaches but that was what I did when it was my hat to wear 🙂

jasona.work SSC Guru Points: 50084 More actions · Answer 5

Maxer (2/18/2016)
Long story short the infrastructure group seems to think that VM snapshots are all we need for SQL backups.
This concerns me for countless reasons, the least of which is that IMHO regardless of what setup is being used, SQL should always have direct access to native SQL backup files (.bak, etc... logs).
So that in any given situation, you can take your tape or hard drive or cloud access key, whatever, and just pull down native SQL backup files.

Mm, yeah, I think your infrastructure guys need a lesson in RPO / RTOs with SQL...

Perry Whittle SSC Guru Points: 233921 More actions · Answer 6

Maxer (2/18/2016)
Long story short the infrastructure group seems to think that VM snapshots are all we need for SQL backups.

Theyre good for the OS drives but not for sql server databases.

Take sql server backups (native or 3rd party) would be my recommendation.

The VM snapshots do not take a transactionally consistent backup of the database and its only recoverable to the time its snapshotted (if at all).

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉