Windows Failover Cluster - Vmware Snapshot

  • I am hoping someone can provide some insight into this subject

    I have been working with SQL Server for several years, but have not had the pleasure of working much with the Windows Failover Cluster Service and Availability Groups.  In my new role I am responsible for both WFCS and AG and have a question

    So I need to patch SQL Server and usually take VMWare snapshots in case there is a problem.  It the past I have been using a stand alone SQL installation and no WFCS and I now have WFCS using used by SQL 2012, 2014 and 2016 Availability Groups

    My question:

    1: Will I be able to revert to snapshot when using a WFCS and AG just as easy as it was using non WFCS and Stand alone instances - my concern is with the cluster and AG synchronizing if I had to revert to the snapshot

    2: What is the best process for taking a snapshot, more specifically

    Should I snapshot both server before patching and revert both servers if even one fails

    or

    should I use a process recommended by one of the server opt team member.  They said use the following steps

    A: Shut down SQL Server Service on the Secondary Node and snapshot it

    B: Install your patch, reboot and test as needed

    C: Fail the AG over to the Secondary

    D: Shut down SQL Server Service on the first node, snapshot and then patch and test that node.

    E: Fail everything back.

    Your feedback is appreciated.

    • This topic was modified 3 years, 7 months ago by  jayoub.
    • This topic was modified 3 years, 7 months ago by  jayoub.

    Jeff

  • Thanks for posting your issue and hopefully someone will answer soon.

    This is an automated bump to increase visibility of your question.

  • Our network admin created a SQL 2017 HA cluster  (vmware using shared disks seetings - 2 nodes). He takes care of the hardware I take care of the SQL. Anyway, when we were setting it up, I wanted to take a snapshot so I can rollback if there was an issue with my SQL install. He said that he won't be able to do that for the cluster and it's a VMware limitation.  Not sure if you have the same setup.

    Already in production for several months, I have patched our cluster 2x already. So far so good!

    Maybe others would have a different input.

  • Thank you for the feedback

    In my case I do not have any shared storage, so i wonder if it would be the same problem.  I am running Windows 2012 R2, SQL Server 2012 and using only Availability Groups

    Please if anybody can provide more feedback I would appreciate it.

    Jeff

  • First - I'd recommend against crossposting.  I found your post on here, Microsoft and VMWare.  This is the only one with replies though.

    I do not have failover clustering set up at my workplace, but I did find this post on SSC:

    https://www.sqlservercentral.com/forums/topic/vmware-snapshot-breaks-windows-2012-cluster

    That makes it sound like snapshots are iffy with WFC.

    My approach though would be to stop anything on the secondary that would be monitored by the failover clustering (ie anything that needs to be in-sync for the failover to occur), then do the snapshot, make the changes, test everything and then put it back into the failover and force a failover and repeat on primary.

    I'd be doing that on a test environment and document the process so future people have a tried and tested process.  It might be that something in your server doesn't work the way you expect with failover and snapshotting, so that is worth checking out.

    Even if you get advice from a random stranger on the internet that everything is going to be OK, we don't know your environment (as ARPRINCE pointed out with shared disks) and we could VERY easily miss something critical in how you have things set up that we were not expecting and didn't think to ask.  In the end, if random stranger from the internet says it's OK and it goes sideways, your boss is coming to you and you telling him "This free internet forum told me it would be fine" is not going to be a valid excuse to not test this.

    My advice, no matter what the internet tells you, if it isn't coming from an official support channel or partner, it is your neck on the line if the advice is incorrect.  And if it were me, I'd much rather fight with IT to get a few small VM's set up and spend a few hours getting a similar environment set up so I can test what I am doing.  The last thing you want is for your secondary to fail due to something you read on a forum and then primary to have a failure and needs the failover to be in place.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Thank you very much for the feedback.  Below are my comments and follow up questions

    First - I'd recommend against cross posting. I found your post on here, Microsoft and VMWare. This is the only one with replies though.

    -Understood. I did not know that the Microsoft forum would not reply.  I am trying to search as many places as I can.  I even have a post on the VMWare forum and also did not get a response.  SSC is my go to

    I do not have failover clustering set up at my workplace, but I did find this post on SSC:

    https://www.sqlservercentral.com/forums/topic/vmware-snapshot-breaks-windows-2012-cluster

    That makes it sound like snapshots are iffy with WFC.

    -Thanks I read through the entire post and found some tips.

    ----------------------------------------------------------------------------------------------------------

    My approach though would be to stop anything on the secondary that would be monitored by the failover clustering (ie anything that needs to be in-sync for the failover to occur),

    -Question: So this would be the SQL Server Service to stop the Availability Group from synchronizing. Correct?

    or would we want to stop the Role (which is the Availability Group) in the cluster?

    then do the snapshot - Got it

    make the changes - Install my patch

    test everything - OK

    then put it back into the failover and force a failover and repeat on primary - OK

    Question: if the patch fails and I am not able to fix it would I at this point revert the server to the snapshot and turn everything back on?

    ----------------------------------------------------------------------------------------------------------

    I'd be doing that on a test environment and document the process so future people have a tried and tested process. It might be that something in your server doesn't work the way you expect with failover and snapshotting, so that is worth checking out.

    -Understood

    ----------------------------------------------------------------------------------------------------------

    Even if you get advice from a random stranger on the internet that everything is going to be OK, we don't know your environment (as ARPRINCE pointed out with shared disks) and we could VERY easily miss something critical in how you have things set up that we were not expecting and didn't think to ask. In the end, if random stranger from the internet says it's OK and it goes sideways, your boss is coming to you and you telling him "This free internet forum told me it would be fine" is not going to be a valid excuse to not test this.

    -Understood

    ----------------------------------------------------------------------------------------------------------

    My advice, no matter what the internet tells you, if it isn't coming from an official support channel or partner, it is your neck on the line if the advice is incorrect. And if it were me, I'd much rather fight with IT to get a few small VM's set up and spend a few hours getting a similar environment set up so I can test what I am doing. The last thing you want is for your secondary to fail due to something you read on a forum and then primary to have a failure and needs the failover to be in place.

    -Good idea.  I will try contacting VMWare tech support for this

    ----------------------------------------------------------------------------------------------------------

    Jeff

  • Thank you very much for the help on this subject and sorry it has taken me so long to reply to this, but I was able to run a test and the results are good.

    In essence I am taking the advice from Mr. Brian Gale where he says

    My approach though would be to stop anything on the secondary that would be monitored by the failover clustering (ie anything that needs to be in-sync for the failover to occur), then do the snapshot, make the changes, test everything and then put it back into the failover and force a failover and repeat on primary

    I did two tests and both worked successfully.

    Test 1: Suspending Data movement on the Primary Replica

    A: On the primary replica I suspended data movement on all the databases in the Availability Group

    B: Powered down and snapshot the secondary server (power down is what VMWare recommends when snapshots are used on a SQL Server)

    C: Powered up the secondary, patched it with a Service Pack and at the same time added data to the primary replica

    D: I then reverted the secondary replica to the snapshot and powered up.

    E: I then resumed data movement and it worked

    Test 2: Same as above, but without suspending data movement

    I believe that Test 2 worked because of how a synchronous commit secondary replica resynchronizes with the primary replica when brought back on line

    ·         When the secondary replica goes off line the status changes from Synchronized to Not Synchronizing

    ·         Once the secondary replica is brought back on line it establishes a connection with the primary and sends its End of Log LSN to the primary replica.

    ·         On receiving this, the primary then starts sending it the log blocks that it hardened after the EOL LSN

    See my concern was if I revert the secondary replica to a snapshot and there were already hardened log blocks sent to the Secondary - the primary would not be able to synchronize again.

    I still appreciate any feedback on this subject.

    Jeff

  • This was removed by the editor as SPAM

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply