VMWare snapshot breaks windows 2012 Cluster

  • Hi,

    We have set up SQL server 2012 alwaysOn availability group on windows 2012. It runs and fails over successfully. Recently, a VMWare creates a snapshort of a primary and it breaks a cluster. We saw errors in a cluster log.

    In VMware setting, The quiesce option is turned off for these VMs. Also, we configured cluster setting :\Windows\system32> (get-cluster).SamesubnetThreshold = 10 ( Relaxed )

    Cluster node 'Test1' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or

    File share witness resource 'File Share Witness' failed to arbitrate for the file share '\\Test\TEstQuorum'. Please ensure that file share '\\Test\TestQuorum' exists and is accessible by the cluster.

    In alwaysOn error log,

    A connection timeout has occurred on a previously established connection to availability replica 'Test1' with id [CAD40D99-E333-457E-9993-BBE977D2CDA2]. Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.

    I am pulling my hair out. Any thoughts?

    Thanks

    AyeMya

  • Are you using RDMs or normal VMDKs? If you are using VMDKs then you'll want to turn the quiesce option back on. If you are using RDMs then just stop taking snapshots of the VMs. Why the change to the cluster (SameSubnetThreshold)? What verion of vSphere? Does this happen when you snapshot the primary replica, or the secondary replica, or any replica?

    Odds are your answers will lead me to ask more questions before giving you an answer.

  • It is normal VMDK backup. VMWare version is 5.1. It is taking a snapshot of a primary replica. We changed the SamesubnetThreshold = 10 (Relaxed) to be more tolerant of failure. We refer to the following link.

    http://blogs.msdn.com/b/clustering/archive/2012/11/21/10370765.aspx

    Everytime, VMWare takes a snapshot, we have a connection timeout error.

    A connection timeout has occurred on a previously established connection to availability replica 'Test' with id [399ED765-5052-4448-86B1-02818E038E45]. Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.

    after 30 secs, connection restablished.

    A connection for availability group 'AG' from availability replica 'Test' with id [DE46449A-072C-4274-9E48-ABD821D815B6] to 'Test' with id [399ED765-5052-4448-86B1-02818E038E45] has been successfully established. This is an informational message only. No user action is required.

  • OK, well the first problem is that with the quiesce option turned off the snapshots are useless for backups as SQL hasn't flushed the buffer and the transaction log and the database file may not be in sync. Turn the quiesce option on and try taking the snapshot again and see if that resolves the problem.

  • We already turned on quiesce back on and connection timeout still occured.

    "connection timeout has occurred on a previously established connection to availability replica "

    how high would you suggest setting the SameSubnetTimeout value?

    Thanks.

    AyeMya

  • AlwaysOn Availability group Connection timeout occurred when a snapshot is removed or created..

  • VM admin set a virtual disk mode to independent on the drives where SQL server data/log, tempdb and log nd system db/log. They took a snapshot and no errors occurred. In this case OS is only taken snapshot. If a server crashes, Can we able to start Sql server service after a OS snapshot is restored? The drives where Systemdb/log and tempdb/log are not taken snapshot. But we backup systemdatabases/usersdbs in two different locations.

  • You may need to restore the master, model, and msdb databases then the user databases, but yes you should be able to recovery with just an OS snapshot.

  • Hi All,

    We are also facing same issue when VMware backup triggeres, you have got any solution ?

    are do we can change or increase timeout values to delay failover by then snapshot completes.

    Thanks

    sandeep

  • We havent found solution yet. We have workaround.

    First, we set to relax setting in Cluster Threshold.

    Second, set data/log/temp disk to independent mode in VMWare.

    We only back up C:\OS and Apps drive

    http://blogs.msdn.com/b/clustering/archive/2012/11/21/10370765.aspx

    Hope it works for you too.

    Thanks

    ayemya

  • Hi ,

    Thank you for providing work around , i need to work with my VMware team to set this and try backups.

    I have got another question , In VMware we can select to backup either only OS drive or Data drive, how can we configure VMware backup only OS and Apps drive.

    I will test this and update you.

    Thanks,

    sandeep

  • Hi,

    We are experiencing the exact same issue with AG on SQL2012 and VMware 5.5. Snaps always cause the AG to failover. Increasing the timeout and only taking a snap of the c: alleviates the problem somewhat but can still cause a failover intermittently. Did you find a resolution to this? We really want to back these boxes up with vdp but can't until we resolve this issue.

    Cheers,

    Joe

  • The quiesce option is turned off in VMWare for the servers since we are not backing up SQL drives.

  • Do not include the memory while taking the snapshot.

    It stuns the server and it causes the cluster to failover.

  • ayemya - Thursday, March 20, 2014 3:16 PM

    We havent found solution yet. We have workaround.First, we set to relax setting in Cluster Threshold. Second, set data/log/temp disk to independent mode in VMWare.We only back up C:\OS and Apps drivehttp://blogs.msdn.com/b/clustering/archive/2012/11/21/10370765.aspxHope it works for you too.Thanksayemya

    Please let me know where should i change the timeout settings on the primary or the secondary. I see errors 35206(connection timeout) on primary  and as a result i see errors 976 (broken connection )on the seconday.

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply