Always On Failover Issue

  • Hello, hopefully someone can provide some help for me.

    I have a always avaliability group setup with 4 nodes, two in a HA pairing, syncronous commit.  A day ago twice the AG automatically fell over to the secondary node in the HA pairing.  The errors I have seen are -

    From Failover CLuster Manager - 'Cluster node 'hh-###-###' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.'

    And from the cluster logs for that node  -

    Local endpoint: 10.##.#.###:~3343~

    Remote endpoint: 10.##.#.###:~3343~

    [Operational] 0000#####.0000#####::2020/06/10-10:34:31.486 INFO Microsoft Failover Cluster Virtual Adapter (NetFT) has missed more than 40 percent of consecutive heartbeats.

    Local endpoint: 10.##.#.###:~3343~

    Remote endpoint: 10.##.#.###:~3343~

    [Operational] 0000####.0000####::2020/06/10-10:34:36.908 INFO Cluster has lost the UDP connection from local endpoint 10.##.#.###:~3343~ connected to remote endpoint 10.##.#.###:~3343~.

    This seems to me to be network related and as I am a DBA I passed it on to the Networks team to have a look and all they have said is 'The loss of connection to the remote endpoint is almost certainly a server side issue.' and they are not helping any more.

    It happend a couple of days ago, a couple of times, but has been fine since.  Being a DBA I am not sure what to look for or whether it is something to be worried about.

    • This topic was modified 3 years, 10 months ago by  garryha.
  • Thanks for posting your issue and hopefully someone will answer soon.

    This is an automated bump to increase visibility of your question.

  • Do you know if the heartbeat is using the same network as other data?

    I once had an issue where the database backup was using all of the bandwidth and stopping the heartbeat from working long enough for a failover to occur. Or, if you are running in a virtual environment, another VM guest could have used the bandwidth.

    In terms of whether it's something to worry about. Ideally I think it's worth trying to find out what caused it if possible. It could be something that could repeat or get worse. I think the failover would have disconnected all of the client applications?

  • I do believe the heartbeat is using the same network as everything else.

    I use a monitoring tool as well and at the same time this happened to the cluster, the monitoring tool picked up this -

    Cannot connect to SQL Server instance '######' :

    A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 -

    The semaphore timeout period has expired.) : The semaphore timeout period has expired [121] (requires acknowledgement)

    From what I read this error can relate to Network Adapter issues, so I will ask my Network guys again to have a look at this, I cant see any error message that don't point towards networks.

    • This reply was modified 3 years, 10 months ago by  garryha.
  • I've seen issues like this related to VSS-Backups. During these backups the server is halted for a short-term including network cards. In one AG-Environments the primary went in pending state for a couple of seconds but afterwards came back online without a failover. On other days a failover occurred.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply