Availability Replica is Disconnected

  • Yesterday afternoon the AG Listener name for a two node SQL Server 2014 Enterprise AG became unavailable.  We were able to determine that the DNS entry for the Listener no longer existed for some reason.  We manually created the Listener name and then we could connect to SQL again with that name.  I then noticed a bit later that the node that owned the Windows cluster was not the node the owned the AG resources. Being late in the day and everything was connecting I waited until this morning to address that.  To resolve it, I rebooted the Windows cluster owner, which happened to be the secondary in the AG.  That failed over the cluster, but now I have a message in SSMS on the secondary, the machine I rebooted just a bit ago, stating that the availability replica is disconnected.  And as a result I have no synchronization happening. 

    How do I resolve this?

  • I'm also seeing this in the SQL Server Error Log:

    A connection timeout has occurred while attempting to establish a connection to availability replica availability_replica_name with id availability_replica_id. Either a networking or firewall issue exists, or the endpoint address provided for the replica is not the database mirroring endpoint for the host instance.

    I've looked at the endpoint info on both machines and is looks like this on both:  TCP://MachineName.domainname:5022

    I tried restarting SQL Server on the secondary per this article, but that did not resolve.  I have not restarted SQL Server on the primary, as the article suggest is sometimes necessary.  Obviously that will make the applications unavailable.  The article does indicate the race condition that it says causes this was resolved in a SP2 CU4 of SQL Server 2014.  The instance is currently on SP1 CU6.

  • While investigating this, we've realized that there are DNS issues where an A record for one of the nodes didn't exist. We manually created it but we're still having issues. Point is, this isn't a SQL thing causing the problem

  • lmarkum - Thursday, August 2, 2018 7:29 AM

      I then noticed a bit later that the node that owned the Windows cluster was not the node the owned the AG resources.

    This doesn't usually matter though. It's not like in an FCI. 

    So what state is your cluster in now; do you have two SQL servers/instances and a file-share witness, something like that? Are they all up and running if you check in Server Manager > Tools > Failover Cluster Manager? 

    How is your Primary replica; is it up and running and accepting connections?

  • Beatrix Kiddo - Thursday, August 2, 2018 9:30 AM

    lmarkum - Thursday, August 2, 2018 7:29 AM

      I then noticed a bit later that the node that owned the Windows cluster was not the node the owned the AG resources.

    This doesn't usually matter though. It's not like in an FCI. 

    So what state is your cluster in now; do you have two SQL servers/instances and a file-share witness, something like that? Are they all up and running if you check in Server Manager > Tools > Failover Cluster Manager? 

    How is your Primary replica; is it up and running and accepting connections?

    I did learn that this scenario isn't like an FCI in that regard.  Everything had always been showing online and fine in the Failover Cluster Manager.  The primary was/is up and accepting connections.  The issue was that one of the nodes no longer had an entry in DNS for some reason and that was preventing SQL Server from communicating on the endpoint.  That in turn killed the synchronization.  We have everything working now after manually adding the DNS record for the node.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply