Cannot connect after Availability Group automatic failover

  • Hi

    We have a SQL Server setup with Availaibility Groups, in the cluster we have four nodes, two in one datacenter and two in another datacenter.

    There are four AGs on these cluster nodes, each primarially on one of the four nodes, with one other node in the same datacenter for syncronous HA and the other two for asyncronous in the other datacenter for DR.

    We have one listener for each AG.

    One AG seems to automatically failover at certain times due to various Network issues, however the AG seems to failover fine - the HA secondary becomes primary and all databases come online however applications seem unable to connect and we see various connection error type messages such as,

    '"Cannot connect to SQL Server instance 'HH-SQL-D11' :

    A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 -

    The semaphore timeout period has expired.) : The semaphore timeout period has expired [121] (requires acknowledgement)"

    and

    'named pipes provider error 40 the network path was not found'

    Yet when we failover manually the connection is fine and works as expected. Other AGs on the cluster have automatically failed over and have worked fine. I have checked various things like the SQL Instance is up, TCP/IP and Named pipes enabled, remote connections enabled. I have asked the networks team to check things like firewalls and they say they are all ok.

    I have tried to read the cluster logs but I am not sure what to look for in here and I am at a lost as to what to look at to try and fix this issue. Its a problem as it means that HA is not operating and if it does failover automatically applications connecting will not work.

    Any help would be greatly appreciated.

  • To make my questions easier to follow I have used EX-Listener-02 as the listener name for one of the other AGs.

    Also, apologises, as you have covered at least some of this in your question, I just want to make sure I understand.

    If I understand correctly you have the following setup (I'm ignoring the async to the second data centre for simplicity)?:

    Node 1 - Is normally the primary for HH-SQL-D11 and the secondary for EX-Listener-02

    Node 2 - Is normally the primary for EX-Listener-02 and the secondary for HH-SQL-D11?

    When HH-SQL-D11 attemps an automatic failover to node 2. Does EX-Listener-02 still work on node 2?

    Is there anything in the SQL logs? There is some information here https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/sql-server-error-log-always-on-availability-groups?view=sql-server-2016

    The availability group dashboard may also help. There is information here https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/use-the-always-on-dashboard-sql-server-management-studio?view=sql-server-ver15. The item "Synchronization mode and state" maybe particularly useful

    I'm wondering if the network issue causes the syncronisation to fall behind and then as it gets worse it causes the failover attempt. I can't find a documentated reference at the moment but, from memory, if the sync is behind then the automatic failover won't work.

     

  • Node 1 is the Primary for HH-SQL-D11 AG with a listener, this has a secondary replica on HH-SQL-D10.

    Node 2 is primary for HH-SQL-D10 AG with a listener, this has a secondary on HH-SQL-D11.

    Yes when HH-SQL-D11 attempts a failover the listener on node 2 (for HH-SQL-D10) still works.  Indeed, the replica HH-SQL-D10 has failed over from primary D10 to D11 becoming primary automatically perfectly fine recently.

    I have found loads of messages in the sql logs on HH-SQL-D11 regarding lease expiry, I understand this to be the heartbeat being missed hence the failover.

    Two messages which do concern me are these -

    2020-07-18 16:02:19.15 spid55s     The local availability replica of availability group 'HH_SQL_AG2' is in a failed state.  The replica failed to read or update the persisted configuration data (SQL Server error: 41005).  To recover from this failure, either restart the local Windows Server Failover Clustering (WSFC) service or restart the local instance of SQL Server.

     

    2020-07-18 16:02:21.24 spid55s     Failed to obtain the Windows Server Failover Clustering (WSFC) resource handle for cluster resource with name or ID 'XXXX-XXXX-XXXX-XXXX-XXXXXXXX' (Error code 1722).  The WSFC service may not be running or may not be accessible in its current state, or the specified cluster resource name or ID is invalid.  For information about this error code, see "System Error Codes" in the Windows Development documentation.

  • Thanks for the additional information. Sorry I can't be sure about the cause of this. I've had similar issues with AG's but not this one.

    If you haven't seen it the information here https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/monitoring-of-availability-groups-sql-server?view=sql-server-2016 might help you to identify what's going on.

    My instinct is either:

    The sync gets behind so the automatic failover fails

    There is something different in the network configuration between the AGs. Are the listener IP's on the same subnet? I haven't used multi subnet clusters so I can't help you with this, I'm just trying to work out why it's just one AG that has an issue.

     

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply