July 20, 2020 at 7:12 pm
To make my questions easier to follow I have used EX-Listener-02 as the listener name for one of the other AGs.
Also, apologises, as you have covered at least some of this in your question, I just want to make sure I understand.
If I understand correctly you have the following setup (I'm ignoring the async to the second data centre for simplicity)?:
Node 1 - Is normally the primary for HH-SQL-D11 and the secondary for EX-Listener-02
Node 2 - Is normally the primary for EX-Listener-02 and the secondary for HH-SQL-D11?
When HH-SQL-D11 attemps an automatic failover to node 2. Does EX-Listener-02 still work on node 2?
Is there anything in the SQL logs? There is some information here https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/sql-server-error-log-always-on-availability-groups?view=sql-server-2016
The availability group dashboard may also help. There is information here https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/use-the-always-on-dashboard-sql-server-management-studio?view=sql-server-ver15. The item "Synchronization mode and state" maybe particularly useful
I'm wondering if the network issue causes the syncronisation to fall behind and then as it gets worse it causes the failover attempt. I can't find a documentated reference at the moment but, from memory, if the sync is behind then the automatic failover won't work.
July 22, 2020 at 1:06 pm
Node 1 is the Primary for HH-SQL-D11 AG with a listener, this has a secondary replica on HH-SQL-D10.
Node 2 is primary for HH-SQL-D10 AG with a listener, this has a secondary on HH-SQL-D11.
Yes when HH-SQL-D11 attempts a failover the listener on node 2 (for HH-SQL-D10) still works. Indeed, the replica HH-SQL-D10 has failed over from primary D10 to D11 becoming primary automatically perfectly fine recently.
I have found loads of messages in the sql logs on HH-SQL-D11 regarding lease expiry, I understand this to be the heartbeat being missed hence the failover.
Two messages which do concern me are these -
2020-07-18 16:02:19.15 spid55s The local availability replica of availability group 'HH_SQL_AG2' is in a failed state. The replica failed to read or update the persisted configuration data (SQL Server error: 41005). To recover from this failure, either restart the local Windows Server Failover Clustering (WSFC) service or restart the local instance of SQL Server.
2020-07-18 16:02:21.24 spid55s Failed to obtain the Windows Server Failover Clustering (WSFC) resource handle for cluster resource with name or ID 'XXXX-XXXX-XXXX-XXXX-XXXXXXXX' (Error code 1722). The WSFC service may not be running or may not be accessible in its current state, or the specified cluster resource name or ID is invalid. For information about this error code, see "System Error Codes" in the Windows Development documentation.
July 24, 2020 at 2:19 pm
Thanks for the additional information. Sorry I can't be sure about the cause of this. I've had similar issues with AG's but not this one.
If you haven't seen it the information here https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/monitoring-of-availability-groups-sql-server?view=sql-server-2016 might help you to identify what's going on.
My instinct is either:
The sync gets behind so the automatic failover fails
There is something different in the network configuration between the AGs. Are the listener IP's on the same subnet? I haven't used multi subnet clusters so I can't help you with this, I'm just trying to work out why it's just one AG that has an issue.
Viewing 3 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply