Availability Group Errors

  • I've set up a test availability group with a test db. there's no app that connects to the db. I receive sometimes daily, sometimes every two or three days various repeating errors in the SQL log relating to the health of the AG. Curiously they are all outside of normal business hours (i.e 8am-6pm) and each bunch of errors spans a few seconds or a fairly small number of minutes and can occur multiple times during this out of hours period.

    In the AG Dashboard Always On health report I get error_reported, availability_replica_state_change, availability_replica_manager_state_change, and for the actual messages they typically include....

    AlwaysOn Availability Groups: Waiting for local Windows Server Failover Clustering node to start. This is an informational message only. No user action is required.

    AlwaysOn Availability Groups: Waiting for local Windows Server Failover Clustering node to come online. This is an informational message only. No user action is required.

    AlwaysOn Availability Groups: Local Windows Server Failover Clustering node started. This is an informational message only. No user action is required.

    AlwaysOn Availability Groups: Local Windows Server Failover Clustering service started. This is an informational message only. No user action is required.

    AlwaysOn Availability Groups: Waiting for local Windows Server Failover Clustering service to start. This is an informational message only. No user action is required.

    Server is listening on [ 'any' <ipv4> XXXX] - being one of three port numbers

    A connection for availability group 'MyTestAG' from availability replica 'ServerName\MyPrimaryReplica' with id [guid] to 'ServerName\MySecondaryReplica' with id [giud] has been successfully established. This is an informational message only. No user action is required.

    AlwaysOn Availability Groups: Local Windows Server Failover Clustering node is online. This is an informational message only. No user action is required.

    The WSFC the two standalone SQL Servers replicas sit in (each with their own storage), does have many heartbeat problems for reasons I won't go into in this post. Basically the heartbeat can often fail. the out of hours occurrences of the heartbeat failing seem to be linked with known network issues whereby it's battered out of hours and we think the heartbeats fail occasionally each day during this time.

    Failover cluster log has many of these errors at the same time as the SQL Errorlog has the above AG errors.

    - Cluster has established a UDP connection from local endpoint ThisNodeIPAddress:~3343~ connected to remote endpoint OtherNodeIPAddress:~3343~.

    - Cluster has missed two consecutive heartbeats for the local endpoint .....to remote endpoint...

    - Clustered role 'MyAGClusteredRole' is moving to cluster node "PrimaryReplicaServerName".

    - Cluster has lost the UDP connection from local endpoint....to remote endpoint....

    - The Cluster service is attempting to bring the clustered role 'CAUMyServernameMku1' offline.

    - Cluster resource 'CAUMyServerNameMku1Resource' in clustered role 'CAUMyServernameMku1' has transitioned from state Online to state WaitingToGoOffline. Cluster resource 'CAUMyServerNameMku1Resource' is waiting on the following resources: .

    - Cluster resource 'CAUMyServerNameMku1Resource' in clustered role 'CAUMyServernameMku1' has transitioned from state WaitingToGoOffline to state OfflineCallIssued.

    - other similar errors relating to nodes and roles and CAUthis and CAUthat.

    The failover clustermanager has errors at the same time that relate to trying to failover the Role that the AG setup Wizard put there. However, again for reasons I won't bother explaining here, the 02 node is down and always down. so no failovers of the resource occur, only restarts on the 01 node.

    The SQL instances on each replica remain up and running. they don't restart.

    The cluster has passed its validation report.

    All of the errors found in the SQL error log tend to end with No User Action is required. Is that to be believed?! They sound too ominous to me. I'm concerned that during the many/brief occurrences of the errors, that if an application was connecting to the instance and db (I'm not using a listener by the way, and a listener isn't even present/configured) that these errors are causing an interruption to the application connection, which would be perceived as connection blips leading to lost work.

    the AG was setup using the Ag wizard and was 100% successful. no errors or blips or tweaking/workaround have happened.

    really could do with an explanation to this as it's stumped me. Can I change the tolerances/timeouts for heartbeat failures? any other ideas?

    thanks!

  • Is it a virtual environment?

    is this a same-site or cross-site cluster?

    how many NICs does each node have?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply