availability group failover - urgent

Question

availability group failover - urgent

SQLAssAS

SSCertifiable

Points: 7246
More actions
November 18, 2016 at 6:03 am

#315924

So last night we had a failover on our availability group but it didnt fully fail over..
I am not sure why this happened but we are currently looking into it.
Do these erorrs mean anything to anyone?
1)
AlwaysOn: The local replica of availability group 'AG1' is preparing to transition to the resolving role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required.
2)
The state of the local availability replica in availability group 'AG1' has changed from 'PRIMARY_NORMAL' to 'RESOLVING_NORMAL'. The state changed because the availability group is going offline. The replica is going offline because the associated availability group has been deleted, or the user has taken the associated availability group offline in Windows Server Failover Clustering (WSFC) management console, or the availability group is failing over to another SQL Server instance. For more information, see the SQL Server error log, Windows Server Failover Clustering (WSFC) management console, or WSFC log.
This happened during a big data load. I am thinking that CPU may have spiked to 100% and the lease timeout of health check timeout wasn't long enough and the CPU Caused the cluster to miss a response from SQL back to the cluster, so it tried to fail over.
But as it was during a big data load, it had to roll back the transaction but due to some other timeout setting, it didn't have long enough to roll it back so didn't failover properly.
If anyone could shed some light on this that would be great!

Viewing 3 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

Beatrix Kiddo SSC-Dedicated Points: 32407 More actions · Answer 1

SQLAssAS (11/18/2016)
So last night we had a failover on our availability group but it didnt fully fail over..
I am not sure why this happened but we are currently looking into it.
Do these erorrs mean anything to anyone?
1)
AlwaysOn: The local replica of availability group 'AG1' is preparing to transition to the resolving role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required.
2)
The state of the local availability replica in availability group 'AG1' has changed from 'PRIMARY_NORMAL' to 'RESOLVING_NORMAL'. The state changed because the availability group is going offline. The replica is going offline because the associated availability group has been deleted, or the user has taken the associated availability group offline in Windows Server Failover Clustering (WSFC) management console, or the availability group is failing over to another SQL Server instance. For more information, see the SQL Server error log, Windows Server Failover Clustering (WSFC) management console, or WSFC log.
This happened during a big data load. I am thinking that CPU may have spiked to 100% and the lease timeout of health check timeout wasn't long enough and the CPU Caused the cluster to miss a response from SQL back to the cluster, so it tried to fail over.
But as it was during a big data load, it had to roll back the transaction but due to some other timeout setting, it didn't have long enough to roll it back so didn't failover properly.
If anyone could shed some light on this that would be great!

I think you're possibly right about the lease timeout, but is there any more information in the cluster log?

SQLAssAS SSCertifiable Points: 7246 More actions · Answer 2

Nothing very useful no, quite fustrating how it tells me its failed but cant find anything anywhere to tell me why..

1st error in cluster events:

Cluster resource 'AG1' of type 'SQL Server Availability Group' in clustered role 'AG1' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

2nd error in cluster events:

Cluster resource 'AG1_IP' of type 'IP Address' in clustered role 'AG1' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

--------------------------------

in error 2, AG1_IP is the DR listener IP address, which should be offline anyway in the cluster resources as we are not in DR?