SQL 2016 alwayOn AG failover failure when primary site went down

Question

Post reply

SQL 2016 alwayOn AG failover failure when primary site went down

ricfors

Say Hey Kid

Points: 665
More actions
October 16, 2018 at 10:46 am

#375271

Hi,
I have an issue that happened during a disaster testing.
The setup is the following.
2 sites geographically separated with AD on both sites replicated between
SQL 2016 AlwaysOn Availability group setup with synchronous commit
Primary sql server on site 1
Secondary sql server on site 2
File share witness on a file server on site 2.
The disater exercise looked as follows:
The power to site 1 was cut off. The AD on site 2 took over the domain responsibilities. All applications worked well with only one site available. Except for the SQL servers in the AlwaysOn setup.
The databases on server 1 went down but no failover was done.
When the disaster exercise was over and power to site 1 was turned on the dbs were available again.
So I wonder
- Is this how it should work? Or is it some config that needs to be fixed?
- Shouldn't a failure like this (one site is lost) trigger an automatic failover of AG to the other db server on the other site?
What am I missing?
If you have any input or you could point me to a resource for me too check, I would really appreciate it.
Thanks,
rico

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply

goher2000 SSCertifiable Points: 5479 More actions · Answer 1

goher2000

SSCertifiable

Points: 5479

October 16, 2018 at 11:00 am

#2009918

check ag group properties

ricfors Say Hey Kid Points: 665 More actions · Answer 2

Thanks for the tip but it is already set to automatic.

So the questions are:

- When the server goes down, the AG fails over to node 2.
- But what should happen if all primary site goes down? Should the AG failover to site 2? Is this how it works? Or is it that in this case no failover will occur?

rico

e4d4 SSCertifiable Points: 5777 More actions · Answer 3

Start from cluster level, the cluster went down after site 1 had gone down? If not what was DBs and replica state, for automatic failover they should be Synchronized, you can check this on report or in DMV
select * from sys.dm_hadr_database_replica_states select * from sys.dm_hadr_availability_replica_states
column is_failover_ready in sys.dm_hadr_database_replica_cluster_states should be 1 for databases that can failover

ricfors Say Hey Kid Points: 665 More actions · Answer 4

Thanks, but this happened a few days ago so I can't see the status from then. The dbs and AGs are up and running right now.
I'm troubleshooting the incident.

I try to understand the underlying functionality of a cluster to see if it's by design or actually a failure.

As I have understood, the network has to be seen by both nodes in the cluster, is that correct?
So in my case when all site 1 was lost then the databases in the AG went down because the node 1 in site 1 could not see the network and therefore it didn't attempt to failover. Is this correct?

/rico

Perry Whittle SSC Guru Points: 234065 More actions · Answer 5

ricfors - Tuesday, October 16, 2018 10:46 AM
Hi,
I have an issue that happened during a disaster testing.
The setup is the following.
2 sites geographically separated with AD on both sites replicated between
SQL 2016 AlwaysOn Availability group setup with synchronous commit
Primary sql server on site 1
Secondary sql server on site 2
File share witness on a file server on site 2.
The disater exercise looked as follows:
The power to site 1 was cut off. The AD on site 2 took over the domain responsibilities. All applications worked well with only one site available. Except for the SQL servers in the AlwaysOn setup.
The databases on server 1 went down but no failover was done.
When the disaster exercise was over and power to site 1 was turned on the dbs were available again.
So I wonder
- Is this how it should work? Or is it some config that needs to be fixed?
- Shouldn't a failure like this (one site is lost) trigger an automatic failover of AG to the other db server on the other site?
What am I missing?
If you have any input or you could point me to a resource for me too check, I would really appreciate it.
Thanks,
rico

The secondary replica should have transitioned to the resolving state, was the group synch state healthy when the failure occurred?

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

ricfors Say Hey Kid Points: 665 More actions · Answer 6

I don't know the state it was before the disaster test. I was involved after, when everything was back up again and working.
Can I check that afterwards?

/rico

e4d4 SSCertifiable Points: 5777 More actions · Answer 7

What errors do you have in error log? There should be many errors and information regard this failover that will tell if this was cluster fault or some synchronization errors in SQL.

Beatrix Kiddo SSC-Dedicated Points: 32407 More actions · Answer 8

Was the primary replica shut down gracefully or was the power cable yanked out, or something? The reason I ask is that it surprised me a few years ago to learn that a graceful shutdown of the primary replica (i.e. when the services are stopped cleanly) does not trigger an automatic failover. I really thought it would.

ricfors Say Hey Kid Points: 665 More actions · Answer 9

Beatrix Kiddo - Friday, October 19, 2018 4:26 AM
Was the primary replica shut down gracefully or was the power cable yanked out, or something? The reason I ask is that it surprised me a few years ago to learn that a graceful shutdown of the primary replica (i.e. when the services are stopped cleanly) does not trigger an automatic failover. I really thought it would.

The power was cutoff to the whole site.

/rico

ricfors Say Hey Kid Points: 665 More actions · Answer 10

e4d4 - Friday, October 19, 2018 3:50 AM
What errors do you have in error log? There should be many errors and information regard this failover that will tell if this was cluster fault or some synchronization errors in SQL.

In the windows eventlog - Applications and Services Logs - Microsoft - Windows - FailoverClustering - Operational

First it says
- Cluster has missed two consecutive heartbeats for the local endpoint 127.0.0.1 connected to remote endpoint 127.0.0.1
- Cluster has lost the UDP connection from local endpoint 127.0.0.1 connected to remote endpoint 127.0.0.1

Then it goes on
- Cluster resource 'File Share Witness' in clustered role 'Cluster Group' has transitioned from state Online to state ProcessingFailure.
and goes through all the stages
from state ProcessingFailure to state WaitingToTerminate
to state Terminating
to state DelayRestartingResource
to state OnlineCallIssued
to state WaitingToTerminate
to state Terminating
to state Failed
The Cluster service is attempting to fail over the clustered role 'Cluster Group' from node 'A' to node 'B'
... and then going through different states until
Clustered role 'Cluster Group' is moving to cluster node 'B'

In the SQL server log it says
- A connection timeout has occurred on a previously established connection to availability replica 'B' with id [EF8F981F-BB63-4744-B058-75768896FF19]. Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.
- Always On Availability Groups connection with secondary database terminated for primary database 'database1' on the availability replica 'B' with Replica ID: {ef8f981f-bb63-4744-b058-75768896ff19}.
... and repeats the same for all the databases in the AAG, which are 6.

Then comes this
- SSPI handshake failed with error code 0x80090311, state 14 while establishing a connection with integrated security; the connection has been closed. Reason: AcceptSecurityContext failed. The Windows error code indicates the cause of failure. No authority could be contacted for authentication. [CLIENT: xx.xx.xx.xx]
- SSPI handshake failed with error code 0x80090311, state 14 while establishing a connection with integrated security; the connection has been closed. Reason: AcceptSecurityContext failed. The Windows error code indicates the cause of failure. No authority could be contacted for authentication. [CLIENT: yy.yy.yy.yy]
- Login failed. The login is from an untrusted domain and cannot be used with Windows authentication. [CLIENT: xx.xx.xx.xx]
- Login failed. The login is from an untrusted domain and cannot be used with Windows authentication. [CLIENT: yy.yy.yy.yy]

.. and repeats that until everything is back up again.

I can't find any information that the AAG group has failed over to the other node. Only the cluster group seems to have failed over. But not the AAG.
Is this a config error, a expected behavior? Shouldn't the AAG also fail over in this case? Can you clarify this, please
According to the latest errors the application couldn't connect to the database with windows authentication because it couldn't find it's domain controler. But I can't see if the application tries to connect to the AAG listener or if it's the server A.
So I'm really confused about the errors.

/rico

goher2000 SSCertifiable Points: 5479 More actions · Answer 11

make sure the following
3434 tcp & udp ports are enabled between nodes..
135 tcp rpc port is open between nodes
137 udp port open between nodes
49152 - 65635 upd port range is open between nodes. (make sure this range is defined in component services > my computer properties > default protocol > connection-oriented TCP properties )
ICMP (ping) is open between nodes.

check if you have adequate bandwidth available between nodes

The errors below are due to (1) either SPN is not set properly or (2) SQL is located in prod domain and client is trying to login using other domain e.g. dev\user and dont have access to sql (3) windows auth is disabled in sql

- SSPI handshake failed with error code 0x80090311, state 14 while establishing a connection with integrated security; the connection has been closed. Reason: AcceptSecurityContext failed. The Windows error code indicates the cause of failure. No authority could be contacted for authentication. [CLIENT: xx.xx.xx.xx]
- SSPI handshake failed with error code 0x80090311, state 14 while establishing a connection with integrated security; the connection has been closed. Reason: AcceptSecurityContext failed. The Windows error code indicates the cause of failure. No authority could be contacted for authentication. [CLIENT: yy.yy.yy.yy]
- Login failed. The login is from an untrusted domain and cannot be used with Windows authentication. [CLIENT: xx.xx.xx.xx]
- Login failed. The login is from an untrusted domain and cannot be used with Windows authentication. [CLIENT: yy.yy.yy.yy]

ImYourDBA SSCertifiable Points: 5654 More actions · Answer 12

I would check your quorum setup, it sounds like there weren't enough voting members to adequately vote to make the second site the primary. I typically recommend a file share witness that has nothing to do with either site or an Azure Cloud witness if on Windows 2016.

Owner & Principal SQL Server Consultant
Im Your DBA, Inc.
https://www.imyourdba.com/

ricfors Say Hey Kid Points: 665 More actions · Answer 13

goher2000 - Friday, October 26, 2018 11:57 AM
make sure the following
3434 tcp & udp ports are enabled between nodes..
135 tcp rpc port is open between nodes
137 udp port open between nodes
49152 - 65635 upd port range is open between nodes. (make sure this range is defined in component services > my computer properties > default protocol > connection-oriented TCP properties )
ICMP (ping) is open between nodes.
check if you have adequate bandwidth available between nodes
The errors below are due to (1) either SPN is not set properly or (2) SQL is located in prod domain and client is trying to login using other domain e.g. dev\user and dont have access to sql (3) windows auth is disabled in sql
- SSPI handshake failed with error code 0x80090311, state 14 while establishing a connection with integrated security; the connection has been closed. Reason: AcceptSecurityContext failed. The Windows error code indicates the cause of failure. No authority could be contacted for authentication. [CLIENT: xx.xx.xx.xx]
- SSPI handshake failed with error code 0x80090311, state 14 while establishing a connection with integrated security; the connection has been closed. Reason: AcceptSecurityContext failed. The Windows error code indicates the cause of failure. No authority could be contacted for authentication. [CLIENT: yy.yy.yy.yy]
- Login failed. The login is from an untrusted domain and cannot be used with Windows authentication. [CLIENT: xx.xx.xx.xx]
- Login failed. The login is from an untrusted domain and cannot be used with Windows authentication. [CLIENT: yy.yy.yy.yy]

Hi, the cluster is up and running fine.
All this happened while doing a disaster drill exercise. So the suggestions are not relevant in this case.
Thanks anyway.

/rico

ricfors Say Hey Kid Points: 665 More actions · Answer 14

ImYourDBA - Friday, October 26, 2018 8:46 PM
I would check your quorum setup, it sounds like there weren't enough voting members to adequately vote to make the second site the primary. I typically recommend a file share witness that has nothing to do with either site or an Azure Cloud witness if on Windows 2016.

Hi,
This is were I get stuck in trying to understand the cluster behavior.
The voting members are node 1, node 2 and quorum file share.
The file share is located on a 3rd server. This server with the file share is on site 2. The same goes for node 2 (replica server).
On site one was (and is) node 1 (primary server) with the AAG group. On this server, the "cluster group" was also active (this group contains the file share resource).

When site 1 was cut off, the cluster group failed over to node 2 (on site 2). But the AAG did not. The connected sessions complained about not being able to access the db due to non-authorized domain.
So to me it sounds that something couldn't access the dns and/or the Domain controller. But the cluster group had no issues in failing over. So is this by design or what caused this behavior?

I will propose to set up a file share for the quorum on a server in a site 3, if possible. But, the people responsible for their application would still want a explanation to why it happened. So there's my dilemma.

/rico