Manual failover of SQL Server 2014 HA nodes not functioning

  • I'm running into an interesting little problem on a new cluster I've set up. I've done a number of these and not had this happen before. Hopefully someone (Perry, the HA guru) can offer a few trouble shooting steps I can follow.

    I have 3 node HA set up:

    -MS Server 2012 R2.

    -SQL 2014 EE.

    -All windows updates.

    -Clean install of both OS and SQL, all 3 nodes are identical.

    -SQL Server is running on an alternate port, which I've opened in the firewall. Connections from all network locations are working swimmingly; including connections between all 3 nodes.

    -I've got the groups up and running; Listener is set up correctly. Connections work great.

    -One node is synchronous, one is asynchronous. Both show synchronized, and synchronizing respectively.

    -Data added at the primary node is moved across to all 3 with lightning speed.

    When I attempt a manual failover it hangs..and hangs...then pops up an error 41131 and rolls back the failover. Leaving the cluster perfectly intact and working just as it did prior to the failover attempt.

    What I've checked so far:

    -There is absolutely NOTHING in the cluster events log.

    -Windows event log shows no errors, just the standard stuff of the primary nodes state changing from primary to resolving and then back again.

    -SQL Event log has a few things in there, but nothing that's leading me to a solution, I've attached the log from start to finish on an attempted manual failover:

    Names have been changed to protect the innocent.

    Thanks guys.

  • I'd make sure the account under which the cluster service is running on those machines has a non-disabled login in SQL Server. Just being a member of the public server role should be fine.

    I know this was a common problem when setting up AGs (see https://support.microsoft.com/en-us/kb/2847723 for an example of that), and it will also cause failover to time out as you described. I ran into this with a lab machine where the cluster service was running using the Local System account, and the NT AUTHORITY\SYSTEM login in SQL Server had been disabled at some point after setting up the AG.

    I hope this helps!

  • Hello,

    I really appreciate the reply, thanks. Unfortunately I have checked all that; The cluster service is running under the local service account, but I have triple checked it has all the needed permissions in sql. With Windows Server 2012, it's no longer recommended (or easy) to use a domain account for the cluster service. I've built this exact set up a number of times before with no issue; tis a very strange problem indeed.

    Hope we get some light on the problem.

    Thanks again!

  • Please clarify the following

    Ozzmodiar (6/29/2015)


    I have 3 node HA set up

    Each node has a standalone install of sql server?

    You're implementing an alwayson availability group?

    Ozzmodiar (6/29/2015)


    Clean install of both OS and SQL, all 3 nodes are identical.

    All named instances of SQL server?

    Ozzmodiar (6/29/2015)


    -SQL Server is running on an alternate port, which I've opened in the firewall.

    What other ports have you opened?

    Ozzmodiar (6/29/2015)


    Connections from all network locations are working swimmingly; including connections between all 3 nodes

    How have you verified this?

    Have you run a cluster validation?

    If so please provide it.

    Ozzmodiar (6/29/2015)


    Listener is set up correctly. Connections work great.

    Again how have you verified?

    Please provide details of the read only routing scripts you used

    Ozzmodiar (6/29/2015)


    -One node is synchronous, one is asynchronous. Both show synchronized, and synchronizing respectively.

    Confused, you mention 3 nodes above but only 2 replicas here.

    For synch replication to work between 2 replicas they must both be set to synchronous

    Ozzmodiar (6/29/2015)


    When I attempt a manual failover it hangs..and hangs...then pops up an error 41131 and rolls back the failover. Leaving the cluster perfectly intact and working just as it did prior to the failover attempt.

    Cluster validation report please, also export the cluster log and the sql server error log.

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Hi Perry,

    SQL Log is attached to the original post; I will gather up the other reports and logs you requested when I have a few minutes today.

    To answer some of your other questions:

    [Quote]Each node has a standalone install of sql server?

    You're implementing an alwayson availability group?[/Quote]

    Yes, each node is a standalone install of SQL Server, I am implementing an AlwaysOn Availability Group.

    [Quote]All named instances of SQL server?[/Quote]

    Yes, all named instances of SQL Server. Instance names are the same, server names are different. Also, all drive configurations are identical as well.

    [Quote]What other ports have you opened?[/Quote]

    I have opened a port for the HADR listener as well.

    [Quote]Please provide details of the read only routing scripts you used.[/Quote]

    To test:

    -I've connected from a local desktop PC from SSMS. (Listener name)

    -I've used a small test app I have, with a read-intent only connection string. (Listener name)

    -I've verified both connections are being sent to the correct server but checking the active connections report after executing.

    -Data is moving between all 3 nodes, I can see the data I've written to Primary on the Reporting node, I haven't verified the data is on the Failover node, as I can't fail to it.

    [Quote]

    Confused, you mention 3 nodes above but only 2 replicas here.

    For synch replication to work between 2 replicas they must both be set to synchronous[/Quote]

    Sorry Perry, I didn't catch that in my proofreading. There are 3 nodes, a Primary, Fail-over, and Reporting:

    Primary: Synchronous

    Fail-over: Synchronous

    Reporting: Asynchronous

    I'll get that cluster verification report as soon as I have a few spare minutes today. Thanks again Perry.

  • Hi Perry,

    I've had few production issues come up in the past few days; I'll get that report off to you as soon as I can.

    Thanks again!

  • Wow! busy couple weeks, are you still availabe to assist me with this Perry?

    I'm still having the same issues.

    Let me know and I'll provide you with what you need.

    Thanks!

  • Well,

    if anyone ever comes across this post I thought it best I put the solution in here.

    The issue was a ridiculous one. In the planning and setup of this new state-of-the art cluster, maximum security was goal number 1. Our DBA team came up with the brilliant idea of setting the default database of general accounts to tempdb, rather than the standard master. This seemed a safe and secure way to ensure the master db remained pristine.

    That's it guys, that's was the issue. The service account running SQL Server had a default db set to tempdb, and the OOB code to failover the node does not have a USE statement in it.

    Upon changing the default db for the service account to master everything worked swimmingly.

    As always, thanks for all the help everyone.

  • Thanks for the update! That is rather strange indeed.

    I'm glad you figured it out!

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply