Availability Group Failover Stops Working After First Failure.

  • Hello-

    I've setup a two node Cluster Server (non-shared storage) with a file sharing witness. I'm testing some of the different failover scenarios to see that everything is working properly. Everything works fine until I try testing the failure of the SQL Server service. When I stop the SQL Service on the primary server, it fails over to the secondary server as expected. I then start the service on the (now) secondary server and it comes back online as the secondary server. I then try to test that the service will fail back over when I stop the service on the new primary server. However, when I stop the service, the secondary server now shows "resolving" and never comes back online. When I bring the service back up on the primary server, the secondary now shows as secondary instead of resolving. So to see if it's something about failing over from one server to another, I do a manual failover making the original primary server the primary again and everything is as it was originally. I then stop the service on the primary server, but the secondary server now says resolving and the AG will not become available again until I start the service on the primary server.

    It seems that when I first configured the quorum it worked fine the first failover scenario, then stopped working. I then added the file sharing witness, and failover worked the first time again, but not after that. For some reason after the initial failover it won't automatically failover again after that.

    Any ideas on why this might be happening?

    Thanks in advance.

    Config:

    Servers: Windows Server 2012 Standard

    SQL : SQL Server 2012 Enterprise SP1

  • Are the build numbers for both sides of the cluster the same? This should be an immediate failure in the failover.

    Also, is the (once) primary in synchronous mode? Failover partners can not be asynchronous.

  • Thank for the reply Matt-

    Both servers are running SQL Server 11.00.3128 and both servers are running the same service pack levels and software updates. I verified this again through the Validate a Configuration Wizard.

    Both nodes are set to Automatic Failover with Synchronous commit.

    Any other ideas?

  • Well I figured it out. Apparently it will only fail over once by default within a six hour period. I upped it to three and could fail over back to the original primary from the secondary. Not sure why this doesn't apply when the server hard fails, but that's for a later time.

    Thanks!

  • this is the default setting for all cluster resources, just something to be aware of

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Yep I have that listed on my implementation sheet for when we build the production system.

    Thanks

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply