Problem with Failback of FCI node that's part of an Availability Group

  • Hi,

    I was wondering if anyone was able to help with a problem I am experiencing with the failback of SQL Server 2014/Windows Server 2012 R2 FCI that is part of an Availability Group

    Overview of the set up:

    3-node WSFC (Windows 2012 R2), 2 nodes within same datacenter and the other node in a second datacenter.

    The 2 nodes in the same datacenter are configured as an AlwaysOn AG Replica and the node in the second datacenter is a standalone AG Replica. Currently configured as FCI (Primary) and Standalone (Secondary) with a manual failover mode and asynchronous commit availability mode.

    Overview of problem:

    Failover at an Availability Group level works fine and can happily switch between primary and secondary and back again use the appropriate commands within SSMS. The problem comes by where we want to test running from the second node of the FCI. So within Failover Cluster Manager I've moved the SQL Server role from Node 1 to Node 2 and this works OK, this causes the AG role to failover as well. When I come to failback I repeat the same process but this time moving the SQL Server role from Node 2 back to Node 1 and the AG role does not move this time. I try to manually move it but fails saying the selected node is not a possible owner of the AG cluster resource.

    If I go into the properties of the AG cluster resource it shows only the current node (Node 2) as a possible owner. If select Node 1 as well I can then move the resource back. The problem being that this property is reset each time it fails over so only the current node is selected.

    Is anyone able to help provide insight into why this might be happening? I know your not supposed to control/configure AG roles through the Failover Cluster Manager as it's not aware of the synchronization state of the AG replica's but I can't see how else you would control a failover between FCI nodes.

    Many thanks....

  • MrG78 (5/18/2015)


    with a manual failover mode

    That is correct, manual failover is the only possible selection when integrating an FCI as a replica.

    See more in my stairway to alwayson series starting at this link

    http://www.sqlservercentral.com/articles/FCI/107536/[/url]

    MrG78 (5/18/2015)


    The problem comes by where we want to test running from the second node of the FCI. So within Failover Cluster Manager I've moved the SQL Server role from Node 1 to Node 2 and this works OK, this causes the AG role to failover as well.

    It wont cause a failover as this is a manual process but it will take the AG cluster resource offline, there is a difference.

    MrG78 (5/18/2015)


    When I come to failback I repeat the same process but this time moving the SQL Server role from Node 2 back to Node 1 and the AG role does not move this time. I try to manually move it but fails saying the selected node is not a possible owner of the AG cluster resource.

    You'll need to provide more detail of the steps you're taking here as it seems that some parts are missing.

    MrG78 (5/18/2015)


    The problem being that this property is reset each time it fails over so only the current node is selected.

    Not a problem, it's by design. This is all detailed in my stairway series

    MrG78 (5/18/2015)


    Is anyone able to help provide insight into why this might be happening? I know your not supposed to control/configure AG roles through the Failover Cluster Manager as it's not aware of the synchronization state of the AG replica's but I can't see how else you would control a failover between FCI nodes.

    Many thanks....

    Can you provide any more detail and screenshots of what you're seeing. Error log messaqes too may be helpful

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Hi Perry, thanks for your prompt reply. I have read your 'Stairways to AlwaysOn' series and found it extremely helpful in the set up of our AlwaysOn solution.

    I have attached number of files that show you the cluster events from both node 1 and node 2 of the FCI within our AG configuration.

    Two of the files show the events when we move the SQL server role from Node 1 to Node 2 within Failover Cluster Manager and shows that immediately after moving this role the associated AG role is taken offline and started on Node 2 as well (with no interaction from myself).

    Whereas when I move the SQL server role from Node 2 to 1 this role moves over and the associated AG role doesn't move and remains online on Node 2. The error I receive when I try and move the AG role manually to node 1 is the following:

    The operation has failed

    The action 'Move' did not complete

    The operation failed because either the specified cluster node is not the owner of the group, or the node is not a possible owner of the group

    When I add Node 1 to the list of 'possible owners' to the AG cluster resource it moves ok, so I don't understand why when I move roles from Node 1 to Node 2 both roles move over fine but when I move from Node 2 to Node 1 the AG role doesn't move (or go offline) until I manually set Node 2 as a 'possible owner'.

    How would you manually change between nodes in an FCI cluster that is part of an AG replica? For instance if you need to windows patch and want to apply the patch to the inactive nodes of the FCI.

  • Apologies here is the attachment

  • Here's some more info showing config of AG and SQL Server cluster roles and resources.

    Also showing the steps I took to move role between nodes of FCI. Again just to clarify, this purely regarding moving between nodes of the FCI (that is also an AlwaysOn replica) and not the failover of AlwaysOn between replicas

    In the examples Node 1 and 2 are the two nodes of the FCI and Node 3 is the standalone secondary replica hosted in a separate DC

  • so from what I can understand you have 2 FCIs and both wont come online on the same node, correct?

    Are you using any startup parameters for the instances?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Yes that's right, I only specified one in the detail as to keep things simple. No startup trace flags/parameters are set.

    The SQL server cluster roles SQL Server (CORE) and SQL Server (NONCORE) move between the two underlying nodes ok. Its just the behavior of the two AG roles AG-CORE1 and AG-NONCORE1 that differs.

    When I move SQL Server (CORE) and/or SQL Server (NONCORE) from Node 1 to Node 2 both these roles and the AG roles move over ok.

    When I move SQL Server (CORE) and/or SQL Server (NONCORE) from Node 2 to Node 1 these roles move over ok but the AG roles don't move over and remain online on Node 2. Trying to move them explicitly to Node 1 results in the following error:

    The operation has failed

    The action 'Move' did not complete

    The operation failed because either the specified cluster node is not the owner of the group, or the node is not a possible owner of the group

    So I'm not sure why the behavior of the AG roles is different going from Node 2 to Node 1

  • so u have 2 different availability groups between the core and noncore instances, but they both use the same standalone instance?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Yes that's correct

  • and you're using sql server 2014?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • That's correct, SQL Server 2014 and Windows Server 2012 R2

  • We encountered the same issue. The Primary replica is hosted on FCI. When we fail over the cluster resource, the AG group is automatically moved to the designated node. However, when we fail back. The AG group is not moving back, and the AG group is in the resolving state.

    What we can do is if we encounter the issue, we have to use SQL Server management studio, and run the command "Alter Availability group test failover" to fix the problem. But this not the automatic way, and cause extended downtime.

    Mrg78: Do you have a better solution now?

  • Finally, it turns out to be a bug. KB 2687741 fix the issue.

  • Did we find a resolution to this issue.  I have the exact same problem and that KB article is for Windows 2008 R2 however we are on Windows 2012 R2

  • I would like to bump this back up as well.

    We just experienced the same issue today with our FCI with a single AG.  We manually failed over the SQL Instance for patching and the AG resource stayed up and running on the current node and went to RESOLVING.  That's the first time I have seen it happen and I am pretty certain we have fully patched SQL 2014 and Windows 2012 R2.

    I would love to know what caused it because when these kinds of things happen, we end up getting asked by 50 different people what happened and why.

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply