2008 failover clustering, failover unsuccessful when NIC disabled or network disconnected

  • Running 2 node (disk only quorum) multi-instance cluster (active-active)

    failovers successful w/ server reboot testing

    manual fail over successful

    "cluster group" and SQL groups failing on power down of switch, this should auto fail over

    disable all NICs to force fail over is successful

    disable only public NIC is unsuccessful

    System Log Events

    1126,1127,1129,1130

    Failover Clustering Log Events

    1566,1538,1537,1281,1280,1204,1203,1201,1200,1153,1132,1131,1128,1125,1062

    Testing by modifying the NIC negotiation (Auto, 10/Full) for the heartbeat has also been unsuccessful.

    Heartbeat is a direct connection between nodes via crossover cable (ie, no switch between)

    Anyone seen a similar issue or have any ideas?

  • I believe this is the crux of the issue although I cannot figure out why. I've tried all duplex modes and transfer rates w/o success

    Resource Group Node Status

    -------------------- -------------- ------ -------

    Cluster IP Address Cluster Group %1 Failed

  • Windows 2003 or Windows 2008 cluster?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • 2008

  • I know this thread has been dormant for a long time now, but just in case it helps anyone...

    I've been having the exact same problem as outlined here on a new system I'm building and turned up very little to help on many hours of searching. I then happened to try a test failover of my SQL cluster after a day or so of inactivity - and it worked.

    That has since led me to discover that there's a set of failover parameters for the cluster hiding in the Failover Cluster Manager console.

    If you go to <MSCS cluster name>/Services and Applications, right-click on <SQL cluster name> and choose Properties from the context menu, you should see two tabs - General and Failover.

    If you click on the Failover tab, there are two configurable parameters:

    Maximum failures in the specified period and Period (hours)

    The failures parameter will be set to (n-1), where n is the number of nodes in your cluster. The period was set to 6 hours in my case, which I think is default, but your mileage may vary. More than n-1 failures will put the clustered service in a permanently failed state, to stop a problematic service endlessly bouncing between cluster nodes.

    What this meant in my case, having a 2-node cluster, was that the cluster would only automatically fail over once in any 6-hour period. Any further failures in that time would put the cluster in a state of permanent failure. Manually moving the service between nodes doesn't count as a failure, so can be done at any time.

    As this cluster is not live yet, I've since changed the options to allow failure 10 times every hour for testing, and I can now automatically perform a test fail from one node to another by simply disabling the Windows-facing network interface - as long as there are no more than 10 test failures an hour, so it looks like my cluster was actually behaving itself all along!

    Hope that helps someone.

    Cheers

    Graham

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply