2008 failover clustering, failover unsuccessful when NIC disabled or network disconnected

Question

2008 failover clustering, failover unsuccessful when NIC disabled or network disconnected

Jon.Morisi

SSChampion

Points: 12846
More actions
May 26, 2010 at 11:16 am

#223036

Running 2 node (disk only quorum) multi-instance cluster (active-active)
failovers successful w/ server reboot testing
manual fail over successful
"cluster group" and SQL groups failing on power down of switch, this should auto fail over
disable all NICs to force fail over is successful
disable only public NIC is unsuccessful
System Log Events
1126,1127,1129,1130
Failover Clustering Log Events
1566,1538,1537,1281,1280,1204,1203,1201,1200,1153,1132,1131,1128,1125,1062
Testing by modifying the NIC negotiation (Auto, 10/Full) for the heartbeat has also been unsuccessful.
Heartbeat is a direct connection between nodes via crossover cable (ie, no switch between)
Anyone seen a similar issue or have any ideas?

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

Jon.Morisi SSChampion Points: 12846 More actions · Answer 1

I believe this is the crux of the issue although I cannot figure out why. I've tried all duplex modes and transfer rates w/o success

Resource Group Node Status

-------------------- -------------- ------ -------

Cluster IP Address Cluster Group %1 Failed

Perry Whittle SSC Guru Points: 233915 More actions · Answer 2

Windows 2003 or Windows 2008 cluster?

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

Jon.Morisi SSChampion Points: 12846 More actions · Answer 3

Jon.Morisi

SSChampion

Points: 12846

May 26, 2010 at 1:16 pm

#1172990

2008

Jon.Morisi SSChampion Points: 12846 More actions · Answer 4

https://connect.microsoft.com/WindowsServerFeedback/feedback/details/561750/2008-failover-clustering-failover-unsuccessful-when-nic-disabled-or-public-network-disconnected

Graham Simpson-434618 SSC Journeyman Points: 87 More actions · Answer 5

I know this thread has been dormant for a long time now, but just in case it helps anyone...

I've been having the exact same problem as outlined here on a new system I'm building and turned up very little to help on many hours of searching. I then happened to try a test failover of my SQL cluster after a day or so of inactivity - and it worked.

That has since led me to discover that there's a set of failover parameters for the cluster hiding in the Failover Cluster Manager console.

If you go to <MSCS cluster name>/Services and Applications, right-click on <SQL cluster name> and choose Properties from the context menu, you should see two tabs - General and Failover.

If you click on the Failover tab, there are two configurable parameters:

Maximum failures in the specified period and Period (hours)

The failures parameter will be set to (n-1), where n is the number of nodes in your cluster. The period was set to 6 hours in my case, which I think is default, but your mileage may vary. More than n-1 failures will put the clustered service in a permanently failed state, to stop a problematic service endlessly bouncing between cluster nodes.

What this meant in my case, having a 2-node cluster, was that the cluster would only automatically fail over once in any 6-hour period. Any further failures in that time would put the cluster in a state of permanent failure. Manually moving the service between nodes doesn't count as a failure, so can be done at any time.

As this cluster is not live yet, I've since changed the options to allow failure 10 times every hour for testing, and I can now automatically perform a test fail from one node to another by simply disabling the Windows-facing network interface - as long as there are no more than 10 test failures an hour, so it looks like my cluster was actually behaving itself all along!

Hope that helps someone.

Cheers

Graham