Server down

  • y'day we faced situation one of the primary server went down and unable to failover the services to second node . by checking in logs we found

    Cluster network 'Public' is partitioned. Some attached failover cluster nodes cannot communicate with each other over the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

    and also we found in

    Event ID 2003 :A Windows Firewall setting in the Domain profile has changed.

    is this id is responsible to shutdown the server ?

    what are the reasons for this situation ?

  • ramyours2003 (3/27/2015)


    Event ID 2003 :A Windows Firewall setting in the Domain profile has changed.

    is this id is responsible to shutdown the server ?

    what are the reasons for this situation ?

    Possibly, depending on the firewall setting that was changed.

    *IF* it was a firewall settings change, it could've been done at the Domain level with Group Policy. Otherwise, someone would have had to have logged into the server(s) to make the change.

    One question, as well. Do you have a dedicated "heartbeat" network between the cluster nodes? Or is all the network communication over the public network? POSSIBLY (if you don't have a heartbeat network) a heartbeat network would've kept the cluster alive and happy, unless the firewall change also impacted that network.

  • we have a monitoring tool which provide alerts on heart beat ..

  • OK, but do you have a dedicated heartbeat network for the cluster servers?

    So, for example, each server in the cluster would have the following:

    1x NIC for the public network (this would be using your "public" IP range)

    1x NIC for the private (heartbeat) network (allocated as cluster traffic ONLY in cluster manager) (this would be using a "private" IP range, and ideally a separate *physical* network from the public network, even if only a cross-over cable between 2 nodes)

    1x NIC for the storage network (if needed, such as shared storage)

    Or, is your configuration more like this:

    1x NIC for the public network

    1x NIC for the storage network (if needed, such as shared storage)

    The idea behind having a dedicated heartbeat network is, if one of the servers loses it's public connection, it can still hand off any cluster resources over the heartbeat to the other node(s). If the heartbeat goes down, then (unless configured otherwise) the nodes can still heartbeat each other over the public network. It's really the same reason you have a cluster in the first place, redundancy.

    (Also, please bear in mind, I've not set up a cluster in quite a while, my SQLs (for now) are VMs on a beefy VMWare cluster)

  • The root cause of this failure was that you either a) built a cluster that wasn't properly validated/tested from the start or b) you allowed changes to your production environment that broke that valid server pair. Either way it is a failure of process and/or operations.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • jasona.work (3/27/2015)


    OK, but do you have a dedicated heartbeat network for the cluster servers?

    You no longer need to have a dedicated heartbeat network, you don't push the heartbeat traffic through a specific Network, the networks aren't used in the same way they were in Windows 2003.

    You do however, need multiple redundant networks.

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • ramyours2003 (3/27/2015)


    Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

    First thing is to run a cluster validation and address any issues it reports back

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Perry Whittle (3/27/2015)


    jasona.work (3/27/2015)


    OK, but do you have a dedicated heartbeat network for the cluster servers?

    You no longer need to have a dedicated heartbeat network, you don't push the heartbeat traffic through a specific Network, the networks aren't used in the same way they were in Windows 2003.

    You do however, need multiple redundant networks.

    Ah, OK. The last cluster I worked with was a Server 2003 cluster, and the last one I built (4'ish years back) was for Hyper-V 2008 R2 (which means I think the dedicated network was for Live migrations of VMs)

    I think it's time for me to setup my test cluster at home and get familiar with it again...

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply