2012 Cluster Failures

  • lmacdonald

    Hall of Fame

    Points: 3763

    Hi,

    We are relatively new to clustering. We are experience a lot of failovers and was wondering if someone could help me figure out where to start looking for the issue. We see regular errors in the cluster manager. Also the remote desktop experience is poor, very laggy taking 20-30 seconds to drag windows around the screen. Disk latency is also rather high. For example the average read latency for the drives are 70 milliseconds each and I read that over 20 milliseconds you may start seeing problems. I'm not sure if this would cause a failover or if what I read was accurate. The average read stall times in ms is several hundred. The most common wait types are PAGEIOLATCH_SH and LATCH_EX and SOS_SCHEDULER_YIELD.

    I also ran the network cluster validation test and did get some warnings. They are below.

    -Adapters iscsi_vlan192_slot6_top and iscsi_vlan192_slot7_top on node serverfqn have IP addresses on the same subnet.

    -Adapters iscsi_vlan197_slot6_bottom and iscsi_vlan197_slot7_bottom on node serverfqn have IP addresses on the same subnet.

    -The RegisterAllProvidersIP property for network name 'Name: serverfqn' is set to 1. For the current cluster configuration this value should be set to 0.

    I sent this info out to the systems team who built the cluster but did not get a response. Below are some more errors, this time from the cluster manager. I am just wondering as a DBA who doesn't have much control over the windows cluster if there are other places I can look to help narrow down the bottleneck. Thanks for any help.

    System

    Event ID: 1129

    Level: Error

    Task: Network Manager

    Source: Microsoft_windows-FailoverCluster

    Cluster network 'Cluster' is partitioned. Some attached failover cluster nodes cannot communicate with each other over the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

    ----------------------------------------------------------

    System

    Event ID: 1126

    Level: Warning

    Task: Network Manager

    Source: Microsoft_windows-FailoverCluster

    Cluster network interface 'bnhbiscl05-01 - Cluster' for cluster node 'bnhbiscl05-01' on network 'Cluster' is unreachable by at least one other cluster node attached to the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

    ---------------------------------------------------------

    System

    Event ID: 1130

    Level: Error

    Task: Network Manager

    Source: Microsoft_windows-FailoverCluster

    Cluster network 'Cluster' is down. None of the available nodes can communicate using this network. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

    ----------------------------------------------------

    System

    Event ID: 1127

    Level: Error

    Task: Network Manager

    Source: Microsoft_windows-FailoverCluster

    Cluster network interface 'bnhbiscl05-01 - Cluster' for cluster node 'bnhbiscl05-01' on network 'Cluster' failed. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

  • lmacdonald

    Hall of Fame

    Points: 3763

    Let me add that the MDF and LDF files are on the same drive. I told the people who created the database to not do this as it's bad practice and the two file types have different IO access patterns but they are so convinced the drive speed is so great that it shouldn't matter.

  • Perry Whittle

    SSC Guru

    Points: 233859

    lmacdonald (1/25/2016)


    Hi,

    We are relatively new to clustering. We are experience a lot of failovers and was wondering if someone could help me figure out where to start looking for the issue. We see regular errors in the cluster manager. Also the remote desktop experience is poor, very laggy taking 20-30 seconds to drag windows around the screen. Disk latency is also rather high. For example the average read latency for the drives are 70 milliseconds each and I read that over 20 milliseconds you may start seeing problems. I'm not sure if this would cause a failover or if what I read was accurate. The average read stall times in ms is several hundred. The most common wait types are PAGEIOLATCH_SH and LATCH_EX and SOS_SCHEDULER_YIELD.

    I also ran the network cluster validation test and did get some warnings. They are below.

    -Adapters iscsi_vlan192_slot6_top and iscsi_vlan192_slot7_top on node serverfqn have IP addresses on the same subnet.

    -Adapters iscsi_vlan197_slot6_bottom and iscsi_vlan197_slot7_bottom on node serverfqn have IP addresses on the same subnet.

    -The RegisterAllProvidersIP property for network name 'Name: serverfqn' is set to 1. For the current cluster configuration this value should be set to 0.

    I sent this info out to the systems team who built the cluster but did not get a response. Below are some more errors, this time from the cluster manager. I am just wondering as a DBA who doesn't have much control over the windows cluster if there are other places I can look to help narrow down the bottleneck. Thanks for any help.

    System

    Event ID: 1129

    Level: Error

    Task: Network Manager

    Source: Microsoft_windows-FailoverCluster

    Cluster network 'Cluster' is partitioned. Some attached failover cluster nodes cannot communicate with each other over the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

    ----------------------------------------------------------

    System

    Event ID: 1126

    Level: Warning

    Task: Network Manager

    Source: Microsoft_windows-FailoverCluster

    Cluster network interface 'bnhbiscl05-01 - Cluster' for cluster node 'bnhbiscl05-01' on network 'Cluster' is unreachable by at least one other cluster node attached to the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

    ---------------------------------------------------------

    System

    Event ID: 1130

    Level: Error

    Task: Network Manager

    Source: Microsoft_windows-FailoverCluster

    Cluster network 'Cluster' is down. None of the available nodes can communicate using this network. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

    ----------------------------------------------------

    System

    Event ID: 1127

    Level: Error

    Task: Network Manager

    Source: Microsoft_windows-FailoverCluster

    Cluster network interface 'bnhbiscl05-01 - Cluster' for cluster node 'bnhbiscl05-01' on network 'Cluster' failed. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

    Firstly, are these virtual machines?

    Secondly, how many network cards does each node have and how are they configured?

    How many nodes are in the WSFC?

    Are the nodes on separate geographical sites?

    What quorum configuration are you using?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • lmacdonald

    Hall of Fame

    Points: 3763

    I'll answer the questions I know. I'll have to ask systems for the other information.

    There are only two nodes in the cluster. I believe the quorum is a disk share.

    They are not in separate geographical locations.

    I believe they are virtualized but this and the networking questions I have sent to someone else.

  • lmacdonald

    Hall of Fame

    Points: 3763

    More information I was able to get.

    The two nodes a giant physicals

    Currently four 10gb ports dedicated to iscsi - two on each iscsi networks

    Two one gb ports bound to a team (resource network)

    One 10gb crossover for cluster communication.

  • king

    Newbie

    Points: 1

    I have the same problem for my cluster. Please hlep me to fix this problem

    Errors  :

     

    Event ID: 1129

    Level: Error

    Task: Network Manager

    Source: Microsoft_windows-FailoverCluster

    Cluster network 'Cluster' is partitioned. Some attached failover cluster nodes cannot communicate with each other over the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

    ----------------------------------------------------------

    System

    Event ID: 1126

    Level: Warning

    Task: Network Manager

    Source: Microsoft_windows-FailoverCluster

    Cluster network interface 'bnhbiscl05-01 - Cluster' for cluster node 'bnhbiscl05-01' on network 'Cluster' is unreachable by at least one other cluster node attached to the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

    ---------------------------------------------------------

    System

    Event ID: 1130

    Level: Error

    Task: Network Manager

    Source: Microsoft_windows-FailoverCluster

    Cluster network 'Cluster' is down. None of the available nodes can communicate using this network. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

     

Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply