HeartBeat Failure

  • Hi Experts,

    We are observing frequent heartbeat failure in cluster logs and the RCA is found to be network issue.

    1.When this happens the storage is going down.Is this normal?

    2.The server is in VM , IT team says the NW issue is because the computing resources are overloaded, can Netowrk issue happen due to this?

    Thanks

  • can you please supply more information on the environment itself and the configuration of the cluster nodes (e.g. number of vCPUs, vNICs, etc).

    Have you run a cluster validation report?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Perry Whittle (7/24/2016)


    can you please supply more information on the environment itself and the configuration of the cluster nodes (e.g. number of vCPUs, vNICs, etc).

    Have you run a cluster validation report?

    Thanks Perry.

    its 2 Node cluster in Active-Active configuration with 3 instances.16 Cores,32GB RAM, 2 NIC. This is for Dev environment and it consists of anothher cluster with same configuration in Active\Passive Setup.

  • VastSQL (7/25/2016)


    Perry Whittle (7/24/2016)


    can you please supply more information on the environment itself and the configuration of the cluster nodes (e.g. number of vCPUs, vNICs, etc).

    Have you run a cluster validation report?

    Thanks Perry.

    its 2 Node cluster in Active-Active configuration with 3 instances.16 Cores,32GB RAM, 2 NIC. This is for Dev environment and it consists of anothher cluster with same configuration in Active\Passive Setup.

    going to need a little more than that.

    What does the cluster validation report show too?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Perry Whittle (7/25/2016)


    VastSQL (7/25/2016)


    Perry Whittle (7/24/2016)


    can you please supply more information on the environment itself and the configuration of the cluster nodes (e.g. number of vCPUs, vNICs, etc).

    Have you run a cluster validation report?

    Thanks Perry.

    its 2 Node cluster in Active-Active configuration with 3 instances.16 Cores,32GB RAM, 2 NIC. This is for Dev environment and it consists of anothher cluster with same configuration in Active\Passive Setup.

    going to need a little more than that.

    What does the cluster validation report show too?

    Validation report didnt threw any error.

  • VastSQL,

    Can you clarify what storage goes down? The VMWare data store or disks assigned to the SQL cluster if you are using in guest iSCSI?

    If the ESXi host has all CPU or RAM in use then this will affect the vSwitches ability to run network traffic and could cause what you are seeing.

    With ESXi I like to build a resource group for VMs and exclude 1 CPU and 4 GB of RAM and assign all VMS to this group so the host always has CPU/RAM resources. This helps when we are running high CPU applications on the virtual machines.

    Another common cause I encounter for Heartbeat failures that seem to be networking is mini-filter drivers being used that delay traffic coming in on the network interfaces like iSCSI. You can run the command Fltmc.exe to check mini filter drivers installed. We had issues with Diskeeper for years and worked with Condusiv for months to fix the product to stop causing cluster failures due the mini filter driver.

    Hope this helps!!

    Jon

  • SeniorITGuy (7/26/2016)


    VastSQL,

    Can you clarify what storage goes down? The VMWare data store or disks assigned to the SQL cluster if you are using in guest iSCSI?

    If the ESXi host has all CPU or RAM in use then this will affect the vSwitches ability to run network traffic and could cause what you are seeing.

    With ESXi I like to build a resource group for VMs and exclude 1 CPU and 4 GB of RAM and assign all VMS to this group so the host always has CPU/RAM resources. This helps when we are running high CPU applications on the virtual machines.

    Another common cause I encounter for Heartbeat failures that seem to be networking is mini-filter drivers being used that delay traffic coming in on the network interfaces like iSCSI. You can run the command Fltmc.exe to check mini filter drivers installed. We had issues with Diskeeper for years and worked with Condusiv for months to fix the product to stop causing cluster failures due the mini filter driver.

    Hope this helps!!

    Jon

    Thanks a lot Jon.

    Yes I mean the disks assigned to SQL Cluster. "If the ESXi host has all CPU or RAM in use then this will affect the vSwitches ability to run network traffic and could cause what you are seeing" this was the reason IT team conveyed and good to know the cause.

    How can i prevent this from happening?

    I didnt understand this part , can i check this in my SQL Servers?

    "Another common cause I encounter for Heartbeat failures that seem to be networking is mini-filter drivers being used that delay traffic coming in on the network interfaces like iSCSI. You can run the command Fltmc.exe to check mini filter drivers installed. We had issues with Diskeeper for years and worked with Condusiv for months to fix the product to stop causing cluster failures due the mini filter driver.

    "

    Thanks Again

  • VastSQL,

    For ESXi, it has to use its hardware resources to run operations just like any OS. Think of the virtual switches as a virtual appliance that needs CPU and RAM as well to process the network traffic. As I stated above, if the host is experiencing high resource usage then this can cause the virtual switches to run slow too.

    For VMWare, the indicator of CPU overuse is to check the "CPU Ready Time". This is the amount of time that operations are ready to run on a CPU but have to wait for a free CPU to actually run. This can be checked with the EXSTOP program and is displayed as a percentage (%RDY). 10% is the watermark you should stay under as CPU is considered to be saturated at 10% CPU Ready Time.

    If you have confirmed that your ESXi host is resource saturated (either CPU it RAM) then you need to add more hosts or reduce resources assigned to existing VMs. I use a product called VMTurbo which does a better job than DRS at shifting VMs around the cluster to guarantee resources and it lets me know when I have too much CPU and RAM allocated to a virtual machine because it's not being used. (Of course we don't let it move our virtual cluster nodes around because that causes a disconnect between nodes which can cause an instance failure depending on setup and which node is moved).

    Now if you don't have the ability to reduce resources, add hosts or purchase VMTurbo then you can do the trick I explained above to reserve CPU and RAM for the ESXi host itself. We do this for our eDiscovery clients that do data processing and analytics which use all CPU at 100% while running. Basically you create a Resource Group and you assign all but 1 CPU core and 4 GB of RAM to the group and then put all your virtual machines into this Resource Group. If you have hyper threading turned on, assign all but 2 CPU cores.

    Ok, the Minifilter driver is another discussion. If your Virtualization team comes back and says, we have enough resources and no CPU ready time but the cluster failure was caused by a network disconnect but everyone else in IT says their equipment is running and configured correctly then it is time to start looking at the cluster nodes and the Minifilter drivers.

    A Minifilter driver is a way for vendors to inspect traffic performing read and write operations. They are file system filter drivers but when your disks are connected via iSCSI the read and write operations have to traverse the network layer so a file system filter inspecting traffic can cause delays seen as network issues and the cluster logs report it as a network failure, almost always on the heartbeat network.

    You can fix this by uninstalling software that uses these drivers (DoubleTake, Diskeeper, Antivirus software) or work with the vendor to stop it from causing cluster failures. Also keep in mind that having software installed that uses a Minifilter driver will still handle traffic even if it's turned off. For example, we have DoubleTake installed but we aren't protecting anything. It's just one more driver and when you have a bunch of different software inspecting files then it takes time for the packets to go through each filter. You can check which Minifilter drivers are installed on any Windows Server using the Fltmc.exe application that is built in, and you can do this on your SQL cluster nodes too.

    When running clusters make sure everyone in the chain is following vendor best practices. You have SAN / VMWare best practices to follow on how to configure HBAs and virtual switches. You have best practices for physical network switch configurations as well that need to be followed. You have OS level NIC card configuration best practices to follow and just because it's a VMXNET3 adapter doesn't exclude it from this requirement. Every little setting and tweak required from the storage layer up to the SQL instance helps improve or impede performance. It's a good idea to check if the vendor updated its best practice since you set up the environment. I can speak from experience that we have had many calls with high level product support and dev engineers at Dell looking at iSCSI issues in the SC and PS series storage over the years and there have been a handful of changes to best practices come out of our calls.

    Hope this helps!!

    Jon

  • SeniorITGuy (7/27/2016)


    VastSQL,

    For ESXi, it has to use its hardware resources to run operations just like any OS. Think of the virtual switches as a virtual appliance that needs CPU and RAM as well to process the network traffic. As I stated above, if the host is experiencing high resource usage then this can cause the virtual switches to run slow too.

    For VMWare, the indicator of CPU overuse is to check the "CPU Ready Time". This is the amount of time that operations are ready to run on a CPU but have to wait for a free CPU to actually run. This can be checked with the EXSTOP program and is displayed as a percentage (%RDY). 10% is the watermark you should stay under as CPU is considered to be saturated at 10% CPU Ready Time.

    If you have confirmed that your ESXi host is resource saturated (either CPU it RAM) then you need to add more hosts or reduce resources assigned to existing VMs. I use a product called VMTurbo which does a better job than DRS at shifting VMs around the cluster to guarantee resources and it lets me know when I have too much CPU and RAM allocated to a virtual machine because it's not being used. (Of course we don't let it move our virtual cluster nodes around because that causes a disconnect between nodes which can cause an instance failure depending on setup and which node is moved).

    Now if you don't have the ability to reduce resources, add hosts or purchase VMTurbo then you can do the trick I explained above to reserve CPU and RAM for the ESXi host itself. We do this for our eDiscovery clients that do data processing and analytics which use all CPU at 100% while running. Basically you create a Resource Group and you assign all but 1 CPU core and 4 GB of RAM to the group and then put all your virtual machines into this Resource Group. If you have hyper threading turned on, assign all but 2 CPU cores.

    Ok, the Minifilter driver is another discussion. If your Virtualization team comes back and says, we have enough resources and no CPU ready time but the cluster failure was caused by a network disconnect but everyone else in IT says their equipment is running and configured correctly then it is time to start looking at the cluster nodes and the Minifilter drivers.

    A Minifilter driver is a way for vendors to inspect traffic performing read and write operations. They are file system filter drivers but when your disks are connected via iSCSI the read and write operations have to traverse the network layer so a file system filter inspecting traffic can cause delays seen as network issues and the cluster logs report it as a network failure, almost always on the heartbeat network.

    You can fix this by uninstalling software that uses these drivers (DoubleTake, Diskeeper, Antivirus software) or work with the vendor to stop it from causing cluster failures. Also keep in mind that having software installed that uses a Minifilter driver will still handle traffic even if it's turned off. For example, we have DoubleTake installed but we aren't protecting anything. It's just one more driver and when you have a bunch of different software inspecting files then it takes time for the packets to go through each filter. You can check which Minifilter drivers are installed on any Windows Server using the Fltmc.exe application that is built in, and you can do this on your SQL cluster nodes too.

    When running clusters make sure everyone in the chain is following vendor best practices. You have SAN / VMWare best practices to follow on how to configure HBAs and virtual switches. You have best practices for physical network switch configurations as well that need to be followed. You have OS level NIC card configuration best practices to follow and just because it's a VMXNET3 adapter doesn't exclude it from this requirement. Every little setting and tweak required from the storage layer up to the SQL instance helps improve or impede performance. It's a good idea to check if the vendor updated its best practice since you set up the environment. I can speak from experience that we have had many calls with high level product support and dev engineers at Dell looking at iSCSI issues in the SC and PS series storage over the years and there have been a handful of changes to best practices come out of our calls.

    Hope this helps!!

    Jon

    Thanks a ton Jon for the detailed help.

    Only option I have is to take it up with higher management , let them decide whether add more hosts or reduce the resources of current VMs.

    Thanks Again.

  • Also check for excessive co-stop waits on the esx host

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Perry Whittle (7/29/2016)


    Also check for excessive co-stop waits on the esx host

    Thanks Perry.

  • Experts,

    We have a 3 Node cluster and if i shutdown 2 nodes of that cluster will it help in this situation?

    Thanks

  • Then you'll have a cluster with no failover capability!

    I'd still like to know the full spec of each VM (I.e. Number of vcpus, vnics, vdisks, etc)?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • If your VMs are saturated CPU wise and you have Hyperthreading enabled I would consider disabling it.

    This article explains why and is 2005 tagged but applies to all versions of SQL Server upto 2014.

    TURN OFF HYPERTHREADING ARTICLE

  • Perry Whittle (8/2/2016)


    Then you'll have a cluster with no failover capability!

    I'd still like to know the full spec of each VM (I.e. Number of vcpus, vnics, vdisks, etc)?

    Thanks Perry.

    As i mentioned earlier we have 16 vCPU, 24 GB RAM, 2NIC 6 disks.

    I am planning to reduce the Max memory of SQL Server from allocated 20GB to 10GB , lets see how it goes.

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply