SQL Cluster with 10 nodes losing network connection.

  • Hi Everyone

    We have a 10 node (win 2016) sql 2016 SP2 CU6 cluster with AOAG in automatic failover.

    Every Saturday night at exactly 11:00 PM cluster loses connection between nodes and we get random failovers.

    I extended session timeout to 2 minutes and still get failovers.

    We checked all the jobs and processes and nothing is running at that time.

    I even followed all the changes proposed in this article https://www.virtual-dba.com/always-on-changing-cluster-configuration/

    random failovers still occur.

    Is there anything i can monitor or do to prove to network team that it's not a AG or SQL issue ?

    Alex S
  • You may want to start by checking the cluster log as well as the Windows event logs at the times the issue hits.

    Sue

  • Hi Sue

    i did check cluster logs and event viewer there are tons of messages about cluster losing network connections:

    The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.

    Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

     

    I also get ISCSI connectivity messages:

    Connection to the target was lost. The initiator will attempt to retry the connection.

     

     

    Alex S
  • You may need to start checking for issues with the NICs. Make sure the firmware is current and doesn't have known issues with iSCSI (Broadcom for whatever reason seems to have a lot of issues with iSCSI). Or the network gets saturated at the time - like all the backups and copies are scheduled for 11:00 pm Saturday, that type of scenario. Hopefully you have a network group you can pawn this issue off to 🙂

    Sue

  • Hi Sue

    We checked all the NIC cards and even updated firmware none are broadcom.

    We even re scheduled backup and DFS sync jobs.

     

    Alex S
  • If you already checked everything including the network trace and found nothing I don't have anything to add other than I would wonder if it's random. Your post indicates it happens every Saturday at 11:00 pm. Either there are other times or it's not random.

    Sue

  • Are these physical servers or VMs? Do you have any VM snapshots/system-state backups/fileset backups running on those servers at the time? What about antivirus scans?

  • I think Beatrix is on the right track. Something scheduled, likely outside of SQL, is affecting the cards. I would also be sure no "phone home" software is looking for NIC or other updates at that time and maybe trying to do some update.

  • They are physical servers. We eliminated/rescheduled backups, antivirus and windows updates.

     

    thank you

    Alex S
  • cluster log will probably provide the best clue to pinpoint the cause, especially those with errors.

    I would also check the power plan for each node. All should be on high performance, not on balanced or power saver. this could miss some heartbeats and trigger fail over.

  • ok all the same errors but there was no failover.

     

    i'm contacting Microsoft.

    Alex S

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply