AlwaysOn sometimes becomes out of sync

  • Hi all,

    We have a AlwaysOn availability group that is serving a SCOM 2012 installation. We have just moved the VMs holding databases instances over to 10GB interfaces along with 10gb iSCSI interfaces(overkill i know) so there should be no bottleneck on access to the disks. Periodically we get the secondary databases going out of sync and i receive 100+ emails notifying me.

    The only thing i can think of is the secondary databases are on our secondary SAN but on SATA disks (thin provisioned), the primary are on SAS disks. Could this be the reason why the dbs go out of sync?

    The error i get is the following:

    DATE/TIME:29/01/2014 23:23:53

    DESCRIPTION:AlwaysOn Availability Groups connection with secondary database established for primary database 'OperationsManager' on the availability replica with Replica ID: {b946263e-2d7e-48aa-834b-870524acbac4}. This is an informational message only. No user action is required.

    COMMENT:(None)

    JOB RUN:(None)

    Cheers

  • what are the network connections like between nodes?

    synch or asynch mode?

    are you using a dedicated network for the mirroring traffic?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • They are VM machines so the NIC are virtual (VMXNET3) which connect to a distributed virtual switch. The physical NICs on the host are 10gb Fibre which are LACP trunked. The mode is Synchronous between the nodes with a file share witness via a CIFs share on the SAN.

    Node1:

    LAN 10gb (public VLAN)

    iSCSI1: 10Gb (private VLAN)

    iSCSI2: 10GB (Private VLAN)

    Storage connected on SAN1 (SAN Disks)

    Node2:

    LAN 10gb (public VLAN)

    iSCSI1: 10Gb (private VLAN)

    iSCSI2: 10GB (Private VLAN)

    Storage connected on SAN2 (Sata disks)

  • If this is in sync mode, then it sounds like the replica's failing to respond to a ping within the configured session timeout, so it's dropping out of sync rather than hanging the primary database.

    Slower disks are unlikely to cause this, that'll just slow down transactions on the primary. As an aside, there's little point in having high performing disks on the primary if you're going to use sync mode and not replicate the performance on the secondary.

    What's the session timeout configured to?

  • Its set to default which is 10 seconds. Were currently setting up some SAS volumes on the 2nd SAN to see if this helps.

    The only other things i can think may not be helping matters is the 2nd SAN is performing snap mirrors and NDMP backups from the same filer but these use their own fibre paths although the disks will be spinning at that point.

  • Does the SQL error log on the secondary show any I/O errors during the time of the snapshot? Anything else of interest in the error log at the time of the alerts? Windows event logs?

  • I need to confirm what time the VM snapshots are taken so i'll check these and report back.

  • 1) are the machines using name resolution of any kind to know who the other is? I would look there if so.

    2) do a file IO stall and wait stats analysis on both machines to see if something jumps out at you

    3) Triple-check your virtual network setup

    4) maybe change to async commits and see if problem continues to occur? that might help narrow down the potential causes. note that if you chance to async from sync you expose yourself to data loss, but you are already at that point regularly, so likely not a concern.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • They just use DNS for name resolution (infoblox, not MS DNS)

    I can check stats the next time they come out of sync (they do eventually sync back up to each other) Do you have any particular useful commands to run?

    We have moved the 2nd SAN volumes to SAS drives today so its now the exact same as the primary SAN volumes so there now shouldnt be any issues with speed of access to the disks or between nodes.

  • michael.mcloughlin (1/30/2014)


    They are VM machines so the NIC are virtual (VMXNET3) which connect to a distributed virtual switch. The physical NICs on the host are 10gb Fibre which are LACP trunked. The mode is Synchronous between the nodes with a file share witness via a CIFs share on the SAN.

    Node1:

    LAN 10gb (public VLAN)

    iSCSI1: 10Gb (private VLAN)

    iSCSI2: 10GB (Private VLAN)

    Storage connected on SAN1 (SAN Disks)

    Node2:

    LAN 10gb (public VLAN)

    iSCSI1: 10Gb (private VLAN)

    iSCSI2: 10GB (Private VLAN)

    Storage connected on SAN2 (Sata disks)

    This is smelling like a network issue. The distributed virtual switch will have an overhead on the host(s), so be aware of this. So, you basically only have 1 NIC on each VM for the following traffic

    • Public\client
    • heartbeat
    • AO send network link between nodes

    AlwaysOn, like database mirrroing, sends transaction realtime across the network. This ideally should be a separate network, especially if you expect a lot of transactional activity. Have you tried raising the default timeout period?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • I have upped the session time-out to 15 seconds, i assume this is just a standard ping?

    Will see how it goes over the weekend.

  • michael.mcloughlin (1/31/2014)


    I have upped the session time-out to 15 seconds, i assume this is just a standard ping?

    Will see how it goes over the weekend.

    Careful with that, it really depends what your priority is. Would you rather have your secondary get temporarily out of date, or have your primary hang for 15 seconds when this happens?

    Microsoft documentation refers to it as a ping, however I doubt it's an ICMP packet, it'll be a specific heartbeat communication from the Windows Cluster services...

  • HowardW (1/31/2014)


    michael.mcloughlin (1/31/2014)


    I have upped the session time-out to 15 seconds, i assume this is just a standard ping?

    Will see how it goes over the weekend.

    Careful with that, it really depends what your priority is. Would you rather have your secondary get temporarily out of date, or have your primary hang for 15 seconds when this happens?

    Microsoft documentation refers to it as a ping, however I doubt it's an ICMP packet, it'll be a specific heartbeat communication from the Windows Cluster services...

    Exacto mundo and should ideally have a separate network for the heartbeat traffic. Currently all traffic is pushed down the "same pipe"

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Thanks guys, apologies if this sound daft but im the perfect example of the accidental DBA. Having come from a job with no exposure to clulstering and SQL (previous job was in schools) im still working this out in my head.

    I have 3 network cards on the VMs all connected to different distributed switches which are tagged on different VLANs. How do i go about using a different network for the heartbeat as i cannot see anything related to this on the existing virtual cluster and physical cluster, only the public and 2 private iSCSI vlans are listed in failover manager.

    From what i read when looking into this it is suggested that a heartbeat isn't required any more from SQL 2008 onwards? If this isn't recommended would another network card on each server on the same VLAN configured with a private IP (192.x.x.x for example) be enough for the heartbeat? obviously this would be set to Cluster use only?

  • michael.mcloughlin (1/31/2014)


    From what i read when looking into this it is suggested that a heartbeat isn't required any more from SQL 2008 onwards?

    Yes but when running the cluster validation the report will still bleat about not having a separate network!

    michael.mcloughlin (1/31/2014)


    If this isn't recommended would another network card on each server on the same VLAN configured with a private IP (192.x.x.x for example) be enough for the heartbeat? obviously this would be set to Cluster use only?

    A separate network would be advisable as you only have 1 Nic available

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply