Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 123»»»

AlwaysOn sometimes becomes out of sync Expand / Collapse
Author
Message
Posted Thursday, January 30, 2014 3:24 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Wednesday, April 09, 2014 7:46 AM
Points: 20, Visits: 99
Hi all,

We have a AlwaysOn availability group that is serving a SCOM 2012 installation. We have just moved the VMs holding databases instances over to 10GB interfaces along with 10gb iSCSI interfaces(overkill i know) so there should be no bottleneck on access to the disks. Periodically we get the secondary databases going out of sync and i receive 100+ emails notifying me.

The only thing i can think of is the secondary databases are on our secondary SAN but on SATA disks (thin provisioned), the primary are on SAS disks. Could this be the reason why the dbs go out of sync?

The error i get is the following:
DATE/TIME:	29/01/2014 23:23:53

DESCRIPTION: AlwaysOn Availability Groups connection with secondary database established for primary database 'OperationsManager' on the availability replica with Replica ID: {b946263e-2d7e-48aa-834b-870524acbac4}. This is an informational message only. No user action is required.


COMMENT: (None)

JOB RUN: (None)

Cheers
Post #1536230
Posted Thursday, January 30, 2014 4:53 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 7:11 AM
Points: 5,956, Visits: 12,837
what are the network connections like between nodes?
synch or asynch mode?
are you using a dedicated network for the mirroring traffic?


-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs"
Post #1536250
Posted Thursday, January 30, 2014 5:02 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Wednesday, April 09, 2014 7:46 AM
Points: 20, Visits: 99
They are VM machines so the NIC are virtual (VMXNET3) which connect to a distributed virtual switch. The physical NICs on the host are 10gb Fibre which are LACP trunked. The mode is Synchronous between the nodes with a file share witness via a CIFs share on the SAN.

Node1:
LAN 10gb (public VLAN)
iSCSI1: 10Gb (private VLAN)
iSCSI2: 10GB (Private VLAN)
Storage connected on SAN1 (SAN Disks)

Node2:
LAN 10gb (public VLAN)
iSCSI1: 10Gb (private VLAN)
iSCSI2: 10GB (Private VLAN)
Storage connected on SAN2 (Sata disks)
Post #1536256
Posted Thursday, January 30, 2014 6:23 AM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Yesterday @ 10:09 AM
Points: 1,207, Visits: 9,333
If this is in sync mode, then it sounds like the replica's failing to respond to a ping within the configured session timeout, so it's dropping out of sync rather than hanging the primary database.

Slower disks are unlikely to cause this, that'll just slow down transactions on the primary. As an aside, there's little point in having high performing disks on the primary if you're going to use sync mode and not replicate the performance on the secondary.

What's the session timeout configured to?

Post #1536290
Posted Thursday, January 30, 2014 6:40 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Wednesday, April 09, 2014 7:46 AM
Points: 20, Visits: 99
Its set to default which is 10 seconds. Were currently setting up some SAS volumes on the 2nd SAN to see if this helps.

The only other things i can think may not be helping matters is the 2nd SAN is performing snap mirrors and NDMP backups from the same filer but these use their own fibre paths although the disks will be spinning at that point.
Post #1536295
Posted Thursday, January 30, 2014 6:56 AM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Yesterday @ 10:09 AM
Points: 1,207, Visits: 9,333
Does the SQL error log on the secondary show any I/O errors during the time of the snapshot? Anything else of interest in the error log at the time of the alerts? Windows event logs?
Post #1536303
Posted Thursday, January 30, 2014 7:47 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Wednesday, April 09, 2014 7:46 AM
Points: 20, Visits: 99
I need to confirm what time the VM snapshots are taken so i'll check these and report back.
Post #1536354
Posted Thursday, January 30, 2014 9:04 AM


SSCarpal Tunnel

SSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal Tunnel

Group: General Forum Members
Last Login: 2 days ago @ 8:43 PM
Points: 4,128, Visits: 5,836
1) are the machines using name resolution of any kind to know who the other is? I would look there if so.

2) do a file IO stall and wait stats analysis on both machines to see if something jumps out at you

3) Triple-check your virtual network setup

4) maybe change to async commits and see if problem continues to occur? that might help narrow down the potential causes. note that if you chance to async from sync you expose yourself to data loss, but you are already at that point regularly, so likely not a concern.


Best,

Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru at GMail
Post #1536402
Posted Thursday, January 30, 2014 9:36 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Wednesday, April 09, 2014 7:46 AM
Points: 20, Visits: 99
They just use DNS for name resolution (infoblox, not MS DNS)

I can check stats the next time they come out of sync (they do eventually sync back up to each other) Do you have any particular useful commands to run?

We have moved the 2nd SAN volumes to SAS drives today so its now the exact same as the primary SAN volumes so there now shouldnt be any issues with speed of access to the disks or between nodes.

Post #1536423
Posted Thursday, January 30, 2014 10:25 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 7:11 AM
Points: 5,956, Visits: 12,837
michael.mcloughlin (1/30/2014)
They are VM machines so the NIC are virtual (VMXNET3) which connect to a distributed virtual switch. The physical NICs on the host are 10gb Fibre which are LACP trunked. The mode is Synchronous between the nodes with a file share witness via a CIFs share on the SAN.

Node1:
LAN 10gb (public VLAN)
iSCSI1: 10Gb (private VLAN)
iSCSI2: 10GB (Private VLAN)
Storage connected on SAN1 (SAN Disks)

Node2:
LAN 10gb (public VLAN)
iSCSI1: 10Gb (private VLAN)
iSCSI2: 10GB (Private VLAN)
Storage connected on SAN2 (Sata disks)

This is smelling like a network issue. The distributed virtual switch will have an overhead on the host(s), so be aware of this. So, you basically only have 1 NIC on each VM for the following traffic

  • Public\client

  • heartbeat

  • AO send network link between nodes



AlwaysOn, like database mirrroing, sends transaction realtime across the network. This ideally should be a separate network, especially if you expect a lot of transactional activity. Have you tried raising the default timeout period?


-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs"
Post #1536457
« Prev Topic | Next Topic »

Add to briefcase 123»»»

Permissions Expand / Collapse