Advice for cross-site clustering

  • Good day SSC community, I'm here to humbly request your assistance:

    We're looking to make a pretty significant update to our infrastructure here. I will be deploying a set of servers to one site here in NA, and another to our second data centre in the EU.

    I would like to have both sites capable of running independently from the other; a requirement of our DR plans.

    My preferred configuration (if possible) will be to have 3 servers at Site A, and 3 servers at Site B. All databases will be contained within HA groups and accessed through DNS listeners; making it possible to move any application to any server at any site.

    Is it possible to create a windows cluster that will allow me to break apart the 2 server groups and still keep both sites running individually?

    Would a better option be deploying 3 servers per site; and then adding a 4th server to each group in the opposite data centre? (I assume this would allow me to fail over critical applications from the large main servers to the other site).

    The link speed between the 2 sites is decent, albeit not perfect. I assume I will be able to do synchronous commit on the local servers, and asynchronous commit to the remote servers.

    I appreciate any feedback and advice you guys can provide. If you need any more information please just let me know and I will try and answer as soon as I can.

    Thanks in advance everyone!

  • Do you plan to include any Failover Cluster instances of sql server in your AlwaysOn group configuration?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • I did not intend to.

    My plan was to install stand-alone instances of SQL Server on top of a Windows Failover cluster; and use Always On HA. I wasn't aware that you could use both technologies at the same time.

    I am very interested in how this would work in practice. Would this mean that I would be able to keep a warm standby of each node; as well as the HA groups?

    Thanks again!

  • yes, you can use both but there are restrictions so don't do it lightly.

    Keeping the instances standalone will provide you a flexible configuration with no shared storage requirements, very handy when your cluster is geographically dispersed.

    Just out of interest is the remote site a true DR site or is it a "working site" doubled as DR?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Not a true DR; sorry for the confusion...it just get's thrown around by management so much I tend to use it myself.

    Both sites are true working sites; but no database will ever need to be active at both sites at once (save maybe for read-only access for reporting.)

  • ok, but do you really need to have 3 nodes in the remote site too?

    Surely a failover to the remote site will only ever be for a short period of time until the primary site is back online?

    Take care to think about the Windows server failover cluster design that will sit under your SQL server configuration.

    An odd number of nodes in your cluster will enable you to utilise Majority Node Set for your quorum witness.

    Consider separating your mirror traffic from the client network traffic too.

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • I agree that I may be over engineering this slightly; but I would like to be able to move everything from one site to the other without any huge hit on the performance. Hence the 3 nodes per site.

    It's not completely outrageous to have our main site get completely knocked out; we're in the Cayman Islands and have been hit with hurricanes before.

    I was thinking I would need to set up a fileshare witness somewhere to act as the odd vote for quorom.

  • Definitely use Windows Server 2012 R2 if possible. The multi-subnet clustering is much improved due to dynamic witness.

  • Ozzmodiar (1/28/2015)


    I agree that I may be over engineering this slightly; but I would like to be able to move everything from one site to the other without any huge hit on the performance. Hence the 3 nodes per site.

    It's not completely outrageous to have our main site get completely knocked out; we're in the Cayman Islands and have been hit with hurricanes before.

    I was thinking I would need to set up a fileshare witness somewhere to act as the odd vote for quorom.

    Quick question, How does 3 nodes on DR give you performance?

    Even on a failover to DR you'll have one replica as primary and the others secondary, you indicated you may want a reporting replica too, but didn't state it as a must.

    A word of warning about over engineering the solution, with 5 nodes for instance you wont require a separate witness, you'll rely on Majority Node Set and use less hardware.

    Have you thought about the impact an AO readable secondary will have on things like tempdb size and reporting workloads?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Thanks again Perry, I'm just leaving the office, I'll get back to you a little later.

  • My plan was to have the availability groups split across all the 3 nodes at each site. Each server will act as the primary for a portion of the availability groups. I currently have a similar setup with my 2012 HA AG's. I am going to spec each server so it is capable of running all of the groups if required.

    The reporting replica is not required locally. What I was hoping for was to have a reporting replica; running asynchronously at Site B. making the report generation that much smoother at each location.

    As far as tempDB sizes go on the readable secondaries; I've already created some scripts that will clean up the temporary statistics daily. I have been running the setup for nearly a year with no adverse problems standing out as of yet.

    I"m going to draw up something quick to give you a better idea of what I have envisioned; I'll post it up when it's complete.

  • keep an eye on the size of the version store too

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • This is a basic overview of what I want to achieve with this upgrade.

  • so for a given AO group and database, all 6 instances will be replicas in that group?

    Which server have you reserved as your primary replica?

    Why sooo many listeners?

    Which site is your primary site?

    I would be more inclined to remove a node vote from the secondary site leaving 5 votes in the cluster and no witness. If the site link is broken site a knows it has 3 votes out of 5 and gains control, site b knows it has at most 2 votes out of 5 and will relinquish control

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Thanks Perry,

    This is where it starts to get a little complex; my goal is to not have a designated 'primary' replica. The intention is to have the listeners spread across all nodes, making each of nodes the primary replica for a small group of listeners.

    Where I'm struggling is how to gracefully handle a scenario where the link has been severed.

    Assume that all 6 nodes are acting as the primary for at least one listener. If at any point the link is severed, I would force all connections off of one of the sites, and set 3 of the nodes to act as the primary for all listeners (spread across all 3 servers), and disable the file share witness temporarily to ensure that quorom at the active site.

    Keep in mind that this will happen gracefully, before the link is actually severed; hurricanes move pretty slowly and we can anticipate them before they get here. I would be able to move the primary's all over to one site (say Site B), then cleanly shut down the nodes at Site A.

    Does this make sense?

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply