Always On, as a DR site.

Question

Always On, as a DR site.

Leonard Rutkowski

SSCrazy

Points: 2668
More actions
June 1, 2016 at 8:46 am

#314157

Hi,
We have set up an Always On, SQL Server 2012, as a proof of concept. What we want to do, is use our fail over site, as a DR site. From what we have discovered, not only is this not a good DR product, it isn't a good HA product. Now, don't get me wrong. I like SQL Server, and I'm not a Microsoft basher. But, I am disappointed they are marketing it as an HA\DR replacement. Perhaps as it matures. Let me make a few points, and ask a few questions, and get some responses.
As I said, we have set up an POC cluster, and after a lot of work, managed to get Always On working. We haven't set up a listener yet. We don't want to use this for HA, only for DR.
1. Currently, we use cluster for HA, and SAN replication for DR. During a DR test, we 'break' replication, and mount the replicated LUNS on our UAT system. That way, we can test multiple applications, without updating production data. Doing this work delays our DR testing and implementation, but, we are assured the current data is available. Once DR testing is complete, we re-establish replication.
2. We want to use Always On, in the same manner. We want to be able to 'break' the Always On, and use the secondary as our DR during testing, then once complete, go back to using the primary, with the secondary being synchronized again.
3. One of the benefits we liked about Always On, was the ability to group databases, so we could fail over an 'application', without affecting other applications, or fail over priority applications first, for SLA.
Issues:
1. Must be on same cluster. I get this, for fail over purposes, but in some ways, defeats the purpose of DR. Stretching a cluster is a pain point, but we finally managed to do it. Using SAN replication, we did not need to be on the same cluster. We use our UAT servers for DR, so they get dual use. But, if we go with Always On, we feel like we would need a separate DR cluster.
2. How to 'break' synchronization, without a fail over, then connect to the secondary, to use in DR testing? Once broken, how do we re-establish and get back to synchronized databases, without restoring hundreds of databases. Not sure that is even possible, as the databases would no longer be in synch.
3. Logins, Jobs, Linked Servers, etc., must be kept in synch manually, and by manually, yes, we could set up something, a job, Powershell, etc., to keep them in synch, but if this is true DR you shouldn't have too. Again, I get why. If you are failing over just a subset of databases, how do you know which Logins, Jobs, Linked Servers, are needed. But, for DR purposes, lets say everything goes. This is one of the reasons why I say it isn't ready for primetime. This is the same issue, you would have, if using Always On for HA purposes. It is a different instance, so all of the SQL objects would have to be kept in synch. What would be the point, if you could just use a cluster fail over, and not worry about synching SQL objects? For Logins, those could be replaced, if the database is a container, but how many have converted to that? We certainly haven't, and aren't looking to, in the future. Yes, different LUNS\Storage, etc., but with RAID and SAN technology, that issue is mitigated. Yes, SAN could go down, but that's why we have DR severs.
4. Has anyone else had experience using Always On, in a DR situation? Any suggestions, help, would be appreciated.
Thanks,
Leonard

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 1

Leonard, let me start by saying this response is going to be very blunt and direct. It is not a personal condemnation. I have seen multiple companies struggle and even go out of business because they didn't know what they were doing or have a good, PROPERLY TESTED disaster recovery plan.

And that last part is the first thing I will touch on. Your company's very existence depends on your DR plan being flawless and being PROVEN to be flawless. And you are not doing that with any part of your system that you mentioned thus far. You are playing at it, but just doing some work on snapshots is not the same thing because at the end of the disaster you (well, every single company I have ever heard of anyway) will be moving back to your primary location/systems. Did you take your entire primary Active Directory/DNS/DHCP/etc system offline when you did that SAN-replica test? These types of things are why your comments about AGs not being ready is flawed. Microsoft knows that if you fail over to them (for testing or otherwise) they MUST be writable and they MUST be able to fail back to the primary(s).

I also note that your company is doing itself a serious disservice and again risking it's existence in trying to create a DR system with staff that appears to not have SIGNIFICANT training and real-world experience in constructing, testing and maintaining said DR system and executing the plan when disaster really strikes.

Answering a few of your questions:

A) If you fail over to AGs for DR test (or for real), then you simply fail back to the primary when it is available again. If it was for testing purposes this is trivial. If you need to rebuild your primary then will be a normal AG reconfiguration.

B) As you mentioned, SANs and other shared storage can go offline so Failover Clustering is not enough. And I have seen catastrophic SAN loss at clients. Not needing shared storage is one of the "good" features of AGs.

C) Where does it say that "true DR" doesn't need work to keep it prepared to start processing your workload when your primary systems go offline? You create a new SQL Server login - creating the same one with the same SID on the DR machine is trivial and becomes part of your process for creating a new login. Same for SQL Agent jobs, etc.

Now for the $64K question: You have a DR scenario now that you think is working. Why are you looking for a different solution?

In closing I recommend you check out Perry Whittle's Stairway to Always On series here on SSC.com. Better yet, get in touch with Allan Hirt (sqlha.com) and have him help you understand the ins and outs of AGs and help you do this right if you decide to continue to pursue it. There is no one more knowledgeable or experienced with them than him IMNSHO.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Leonard Rutkowski SSCrazy Points: 2668 More actions · Answer 2

Thanks for the response. I do in fact appreciate it. I wanted to get a conversation going, to see how others are using Always On, and how they use it in a DR scenario.

We do in fact, have a very comprehensive DR. We have a specific group that does nothing but ensure our DR plan is tested and up to date. It gets tested twice a year, with all hands, including Business. I did not go into all of the details for a variety of reasons, but yes, network gets changed around, VM's get created and destroyed, etc.. What we are trying to do, is improve our response times. Yes, we have some issues, each time we test. That's why we test. If you have a flawless plan, I would like to see it, because even with our testing, and we have been doing it for years, we still run into issues. No plan is flawless, things change, hardware, software, network, etc. We want to do a better job, from a SQL standpoint. When we first saw Always On, we were hoping for a better solution than what we are currently using. Something a little more seamless, and a little quicker. It is being marketed as a HA\DR solution, so we said show us. We are doing this as a proof of concept, and it has not entered our DR plan in anyway. We are just trying to get questions answered, before trying to deploy in a production\DR environment. I was just trying to see if anybody else was using it, as a DR solution, and how they were using it.

As for failing over to the secondary, then failing back, in test, that is no good. We do business testing, in DR, that is applications are connected, and changes made to the DR databases. We can't fail those back in a DR test situation. A real situation, absolutely, but not testing. We don't just test to see if a fail over works and then fail it back.

Never said we couldn't have a SAN failure. In fact, our SAN team is looking at ways to mitigate that situation as well. We are currently doing SAN replication, to other sites, so we have that covered.

As for creating the logins, jobs, etc., again, I said we can do that, with a manual effort (manual, meaning some scripting, jobs, etc.). But its one more issue that we then have to deal with. Currently, the system databases get replicated, through the SAN, so all we have to do is mount the replicated LUNS, change a few things, and we are good to go, as if we were production. But, that takes time and effort, that we are trying to reduce.

We have talked to Microsoft Engineers, and basically they told us that's how it works. We plan on meeting with them again, to see if we can find answers to our questions.

Yes, we have a DR scenario that works. But, why not try to improve on it. If it doesn't work, then if nothing else, we have learned a little bit about Always On, and may be able to use it in other situations. I'm not saying its a bad product, just saying I don't think it should be marketed as an HA\DR solution.

Leonard

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 3

Sounds like you are well ahead of most entities I have encountered over the years Leonard! Yes, "flawless" was a loose word choice there, at least for any entity of any reasonable complexity.

I really do believe that AGs (or simply database mirroring since you seem to really only need one copy , which removes the need for Enterprise Edition of SQL Server for ALL of your boxes) offer a very good DR scenario (and HA that is good enough for many companies out there) with very quick recovery times. As long as the rest of your critical infrastructure is up quickly too you could get online rather fast for your most important pieces parts.

What if your testing had scripts that could clean out any "business testing" that was done as part of a failover test? If you did that you could safely (and easily) fail back to the primary without any actual data changes remaining in the affected databases. Depending on a number of factors this could be relatively painless to construct or a complete nightmare. 🙂 My suspicion is it will be closer to the latter.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Leonard Rutkowski SSCrazy Points: 2668 More actions · Answer 4

There are lots of solutions, but other than the SAN replication, we haven't really found a better solution.

We use log shipping for a couple of databases for DR purposes, but that isn't practical for all of our databases. I think for those, we only test read only parts of the application, so no roll back.

We also use transactional replication, but only for certain databases, and certain tables. Again, not practical, because we would have to replicate every table and every database, plus, same issue with logins, jobs, linked servers. And, we would have to re-sync at end of test.

We don't use database mirroring, but I think I have read some articles indicating that is going away. Maybe, but they will have to improve Always On. The fact that it has to be on the same cluster, and enterprise edition, could be an issue.

As for cleaning out the data, that's not really an option. Too many things could go wrong, and I don't think our auditors would like that. 🙂

Anyway, we are going to do some additional testing, and see what happens.

Regards,

Leonard

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 5

Leonard Rutkowski (6/1/2016)
There are lots of solutions, but other than the SAN replication, we haven't really found a better solution.
We use log shipping for a couple of databases for DR purposes, but that isn't practical for all of our databases. I think for those, we only test read only parts of the application, so no roll back.
We also use transactional replication, but only for certain databases, and certain tables. Again, not practical, because we would have to replicate every table and every database, plus, same issue with logins, jobs, linked servers. And, we would have to re-sync at end of test.
We don't use database mirroring, but I think I have read some articles indicating that is going away. Maybe, but they will have to improve Always On. The fact that it has to be on the same cluster, and enterprise edition, could be an issue.
As for cleaning out the data, that's not really an option. Too many things could go wrong, and I don't think our auditors would like that. 🙂
Anyway, we are going to do some additional testing, and see what happens.
Regards,
Leonard

I LOVE log shipping personally when it fits the need!

I HATE replication with passion!

Database Mirroring is deprecated, but I can GUARANTEE you that I will have clients still using it at least a decade from now. Same for Profiler and tracing. 😎 I still have clients on SQL 2000, and that is 16 years ago now.

Tell your auditors to quit being cry babies. 😀

Good luck with things. I look forward to a follow-up post with how you decide to play things out. Sounds like a fun environment!!

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Steve Jones - SSC Editor SSC Guru Points: 734449 More actions · Answer 6

A few things. First, I think you have a fairly well setup and thought through environment. Kudos.

Second, you're confusing DR with testing a bit. If you break your DR system off and run tests on it, you're not really doing DR. What you're doing here is copying production to a test system. Even if you do this with a SAN snapshot, you have to reset this back up with another SAN snapshot in the event of a DR situation.

There's no magic to a SAN snapshot. It doesn't rip 2TB across the wire in a split second. It does what many systems do in that these fake the copy by starting to move data and if you access an unmoved block, the system goes to get it. If you do this fairly often, it's a decent DR plan, but you still could get caught. If the system had copied 1TB of 2TB and the production system disappeared, I hope you have a (very slightly) older snapshot.

When you break an Always On Availability Group (let's just use AG), then you break this and need to reset it. I'm not completely sure of the reset if you could snapshot over (from the SAN) the database files or backup, but you'd have to do this. Some of the really, really good DR plans don't break their DR and test it. They just move everyone to the DR server for a day or week and run on it. That shows you can really run on the DR side of things. You learn what your failover time is and then adjust your environment as needed.

Not to say that your system isn't good, but just that mimicking a failover isn't failing over.

In terms of your questions.

1. the stretch doesn't need shared storage. Kevin addressed that, but the overall plan is an FCI + AGs. If you have a cluster, there's a lot of value in that and you might want a SQL cluster to keep your HA side of things going. You also can't use these as UAT.

2. might be able to do this with a reset and a new SAN snapshot. Not sure here. Allan Hirt is the guy to ask and if I weren't buried with something, I'd do it.

3. Yes, weakness of AGs. I think that there is work to contain these items inside the DB, ala Azure style, which will eliminate this. No idea when this work comes.

4. Plenty have, and it works, but differently for different requirements.

One other note, the FCI + AG stuff is improved quite a bit in 2014 and likely more in 2016. I wouldn't disrupt my HA/DR plan if I were you unless I were moving versions.

Leonard Rutkowski SSCrazy Points: 2668 More actions · Answer 7

and the light bulb goes on. Yes, Steve, you are correct, we are testing our production systems. If we were truly DR, then the Always On is actually a good solution, other than some pesky labor in keeping some things in synch. If it was truly DR, then it should be no different than our HA. When we failover a cluster node to another node, for HA, we aren't 'testing', its true production. Same should go for DR. However, real world. There are other factors that come into play, when failing over for DR. Network changes, for example. Maybe the Always On listener fixes that. Currently for our DR, we use VIPs, CNAMES, etc., and have to flip DNS's around for the network. We also have VM's, Oracle, etc. Not everything is set up to do a seamless fail over. But, even using other methods, mirroring, log shipping, etc. we would have to do the same thing, so those methods aren't seamless. I can see where Always On would be helpful, but not necessarily in our situation, where our testing of the DR system consists of changing parts of the network, fooling around with SAN mounts, etc., Business testing, then flipping the network back. As for the SAN replication, there is a snapshot before the test, replication is broken, then after DR, replication is restarted, and picks up at the checkpoint. The SAN team is also looking at fail over for the SAN. Maybe someday we will get to the point where we can fail over any part of the system, and have it seamless.

Leonard

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 8

Maybe someday we will get to the point where we can fail over any part of the system, and have it seamless.

And you though my "flawless" word was too much!! That pipe dream is a Genius Billionaire Playboy Philanthropist UNICORN! :hehe:

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

EdVassie SSC Guru Points: 60445 More actions · Answer 9

It gets tested twice a year

As an outsider to your organisation, it is really hard to see that a DR that only gets tested twice a year is going to do what you need in a real disaster. It will work fine in a controlled test, but real disasters very seldom have any degree of control.

IMHO you should run your production environment out of multiple sites, and regularly (each week or month) rotate the site that hosts the primary AG role. Your applications should aim to get read-only data from a RO site, and only bother the primary AG site with read/write data, and you should aim to control which sites get used for which purpose by aliases (DNS or otherwise).

A reasonable way to work out how many sites you need is to discuss the risks of site failure with the business, and adopt a N+1 approach to give you the number of sites you should operate from. If the business feel they can cope with the risk of only having two active sites, you need to spread your estate over 3 sites.

A major aim is to get as much as possible of DR rolled into your normal way of working. This gives you significant resilience to all types of shock, and means the cost of planning and doing DR are much reduced. You will never get all of DR into normal working - there will always be the rogue update risk that gets pushed out to all sites and can only be fixed by data repair or DB restore. However, just about everything else can be covered by standard day to day working when you run Production out of multiple sites.

If a disaster does strike, recovering from it to get back to normal operations should not be a big deal. Different DR scenarios will have different effects, but having your Production running in N+1 sites will give significantly more resilience than you currently have. For example, if you need to test how you cope when an entire site is offline, roll this into how you do your monthly patching.

Where I used to work before retirement, they run all of Production in AWS and spread their estate over 3 availability zones. Each AWS AZ is a separate physical site, and they are all treated as peers for day to day workload. Some discretionary work is done at only one AZ, but server images are kept so any server can be booted up in any AZ if the normal location goes down. They even keep a AD/DNS server running in a separate AWS region to facilitate a major build elsewhere if the normal regions go down. The additional costs of spreading the estate in this fashion compared to using only a single site were estimated at 12% - 15% of total server spend, which was considered a small premium to pay to safeguard the business from most DR scenarios.

Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

Leonard Rutkowski SSCrazy Points: 2668 More actions · Answer 10

Thanks all, for the responses. I think we are starting to stray from my original questions, into general DR. We are trying to understand how we can, or if we can, implement Always On, into our plan, without disrupting the current plan. While we have some input into DR, its only from a SQL standpoint. So, we are trying to improve our little corner of it.

Regards,

Leonard

Perry Whittle SSC Guru Points: 233854 More actions · Answer 11

Leonard Rutkowski (6/1/2016)
1. Currently, we use cluster for HA, and SAN replication for DR. During a DR test, we 'break' replication, and mount the replicated LUNS on our UAT system. That way, we can test multiple applications, without updating production data. Doing this work delays our DR testing and implementation, but, we are assured the current data is available. Once DR testing is complete, we re-establish replication.

There's a fair amount of hardware, software and people skills required to achieve this, can be costly.

Leonard Rutkowski (6/1/2016)
2. We want to use Always On, in the same manner. We want to be able to 'break' the Always On, and use the secondary as our DR during testing, then once complete, go back to using the primary, with the secondary being synchronized again.

You can certainly use the AG for DR, you don't need to break the group for failover, it all depends on what initiated the failover in the first place

Leonard Rutkowski (6/1/2016)
3. One of the benefits we liked about Always On, was the ability to group databases, so we could fail over an 'application', without affecting other applications, or fail over priority applications first, for SLA.

Ideal for doing this

Leonard Rutkowski (6/1/2016)
1. Must be on same cluster. I get this, for fail over purposes, but in some ways, defeats the purpose of DR. Stretching a cluster is a pain point, but we finally managed to do it. Using SAN replication, we did not need to be on the same cluster. We use our UAT servers for DR, so they get dual use. But, if we go with Always On, we feel like we would need a separate DR cluster.

Stretching a cluster is no more taxing than deploying a server to a new site. Once the network links are in place and firewalls opened to allow the Directory Services traffic joining a WSFC should be fairly simple

Leonard Rutkowski (6/1/2016)
2. How to 'break' synchronization, without a fail over, then connect to the secondary, to use in DR testing? Once broken, how do we re-establish and get back to synchronized databases, without restoring hundreds of databases. Not sure that is even possible, as the databases would no longer be in synch.

As I said, you don't need to break the synchronisation

Leonard Rutkowski (6/1/2016)
3. Logins, Jobs, Linked Servers, etc., must be kept in synch manually, and by manually, yes, we could set up something, a job, Powershell, etc., to keep them in synch, but if this is true DR you shouldn't have too. Again, I get why. If you are failing over just a subset of databases, how do you know which Logins, Jobs, Linked Servers, are needed. But, for DR purposes, lets say everything goes. This is one of the reasons why I say it isn't ready for primetime. This is the same issue, you would have, if using Always On for HA purposes. It is a different instance, so all of the SQL objects would have to be kept in synch. What would be the point, if you could just use a cluster fail over, and not worry about synching SQL objects? For Logins, those could be replaced, if the database is a container, but how many have converted to that? We certainly haven't, and aren't looking to, in the future. Yes, different LUNS\Storage, etc., but with RAID and SAN technology, that issue is mitigated. Yes, SAN could go down, but that's why we have DR severs.

This can be mitigated by applying the change to each replica of the AG.

So, when you create a new login or agent job, the script is run against each replica.

It is also fairly simple to use TSQL to synch these objects

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉