Do You Really Need HA?

Question

Do You Really Need HA?

Steve Jones - SSC Editor

SSC Guru

Points: 740271
More actions
November 10, 2025 at 12:00 am

#4677338

Comments posted to this topic are about the item Do You Really Need HA?

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply

EdVassie SSC Guru Points: 60543 More actions · Answer 1

I agree that the level of HA should be guided by RTO, RPO and the SLAs agreed with the business. There will often be databases that do not need the HA features of SQL Server or other products.

However, speaking as someone who is now retired but had to deal with real DR situations, I think it is easy to be overly optimistic on how long it will take to cope with a real DR issue.

The worst instance was around 2018 (I forget even the year). I was working for a charity that had just upgraded to the latest hardware and all primary storage was on the latest and fastest SSDs. Our storage vendor was asked to add a another bunch of TB and suddenly DR was needed.

Upgrading the storage pool required the servers to take a Windows checkpoint, add the storage, take a new checkpoint, and if all was well continue with the new storage. Windows checkpoint had been around since the millennium and used trillions of times, but not with servers and storage of that speed. It broke and all the storage at our primary site became unusable.

We used AGs for local resilience and a Distributed AG to do resilience to our remote site. It took us a few hours to get all our systems running at the remote site, but the DB side was not the hold-up. Our analysis was that we had zero data loss. However it took almost two weeks to rebuild the original primary site.

Our storage vendor had their own logs showing what had been done, and were able to replicate the problem. This sort of convinced MS that checkpoint had a timing issue, and they issued a fix.

Other DRs were an incident with a digger and a major network hub that took out all the comms for a large part of south-east London, and (with a different organisation) a fire in the server hall.

The moral is that a real DR will hit you either when you are ready for it, or when you are not ready for it. We were lucky in all the DRs that only kit was affected, and all staff were able to do their normal jobs.

Part of planning for DR is to make things as simple as possible for when a DR situation hits. Having a 48 hour RTO and expecting to meet it by rebuilding servers in the middle of a real DR to me is very overly optimistic.

When we started with AGs it did take a while to master it, but after the first few months we had no reliability issues with the HA environment. Clusters and AGs simply worked, and were not a source of downtime. YMMV but AGs and DAGs kept that charity going when the primary data centre died.

Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

Chandan Shukla Ten Centuries Points: 1270 More actions · Answer 2

Absolutely agree with you, Steve.

AGs have become the go-to HA setup lately, but they aren’t a free lunch. Without proper understanding and monitoring, even AGs can backfire—just like FCIs used to for many. Sometimes, a well-tested backup strategy with automation gives more peace of mind than a complex HA layer. Really appreciate the reminder to match HA design with real RTO/RPO needs.

David.Poole SSC Guru Points: 76320 More actions · Answer 3

I think this article is a prompt to review a lot of things we have learned as "received wisdom".

I inherited a process for rolling back a failed data load for a daily pricing model to a production system. The idea was that we could roll back to the previous day's pricing model. I found that it was quicker to run the previous day's load than the recovery process that would result in the same thing. The load would take 45 minutes. The recovery would take nearly 24 hours.

Related to HA is the discussion on what is meant by real-time and when real-time is actually necessary. I've found that in many cases, what the user means by real-time is that when the user presses a button, they get a response within a second or two. That is a very different thing from consuming real-time data from a stream and all the complexity that requires.

I feel that a lot of complexity results from situations where someone presents the solution, rather than the problem they are trying to solve. Implementing that solution is the easy path because it is a defined solution. Later on, we find that the person or department providing the solution only had their facts, not the broader picture, thus the solution turns out to be a problem.

LinkedIn Profile

DB_Newbie2007 SSCarpal Tunnel Points: 4459 More actions · Answer 4

We started off with Microsoft's Hyper-V server replication/FCI but ran into issues where one server would want to do Windows updates and the other server didn't, causing the entire system to crash.... which happened 2 times in 3 years.

We since migrated to AG for our customer-facing system and, other than one hiccup early on, have been very pleased with the result. But AG as a default for all our systems? Unnecessary. We have a very large data warehouse, that, if it crashes, we will rely on backups to restore it.

Argue for your limitations, and sure enough they're yours (Richard Bach, Illusions)

Eric M Russell SSC Guru Points: 125620 More actions · Answer 5

AOG is great, because having all read-only queries routed to one or more secondary nodes will take the load off the primary where the ETL load takes place. However, there WILL be occasional situations where the primary node will fail over to a secondary, or a readable secondary will be unavailable due to transient synchronization issues, and all read and write requests will default to the primary alone.

But so long as applications can continue hitting the AOG listener and getting their results back with acceptable latency, then the AOG should be considered available while the DBA sorts out the synchronization or recovery of the other nodes.

In a lot of cases, the AOG itself is still available, but the application is trying to hit a specific node that's temporarily offline. So, making all this work seamlessly is not just on the DBA, it also requires that the application / reporting / ETL developers code their connection strings so that they use the AOG listener instead of connecting directly to a specific sever name.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

EdVassie SSC Guru Points: 60543 More actions · Answer 6

We used a system of DNS aliases in the connection strings. The end result was a layer of indirection that allowed us to change the underlying server topology with zero impact to connection strings.

* Each application had a dedicated DNS alias

* Each Distributed AG had a dedicated DNS alias

* The target of a given application alias was the relevant DAG alias

* The target of the DAG alias was the relevant AG at the current primary site

If we did a failover to the secondary site, the DAG alias was updated to point to the new target AG.

If we wanted to move databases to a different server, either for consolidation, workload rebalancing, or upgrade to new SQL version, we could repoint the application alias as needed, with zero impact on connection strings.

This reply was modified 1 months, 2 weeks ago by EdVassie.

Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

Steve Jones - SSC Editor SSC Guru Points: 740271 More actions · Answer 7

EdVassie wrote:

I agree that the level of HA should be guided by RTO, RPO and the SLAs agreed with the business. There will often be databases that do not need the HA features of SQL Server or other products.
...

I think AGs are easy for me, but not necessarily for others. The article did get me to think that maybe they're not worth it for me, if I have a large enough RTO.

Steve Jones - SSC Editor SSC Guru Points: 740271 More actions · Answer 8

David.Poole wrote:

I think this article is a prompt to review a lot of things we have learned as "received wisdom".
...
Related to HA is the discussion on what is meant by real-time and when real-time is actually necessary. I've found that in many cases, what the user means by real-time is that when the user presses a button, they get a response within a second or two. That is a very different thing from consuming real-time data from a stream and all the complexity that requires.
I feel that a lot of complexity results from situations where someone presents the solution, rather than the problem they are trying to solve. Implementing that solution is the easy path because it is a defined solution. Later on, we find that the person or department providing the solution only had their facts, not the broader picture, thus the solution turns out to be a problem.

Agreed, people want responsiveness, not necessarily up to the second data. By the time person A sees the data and asks Person B to look at it, real time has changed. They can't have a meaningful discussion. Most things we do with data require us to deal with a snapshot, and usually a real-time, few seconds latency doesn't help us with our snapshot.

There are cases where real time is needed, but I don't think it's that many. More often we need to think about the problem before deciding we need HA or Real-time BI data.

Steve Jones - SSC Editor SSC Guru Points: 740271 More actions · Answer 9

Eric M Russell wrote:

AOG is great, because having all read-only queries routed to one or more secondary nodes will take the load off the primary where the ETL load takes place. However, there WILL be occasional situations where the primary node will fail over to a secondary, or a readable secondary will be unavailable due to transient synchronization issues, and all read and write requests will default to the primary alone.
But so long as applications can continue hitting the AOG listener and getting their results back with acceptable latency, then the AOG should be considered available while the DBA sorts out the synchronization or recovery of the other nodes.
In a lot of cases, the AOG itself is still available, but the application is trying to hit a specific node that's temporarily offline. So, making all this work seamlessly is not just on the DBA, it also requires that the application / reporting / ETL developers code their connection strings so that they use the AOG listener instead of connecting directly to a specific sever name.

Yep a complexity that some can handle, some can't.

jasona.work SSC Guru Points: 50239 More actions · Answer 10

So I've kicked around the idea of standing up HA for our SQL Servers, usually FCI, but more recently AO, but...

The additional cost of the secondary server(s), the additional effort to keep it up, and the fact that our systems have a RTO of "up to" 30 DAYS has generally made it a non-starter. The biggest advantage would be when it comes time to patch the SQL Server, being able to patch the secondary server, then fail over to it during a slow point / overnight / on a weekend, and then patch the former primary server. Even this wasn't enough of a driver to make it worthwhile, considering I could patch SQL on a weekend (well, used to be able to) which minimized the impact on the users.

So, for now, HA is something I will occasionally play with setting up in my home lab so I can see what's involved, where the possible headaches during setup might be, that sort of thing.

Steve Jones - SSC Editor SSC Guru Points: 740271 More actions · Answer 11

30 Days for RTO. That's the kind of job I want. Not that I want to take 30 days, but the pressure of minutes is really hard.

jasona.work SSC Guru Points: 50239 More actions · Answer 12

Steve Jones - SSC Editor wrote:

30 Days for RTO. That's the kind of job I want. Not that I want to take 30 days, but the pressure of minutes is really hard.

I mean, the customers WANT faster, but we're not one of the divisions that has an RTO of minutes or less, so...

That's not to say when there's a problem we don't try to get things back up and running ASAP, but if the Azure datacenter our systems are hosted in took a meteor hit, or a tornado ripped through it, we've got 30 days to get their applications back up and running. Basically, everything we support isn't any sort of "mission critical" application, even if the end users think it is.

Knock on wood, I can count the number of times I've had to restore a database for anything other than either testing backups, updating the test environment, or because "someone deleted something they didn't intend to" in the last year on both hands with fingers left over.

Coffee_&_SQL SSC Veteran Points: 284 More actions · Answer 13

With increasing costs, I assume many companies that use to think SQL HA technologies are a must-have now think the opposite.

I don't remember where I read it, but some SQL MVP has recently discussed that VM snapshots and sql backups are most likely all that is needed by the vast majority of SQL Server instances out there. I agree with this person whose name I can't recall.

Separate Note:

I had the "fortune" of inheriting and supporting an six SQL Server FCI built on 3 physical nodes. They used async AGs for cross site DR. Talk about an over-engineered headache.

The designers and builders of this solution had not considered the impact of a network outage between the two sites (spanning some 500 miles). AGs can't replicate, log send queue accumulates, t-logs don't truncate, drives run low on space......I don't miss any of it!

To Steve's point earlier, Log shipping is great technology and AGs are great but please don't use AGs for DR.

This reply was modified 1 months, 2 weeks ago by Coffee_&_SQL.

Coffee_&_SQL SSC Veteran Points: 284 More actions · Answer 14

Eric M Russell wrote:

AOG is great, because having all read-only queries routed to one or more secondary nodes will take the load off the primary where the ETL load takes place. However, there WILL be occasional situations where the primary node will fail over to a secondary, or a readable secondary will be unavailable due to transient synchronization issues, and all read and write requests will default to the primary alone.
But so long as applications can continue hitting the AOG listener and getting their results back with acceptable latency, then the AOG should be considered available while the DBA sorts out the synchronization or recovery of the other nodes.
In a lot of cases, the AOG itself is still available, but the application is trying to hit a specific node that's temporarily offline. So, making all this work seamlessly is not just on the DBA, it also requires that the application / reporting / ETL developers code their connection strings so that they use the AOG listener instead of connecting directly to a specific sever name.

I've only briefly supported a setup that used Secondary AGs for reads for reporting. I've read and heard about some of the difficulties with doing so. Have you had any issues caused on the primary node by readable secondaries?

I've not used it in a very long time, but log shipping with standby is still a thing. What are your thoughts on using that over readable secondary in AG? Obviously, this doesn't suit your HA need. Just curious how well this works for reporting nowadays.