Is 0% Downtime Possible?
Introduction
The short answer is "no".
And there isn't really a long answer. So is this article over? Well, I suppose I should explain why this isn't possible, so let's begin by looking at other systems.
Who has 0% downtime?
No one. At least over time. My desktop has 0% downtime over the last 5 days, but 5 days isn't very long. The night light in my 3 year old's room has had 0% downtime for a couple months (it burns 24x7). However, 60 days isn't very long either.
What about outside my world? Cars? Well, since my Trooper is in the shop, it's got downtime. Plus, I don't drive it 24x7. Boats? Go in the shop regularly. Many can float 24x7 for years, but they aren't being used, plus the idea is to keep working in the event of some disaster. Lots of boats aren't very fault tolerant of storms.
What about critical systems? Like those in a nuclear power plant. Well, I worked in one for almost 3 years and they too have downtime. Partly planned, occasionally unplanned, but it still exists. And nearly every system in the plant has triple redundancy. It's an amazing place, but things still go wrong.
Medical systems. Not those either. There are failures in all types of equipment, which is why a hospital usually has spare systems for cardiac care, etc. They still experience failures, though I'd bet that thanks to the tremendous efforts of our doctors and nurses, most patients survive.
The group most often looked at as providing 0% downtime is the telcos or phone carriers. People expect that these systems will be always available. Why look at them? Well, they provide a service to a great many people across a large physical distance. Not many other systems do this. Do the telco's provide 0% downtime?
Ahhh...............................no.
Not a resounding NO because they do a pretty good job. Actually an excellent job, considering they use so much analog equipment. In fact, many network equipment companies were shooting for telco levels of high availability. But the fact remains that they cannot duplicate all single points of failure, though they catch most of them, and can get to minutes of downtime a year, year after year, for 99+% of their clients.
Even the electrical utility companies, who use a grid of wires so no one company provides all electricity for an area cannot help their downtime. Who can predict when some drunk will slam into a transformer or pole carrying wires? Not much different than your junior admin or CTO tripping over a power cord.
Why can't you do it on a database
There are two types of downtime, planned and unplanned. Of course, sometimes a planned action results in an unplanned downtime (like an SP installation), but both types exist in every application.
So if both types exist, then you cannot build a 0% downtime solution, right? Right. You can't. Not it's hard, not it takes lots of $$, you can't.
OK, I know lots of you will disagree. However I've tried and searched for answers and, while I'm sure I haven't seen/read/tried everything, I have seen lots. Let's go through some scenarios where you can try and achieve 0% downtime.
Well, a database must run on a computer, so you first want to ensure that your machine cannot fail. That starts with power. Well, power supplies fail (both the internal ones in the box and the ones at the other end of the outlet in the wall). You can buy machines with multiple power supplies (my main db server has 3 and only needs 2), even hot swappable ones. This solves the first issue. You can also use a generator/UPS combination. This will in all likelihood handle all your power issues with the exception of 1. Some knucklehead trips a power cord.
Don't laugh, I've been that knucklehead before.
OK, let's assume that you seal and bind the power cords so I can't trip over them. Power problem solved.
What about your disks? After all the data has to be somewhere. Well, most people use RAID. Works fine. Can be expensive, but it works. However, your disk system may need to be accessed by more than one machine (clustering is coming), so you need multiple connections. The big EMC arrays will handle all this on a network, but then you also have to be aware of the power and network connections between the machines. This is a place of potential failure (after all as you add more components and complexity the chance of failure increases), but these can be solved with redundant components.
OK, you're mostly covered. Now, what about the Operating System. Well, we here in SQL Server Central land tend to use Windows because, well, mainly because Microsoft hasn't ported SQL Server to any other system (yet). OK, we've all heard about issues with Windows. They exist, but I definitely have Windows server machines that have run for months without a reboot. Haven't said years, but mainly because of planned downtime, but occasionally unplanned.
OK, if the OS is unstable, then you can always cluster, right? Build an active/passive cluster (there's quite a few
Despite the reliability of my machines (meaning Windows NT/2000 and SQL Server), I have had issues. There have been times where seemingly stable code (I think all code is "seemingly stable"), the server has flipped out to the point where SQL Server is eating 99% of the CPU and work isn't being done. At least not new work. I assume that SQL Server is still processing some query that it was asked to by someone, but from the point of view of our Operations group, it ain't working.
Which really brings me to the second to the last place where 0% downtime is compromised. Application software. In all my years in this business, after meeting and working lots of programmers in many different fields, I have NEVER, NEVER, NEVER, NEVER, NEVER, EVER worked with anyone who produced bug free code.
Even after testing.
It's not most people's fault, but this industry is still unable to build reliable tools that can exhaustively test and verify an applications stability. Individual modules work well, but at some point, every complex application becomes too large to completely test, shortcuts are made, and errors occur.
And these errors result in downtime.
So what's the last thing to prevent 0% downtime?
You.
And me, of course. And every other human who works with computers. We make mistakes. No matter how many protocols, procedures, rules, double checks, etc., we will make mistakes. And we will cause downtime. Whether this is bad programming, a mistake in change control, incomplete testing, or, the biggest problem I see, user's demanding enhancements.
Even if you built a bullet-proof system, all the hardware protected from failure, people will want changes. Which will require human intervention. Which will require more testing. Which will require you to upgrade the code. And that upgrade, no matter how automated, will result in downtime. Now the post I read only had to worry about 12 hours of uptime a day. I'd still argue that no application that undergoes development and is upgraded yearly, will achieve 100% uptime over more than two years.
However, two years may be enough.
Conclusions
Now I know this isn't complete. I forgot to mention the lovely Service Packs that MS releases, some of which have not had the smoothest installation, especially in clustered environments. But I hope I'll get a good debate going and some other ideas that will allow me to update this article.
Most of this is based on personal experience, but I also looked around the Internet for this article and found some resources, but it took some time. Lots of nonsense includes the "downtime" word. Anyway, here are a few links:
- META Report: Planned or Unplanned, It's All Downtime - Discussion about how to minimize downtime.
- Case Study from Veritas - Interesting case study from Veritas about Umbro.com. They talk quite a bit about available every minute of every day, and 365 by 24x7, but then mention the comprehensive suite of products "minimizes both planned and unplanned downtime"
- Beyond.com - From Google, this report on Beyond.com says they have the lowest downtime statistics. And they run at 99.9%!!!!!!
- Utility Reliability Metrics - A look at some things that cause outages at utilities.
- Canadian Lottery Setup - Another interesting note where the 3rd paragraph says "0% downtime" and the 4th says "nearly instantaneous. It's close to 0%, but nearly instantaneous ain't 0.
- Hidden Cost of Downtime - Interesting. The first paragraph mentions most industrial assets run at 85-95% uptime.
I know there will be someone who disagrees, but there is no way to get 0% downtime. However, I'm sure I forgot some things, so please feel free to comment and let me know.
As always I welcome feedback on this article using the "Your Opinion" button below. Please also rate this article.
Steve Jones
©dkRanch.net February 2002