There are a lot of Mean-Time-To-xxxx acronyms. Many of us have heard of the mean time between failures (MTBF) for disk drives. Some of us use that information when considering which model to buy. In the DevOps world, there are also the mean time to failure (MTTF) and mean time to resolve/repair (MTTR). There is one more that I think is very interesting, and that is the MTTD: the mean time to detect an issue. This is the average amount of time it takes you to detect there is a problem after the problem occurs.
There was an outage at Monzo recently due to a database upgrade, which was recounted on their blog. In this case, their MTTD, or rather actual time to detect, was a minute. I think that is amazing. In fact, I'm somewhat skeptical that an alert is raised, someone looks at it, the customer service desk calls the Ops team (who were upgrading servers), and the Ops person realizes in the space of a minute or two that there is an issue. It's possible, but I have found that help desk personnel that discover something can take a few minutes to verify the issue and then scramble to find the on-call phone number. Relaying information can take a minute or two, so if this is accurate, huge props to the IT staff at Monzo.
Many of us strive to high a high availability number for our systems, especially databases. This is one of the drivers for the growing use of availability groups in SQL Server systems: to ensure the database is highly available to clients. In determining availability, we often speak of the percentage of time that a system is available. The holy grail is five 9s, or an uptime of 99.999% of the year. This gives you just over 5 minutes of downtime a year.
In the case of the Monzo outage, which took place in July 2019, the alert is reported at 13:14 and the incident was declared at 13:15pm, one minute later. The time to diagnose the issue (maybe another MTTxx item) was 63 minutes, just over an hour. At this point, availability is arguably down to 99.988%. The actual fix was completed at 113 minutes, or 99.978%. That's the number if nothing else happens this year.
If you're attempting to get to 5 9s of reliability, you get less than 6 minutes of downtime a year. Can you figure out what's wrong in 6 minutes? Much less fix it? That's a difficult task. I think 4 9s, giving you 52-ish minutes of downtime, is realistic, but very hard. Most of us can likely handle 3 9s, which allows for 8:30:00 of downtime a year. While I've exceeded that before, it's been rare.
We have a lot of HA (high availability) options in SQL Server, and there are many successful implementations that achieve high levels of availability for the database. The network and the application are another story, but I think the quality of those areas has increased over the years as well. Doing HA well is hard, and if you aren't 100% sure of what you're doing, or your system is very valuable, you might engage a consultant, like Allan Hirt, to ensure that you've configured things well. SQL Server runs well in HA configurations, but getting it set up can be more difficult than you expect.