Your MTTR

There will always be situations and problems in our software systems. I know many people feel we might get beyond most issues, but as long as we continue to develop and deploy software, I know we'll have issues. Hardware fails, bugs slip through tests and are triggered by edge cases we never anticipated, or perhaps the data is unexpected. That last one might be one of the most common issues, and a source of many security issues. Too many developers think that quality will always be high, but that's not always the case.

When things go wrong, how quickly can you get the system back up and running? Over time, a mean-time-to-recovery (MTTR) can help determine if you are getting a better handle on your environment or are things getting worse? Both your operation staff and your developers should better understand the system over time, and hopefully get broken applications back up and running quicker over time.

Do you track your MTTR? Or if you're operations, maybe you track a mean-time-to-identification (MTTI). This is the time to actually figure out what's wrong. I don't know anyone tracking this metric, but that's an interesting one to note. If we can't identify problems quickly, or the MTTI grows over time, perhaps we have a training or turnover issue. Or perhaps we have a disconnect between developers and operations staff. Even in a DevOps environment where developers are responsible for parts of the production environment, there will be differing levels of ability, and this metric might help you identify who needs more training or practice in troubleshooting if the number rises.

For most of my career, I've reported on uptime (or downtime) to management. That's not a bad metric, but it doesn't help the dev or Ops staff understand if where they might have problems. Many of us have ticketing systems where incidents are logged, and we add notes over time. Knowing how long it takes to find a problem and then fix it can be metrics that help you improve your system reliability over time.

That's if you use them to do so. If these are just numbers to try and make your group look good to upper management, then someone will manipulate things, close tickets early or open them late. They might even be more willing to close a ticket quickly and open another one to reduce the MTTI and MTTR times.

We can use metrics to improve how we work or just look good. One of these will help build an effective, efficient, strong department that does a great job building and running applications. The other usually ends up building an environment where quality stagnates, people don't stay longer than necessary, and keeps the traditional IT stereotypes alive.

Which one do you work in and which would you prefer?

Rate

Share

Categories

Share

Rate

Your MTTR

Rate

Share

Categories

Share

Rate

Related content

SQL Server 2022 Clusterless Distributed Availability Group

The Secondary Database Name in an AG

Troubleshoot SQL Server Always On Availability Groups with SQL LogScout

Do *not* place TEMPDB on a local disk in a SQL failover cluster

Configuring Availability Groups in Ubuntu

Do not place TEMPDB on a local disk in a SQL failover cluster