SQLServerCentral Editorial

All the Costs of Downtime

,

I studied economics in university, which isn't that close to database work, though I did have to work through linear regression problems by hand. I always enjoyed mathematics, so this wasn't a hardship. Until I purchased a PC that was capable of letting me do graphs and calculations in PASCAL and BASIC. Then I realized that my enjoyment wasn't that efficient or useful, and a computer could help me get things done way more efficiently.

Many of us work on systems that process tremendous amounts of data, something our organizations couldn't complete without computer hardware, efficiently or not. We just wouldn't be able to get the work done by hand. That's the main reason why downtime is such a problem in the modern world; we can't fall back to manual systems in many cases.

I ran across an article that discusses some of the large-scale failures in recent history (Heathrow, Delta, NYSE, Royal Bank of Scotland) due to computer system failure. Certainly, there are large financial costs and lost revenue for organizations that suffer these outages. However, there are other costs that are borne by the staffers, which don't often make the news.

When it's "all hands on deck" to solve a problem, other work isn't being progressed. There is certainly the interruption of Operations people, but often developers get asked questions or pulled into meetings to provide input. That can take them away from their existing work. Apart from the "23 minutes to get their head back in the game," as noted in the article, can they even focus anymore? Will they be thinking through all the possible causes, and did they actually provide the right information or all the details needed?

During a crisis, or even after, it is very hard for humans to focus on anything else. Apart from the technical details, IT staffers can have a range of emotions and thoughts. They might have sympathy for customers affected. They might worry they're at fault and might be blamed (or terminated). They might be thinking about how they should have coded or configured something differently? Should they have tested more or accounted for issues? They might have simple anger at others who didn't do their job, or frustration at the failure of a piece of hardware.

Perhaps even more concerning is the load management can place on employees to get things fixed. If people work long hours, how do we ease them back into the flow of all the other daily work? I know I've struggled to get people to rotate work with rest as a manager. As an employee, I struggle to even sleep if I am sent home while others are still working. I've had to work 100+hour weeks and very quickly we get into survival mode, not productive mode.

There are lots of costs to downtime apart from the financial impact. If you can't maintain a stable environment that limits the time employees spend firefighting, you likely aren't going to survive as an organization. Startups sometimes can do this, but often it's from a few extremely dedicated employees who make a difference at a smaller scale. And these employees often pay the price in their personal lives with health, relationship, or other issues.

The article goes on to look at predictive analytics that might help us reduce some of the issues from hardware issues. I think this is likely true, as we've seen digital twins that simulate loads on equipment help proactively catch issues.

What do we do with software? If we don't write well architected software that handles the load, how do we write an analytical system that can predict failures? This seems like a level of static and dynamic code analysis that we aren't mature enough to build.

Heck, even if we could, how hard is for many of you to get queries tuned in a running system? I find too often there isn't enough effort or enthusiasm from developers, management and others to follow solid tuning advice and change your SQL. Maybe that's too limited a view.

Perhaps the AI analysts of the future will become the consultants of the past, whose recommendations often mimic the words of the current staff, but somehow carry more weight. Maybe they'll get more things done and changed to help us build more robust systems.

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating