There was a failure recently at an Azure data center in Australia when a utility power sag caused equipment to trip offline at one of the Azure data centers in Australia. You can read about it here, but essentially the headline is that there were only three people on site when the incident occurred, and that caused them to be unable to restart the equipment in time before an outage occurred.
In a little more detail, there weren't enough people to quickly restart the equipment chillers after the incident. The staff had to access the equipment on a roof when 13 of the units didn't restart. They were able to get to 8, but when they got to the last 5, the temperature of the water had risen to a level that wouldn't allow a restart. So they had to power down some computer equipment and go through a more lengthy process to get everything running.
This sounds bad, but in reality, this is exactly the type of thing I've seen in private data centers, who almost never have all the staff they need, or the knowledge necessary, to deal with large-scale failures. While I haven't seen the chillers, I have seen people trip electrical systems and be unable to restart or reset UPS's or generators for hours until qualified staff could come in. If you read the incident history, there is a good retroactive of what happened, and then some actions taken to try and prevent this in the future. They increased staff levels but also identified some places where the previous staffing level would have been fine with some equipment and protocol updates.
I wish more organizations would review incidents and examine them with an eye towards not only what happened and where there were failures, but how to prevent issues in the future. Too often I see people going through this exercise in order to blame someone and "prevent this from ever happening again", which usually means we fire someone and don't change anything else. We need psychological safety in reviews of actions to get better.
As we build more complex systems, or even more complex organizations with lots of teams, people, equipment, procedures, etc., it's easy to build in lots of points of failure without realizing there will be problems in the future. My goal often these days is to assume I'll have some inexperienced or less capable staff and design processes and systems to survive issues. To keep things simple, and not get too cute with engineering. I like robust, resilient systems that anyone can operate, not those that require me to ensure my senior superstars are always on call.
Of course, it often takes the senior superstars to design and test these systems and protocols, which is a good use of their time.
Many businesses struggle with staffing, in many areas. Technology groups are no different, and we have to learn to work smarter, not assume we will just get more staff and solve our problems.