A Staffing Disaster

There was a failure recently at an Azure data center in Australia when a utility power sag caused equipment to trip offline at one of the Azure data centers in Australia. You can read about it here, but essentially the headline is that there were only three people on site when the incident occurred, and that caused them to be unable to restart the equipment in time before an outage occurred.

In a little more detail, there weren't enough people to quickly restart the equipment chillers after the incident. The staff had to access the equipment on a roof when 13 of the units didn't restart. They were able to get to 8, but when they got to the last 5, the temperature of the water had risen to a level that wouldn't allow a restart. So they had to power down some computer equipment and go through a more lengthy process to get everything running.

This sounds bad, but in reality, this is exactly the type of thing I've seen in private data centers, who almost never have all the staff they need, or the knowledge necessary, to deal with large-scale failures. While I haven't seen the chillers, I have seen people trip electrical systems and be unable to restart or reset UPS's or generators for hours until qualified staff could come in. If you read the incident history, there is a good retroactive of what happened, and then some actions taken to try and prevent this in the future. They increased staff levels but also identified some places where the previous staffing level would have been fine with some equipment and protocol updates.

I wish more organizations would review incidents and examine them with an eye towards not only what happened and where there were failures, but how to prevent issues in the future. Too often I see people going through this exercise in order to blame someone and "prevent this from ever happening again", which usually means we fire someone and don't change anything else. We need psychological safety in reviews of actions to get better.

As we build more complex systems, or even more complex organizations with lots of teams, people, equipment, procedures, etc., it's easy to build in lots of points of failure without realizing there will be problems in the future. My goal often these days is to assume I'll have some inexperienced or less capable staff and design processes and systems to survive issues. To keep things simple, and not get too cute with engineering. I like robust, resilient systems that anyone can operate, not those that require me to ensure my senior superstars are always on call.

Of course, it often takes the senior superstars to design and test these systems and protocols, which is a good use of their time.

Many businesses struggle with staffing, in many areas. Technology groups are no different, and we have to learn to work smarter, not assume we will just get more staff and solve our problems.

All the Costs of Downtime

by Steve Jones

SQLServerCentral

Downtime causes a lot of problems, not all of which are financial for an organization.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

(2)

You rated this post out of 5. Change rating

2025-09-22

114 reads

Discuss

Impact Minutes

by Steve Jones

SQLServerCentral

Disaster Recovery (DR)

When downtime strikes, we may have to make decisions about which systems to focus our efforts upon. Steve talks about the impact of a disaster on your choices.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

(2)

You rated this post out of 5. Change rating

2021-01-28

96 reads

Discuss

Recovering Databases From a Master Backup

by Steve Jones

SQLServerCentral

Losing your instance might result in the need to get information from what you have. Steve Jones looks at a way to get the proper version and patch, and database list, from what limited resources you might have.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

(5)

You rated this post out of 5. Change rating

2020-10-27

2,502 reads

Discuss

Make SQL Server Agent Jobs HADR Aware

by Steve Rezhener

SQLServerCentral

Introduction Always On Availability Groups (AGs/AG...

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

(3)

You rated this post out of 5. Change rating

2020-10-22

9,270 reads

Discuss

DR as a Service

by Steve Jones

SQLServerCentral

Disaster Recovery (DR)

It's not the first task when I start a new job, but often as a DBA or developer, I usually ask about Disaster Recovery (DR) plans sometime within the first six months. If I'm a DBA, of course I need a plan. If I'm a developer, however, I still need to understand how this might […]

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2020-10-01

109 reads

Discuss

A Staffing Disaster

Rate

Share

Categories

Share

Rate

A Staffing Disaster

Rate

Share

Categories

Share

Rate

Related content

All the Costs of Downtime

Impact Minutes

Recovering Databases From a Master Backup

Make SQL Server Agent Jobs HADR Aware

DR as a Service