SQL Clone
SQLServerCentral is supported by Redgate
Log in  ::  Register  ::  Not logged in

Mini Disaster - AC Failure

By Andy Warren,

I bet you've all had what I call a mini disaster, something that entirely disrupts production for a minute to a day or so. Sometimes it's user error, a bad patch, user error, hardware failure, or user error! These mini's are times that will often test your character. As much as I hate to admit the ones that occur, I think it's good to talk about them, maybe it will save someone else some pain someday.

Some percentage of you have a disaster recovery (DR) plan designed to sustain the business if really bad things happen such as the data center being destroyed by a storm. I'd say they are slightly better than nice to have and some are better than others, but do they cover the smaller things that might happen? Or should they?

On to the disaster!

A little background first. Few server manufacturers talk to you about planning for the additional heat load and electrical demand that adding new hardware requires. Adding electrical might be easy if you have the capacity, gets expensive if they have to run a new supply or put in a dedicated panel. Same for AC. Adding a small 1U box probably won't affect the AC, but adding a 30 disk SAN running on 220V will make a huge difference. Adding more AC is not simple, or cheap. Something to think about.

Losing electric can ruin your day. I'll go out on a limb and bet that most servers are on some type of UPS, although it's rare to have the ability to hold more than 20 minutes. Far fewer sites have a backup generator. Let's say the building loses power, UPS kicks in, you're still taking web hits, whatever your business does. Is the AC on a UPS? Not likely! What happens to the temperature in the server room? It starts climbing, though if you only have 20 minutes of power it usually won't get hot enough to matter. Most servers have some type of thermal protection that will shut them down before they self destruct. I don't know what the actual setting is internally, but our stuff seems to shut down when the ambient temperature is between 100 and 110 degrees.

Wondering how I know?

About every six months our building maintenance people shut down the AC in the entire building for part of Saturday to do preventative maintenance. On more than one occasion everything has restarted correctly except our server room units - on a separate circuit or something like that, to make it less likely to fail! On another occasion they turned it off to service the fire sprinklers in the building. Usually we catch it, but at least once I've opened the door to find the temperature over 100 degrees. Typically servers start to shut down, but it's a guess as to which ones will and which ones have better cooling and will keep running. Once we had a server that lost a processor, but once the room cooled we had the BIOS rescan and everything was fine. A tribute to the hardware, but not our finest hour.

We try very hard to make sure we check after any power outage or building maintenance, but just in case we installed a temperature alarm connected to an analog phone line that will notify several people if the temp exceeds 85 degrees (room is usually cooled to 72). I don't have the brand handy, but it cost us about $250. Cheap insurance. Make sure you plug it into a UPS! Some models also detect water, could be a useful thing depending on the environment you're in. In larger buildings the maintenance staff usually get paged if a system fails, but it never seems to work for us. In the most recent iteration of this particular disaster something happened where they did maintenance without telling us, the AC didn't reset, AND the alarm didn't go off!

So the quiz for today, what do you do if you lose AC right now but still have electric? How long can you continue to operate? Who do you call?

Your first priority is to start moving the heat out. Keep a couple cheap fans in or near the server room, don't let people appropriate them. Get the server room door open, get the fans on high.

Next get on the phone or get someone else on the phone finding whoever works on your AC. Make sure they understand it's an urgent problem. Even better, prep them ahead of time about the importance of responding quickly should this problem occur.

From there I recommend you take steps to reduce the heat load as much as possible. Consider shutting down the inactive portion of any clusters. Turn off non critical servers, candidates might be print servers, secondary domain controllers, file servers, switches, etc. Nice to know ahead of time what is critical and what is not. You may even need a hierarchy where you turn some off immediately, turn off others as the temperature continues to rise.

If you've got a spot cooler (a portable AC unit), get it plugged in and working. If you don't have one, get on the phone and rent one, they start at about $100 a day. Make sure to know what type of electrical is required, these often use 220V, and even check the type of plug they require. Usually they are not big enough to manage all the heat load which is why you need to do the steps above first or concurrently. Most of these coolers also have a condensation bucket that has to be emptied when full or the unit will shut down. The last one we used had a 2.5 gallon container, estimated to be good for 8 hours in Florida with high humidity, longer if you have partial AC that is reducing the humidity already.

At this point you've done most of what can be done. Now assess how long you can maintain operations. Possibly you can keep critical items running indefinitely. Possibly you've only slowed the heat load and at some point it will be better to shut things down gracefully rather than risk losing hardware. If you've got a failover site, think about getting it activated. You also have to look at the time of day, maybe you can shut down more stuff after 5pm. It this is happening during the hottest part of the day you may gain a little ground going into the evening hours.

It can happen. Wednesday afternoon the temp alarm went off, we confirmed the temperature and that the AC was running. Turns out that at some point an electrical contractor had run quite a bit of wiring over one of our ducts, it finally crushed the duct, reducing capacity just enough to let the temp climb. Luckily it was something we could fix and the temperature never got into the danger range.

There are lots of things that can go wrong of course, but maybe this will get you thinking about a little extra preparation if this one ever happens to you.

Total article views: 6108 | Views in the last 30 days: -1
Related Articles

What's Happening Here?

A look at what's happening with the SQLServerCentral servers based on the public information exposed...


SQL Server 2017 Build List

A list of builds for SQL Server 2017.


Miracle happened with my 2 database test server databases...

Miracle happened with my 2 database test server databases...


SQL Server 6.x Build List

The list of builds for SQL Server v6.x


SQL Server 7 Build List

A list of builds and versions for SQL Server 7.