Mini Disaster - AC Failure

, 2004-06-07

I bet you've all had what I call a mini disaster, something that entirely

disrupts production for a minute to a day or so. Sometimes it's user error, a

bad patch, user error, hardware failure, or user error! These mini's are times

that will often test your character. As much as I hate to admit the ones that

occur, I think it's good to talk about them, maybe it will save someone else

some pain someday.

Some percentage of you have a disaster recovery (DR) plan designed to sustain

the business if really bad things happen such as the data center being destroyed

by a storm. I'd say they are slightly better than nice to have and some are

better than others, but do they cover the smaller things that might happen? Or

should they?

On to the disaster!

A little background first. Few server manufacturers talk to you about

planning for the additional heat load and electrical demand that adding new

hardware requires. Adding electrical might be easy if you have the capacity,

gets expensive if they have to run a new supply or put in a dedicated panel.

Same for AC. Adding a small 1U box probably won't affect the AC, but adding a 30

disk SAN running on 220V will make a huge difference. Adding more AC is not

simple, or cheap. Something to think about.

Losing electric can ruin your day. I'll go out on a limb and bet that most

servers are on some type of UPS, although it's rare to have the ability to hold

more than 20 minutes. Far fewer sites have a backup generator. Let's say the

building loses power, UPS kicks in, you're still taking web hits, whatever your

business does. Is the AC on a UPS? Not likely! What happens to the temperature

in the server room? It starts climbing, though if you only have 20 minutes of

power it usually won't get hot enough to matter. Most servers have some type of

thermal protection that will shut them down before they self destruct. I don't

know what the actual setting is internally, but our stuff seems to shut down

when the ambient temperature is between 100 and 110 degrees.

Wondering how I know?

About every six months our building maintenance people shut down the AC in

the entire building for part of Saturday to do preventative maintenance. On more

than one occasion everything has restarted correctly except our server room

units - on a separate circuit or something like that, to make it less likely to

fail! On another occasion they turned it off to service the fire sprinklers in

the building. Usually we catch it, but at least once I've opened the door to

find the temperature over 100 degrees. Typically servers start to shut down, but

it's a guess as to which ones will and which ones have better cooling and will

keep running. Once we had a server that lost a processor, but once the room

cooled we had the BIOS rescan and everything was fine. A tribute to the

hardware, but not our finest hour.

We try very hard to make sure we check after any power outage or building

maintenance, but just in case we installed a temperature alarm connected to an

analog phone line that will notify several people if the temp exceeds 85 degrees

(room is usually cooled to 72). I don't have the brand handy, but it cost us

about $250. Cheap insurance. Make sure you plug it into a UPS! Some models also

detect water, could be a useful thing depending on the environment you're in. In

larger buildings the maintenance staff usually get paged if a system fails, but

it never seems to work for us. In the most recent iteration of this particular

disaster something happened where they did maintenance without telling us, the

AC didn't reset, AND the alarm didn't go off!

So the quiz for today, what do you do if you lose AC right now but still have

electric? How long can you continue to operate? Who do you call?

Your first priority is to start moving the heat out. Keep a couple cheap fans

in or near the server room, don't let people appropriate them. Get the server

room door open, get the fans on high.

Next get on the phone or get someone else on the phone finding whoever works

on your AC. Make sure they understand it's an urgent problem. Even better, prep

them ahead of time about the importance of responding quickly should this

problem occur.

From there I recommend you take steps to reduce the heat load as much as

possible. Consider shutting down the inactive portion of any clusters. Turn off

non critical servers, candidates might be print servers, secondary domain

controllers, file servers, switches, etc. Nice to know ahead of time what is

critical and what is not. You may even need a hierarchy where you turn some off

immediately, turn off others as the temperature continues to rise.

If you've got a spot cooler (a portable AC unit), get it plugged in and

working. If you don't have one, get on the phone and rent one, they start at

about $100 a day. Make sure to know what type of electrical is required, these

often use 220V, and even check the type of plug they require. Usually they are

not big enough to manage all the heat load which is why you need to do the steps

above first or concurrently. Most of these coolers also have a condensation

bucket that has to be emptied when full or the unit will shut down. The last one

we used had a 2.5 gallon container, estimated to be good for 8 hours in Florida

with high humidity, longer if you have partial AC that is reducing the humidity

already.

At this point you've done most of what can be done. Now assess how long you

can maintain operations. Possibly you can keep critical items running

indefinitely. Possibly you've only slowed the heat load and at some point it

will be better to shut things down gracefully rather than risk losing hardware.

If you've got a failover site, think about getting it activated. You also have

to look at the time of day, maybe you can shut down more stuff after 5pm. It

this is happening during the hottest part of the day you may gain a little

ground going into the evening hours.

It can happen. Wednesday afternoon the temp alarm went off, we confirmed the

temperature and that the AC was running. Turns out that at some point an

electrical contractor had run quite a bit of wiring over one of our ducts, it

finally crushed the duct, reducing capacity just enough to let the temp climb.

Luckily it was something we could fix and the temperature never got into the

danger range.

There are lots of things that can go wrong of course, but maybe this will get

you thinking about a little extra preparation if this one ever happens to you.

Rate

Share

Share

Rate

Related content

Backup Scenarios for successful SQL Server Restores and Recovery

SQL Server has a great backup and recovery architecture, but you have to know how to properly configure and use the server to ensure that you will not be seeking new employment anytime soon. A few of the Sonasoft team have written this short piece on strategies for setting up your backup jobs to ensure recovery in the event of a disaster. Welcome new authors Bilal Ahmed, Kiran Kumar, and Vas Srinivasan.

4 (2)

2004-06-29

17,574 reads

Streamlining the Database Server Recovery Process

Are you tired of manually restoring each database on a new server when the original server has a melt down? Does the manual process seem slow, and prone to keystoke and mouse click errors? Would you like to have those restore scripts automatically built, so you only have to fire them off? Well this article will show you one possible method for speeding up and reducing errors will trying to perform a restore of all databases on a server.

5 (2)

2002-11-05

8,972 reads