Disaster Recovery planning is something that everyone says you should do. Everyone wants a detailed plan that spells out exactly what you will do, especially those Sarbannes-Oxley auditors. But in reality, a detailed plan won't work and if you expect a detailed plan to succeed, you'll be in trouble. People need to be able to think on their feet, react, and make up for documents that aren't up to date or don't cover the situation.
In my first article on the Incident Response - The Framework I looked at the outline of things that you need to get the framework in place. This one will look at the actual things that should happen when an incident occurs.
The number of people that can declare an incident should be pretty large. You should have good people in place, doing their jobs, and they should be able to react. When a network technician realizes that there is probably something major going on, they should be able to invoke an incident. Their manager might be at lunch, and you don't want to wait an hour to get things moving. Trust your people to make good decisions and if they don't work with them after to explain why it wasn't the appropriate time. But don't berate them or revoke their ability to declare an incident in the future. Teach them what needs to be done.
From your framework, you should have a contact list as well and a way for this to be invoked. In a smaller organization it could be as simple as having the person invoking the incident call everyone. In larger companies the task has usually be handed off to some group that always has people on call, like the help desk or the NOC. It's then this group's responsibility to first contact a leader (see below) and determine the meeting place (also below) before starting to call the people on the contact list.
For this list, there are a few things you should have. For primary people, those in various areas that are important to solving problems like the senior network engineer, mail engineer, DBA, etc., you should have multiple methods of contacting them. Home phone, work, cell, friends, etc. Be aware that if this is a larger disaster, cell phones might not work, so prepare for that. You might even have people call into some landline if they suspect a problem. If you cannot contact a primary Subject Matter Expert then a secondary person should be contacted.
I have also seen the "on call" person for each area being contacted. In smaller companies, you might only have one person on call, but in larger organizations, more than a 1000 people, you likely have 3, 4 or more people on call at any given time. At Peoplesoft, with 12,000+ employees, we had 10 people on call at anytime for the various areas.
As I mentioned in the last article, not everyone is suited to manage a crisis. At Peoplesoft there were 4 designated people who had a little training and were allowed to lead an incident response. It's a skill that's hard, being able to remain calm under pressure, inspire as well as ensure that people get things done, not blame anyone or get too upset when more things go wrong or solutions aren't working. It takes a cool, calm, collected person that can easily communicate with both technical people and senior management. Typically it is someone at a director level, not so high up that everyone is afraid of making a mistake, but not so low as to not have enough respect and power to make things happen or be allowed to make decisions.
You should designate a few people that can assume leadership in a crisis and have the authority to make decisions. Typically this person is in charge of ALL actions and is the only person. At least two people should be chosen so you can work in shifts if need be.
In a crisis, this person should be called first and they can lead the initial crisis meeting. From there, this person is in charge of assigning actions, receiving status reports, and giving the updates to management and other groups on how the incident response is proceeding.
It's a little thing, but someone should be responsible for taking notes. Either electronically, on a whiteboard, somewhere. In one of my larger employers there was a secretary pool that was on call just as others were in case they were needed for a crisis. In smaller groups the leader designates someone, even one of the techies, but be sure someone documents the problem, the plan, the action items and who they are assigned to. Be sure that you update and post these notes regularly.
There should be a designated meeting place and meeting phone for most companies. If you're truly a small company, then you might just meet in someone's office and use regular phone lines and 3 way calling, but if you've gotten IT above 10 people, you probably should designate some predetermined setup.
The meeting room should be a room large enough to hold the group of people who will be working on the problem. Not for the actual work, though I've seen that done, but large enough for meetings of everyone that might have input. There should be equipment pre-staged and always available since one never knows when an incident will occur. You might consider having the following available:
- speaker phones for remote attendees
- preferably whiteboards for diagramming or making notes, but if this is not possible, a large notepad that can be written on and the pages taped or stuck to the walls. Be sure writing utensils are available.
- Computer projector - For displaying something from a laptop for everyone to see.
- Extra power cords, power strips, network cables, etc.
- A clock. This will be the authoritative clock by which everyone will work. Individual computer clocks, watches, etc. can be managed by each person once they know how this clock relates to their own watch or computer clock.
This can be a normal conference room, which it was in a few companies I've worked for. Larger companies also tend to have extra laptop power supplies, multiple hubs/switches, etc setup in the room.
You also want to plan for remote attendees. For whatever reason, people will not always be able to be in the crisis room in person, so having conference capabilities is a must. Whether you permanently assign a number or not, there should be some conference facility setup for the initial meetings and updates. If it is not a permanent number, then the temporary number should be assigned by someone and given to the central point of contact for the group (help desk, NOC, etc.) The same number should be used for the duration of the crisis if possible.
At one company, a permanent line was setup and once an incident was declared, the line was permanently left open on speaker phone for the duration of the event. It was nice to be able to just call into the number and ask for some status item, but it could also be distracting to those working.
However you do it, be sure that you allow for more than one person to call in for meetings and updates. Chances are that an incident will occur when you have at least one person on vacation, out sick, etc., so plan for it in advance.
An incident has been declared and you've convened a meeting, now what? As silly or redundant as it might seem, the first thing you want to do is have someone give an update on what the incident is. People will have an idea, but it may only be a partial idea or they may not be aware of the scope or impact of the incident. It helps to have someone, preferably the person who initiated the incident response, to give the firsthand impressions of what is occurring to be sure everyone is on the same page. This should be documented so anyone coming in late or joining the group knows what is being focused on.
Others should add information that is relevant and ask questions to determine the scope of the incident. Both in assessing what needs to be done as well as the impact to the business. It is important to note that no one should be getting blamed for this. Let management deal with issues like this later. The incident response is the time to determine what is wrong and it's impact to the business. Not fix things that you don't like, not test someone's competence, you are here to deal with a specific issue, virus attack, network outage, etc. Work on that issue and don't expand the scope of things that need to get done. The goal is to respond to an incident, get it fixed and get back to your normal routine.
Once the scope of the problem is determined, the group should come to some consensus on how to proceed. Decide upon an action plan that will start to resolve the incident. Whether that is someone reconfiguring something, an order placed with a vendor for equipment, or something else, invite arguments, discussions, disagreements, etc. You want people to poke holes in the plan, find potential problems. Everything should be laid on the table so that the risks are understood by all. Most things will be a tradeoff and the incident was declared because there isn't an easy answer or quick fix.
Once you feel that the plan is solid enough to proceed, assign tasks to people, give a summary, and decide when the next update will occur before sending people to complete their tasks.
Update are crucial to ensure that the system runs smoothly and people trust it. Even though you are in the "work as fast as I can to get things working" mode, not everyone knows that, especially management. And people will have questions and planning their future activities based on the information you give them. So give them regular updates. Even if you cannot solve the problem, give an update on what is being done.
At the least, updates should be every couple of hours. The only exception would be when there is a very small group, 2-3 people working on something overnight. You might defer an update for 6 hours then, but in general they should come more often. I've seen then as often as every 30 minutes, which I think was overkill. I'd say do them once an hour unless there is some overriding reason to change that.
Be sure that everyone with an action item returns to update the leader between 45-55 minutes after the last update. This ensures the leader is aware of what is going on and can have those updating him go back to work or remain in case there are questions. Even if your task will take 3 hours, give an update on what has been accomplished in the last hour and any revised estimate.
When giving the update, summarize the situation quickly for those that may be new, outline the plan and give updates on what has occurred to date and what is occurring in the future. Allow for questions and then schedule the next update. You want this to go quickly, so try and keep it under 10 minutes.
One more thing to think about is that the updates should not be limited to just the update meeting (including remote attendees). People that want or need to know what is happening are still leading their lives. They might be commuting, eating dinner, putting their kids to bed, etc. and not be able to attend or call in for the scheduled update. They still need to be informed, but you certainly don't want 25 people calling the incident response leader for an update.
At one company, a trouble ticket was used to track incidents, along with all other issues, and updates were required on the ticket every 30 minutes. At another company, a voicemail box was available and updated regularly for people to call into and get the latest status. Neither method is preferable, but having some sort of non-real time update is important if you don't want to be saying the same thing over and over.
Work the Plan
After the plan is decided, each person should either have something to do, or may have a short downtime. In either case, if any person is leaving the meeting room, they should be sure that the leader knows where they are going, when they will be back, and how they can be contacted. There have been many times where someone leaves for a smoke or some other break and they are needed 5 minutes later. This should remain in effect for the duration of the incident. At any time the leader should be able to contact anyone.
You should also work only the action items you have been assigned. Getting creative, trying to "fix" something else while you're handling an incident is a great way to cause more problems, delay the closure of the event, make more work for yourself down the road. Deviations from the plan, if necessary, should be approved by the leader.
You also should keep track of what you are doing so you can report back at the update meetings what occurred. These don't need to be detailed step-by-step instructions, but if things were successful, what was done, and if there was any side effect or failures.
There are a few other little things that you want to be aware of, or make your manager aware of when dealing with a crisis. These are fairly obvious, but they mean a lot to keeping people cool and working efficiently.
The first is that you should prepare for things to take longer than expected. I've been in this business for over a decade and absolutely everything takes longer than expected. That being said, there are two things that I always look out for when scheduling people. The first is that when an incident is declared I go into shift work mode. That means that if it's during business hours I send someone home with the expectation that I might call them to work overnight sometime. If it's overnight, the unlucky person that got stuck with it is immediately reprieved from coming in for some or all of the next day. I want people to be sure that their interests are being looked at and since I'm asking for above and beyond normal duty help, I'll reward them in some way. Also that I'm not going to work them on some 48 hour death march. Shift work is hard, but it enables us to keep working and more importantly, keep people rested. You may have to force people to leave at some point and be sure you watch them. I've seen people push for 24 hours then not want to leave because they're sure the answer is an hour away. It isn't and when you need them in 12 hours, they're not rested, focused, etc. You can't usually make huge rewards for people, so be respectful of the fact you're having them work over and above.
The other half of this is that whoever is managing needs to keep resources handy in case they are needed. But if they are not, or not anticipated, cut them loose. I've seen crisis situations run for 24 hours and people kept on for 8 or 10 hours more than they were needed because no one thought to let them go. Once this is over you want to get back to normal business and have people working as soon as you can. That won't happen if you burn everyone out. Keep it on your mind at every status meeting that you want to be looking for people to send home. The group will run smoother and you'll have more rested people if you need them. Lots of people will be afraid to ask to go and others won't want to go. The leader needs to be strong enough to handle both situations and send people home as soon as possible.
Food is the other big item that people worry about. If it's 11:00am and a crisis is declared, lunch plans may be cancelled. If it's 4:00pm, people immediately start thinking of dinner. The leader should do everyone a favor once the scope of the problem is determined and you have an estimate to resolution. If it's anywhere close to a meal, let people know that you'll be providing a meal at xx time. Take a few suggestions and then decide what the meal is and ensure that someone gets it at the proper time. Usually between updates the leader or scribe can take orders, contact some vendor and arrange delivery or go to pick up food. This will go a very long way to keeping people happy. If you have a 4 or 6am update meeting, be sure you bring breakfast of some sort. Spending $100 isn't going to impact the cost of the crisis and you will get a great return on the investment from your people.
Handling a crisis is a huge pain in the rear. It's hard, it stresses people, and it may push some of your people over the edge. Or make them quit if it happens too often. You should be ensuring that your daily procedures and practices are geared towards preventing incidents. Virus protection, patches, etc. should be part of your IT group to be sure that the number of incidents is a minimum.
You also should try to reward people that work on the incident. Maybe not everyone, but for sure the people that have shined under pressure, worked extra hard, whatever. Even a $10 movie pass (get two so they can take their significant other) goes a long way after you're been working all night to solve something. I wouldn't setup a standard, but depending on the situation, reward people. It's an easy and inexpensive way to help retain them.
Lastly, while I hate responding to incidents after doing so for over a decade, I do know these are the times that you grow. You learn the most, you grow your skills, and you bond with your teammates. These times are important to building a winning and effective team, so try to make them go smoothly. There will be enough things that go horribly wrong without worrying about some of the items I've mentioned above.
Return to Steve Jones' home