Incident Response - The Framework
Disaster Recovery is something that's on my mind lately. With hurricane Charlie rolling through Florida recently (it's August 2004 as I write this), and my partner Andy without power for nearly a week in central Florida, we've had some discussions about how to prepare for a disaster and the value of doing so. I'm not big on getting too detailed in a recovery plan, but I do agree that there are some good reasons for setting up a framework under which to work. This article looks at some ideas for such a framework based on my experiences.
One of the neat things that I saw setup at J.D. Edwards when I worked there was an Incident Response Team. This was a group of people in the IT group that were designated as those who would be called in to triage any type of incident and determine the course of action. There were similar groups in the business side of things, but I didn't really work with them at all. These people all had 2 way pagers and cell phones and would get called in for any type of incident.
So what's an incident? Well a disaster or any sort certainly qualifies. A natural disaster, fire, tornado, etc. could initiate an incident. As could a virus, major network outage, pretty much anything that interrupts a large set of users. These people were from different areas and there were two people in each area, a primary and a backup. This was to ensure that there would always be someone available and both the primary and secondary were not supposed to both be on vacation, out of town, unavailable at the same time. In practice this wasn't always the case, but by having a good sized group, it tended to work well and it was always likely that any qualified person might be "drafted" if need be.
The team included people from different disciplines as you might have guessed. A sampling of job functions from which people would be drawn includes:
- Network (NOC, physical network infrastructure people)
- Windows - General AD and Windows domain infrastructure people
- Email - Server side, Exchange people
- Desktop - both desktop support (help desk) and engineering, anti virus, scripting, etc. people
- Unix - General Unix infrastructure people
- Database - Can't forget the DBAs can we?
- Application Engineering - Those who know how the internal applications work.
- Director level managers - Someone needs to lead the group and have the appropriate level of authority to ensure things get done.
There may be different groupings or areas that you want represented depending on your organization, but it should be fairly easy to identify the major areas for your company. You might include voice specialists if that's needed, or you might need a representative from your change management group. You will want a primary and secondary person from each area and be sure they are equipped with some sort of technology to enable you to get in touch with them. Quickly!
Now that you've assembled a team, there are a few things that should be pre-staged in your environment. The first of which is documentation. Now I'm not talking about how you restore a failed database server, I'm talking about more basic documentation. Here are a few things you should have available in both electronic AND paper form:
- Contact information for all team members
- Management contact information
- Vendor Contact Information
- Other IT Employee Contacts
- Food/Beverage Companies and Menus
- Nearby Accommodations
- Client Contacts
Below I've provided a little more detail on each of these items and what the purpose is or what it should consist of.
Team Contact Information
This is basic stuff, but includes home, work, cell, pager, etc. contact information for each person on the team. They should be designated as primary or secondary and logically ordered in some fashion. I use alphabetical by first name since most of us work with each other by first name. Be sure their specialty is listed prominently as well. You never know who will be charged with contacting the "primary network guy" and they might not know who that is, so be sure all the information is listed. I've seen nearby relatives' contact information for key employees listed as well for people that may move to a different place in an emergency. For example, Andy's mother and sister live nearby and it would make sense to list their numbers. After hurricane Charlie, he didn't have power, but his mother did, so it made sense to be able to contact him there.
Management Contact Information
The biggest part of successfully managing and getting through an incident is communicating successfully with those affected. Usually this means upper management. More important than fixing the problem is letting people know you are working on it, the status, major issues or possible support that might be needed and estimates for when things will be accomplished. Even if you have to change your estimate, calling your VP every couple hours to update him goes a long way; much further than telling him 4 hours, being out of contact for 3:55 and then calling and saying nothing worked, it's another 4 hours.
However you handle it, you definitely want to be able to contact any of the senior management you need to, which means not fumbling for their contact information. This should be nearly as detailed as the Team contact information.
Vendor Contact Information
Face it; things break and you have to fix them. Without a doubt sometime in your IT career, some piece of vendor supplied hardware or software will break and you'll need to contact them. It makes it much easier if you have a central list of vendors, your customer numbers, phone numbers, sales people's names, keys, etc. Having a list of this stuff at 3am is handy when you call Cisco and they're asking you for it because you need support for the firewall that's DOS'd and the CTO has asked you to call them. Figuring out this information is a pain, so take some time and assemble it so it's available. Buy something new? As part of the receiving process, jot down the relevant information in your Incident Response documentation.
Other IT Employee Information
Similar to the contact info for the team. You never know when you might need the one application developer that worked on the one project that just blew up.
Again, these aren't the "To Restore The Exchange Server, you need to.." documents. These are the ones that spell out how to handle the incident. They should be fairly short, but make a few decisions up front so there is no grey area. A few of the things that I've seen put in here:
- Who's in charge - List clearly that the directors are in charge and if they can't be reached, it's person a, then B, then c, etc. You want to be sure that you pick people who can take charge of the group and ensure that procedures are followed. Not many people can do this well.
- Updates - Spell out how updates will be given and received. I've seen a dedicated voice mailbox in some companies, dedicated bridge (conference call) lines, posted at some external web site, etc. for the updates. Communication is critical and you don't want each person on your team being called individually by different people all trying to get updates. Spell out the method and frequency of updates as well as a method for people to get their own updates from the team without bothering them.
- Notes - You want to be sure that notes are taken regarding the assessment and courses of action. Including what works and what doesn't. It helps to designate someone to be responsible for this during the incident, but set up a place to keep these notes. Be sure you allow for non-electronic note storage.
- What's not allowed - Have a list of actions that are not allowed, like ordering new hardware, spending above $xx, etc. Things that you don't want done without some additional approval. Usually this number was US$1,000, though it could be spent if the VP or CIO was called.
Food/Beverage Companies and Menus
My favorite part. And this is the area of documentation that will get used the most!
Everyone needs to eat. And while they're working on a problem, the last thing you want them worrying about is how they are going to get their next meal. Do they need money from the ATM, who's open?, etc. Do yourself a favor and compile a list of local, nearby restaurants, their hours, what they serve, phone numbers, directions, and menus and have them available. It will greatly ease the process of getting your people fed when you need to. And it will generate lots of goodwill when you provide them with dinner.
A few hints: try to pick place that deliver so you don't have to send someone out. Get a variety of food to appeal to different tastes, but don't go overboard. Picking 20 places when 6 will do only increases the confusion as too many people have differing opinions. Try to find someplace that is open all night. Not always possible, but it does seem that things like to break at night. You might also stock up on some snacks in the office for emergencies, perhaps replacing them monthly and just giving away the old stuff.
Hopefully you won't need this section, but if you do, it's handy to have ready. Get a few hotels approved by management ahead of time in case you need them. I've often worked with people who had long commutes and having them drive home after working 22 hours or asking them to drive an hour each way home and back and be back in 4 hours isn't very respectful or smart. If you need them, get a couple rooms and have people take shifts sleeping. Most people will want to go home, but it isn't always an option, so be prepared.
Not applicable to everyone, but I have worked in some companies where I might need to notify clients of a major issue. Or get in contact with them to change something. Whatever the reason, if you think you might need it, include it just in case. Be sure you have multiple ways to contact people wherever possible.
This is the first part of the Incident Response Framework. In the next article, I'll look at how the group should function in an incident and some of the things I've seen done well and not so well in handling things. Unfortunately I've a bit more experience with this sort of thing than I would like.
It seems like this might be overkill at some companies, but even in my little 30 person company, having some of these ideas in place would have helped us in quite a few situations. It makes things run a little smoother and definitely helps to keep people working together. I'll admit that the team might have been 4 people total out of 30, but it still would have helped.
Give me some feedback and perhaps we can come up with a framework that many people will agree with.
©dkRanch.net August 2004