Over the years I have been fortunate or unfortunate enough, depending on who you are to have experienced several disasters along with a couple of very near misses. In this post I am going to explore some of the areas of DR that are not related to the physical kit. Let me just start by defining what I mean by a disaster, a disaster is not a cluster fail-over, it is not a drive failure in a RAID array and it is not a UPS failure. Disasters can be classified in two categories. The first are natural disasters such as floods, fires or earthquakes. The second are man made disasters, for example, infrastructure failure, or even terrorism.
Disaster strikes when you least expect it! Disaster is unforgiving! Disaster doesn't care! One day it will come and it will bite you, of that you can be sure!
So let us take a look at the things you might not have considered.
Do you even have a plan? If not then you better start praying! If you do you are in with a fighting chance. Everyone involved should have visibility of the plan, the actions required and the order in which they need to be followed. It is pointless having the plan just on a file server in the building that is now ablaze! A physical copy is a great idea, or if you have a laptop, tablet or even a phone, a copy there would also suffice. The more locations the better, but always remember, everyone MUST have the same plan, when it is updated, circulate it!
Who is required? Who is available? Not every member of the team will be required to help with the recovery process. Not every member of the team will be available, as I have mentioned earlier, disaster is unforgiving, disaster doesn't care. Chances are you will have team members who would have been involved in the recovery process, that are on annual leave, incapacitated or otherwise engaged. Personally I don't think it is overkill to have a weekly list of personnel that are available with contact details, this will make a massive difference.
If your lucky your office will still be in one piece and you may be able to conduct all your work from there. If your office is still standing, can the required personnel gain access? do you have keys, security fobs and access codes available? In the event that your office is inaccessible, what do you do? Is there another office or remote site that can be used? Again, can the required personnel gain access? do you have keys, security fobs and access codes available? Is the VPN even available to be an option? Things can deteriorate rapidly if there isn't somewhere from which the required personnel can work from.
Depending on the extent of the disaster and or your environment you may be looking at a very lengthy recovery process. No one person can function at 100% for 24 or 48 hours, what I find works here is to have as many bodies available, working to start with then to split into multiple teams and work in shifts. This way a lot of the initial work can be started by as many people as possible, which is then continued by one team giving other teams chance to get some much needed rest in anticipation for taking over later in the recovery process. It is pointless having a team of 10 people working flat out for 24 hours as mistakes will be made as fatigue sets in, it is much safer to have 2 teams of 5 working 6-8 hour shifts.
Who is going to be responsible for communications? You don't want 10 engineers sending out potentially conflicting messages to the business. You also don't want someone stood over your shoulder asking you how long is left, where on the plan are we up to. Someone should be responsible for collating, and communicating this information, this also helps the other way by the business having a single point of communication, having only one person to contact for updates.
No one person can be available 24/7. There may be team members with children that need to be cared for, chances are they will need to be looked after by friends or family but may have to be brought into the office, or remote site and "entertained". DR is not much fun for a 5 year old on a sunny Saturday afternoon mid-summer.
Food & Drink
That is right folks, fuel! We all need watering and feeding, without this after about 12 hours you will seriously start to flag, concentration levels will fall dramatically and again mistakes will be made. Someone should responsible for keeping everyone fed and watered and the relevant expenses agreed beforehand.
So you have a plan, know exactly who, where, when and how to recover, great job, but that is only half the battle. Have you tested the recovery process? If the answer is no then I would be inclined to say your plan will fail. Why? Well in my experience, even with the best minds in the world something will have been missed. Testing the recovery plan will highlight this, some config file, DLL or permission will not have been taken into consideration. The proof really is in the pudding, get testing!
If all of the above are covered, then there is a good chance your business and infrastructure are in good hands. There will always be something that is forgotten, always some spanner thrown into the mix, it is up to us as professionals to make sure we have catered for as much as is conceivably possible to make sure that when disaster does strike we are armed to the back teeth with the tools to help us succeed.
I've grown up reading Tom Clancy and probably most of you have at least seen Red October, so this book caught my eye when browsing used books for a recent trip. It's a fairly human look at what's involved in sailing on a Trident missile submarine...