Disaster. The word sends chills down the spine, but what exactly is it? Disasters come in all shapes and sizes but I would suggest that for most DBAs a succinct definition is "the loss of infrastructure or data that disables an IT system".
Because of Murphy's Law, it is impossible to plan for every possible system disaster you could encounter as a DBA. For example, I have never written a recovery plan that would be used in the event of a zombie attack on the data center. The odds are long, but yet, someday it could happen! (They say in the Multiverse, it already has!!). Because we cannot plan for all possible outcomes, recovery plans must be first and foremost: flexible.
During some recent business continuity exercises, I developed some level of acumen around developing technical recovery plans. While I can't provide all the answers for you, below is a list of key concepts that should probably be addressed in any recovery plan worth its salt. You may even be able to survive the zombie plague if you play your cards right.
The first thing to determine is the scope of the recovery plan. Are you writing one plan per application and involving the application's support team (expensive, but accurate) or are you writing a general plan for the loss of a SQL Instance or server (cheaper, but does not contain all the information needed to recover individual applications). I always err on the side of caution and would make the personal recommendation that you scope your recovery plan to 1 application per plan and generate several plans.
Define Acceptable Data Loss
One of the primary concerns of database administrators is recovering databases from backup.
You need to confirm with the business exactly how much data loss is acceptable. Come to an agreement and have every stakeholder sign off on the terms. To successfully recovery, you need to ensure that your database recovery model supports the terms and that you are doing whatever it takes to meet those terms 100% of the time. This might include adding monitoring for backups and monitoring that backups have gone to tape (if you backup to disk first). It might also include monitoring that log backups are happening every X minutes. The simplest way to think about it is to find out what the business needs, then do whatever it takes to guarantee you will meet or exceeded their needs.
Define Acceptable Loss of Availability
This task is very similar to defining acceptable data loss. You need to determine how long an application can be down so that you can build the infrastructure and monitoring to support those requirements. This will probably already be in place, but document the details in your recovery plan and have the business stakeholders sign off on it. The Internet is teeming with articles on designing high availability solutions for SQL Server, but you must still ensure that whatever route you take can (and will) be tested periodically to make sure it really works.
Define the Recovery Plan's Granularity
You'll need to decide on how detailed your recovery plan will be. You can simply state in a single step that "the DBA will recovery the database", or you can go into greater detail such as listing all the SQL Commands needed to do the restore, where the backup files are located, how to pull them off tape, etc. My personal advice would be to list the simple step, and then create an appendix to the document that lists more detailed steps. I usually stop short of screenshots, but that's my own personal threshold.
Integrate Application Domain Knowledge into the Plan
Most DBAs know how to recover a database to a point in time by using the RESTORE command, but do you and your application team know how to recover the application? Maybe there are XML files that require updating because you're using a new connection string to a new DB server. Maybe you can't recover the DB backend without restoring the application VM to the same point in time (or close). There are many different dependencies in today's complex and integrated architectures. Much of this is not the domain of the DBA, but do your best to guide your colleagues toward having a complete recovery plan for the rest of the application. At the very least, you should integrate enough domain knowledge about the application's databases that you aren't scrambling to figure out which server, instance, and databases need to be recovered.
Develop a Communication Plan
The foundation of any disaster recovery plan is the communication plan. Who is contacted in the event of a disaster? How are they contacted? When are they contacted? How often are they updated? Each application team along with the DBA and ancillary groups needs to flesh out those communications with enough detail so that communication is smooth and orderly during the recovery process.
Include contact information for the primary application team, the DBAs, the System Administrators, Network Administrators, Storage Administrators, business stakeholders, and any other person that you could think might possibly need to be included.
The communication plan should also include some brief criteria for determining when and who invokes the technical recovery plan. Here is an example excerpt of a communication plan
The On-call technician calls the team supervisor after troubleshooting steps A, B, C and D reveal that there is a system disaster (see Appendix B for details on troubleshooting steps)
The supervisor decides to invoke the recovery plan.
Conference bridge opened up on 1-888-999-9999
Calling tree invoked for application team (see Appendix C: Calling Tree for Application X)
Communication delegate assigned by team supervisor
Enterprise Helpdesk notified (1-866-123-3456; email@example.com)
Business Stakeholders notified (See Appendix D: Business Stakeholder Contact Information)
It might go without saying, but all the names and phone numbers of the individual resources should be updated at least annually along with the rest of the plan.
Cover Your Bases
A recovery plan should be generic enough to fit most possible scenarios, but as you develop the plan, be sure to run a few common scenarios through the paces to see if the plan breaks down.
A thorough recovery plan for a critical application might even have a different communication plan for each scenario. Here are some example scenarios:
Full database recovery
Partial database recovery (recovery to different DB + data massage)?
Storage disaster (restore from tape + several full database recoveries, switch to remote data center/backup site)
Data Center disaster (backup site, rebuild)
Test the Plan
After you have crafted your recovery plan and believe you have everything in order, you should schedule time with the primary recovery team for a tabletop exercise. The tabletop exercise should simply test the recovery plan's viability with one or two disaster scenarios as a group. Each recovery team member should be represented by themselves, or by a surrogate actor (ex. the team lead can be the surrogate for the helpdesk or the CEO and act on their behalf for the purpose of the exercise.) As you go through the plan, be very critical of places where it is breaking down. Look closely at assumptions that the plan makes. Challenge yourselves to produce a better plan every time you go through the exercise. The tabletop exercise should be scheduled at least once per year for mission-critical applications and for small, internal applications that don't change, it might be enough to do an initial recovery plan and not schedule another one.
Update the Plan
Depending on the granularity of your recovery plan, keeping it up-to-date can be a daunting task. A periodic tabletop exercise is a great place to find steps that have become out-of-date due to changes in process or the product, but apart from that, there really is no simple way to keep the plan current.
Updating the plan is a matter of due diligence, or building a review process some other process that can is triggered by changes to the system (such as your PMO office process, or architectural review process).
Unless you can wrap it into a process or do a periodic review specific for the recovery plan, it will very likely become stale, which is almost as bad as not having a recovery plan at all.
Having a well-established, well documented, and versatile recovery plan is worth every penny of the cost of developing and maintaining it. Knowing with absolute certainty that you can quickly and efficiently recover your mission-critical applications (or ride out the zombie apocalypse in data center) is priceless.
Thanks for reading.