Expect the Unexpected with DiRT

Disaster recovery is one of the core tasks that many DBAs think about on a regular basis. Ensuring that we can get our data back online, available, accessible, and intact is important. More than a few DBAs that haven't been able to recover systems, find themselves seeking new employment.

That's not to say that most DBAs perform perfectly under pressure. Plenty make mistakes, and there may be times when they can't recover all data. There does seem to be a correlation between how often DBAs practice recovery skills and how well they perform in an actual emergency. I know that at a few companies, we scheduled regular disaster tests, though often with simulated recovery of a systems that didn't expect to actually take over a workload. Arguably not a good test, but better than nothing.

Google takes things a step further. They have annual, company wide, multi-day DiRT (Disaster Recovery Testing) events. These are across many departments and can be substantial in terms of the disruption that the these events cause to their infrastructure. This is a way for the various individuals responsible for infrastructure to actually evaluate if they are prepared for potential issues.

If you read the article, you find that Google started small with these and progressed them to larger, more inclusive tests, like taking down a data center. They also whitelist some servers, knowing they cannot pass a test, so there is no reason to actually take them down. After all, business still needs to work.

It's good to have tests and walk through actual events, like call lists and bridges to be sure that communication and documentation work. This might be especially important when teams often expect that all their written procedures are available. I went through an audit with one company, where we failed immediately when all our DR plans were on a network share. In this simulation, we had experienced a network failure and servers had crashed. We were supposed to bring up the systems on spare hardware, but some critical documentation wasn't available without a network. We started printing things out right away so that we could continue on with the simulation (as well as have this in a binder in our office).

Not everyone can schedule large scale tests, and certainly many managers don't see the point. They'll often want to gamble that staff will "figure things out" if there is an incident. That doesn't mean that DBAs and sysadmins can afford to wait for a disaster to practice some skills. Be sure that everyone on your team can recover databases, they know where backups are (or how to determine this), and multiple people have access to resources. The last thing you want is a disaster to occur during your vacation and have managers calling you to cut short your holiday because you're the only one that knows where something is or has the authority to access a resource.

Think about this ahead of time and prepare.

How Paranoid Are You About Backups?

by Steve Jones

SQLServerCentral

Sometimes just running a backup isn't enough, especially in this era of ransomware. Steve has a few thoughts on backup strategies and recovery skills.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2021-08-09

433 reads

Discuss

Incident Response Data

by Steve Jones

SQLServerCentral

Being prepared for a disaster might mean having a way to collect data when something occurs.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

(1)

You rated this post out of 5. Change rating

2021-05-12

248 reads

Discuss

Impact Minutes

by Steve Jones

SQLServerCentral

Disaster Recovery (DR)

When downtime strikes, we may have to make decisions about which systems to focus our efforts upon. Steve talks about the impact of a disaster on your choices.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

(2)

You rated this post out of 5. Change rating

2021-01-28

100 reads

Discuss

Recovering Databases From a Master Backup

by Steve Jones

SQLServerCentral

Losing your instance might result in the need to get information from what you have. Steve Jones looks at a way to get the proper version and patch, and database list, from what limited resources you might have.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

(5)

You rated this post out of 5. Change rating

2020-10-27

2,522 reads

Discuss

Make SQL Server Agent Jobs HADR Aware

by Steve Rezhener

SQLServerCentral

Introduction Always On Availability Groups (AGs/AG...

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

(3)

You rated this post out of 5. Change rating

2020-10-22

9,311 reads

Discuss

Expect the Unexpected with DiRT

Rate

Share

Categories

Share

Rate

Expect the Unexpected with DiRT

Rate

Share

Categories

Share

Rate

Related content

How Paranoid Are You About Backups?

Incident Response Data

Impact Minutes

Recovering Databases From a Master Backup

Make SQL Server Agent Jobs HADR Aware