Triple Check Your Restores

I used to work at a nuclear power plant. Not really in the plant, but as a network administrator in the building next door to the plant. Probably a good thing since I struggled to get to work on time and everyone going into the plant had to go through a metal detector like that at most airports. My tenure might have been shorter if I had been late every day to my desk.

However, there was one thing that got drilled into me by each person I knew who did work closely with the power generation side of the business. Everything was (at least) triple redundant. Not only did they want a backup for a component, or a system, but they wanted a backup for the backup. There was a limit to the paranoia, but in critical places, where radiation was concerned, we had multiple backups. One notable, and impressive, area was the power for the control rooms and pumps. In addition to batteries for short term power loss, there was a large diesel generator for each of our two reactors, plus a third that could take the load if either of the first two failed. Those were impressive engines, each about the size of a very large moving truck and jump started with a few dozen large canisters of compressed air that could spin the crankshaft in a split second.

This week there was a report that the database for the US Air Force Automated Case Tracking System had crashed. Apparently the database became corrupted, which happens. However, the surprising part of this story is that the company managing this system reported they didn't have backups and had lost some data going back to 2004. They are looking to see if there are copies in other places, which I assume might mean exports, old backups or something else, but the reports make this seem like a completely unacceptable situation. I assume this is an RGE event for a few people, perhaps all of the staff working the system.

I was reminded of my time at the nuclear plant because we had a similar situation. We didn't lose any data, but we found a backup system hadn't been working for months. Those days we had a tape drive system that automatically rotated 7 tapes. I think this would last us about 4 or 5 days, so it was a once a week job for one administrator to pull the used tapes and replace them with new ones. We had a system where tapes were used 4 or 5 times before being discarded, and our rotation had a tape being used every 3-4 months. However, the person managing the system rarely restored anything.

One day we decided to run a test. I think this was just my boss giving us some busy work to keep us occupied but in a useful way. When we went to read a tape, it was blank. Assuming this was just a mix-up, we grabbed one of the tapes from the previous day and tried it.

Blank.

At this point, my coworker turned a bit red and started to stress. He was in his 40s, with a family and mortgage. I was in my early 20s and had no responsibility here, but I could appreciate his concern. We frantically loaded tape after tape, even looking at the oldest tapes we'd just received from our off-site provider. None were readable, and most were blank. We nervously reported this to our boss, who had us request a sample of tapes from off-site storage going back over 6 months.

Eventually we realized that we hadn't had any backups for about 4-5 months. The tape drive had stopped working properly, hadn't reported errors, but dutifully kept retrieving files and rotating tapes each week, unable to properly write any data. No databases, no email, no system was being backed up.

A rush order to our computer supplier had been placed the first day to get us two working tape drives that we manually loaded tapes in each day, and checked them the next morning. Eventually we replaced the drive in our tape leader and instituted random weekly restores to be sure we had working backups. I'm not sure if the plant manager or upper IT management was ever told, but I'm glad we never had to deal with a hard drive crash during that period.

Backups are something we all need to perform. I note this as the #1 thing a new DBA or sysadmin should perform on systems. However, backups are only good if you can read them and actually restore data. I've made it a point to regularly practice restores as a DBA, randomly restoring backups with diffs, logs, or to a point in time. Not only do I test the backup, but I test my skills. I've also tried to keep an automated process around that restores all production systems to another server to test both the restore as well as run a DBCC CHECKDB. Corruption can live in databases for a long time. It flows through backups, at least in SQL Server, and this is something to keep in mind.

I'd suggest that you make sure you ensure that your backup plan is actually working by performing a few restores. Build an automated process, but also run some manual restores periodically. You want to be sure that you can really recover data in the event of an emergency.