There was a story this week about a fire in a data center in Maryland. It was actually a police data center, and when the fire shut things down, it meant that users could no longer connect to databases. Not a great situation for police that might be depending on access, and certainly not fun for citizens that might have traffic stops run long, or worse, officers unable to determine status for suspects, which might result in more crimes taking place.
What's interesting is this wasn't a big fire. In fact, the data center was running on a generator during routine work when a fire started, sprinklers went off, and the generators shut down. This is precisely the time when people aren't prepared for a disaster, as they're prepping systems for a disaster, which often means their primary systems can't come back online. While most of us would like to have separate, redundant data centers available, we often don't have the resources, and even if we did, many of us don't have the time to architect and implement systems that we can easily switch back and forth between physical hosts.
How likely is this? You never know. I worked at a large software company, with over 10,000 employees, that had our entire data center reboot during routine maintenance when a wrench hit a backup UPS and dropped it offline. The primary UPS was already down for work, and our website and intranet were down for almost an hour. With the CIO in the building. I also worked at a company with our data center in the office building. A bagel in a toaster caused the fire department to evacuate the building. One of our servers locked up, but we couldn't reboot it for customers until the building was re-opened a couple hours later.
My point is that you never know what small disasters might cause you issues. It's not possible to prevent every problem, and it's certainly not feasible to have redundany equipment everywhere, but you can prepare in some ways. Perhaps one of the big ones is communication. Let people know, especially customers, when you are performing maintenance. Most of the time nothing will happen, and the communication doensn't matter. However when something breaks, at least they'll be aware that you were working on systems and can respond quickly. Don't forget to update them regularly, even if it's a matter of telling them you're still working on things and don't have an estimate.
Above all, be sure you have backups. Make sure these are accessable, and off site. If something does go wrong, you want to be sure you've minimized the loss of data.