Most of us that work with technology hate downtime. We don't want a system that we're using to go down. We don't want any software that we depend on to fail when we need it. Most of all, we don't want our phones ringing because some system we're responsible for has gone down. We do everything we can to keep our applications online. We avoid patches. We try to test as much as possible before deploying changes. We also may apply generous amounts of hope and prayer.
However that's not how all companies run their internal systems. Netflix has taken the opposite approach, actually creating downtime for some of their systems using what they call a "chaos monkey," and they think it could help you. To be fair, Netflix doesn't take their entire application offline, but they do cause failures in the hardware and software, specifically to see if their redundant and scaled-out architectures can limit the impact on users.
It's an interesting idea, though one that I've not seen many companies be willing to implement. Netflix thinks you could benefit from it, but they also run a series of services that are scaled our across many machines. Many companies I've worked with have services on one machine handling an application, and they accept the risk that a system might fail and users will experience problems. Given the quality of modern hardware, that might be a good bet to place these days.
However more and more of us are running redundant systems for some applications. If you think the Chaos Monkey could help you, let us know.