Notice I didn't title this "Plan to Fail." We should never plan to fail. That's sabotage. And that ain't right, as we say in the South.
However, when we do our planning, it is an unrealistic expectation for everything to work every time, especially when it comes to IT. Our systems are getting more and more complex with each passing year and there are more and more points where a failure can occur. So it is realistic to plan for failure. After all, that's what recovery, and especially disaster recovery, is all about. Take, for example, the bridge in the picture (Photo: NOAA). This is the Ben Sawyer Bridge and this is what it looked like after Hurricane Hugo got a hold of it. Needless to say, in this state, it wasn't usable for car traffic. This blocked off the only accessible means to Sullivan's Island for normal land transportation.
When we architect systems and processes, we don't want to be in the same sort of situation. We want to make sure that if there is a failure, our systems will handle such gracefully. The only way they can is if we plan for failures to occur. And as we plan, we should consider the advice of those who have gone before us. For instance, the 8 Fallacies of Distributed Computing.
Fallacy #1 is an important one: the network is reliable. What that's basically saying is it is a fallacy to assume that every time you plan on using the network, it is up. I have seen cases where someone was working on a server in a rack and the network cabling for a different server is affected. Sometimes this isn't obvious at all. For instance, the cable looks like it's still plugged in to the back of the server, but things are just loose enough where good contact isn't being made. So now we have a physical fault and the network, at least as far as that one server is concerned, is not available. Or it could be the case like a few years ago where there were some issues with some of the Broadcom NIC drivers and we were affected. In our case, the loss of network connectivity couldn't be predicted. Everything worked okay and then, *blip* the NIC was off-line. What made matters worse was as far as the OS was concerned, everything looked fine. Now the fix was a simple one: log on locally and disable and re-enable the NIC or simply reboot the box. However, that did mean someone had to log on locally. Unfortunately, that KVM wasn't network enabled at the time.
So planning for failure is a proper part of the architecture design. Another would be to look to minimize the chances of failure. For instance, looking at fallacy #1, if I can get to a point to where a network failure doesn't affect me or I can minimize the damage done, all the better. This may simply be a case of copying the flat file extract from the mainframe to the system holding SQL Server where the SSIS package is going to run. Sure, a network failure during the copy could prevent me from starting the data import, but that's a lot better than a network failure occuring during the data import because I'm importing said file across the network. The first case is easy to deal with. I restore the network connectivity, get the file copied over, and start the ETL process. In the second case, I've got to restore the network connectivity, but I also quite likely have to do some data clean-up, depending on how far along I was when the failure occurred.
So plan for failure. And look to minimize the impact of failure. Two general steps to include in architecture design.