• Perhaps most importantly:

    The first, easy part:

    If we assume that somehow a server is rendered unstable, excessively slow, incorrect, invalid, unusable or inoperable, what's the plan... and do you ever test it?

    The second, hard part:

    Same as the above... but on many or all servers.

    At many companies, even very large ones, the response goes much like this between a Hardware/OS/low level team and an application/database team:

    Hardware: "We've got backups."

    App: "So it'll be just like it was before X happened?"

    Hardware: "Of course not - we only back up the data/your SQL Server .bak files!"

    App: "Oh. So what now?"

    Hardware: "We've installed the operating system on [the old | some new] hardware."

    App: "So now we have to reinstall our application? From scratch? We haven't done that in Y years! And those people aren't in our team anymore!"

    Hardware: "If the OS or hardware has issues, call us."

    App: "What were all the settings we had?"

    ...

    App: "It's up!"

    User: "Z feature is broken!"

    App: "Oh... there was an exception we had to do Q for."

    GOTO User

    And this takes awhile, but it's the normal response to one server failing. If a mass update causes multiple servers to fail at once, this becomes a real nightmare... especially if the backup servers are also affected.

    Bare metal restore capable drive images are a very good solution to this.... but almost no-one actually does them of servers, and they do take a lot of space, more if you have encrypted or otherwise incompressible data.

    Whatever you're "it caught on fire" plan is, if you don't try it, and plan for doing it on many machines at once, there's a lot of room for a nasty surprise.