I know it makes the DBA look bad, but yeah we have a lot of unplanned downtimes of production systems. Bear in mind that we have over 1600 databases on ~140 different servers and 2 DBAs, so statistically we do ok with uptime.
Number one cause. Too many people with admin rights and not enough communication. It is a cultural thing here I inherited, and I can't change. As a result we do a lot of fire fighting.
Number two cause, budget constraints. Customers want High Availability for everything, but can't pay for it. Good example here is SAN failures, I won't name specific brand, but hey they were cheap for a reason.
Other unplanned downtimes included:
- Network switch failures
- Antivirus software mis-configured killed clusters
- Autopatch turned on accidently
- AD management failures (helpdesk had power to reset service account passwords and used said power)
- Rarely, the occasional CPU, Mainboard, memory, or other physical hardware failures.