My worst story as a dba on double dose of trouble is this:
The day began as usual: Check the backups, check the space of the SAN, check the open issues on the ERP, etc Since there is no month close there were just a few additional work on my pile. And then just happened: the big troubles have the bad habit to appear just like a minor error. A user report that they can´t access to the ERP. I believe that is just another forgotten password or a blocked account case. I realized that the things are going bad when I was unable to access to the DB. First dose: The database is not on. Hummm, rare but not a cause of panic. In that company we had a datacenter offsite where it mantains all the hardware. This includes: upgrades, maintenance, physical security, etc. Since I have no any report of a hardware trouble I assumed that It was just a "issue" on the db. Perhaps caused by too many open cursors. Since the last time that I have rebooted the DB and APP servers, was a year ago, I follow the standard procedure to do this.
After a few minutes to complete the task I checked the health of de DB and all was ok, the APP was working fine and I regain all the control and I believe that all will be ok. Bad asumption. The same error appaears one hour later. And then started to worry. Afther check the log issues in red hat, there was a spooky error: a hard drive malfunction. After checked the drive and not throws any error I started to believe that maybe it was caused by a missing upgrade or a patch or something like that. Second dose: To short the history, I have spend the whole day and a part of the other, just to figure out that there were a trouble with the optic fiber from the san to the hub. Since the server have a dual fiber channel and a redundant fiber attached, it was necesary to shut down the channel with trouble in order to the redundant link comes up.
As you mention on the post, there was a chain of mistakes. My datacenter never get an alert from the infrastructure, (at least, they claim that), and I failed to prove the redundancy of the fiber channel. There were some other mistakes but I believe this two resumes the big ones.
Lesson learned: Murphy, allways Murphy.