SQL Clone
SQLServerCentral is supported by Redgate
Log in  ::  Register  ::  Not logged in

Another Disaster (Almost)

By Andy Warren,

Back in July 2002 I wrote Disaster In The Real World - #2, my real life adventure about a server going down. Now about six months later, I almost got to relive the whole thing again! Read on to hear about things can go from bad to worse to ok back to bad again.

Thursday afternoon I was sick with a cold, terrible headache, called it a day about 3pm. Two minutes from home when Sean calls my cell phone, the data drive on the reporting server is missing. Turn around and go back to the office, check in with our network dude (ND) to see what is going on. Drive dropped for no apparent reason, server itself is ok, checks status of SCSI containers. You know the one that usually says 'OK'? Now it says 'DEAD'. Great. Another massive restore that will take most of the night. Knowing that we can do the restore in time to be up the next day, we decide to work the problem for a while, the difference this time being we can see the container even though its dead, last time container was just gone.

Three hours later, still on the phone. After various bits of voodoo, including a time when it stopped even booting to the OS on C, they get the container to change status to 'Scrubbing', decide we need to replace the controller card. I go home to get some sleep while we wait on the new card. 11pm our ND calls me, has the card, will be morning before the drive scrub is complete, he will replace the card then. Sounds better than staying up all night, so we decide to regroup at the office at 5 am.

Next morning we replace the card, drive goes back to scrubbing again. Not what we planned. Boot to windows since we're out of time, performance starts out bad and gets worse. No utility installed so we can check status of scrub or change the priority. Reporting, time and attendance stuff, one fairly critical app reside on it, plus the db where we log both errors and information stuff from our main app. By 9:30 they are screaming, our main app is crawling. Reason is that we log both text and screen shot to aid in diagnostics, inserts are taking 45 seconds or longer. After a couple mins thought, decide to change the password on the error logging login. The component is built so that if the connection fails, it write the data out locally, tries to post it the next time the app starts up. That gives us a pretty good boost. Now everyone wants to know when the other applications will be usable, we tell them 5pm to err on the worst case side, we're figuring by 1pm we'll be good.

Drive time settles back down to normal about 8:30 am the next day (Sat). I email that everyone can resume normal operations, turn on some processes I had disabled. Run a dbcc on everything, no problems, set the backup to run later that night, log off.

Monday early everything looks ok initially, then we see drive times way up. Further investigating reveals the backup finished, but took about 5 times as long as usual. Bad. Our agents go on break around 10 am, we decide to bring the server down to upgrade the card bios and put the management software on the box so we can check the status. Everything goes good, we replace the other scsi card for the tape drive, reboot, get an operating system not found error. Double check that everything is connected, seated, etc, etc, reboot, same thing. Call support again. Break lasts 15 mins, we are way over that already. Company meeting scheduled for 1130, we miss that too, our CIO brings us lunch (great boss!), we finally get it to boot...by using a floppy! Doesn't fix the problem, but at least we're back in business, we can finish troubleshooting after 5pm or on the weekend. This is about 1:30 pm. We set the scrub priority to low, disk time is cool, time for a break.

So, a happy ending?

Not quite. SQL won't start, says the master can't be recovered. Not what I want to hear. Thinking wishful thoughts, I get something to drink, give it 10 minutes, try again. Same thing. Ok, I have previous night backup on disk, so I restore the master. 20 of 200+ db's corrupt, the reason of a bad shutdown with the caching controller apparently. Almost all of them are replicated copies that aren't immediately critical and that I can fix by just dropping the db, putting it back, doing a new snapshot. The other five I have to check on to see how bad restoring from the previous night will hurt us. Turns out ok, so I restore those five first, then work on the other ones. By 6pm I've got most everything working again, performance is still good.

Along the way we decide that one db is too important to risk additional down time, so I detach and copy to our main server. Because its a replicated copy, I debate whether to snapshot it across the 256 link to the publisher, or try to repoint the subscription to the new server. I'm thinking the latter will work since it uses an alias on the publisher, but I'm short on time and better to be slow than wrong. Start the snapshot, users continue to hit the db on the reporting server while the snapshot is in progress. Couple hours later snapshot finishes, do the switchover to point the web app to the new server.

7 pm, everything back to normal, set another dbcc to run while I head home. Check in later, no errors reported. Good!

So what did I learn from all this? We had already requested (again) after the last incident money to support clustering the sql boxes and that had been approved for next year, nothing to do but wait. Hardware is usually pretty stable, we were using good stuff (as far as we knew anyway), biggest complaint I had is that there should be no level 1 techs working server support!

On the application side, we had built fairly well in our error logging so that if the server was down the apps that used it wouldn't be down as well, had not envisioned a scenario where we would want to wholesale disable it. We're considering now how to add that feature and whether to add the option to point it to an alternate server. Beyond that, it got us thinking about what it would take to make our databases more portable, where we could easily move them from one server to another if we needed to. Changing the application is pretty easy, we could make the change and deploy in 10-15 mins, the problem is that many of our internal apps have cross database dependencies, so moving one to another server wouldn't work, we'd have to move them all or set up a linked server and change some code to make it all work. Not doable in a hurry and not sure it's doable in a workable/maintainable way at all.

On the SQL side, I need to can the process of restoring the master in case someone else needs to do it maybe, and I need to experiment with replication to see if changing the subscriber location can be done without redoing the snapshot.

All in all, it could have been much worse! Steve Jones took the lead here in writing about things that go wrong, now I'm ahead 2 incidents to 1. Nothing to brag about of course, just hoping that publishing our troubles might help you avoid problems at your place. Comments? Click the link below!

Total article views: 7036 | Views in the last 30 days: 4
Related Articles

Edition Change Check (Warning), During run setup of SQL SERVER 2005

Edition Change Check (Warning), During run setup of SQL SERVER 2005


Changing Server Collation on SQL Server 2008

In this article, learn about the risks of changing server collation as well as how to go about makin...


dbcc check db changing schema version

consistency check is causing a database schema version change


List all Logins in the server and change the CHECK_POLICY=ON"

List all Logins in the server and change the CHECK_POLICY=ON"


Consolidating Again and Again and Again

This editorial was originally published on November 12, 2009. It is being re-run as Steve is on holi...