Another Disaster (Almost)

,

Back in July 2002 I wrote Disaster

In The Real World - #2, my real life adventure about a server going down.

Now about six months later, I almost got to relive the whole thing again! Read

on to hear about things can go from bad to worse to ok back to bad again.

Thursday afternoon I was sick with a cold, terrible headache, called it a day

about 3pm. Two minutes from home when Sean

calls my cell phone, the data drive on the reporting server is missing. Turn

around and go back to the office, check in with our network dude (ND) to see

what is going on. Drive dropped for no apparent reason, server itself is ok,

checks status of SCSI containers. You know the one that usually says 'OK'? Now

it says 'DEAD'. Great. Another massive restore that will take most of the night.

Knowing that we can do the restore in time to be up the next day, we decide to

work the problem for a while, the difference this time being we can see the

container even though its dead, last time container was just gone.

Three hours later, still on the phone. After various bits of voodoo,

including a time when it stopped even booting to the OS on C, they get the

container to change status to 'Scrubbing', decide we need to replace the

controller card. I go home to get some sleep while we wait on the new card. 11pm

our ND calls me, has the card, will be morning before the drive scrub is

complete, he will replace the card then. Sounds better than staying up all

night, so we decide to regroup at the office at 5 am.

Next morning we replace the card, drive goes back to scrubbing again. Not

what we planned. Boot to windows since we're out of time, performance starts out

bad and gets worse. No utility installed so we can check status of scrub or

change the priority. Reporting, time and attendance stuff, one fairly critical

app reside on it, plus the db where we log both errors and information stuff

from our main app. By 9:30 they are screaming, our main app is crawling. Reason

is that we log both text and screen shot to aid in diagnostics, inserts are

taking 45 seconds or longer. After a couple mins thought, decide to change the

password on the error logging login. The component is built so that if the

connection fails, it write the data out locally, tries to post it the next time

the app starts up. That gives us a pretty good boost. Now everyone wants to know

when the other applications will be usable, we tell them 5pm to err on the worst

case side, we're figuring by 1pm we'll be good.

Drive time settles back down to normal about 8:30 am the next day (Sat). I

email that everyone can resume normal operations, turn on some processes I had

disabled. Run a dbcc on everything, no problems, set the backup to run later

that night, log off.

Monday early everything looks ok initially, then we see drive times way up.

Further investigating reveals the backup finished, but took about 5 times as

long as usual. Bad. Our agents go on break around 10 am, we decide to bring the

server down to upgrade the card bios and put the management software on the box

so we can check the status. Everything goes good, we replace the other scsi card

for the tape drive, reboot, get an operating system not found error. Double

check that everything is connected, seated, etc, etc, reboot, same thing. Call

support again. Break lasts 15 mins, we are way over that already. Company

meeting scheduled for 1130, we miss that too, our CIO brings us lunch (great

boss!), we finally get it to boot...by using a floppy! Doesn't fix the problem,

but at least we're back in business, we can finish troubleshooting after 5pm or

on the weekend. This is about 1:30 pm. We set the scrub priority to low, disk

time is cool, time for a break.

So, a happy ending?

Not quite. SQL won't start, says the master can't be recovered. Not what I

want to hear. Thinking wishful thoughts, I get something to drink, give it 10

minutes, try again. Same thing. Ok, I have previous night backup on disk, so I

restore the master. 20 of 200+ db's corrupt, the reason of a bad shutdown with

the caching controller apparently. Almost all of them are replicated copies that

aren't immediately critical and that I can fix by just dropping the db, putting

it back, doing a new snapshot. The other five I have to check on to see how bad

restoring from the previous night will hurt us. Turns out ok, so I restore those

five first, then work on the other ones. By 6pm I've got most everything working

again, performance is still good.

Along the way we decide that one db is too important to risk additional down

time, so I detach and copy to our main server. Because its a replicated copy, I

debate whether to snapshot it across the 256 link to the publisher, or try to

repoint the subscription to the new server. I'm thinking the latter will work

since it uses an alias on the publisher, but I'm short on time and better to be

slow than wrong. Start the snapshot, users continue to hit the db on the

reporting server while the snapshot is in progress. Couple hours later snapshot

finishes, do the switchover to point the web app to the new server.

7 pm, everything back to normal, set another dbcc to run while I head home.

Check in later, no errors reported. Good!

So what did I learn from all this? We had already requested (again) after the

last incident money to support clustering the sql boxes and that had been

approved for next year, nothing to do but wait. Hardware is usually pretty

stable, we were using good stuff (as far as we knew anyway), biggest complaint I

had is that there should be no level 1 techs working server support!

On the application side, we had built fairly well in our error logging so

that if the server was down the apps that used it wouldn't be down as well, had

not envisioned a scenario where we would want to wholesale disable it. We're

considering now how to add that feature and whether to add the option to point

it to an alternate server. Beyond that, it got us thinking about what it would

take to make our databases more portable, where we could easily move them from

one server to another if we needed to. Changing the application is pretty easy,

we could make the change and deploy in 10-15 mins, the problem is that many of

our internal apps have cross database dependencies, so moving one to another

server wouldn't work, we'd have to move them all or set up a linked server and

change some code to make it all work. Not doable in a hurry and not sure it's

doable in a workable/maintainable way at all.

On the SQL side, I need to can the process of restoring the master in case

someone else needs to do it maybe, and I need to experiment with replication to

see if changing the subscriber location can be done without redoing the

snapshot.

All in all, it could have been much worse! Steve

Jones took the lead here in writing about things that go wrong, now I'm

ahead 2 incidents to 1. Nothing to brag about of course, just hoping that

publishing our troubles might help you avoid problems at your place. Comments?

Click the link below!

Rate

Share

Share

Rate