Back in April Steve wrote an interesting article called In The Real World - Disaster! that talked about his recent experience with a disaster at work. Good reading. So naturally when I had a disaster of my own recently, I thought I'd share a condensed version of it with you too!
It started just after 5pm with a call telling me that two of our servers were down. I was working at home that day but I live reasonably close so I was in the office by 5:30. Our email server would not boot, the reporting server running SQL2K would boot but the service was failing because the data drive was missing. Looking bad already! Luckily we wrap most of our business that uses these servers at 5pm, so we had all night to figure things out.
To start with no one knew what happened or why. Nothing in the event log was helpful, no one noticed any power drops, UPS's on both machines were available.
We started with email. We seriously live on email. Being without for any amount of time hurts. We were not able to get the machine to boot. Tried recovery console, repair disk, the works. Decided we'd reinstall Win2K and see what happened. Got that started, then looked at the SQL box. No data drive, no RAID-5 container. A call to Dell support indicated that there was no way to recover the container description, recommended building a new one--which would wipe the drives.
About now you're probably wondering about our backup situation. The SQL machine has a full backup done nightly and transaction logs during the day, we keep one day on disk. Because we were very low on space we had made the decision to NOT backup the replicated db's, knowing that we could pull the data from the publisher. Our backup strategy was approved by senior management understanding the potential risk. The result was we were looking at losing a days worth of work on some not very critical databases, stuff for the most part that could be easily recreated.
Here's where it gets interesting. We had just this week received a new Dell PowerVault - fourteen 18G drives. Yeah! Space at last! Having it available we decided to leave the existing SCSI drives in place and go with the new set. We had to do this work sometime anyway and this way we left open the option to send the original drives to a data recovery firm if we had lost something more important than we thought. Powervault was already in the rack, so hooking it up took all of 5 minutes. Set the container and we were ready to go.
Now comes the part where you hold your breath. Was the tape good? I don't know about you but tape ALWAYS makes me nervous! Everything that should be there was and was accessible. About 25g of backups to recover. Started the restore, thinking it would go fast. It didn't. We averaged about 130M per min. In hindsight would have been smarter to restore the master first, then do all the rest. We decided to let it run while we ate a quick dinner and looked at the email server more.
Got the email server to boot. All the Exchange stuff is there, but it won't run because we just reinstalled Win. Decide to call support to make sure we get it right the first time. I know next to nothing about the administration of Exchange, but sure seems more complex than it ought to be - can only hope they really do start using SQL for the message store. He settles down for one of those fun multi-hour calls while they step through the rebuild process.
Tape is still running, so I review what I need to do once it's done, catch up on some forum posts here. Begin entering the zen like state required for good restores!
Tape is done. Pop the CD in, grab the core mdf/ldf's that RebuildM will need, drop in a folder and clear the read only bit. Run RebuildM. Start SQL in console mode. So far so good. Restore the master, crank up the service. A lot...I mean a lot of suspect databases. But that is expected so I move on. Next is the restore of master, model, msdb. Then I start restoring the rest. Most are fairly small, so I do them in one QA window while I run the larger ones in another window. Fast and easy.
Now I need to replicate almost 200 databases to this server. To do that I need a database already there for each. Time to kill the suspect ones. They all start with the same prefix (this is done for ease of management, they all have the same schema), so I quickly delete and create them all by doing this:
Set rs = cn.Execute("select * from sysdatabases where name like 'aa%'")
Do Until rs.EOF
cn.Execute "Drop database " & rs.Fields("Name")
cn.Execute "Create database " & rs.Fields("Name")
Set rs = Nothing
Set cn = Nothing
Couple minutes for that to run. Before I started replicating there was one more little admin task to do. Probably half of this databases were created with SQL 7, back before a concurrent snapshot was an option. If you turn enable the option on an existing publication it wants to reinitialize the publication. Possibly there is a way around it, but so far it was good enough to just create the new ones with it enabled and fix old ones if they ever needed to have a new snapshot done. Already got a loop ready, so I just modify it and run this:
Set rs = cn.Execute("select * from sysdatabases where name like 'aa%'")
Do Until rs.EOF
cn.Execute "update " & rs.Fields("Name") & "..syspublications set sync_method=3 where sync_method=0"
cn.Execute rs.Fields("Name") & "..sp_reinitsubscription @publication='" & rs.Fields("Name") & "', @subscriber='xyz'"
Set rs = Nothing
Set cn = Nothing
Concurrent snapshots are a little tricky in that the log reader has to run to authorize the snapshot, but the log readers on our databases don't run continuously - read this article to see how and why. Note that the concurrent snapshot was not important at all to what I was trying to do, I wanted it in place so that if I needed to do the snapshot during the day users would not be blocked, one of the big improvements in SQL2K replication.
The next part was even though I could run another loop, activating a hundred or more simultaneous snapshots is probably not a good idea since I only have one distribution database. Worth a try anyway. Bad plan. 600 connections, disk queueing which was expected, plus the occasional deadlock. Still, was worth a shot. I restart the service, go to plan B. I settled on starting them in groups based on the first real letter of the db name. Put together a VB form (while the first batch ran) with a text box for the first letter, two buttons. One button to run the snapshot for db's using that letter, second button to run the log reader and distribution agent.
How did I do that? Replication jobs fall into categories 10, 13, 15. I ran the snapshots (type 15) in one button by running the job that always ran it, in the other button I ran the jobs for the category 13 & 15 jobs - log reader and distribution agent. Works ok but more manual than I would likeI start by running the snapshots on the A's, wait a while, start the logreader/distribution for the A's, do the snapshots on the B's. Watch the spid count using sp_who, when it drops below 80 or so run some more.. Some process quickly, some take longer. End up having to go back and re-run some of the distribution jobs because the snapshot wasn't done for the db when it ran the first time.
Slow going. Or seems that way anyway. Once I get half way through I open explorer and check the file sizes, easy to spot the ones that are not complete because they are all less than a meg in size. Use EM to just run those jobs manually to fill in the blanks.
Replication from our publisher is done. I have one db that subscribes to a publisher across a 256k link. Reinitialize the publication on their side, run the snapshot, distribution agent fails - server not configured for data access? Scratching my head. Nothing has changed in the past day, we restored the database on our side, login's haven't changed. Try it again. As if I could somehow do it different the second time! Same thing. Ok. Terminal service to the publisher, open Query Analyzer, connect to the subscriber using SA. Works fine. Look up the password for the sql account used for replication, try it. Access Denied!
Well now. Says no access allowed to the default db, which should be the db I'm trying to connect to. Open EM, verify, looks good (I swear it did!) Try again, same thing. Think about it a little, open sysxlogins, check the default db - which is really a DBID. Run a select - of course, it's pointing to the totally wrong db since the databases were added to sysdatabases in a totally different order than before. Had not planned on that. I usually set the default to master, but because the login/password were being used on server I didn't fully control I wanted to minimize the risk if the password was compromised. Run sp_configure to allow catalog updates, modify the dbid in sysxlogins, turn catalog updates back off. Works fine!
Next was the matter of shares. I knew one of them and restored it manually, had already found one missing during replication. Ah, forgot to mention that. We have a standard script we run on the subscriber after a snapshot to make sure some extra objects are there, set permissions, adjust the indexes. SQL2K lets you configure a post-snapshot script to run automatically - what a nice feature. The script sat on the data drive and was accessed via a share, so the first snapshot I ran failed because the share was missing. Documenting shares might save a step or two next time.
Double check the jobs, get a new backup started - just in case! Everything good as far as I can tell, have to wait for business to start to if I missed anything. Check on our admin, he's on the phone again, waiting on something to else to finish. I leave about 4:15 am. He's still there when I come back at 8:30, but email is running and he is doing clean up. I missed a couple snapshots, fix those. One the remote pubs failed, track it back to a problem I had fixed the day before, easy. Email notifications failing, run xp_stopmail/xp_startmail and everything is fine (email was down when the service started so had to do it manually). Backup ran fine. All in all everything is good, no one not working because they are missing something.
Hindsight? We knew the risks in our back up strategy and it was a legitimate business decision made at the time. We'll go back and make the case for more redundancy, but not one complaint from senior management and our CIO was totally supportive while we worked. Since we have more drive space I can backup all the databases on the subscriber. Gives me a redundant backup of the db's on the publisher, plus if I have to do this again I won't have to do the drop/recreate step which threw off the database id's. Whether that would be worth it I don't know. I think it's faster to do it the way I did and just know to run a query to fix at the end. Or switch them all to use the master as the default db. Save the shares out to reg file for easy replacement maybe.
What else? Maybe automate the snapshot process. Another item that I'm not sure is worth doing. Can always be done manually or semi-manually as I did it the first time. Worst case a couple people would run jobs from EM for a couple hours. Don't know...would you build it?
Using the new drive set was a both a good thing and a minor risk. Usually not good to try to fix things during a restore but it gave us some added flexibility and at most we would have lost a little time if we had problems getting it up and running. We had plenty of time. Service was running by midnight, we could have done the most important snapshots first and then kept working on the rest - and yes, I had a way to determine which ones had been used in the last day as a starting point for that decision. Had it been during the day we would have pushed a little harder, but not much. Better to get it right the first time.
Just remembered something else. We had already received a new super duper tape drive, just waiting on a card for it, so next time the restore will be faster. Every little bit helps!
Still don't know what happened. I suspect a power surge of some sort, but so far can't prove it. Hate to think it could happen again. What makes TWO servers go down at once?? Bad luck! It's our first serious failure in the four years I've been there and we did ok.
I tried to get it all down. Got any questions? Go easy on me, at least I survived it!