Back in April Steve wrote an interesting article called In
The Real World - Disaster! that talked about his recent experience with a
disaster at work. Good reading. So naturally when I had a disaster of my own
recently, I thought I'd share a condensed version of it with you too!
It started just after 5pm with a call telling me that two of our servers were
down. I was working at home that day but I live reasonably close so I was in the
office by 5:30. Our email server would not boot, the reporting server running
SQL2K would boot but the service was failing because the data drive was missing.
Looking bad already! Luckily we wrap most of our business that uses these
servers at 5pm, so we had all night to figure things out.
To start with no one knew what happened or why. Nothing in the event log was
helpful, no one noticed any power drops, UPS's on both machines were available.
We started with email. We seriously live on email. Being without for any
amount of time hurts. We were not able to get the machine to boot. Tried
recovery console, repair disk, the works. Decided we'd reinstall Win2K and see
what happened. Got that started, then looked at the SQL box. No data drive, no
RAID-5 container. A call to Dell support indicated that there was no way to
recover the container description, recommended building a new one--which would
wipe the drives.
About now you're probably wondering about our backup situation. The SQL
machine has a full backup done nightly and transaction logs during the day, we
keep one day on disk. Because we were very low on space we had made the decision
to NOT backup the replicated db's, knowing that we could pull the data from the
publisher. Our backup strategy was approved by senior management understanding
the potential risk. The result was we were looking at losing a days worth of
work on some not very critical databases, stuff for the most part that could be
easily recreated.
Here's where it gets interesting. We had just this week received a new Dell
PowerVault - fourteen 18G drives. Yeah! Space at last! Having it available we
decided to leave the existing SCSI drives in place and go with the new set. We
had to do this work sometime anyway and this way we left open the option to send
the original drives to a data recovery firm if we had lost something more
important than we thought. Powervault was already in the rack, so hooking it up
took all of 5 minutes. Set the container and we were ready to go.
Now comes the part where you hold your breath. Was the tape good? I don't
know about you but tape ALWAYS makes me nervous! Everything that should be there
was and was accessible. About 25g of backups to recover. Started the restore,
thinking it would go fast. It didn't. We averaged about 130M per min. In
hindsight would have been smarter to restore the master first, then do all the
rest. We decided to let it run while we ate a quick dinner and looked at the
email server more.
Got the email server to boot. All the Exchange stuff is there, but it won't
run because we just reinstalled Win. Decide to call support to make sure we get
it right the first time. I know next to nothing about the administration of
Exchange, but sure seems more complex than it ought to be - can only hope they
really do start using SQL for the message store. He settles down for one of
those fun multi-hour calls while they step through the rebuild process.
Tape is still running, so I review what I need to do once it's done, catch up
on some forum posts here. Begin entering the zen like state required for good
restores!
Tape is done. Pop the CD in, grab the core mdf/ldf's that RebuildM will need,
drop in a folder and clear the read only bit. Run RebuildM. Start SQL in console
mode. So far so good. Restore the master, crank up the service. A lot...I mean a
lot of suspect databases. But that is expected so I move on. Next is the restore
of master, model, msdb. Then I start restoring the rest. Most are fairly small,
so I do them in one QA window while I run the larger ones in another window.
Fast and easy.
Now I need to replicate almost 200 databases to this server. To do that I
need a database already there for each. Time to kill the suspect ones. They all
start with the same prefix (this is done for ease of management, they all have
the same schema), so I quickly delete and create them all by doing this:
Set rs = cn.Execute("select *
from sysdatabases where name like 'aa%'")
Do Until rs.EOF
Debug.Print rs.Fields("Name")
cn.Execute "Drop database " &
rs.Fields("Name")
cn.Execute "Create database " &
rs.Fields("Name")
rs.MoveNext
Loop
rs.Close
Set rs = Nothing
cn.Close
Set cn = Nothing |
Couple minutes for that to run. Before I started replicating there was one
more little admin task to do. Probably half of this databases were created with
SQL 7, back before a concurrent snapshot was an option. If you turn enable the
option on an existing publication it wants to reinitialize the publication.
Possibly there is a way around it, but so far it was good enough to just create
the new ones with it enabled and fix old ones if they ever needed to have a new
snapshot done. Already got a loop ready, so I just modify it and run this:
Set rs = cn.Execute("select *
from sysdatabases where name like 'aa%'")
Do Until rs.EOF
cn.Execute "update " &
rs.Fields("Name") & "..syspublications set sync_method=3
where sync_method=0"
cn.Execute rs.Fields("Name") & "..sp_reinitsubscription
@publication='" & rs.Fields("Name") & "',
@subscriber='xyz'"
rs.MoveNext
Loop
rs.Close
Set rs = Nothing
cn.Close
Set cn = Nothing |
Concurrent snapshots are a little tricky in that the log reader has to run to
authorize the snapshot, but the log readers on our databases don't run
continuously - read this
article to see how and why. Note that the concurrent snapshot was not
important at all to what I was trying to do, I wanted it in place so that if I
needed to do the snapshot during the day users would not be blocked, one of the
big improvements in SQL2K replication.
The next part was even though I could run another loop, activating a hundred
or more simultaneous snapshots is probably not a good idea since I only have one
distribution database. Worth a try anyway. Bad plan. 600 connections, disk
queueing which was expected, plus the occasional deadlock. Still, was worth a
shot. I restart the service, go to plan B. I settled on starting them in
groups based on the first real letter of the db name. Put together a VB form
(while the first batch ran) with a text box for the first letter, two buttons.
One button to run the snapshot for db's using that letter, second button to run
the log reader and distribution agent.
How did I do that? Replication jobs fall into categories 10, 13, 15. I ran
the snapshots (type 15) in one button by running the job that always ran it, in
the other button I ran the jobs for the category 13 & 15 jobs - log reader
and distribution agent. Works ok but more manual than I would likeI start by
running the snapshots on the A's, wait a while, start the logreader/distribution
for the A's, do the snapshots on the B's. Watch the spid count using sp_who,
when it drops below 80 or so run some more.. Some process quickly, some take
longer. End up having to go back and re-run some of the distribution jobs
because the snapshot wasn't done for the db when it ran the first time.
Slow going. Or seems that way anyway. Once I get half way through I open
explorer and check the file sizes, easy to spot the ones that are not complete
because they are all less than a meg in size. Use EM to just run those jobs
manually to fill in the blanks.
Replication from our publisher is done. I have one db that subscribes to a
publisher across a 256k link. Reinitialize the publication on their side, run
the snapshot, distribution agent fails - server not configured for data access?
Scratching my head. Nothing has changed in the past day, we restored the
database on our side, login's haven't changed. Try it again. As if I could
somehow do it different the second time! Same thing. Ok. Terminal service to the
publisher, open Query Analyzer, connect to the subscriber using SA. Works fine.
Look up the password for the sql account used for replication, try it. Access
Denied!
Well now. Says no access allowed to the default db, which should be the db
I'm trying to connect to. Open EM, verify, looks good (I swear it did!) Try
again, same thing. Think about it a little, open sysxlogins, check the default
db - which is really a DBID. Run a select - of course, it's pointing to the
totally wrong db since the databases were added to sysdatabases in a totally
different order than before. Had not planned on that. I usually set the default
to master, but because the login/password were being used on server I didn't
fully control I wanted to minimize the risk if the password was compromised. Run
sp_configure to allow catalog updates, modify the dbid in sysxlogins, turn
catalog updates back off. Works fine!
Next was the matter of shares. I knew one of them and restored it manually,
had already found one missing during replication. Ah, forgot to mention that. We
have a standard script we run on the subscriber after a snapshot to make sure
some extra objects are there, set permissions, adjust the indexes. SQL2K lets
you configure a post-snapshot script to run automatically - what a nice feature.
The script sat on the data drive and was accessed via a share, so the first
snapshot I ran failed because the share was missing. Documenting shares might
save a step or two next time.
Double check the jobs, get a new backup started - just in case! Everything
good as far as I can tell, have to wait for business to start to if I missed
anything. Check on our admin, he's on the phone again, waiting on something to
else to finish. I leave about 4:15 am. He's still there when I come back at
8:30, but email is running and he is doing clean up. I missed a couple
snapshots, fix those. One the remote pubs failed, track it back to a problem I
had fixed the day before, easy. Email notifications failing, run xp_stopmail/xp_startmail
and everything is fine (email was down when the service started so had to do it
manually). Backup ran fine. All in all everything is good, no one not working
because they are missing something.
Hindsight? We knew the risks in our back up strategy and it was a legitimate
business decision made at the time. We'll go back and make the case for more
redundancy, but not one complaint from senior management and our CIO was totally
supportive while we worked. Since we have more drive space I can backup all the
databases on the subscriber. Gives me a redundant backup of the db's on the
publisher, plus if I have to do this again I won't have to do the drop/recreate
step which threw off the database id's. Whether that would be worth it I don't
know. I think it's faster to do it the way I did and just know to run a query to
fix at the end. Or switch them all to use the master as the default db. Save the
shares out to reg file for easy replacement maybe.
What else? Maybe automate the snapshot process. Another item that I'm not
sure is worth doing. Can always be done manually or semi-manually as I did it
the first time. Worst case a couple people would run jobs from EM for a couple
hours. Don't know...would you build it?
Using the new drive set was a both a good thing and a minor risk. Usually not
good to try to fix things during a restore but it gave us some added flexibility
and at most we would have lost a little time if we had problems getting it up
and running. We had plenty of time. Service was running by midnight, we could
have done the most important snapshots first and then kept working on the rest -
and yes, I had a way to determine which ones had been used in the last day as a
starting point for that decision. Had it been during the day we would have
pushed a little harder, but not much. Better to get it right the first time.
Just remembered something else. We had already received a new super duper
tape drive, just waiting on a card for it, so next time the restore will be
faster. Every little bit helps!
Still don't know what happened. I suspect a power surge of some sort, but so
far can't prove it. Hate to think it could happen again. What makes TWO servers
go down at once?? Bad luck! It's our first serious failure in the four years
I've been there and we did ok.
I tried to get it all down. Got any questions? Go easy on me, at least I
survived it!