This article was originally published on Apr 22, 2002. After the recent hurricane Sandy, this is being re-run as a reminder that disasters can be small as well as large.
This is a true story and is intended as a case study in showing how I actually recovered from a disaster. As close as I can remember, I am writing this from memory the night of the events, so I think it is fairly accurate. As I've said many times, one should always take notes, but I had some extenuating circumstances in this case. That's not an excuse, just the way things played out.
Wednesday, April 17, 2002
My cell phone rang around 4:55pm.
"I'll be there in 5 minutes." My wife was calling and coming to pick me up. She works in the technology industry as well and was flying to California for a two day training class. I closed my laptop, locked my desktop, and got ready to leave the office. It had been a fairly quiet day and the QA department had signed off on the production release for that night. After letting the change guy know I was leaving and the release was ready to go from the database perspective, I went outside to wait.
On the way to the airport, the senior developer called. "The production site isn't responding. It seems to be whenever we hit a page with a database call". He's a sharp guy, so I trust his judgement. They'd already restarted some services on the web server so I ran through my mental checklist.
5:20pm, Wed, nothing going on. I gave my ok to restart the SQL Server. Didn't think anything of it as we arrived at the airport and got my wife checked in. Then my cell rang again.
"We lost the array controller on the database box". It was my boss. HE proceeded to tell me that the server would not restart and the BIOS could not see the array controller anymore. Not something you expect everyday, but bad things come in threes. Two outages at SQL Server Central in the past week and now this. He was a little worried and was trying to decide what to do.
Now we've had a spare database server sitting idle, getting ready to be re purposed, but since we've been busy with development, we hadn't done anything with it. It was my old database box until we bought the new one. The old one has less space, but it will still work. Plus it's a dual-CPU server. My boss and the network guy had plugged it in and were booting it up. Apparently they were having issues with the KVM switch, which connects all the servers to a single keyboard, mouse and monitor. I told them to Terminal Server in because it would be just as fast. SQL Server 2000 was already on this server, so I figured I'd be off the phone in 15 minutes.
They couldn't get to it. Since we hadn't booted this server since we'd brought it back from our co-location facility, I suspected the IP addressing was wrong. So I told them to get the console setup, log in and re-address the network settings and call me back.
45 minutes later I was leaving the airport and decided to check with them. They were having problems getting the spare database server on line. Now it's technically my server, but the network guy is really responsible for the OS and network segments. Since I now had 3 kids all to myself, I decided to take the safer path. Even if it led to lower performance for a day.
"Let's push QA into production", I told my boss. I knew this would work because I restore the QA environment from the production database server every week. At this point, I was slightly worried.
You see, every 6 hours I get a complete database backup of the two production databases that we depend on. Then, every 15 minutes in between these backups, I grab a transaction log backup. The way my backups are set, I send these backups to the local disk on the server. Of course, with a dead array controller, it wasn't likely I'd be getting today's backups off the server. And the network guy grabs these backups to tape every night. Around 3:00am.
So here I am, it's about 6:30pm, with the production database down since around 5:10 or so. A full days work. The last backup the network guy has is 8:00pm the previous night, plus transaction logs until around 10:00pm. I bail on these from 10:00pm until 4:00am the next morning as we don't really run a 24 hour shop.
Luckily I had implemented my PushFTP DTS package as a scheduled job. My transaction log backups run at 10, 25, 40, and 55 minutes after the hour. My DTS task runs 10 minutes later and pushes all the backups to our FTP server. I know because I check this almost every day. Now I was curious to see what I had.
Over my crying 11 month old, I told my boss where to look for backup files (on the FTP server), and there were log backups through 4:55pm. So sometime between then and 5:10, the next log backup, the server crashed. I had him grab the last backup file, which was the 2:00pm full backup for each of the two production databases. Then I had him grab all transaction log backups after 2:00pm. It took a few minutes, but I was able to explain how the files were named so he could determine what to copy to QA. I name all database backups files with a .BAK. Transaction log backups are .TRN. Each file starts with the name of the database, then an underscore (_), then the year, month, day, and time of the backup. In this case, I have a database called "plm", so the latest backup was "plm_200204171400.BAK".
I was getting closer and still had hope that I wouldn't have to go into the office. The backups were copied and I had my boss log into the QA server console. He didn't need to, but I could hear the stress in his voice and I knew it made him feel better. Once logged in, I had some start Enterprise Manager and navigate to the database.
Now if you've ever restored a backup from another server, you will have learned a few things. Things I needed to try and teach my boss on the fly. He's a sharp guy, he has a few "man" pages for gcc to his credit, but he doesn't deal with this side of SQL Server everyday. That's why he pays me, so I wasn't sure how this would go.
I can picture the screens in my head as I drove. "Right click databases, choose "all tasks", the "restore database". In the dialog, make sure the right database (prod) is selected. Now choose the "device" radio button, then choose "select device", then "add", then navigate to the backup you copied from FTP."
"Now get back to the main dialog. Choose the "options" tab. In the grid in the middle there are two files listed, an MDF and an LDF. You need to change the paths to these to a valid path on this server."
"Huh?". It seemed straightforward to me, but I've done this. You see, when you make a backup in SQL Server, the paths to the database files are stored in the backup file. When you go to restore these, if the paths are not valid, like because it's a different server or a drive has failed, you need to use the "WITH MOVE" option. In the GUI, this is implemented in the grid on the options tab. After a few minutes of explaining this in different ways, he was set. He clicked "ok" and ...
"It says the restore is terminating because exclusive use of the database could not be obtained".
#$%#!#%$% application server. OK, explain to stop the services, then connect with Query Analyzer, then run "sp_who2". "sp_what?", "who, WHO-2". OK, now find the spids with the database selected and kill them. "Like the Unix kill process?". "Yes, just like that".
Still now working for him, but I'm getting off the highway and I'm a half a mile away. As I pull into the parking lot, my 3 yr old throws up from car sickness. This night is getting better all the time.
I get everyone out of the car and go upstairs. After a quick bathroom cleanup, we get to the server room. I unlock with my key card and walk to the console with the baby in my arms. Get back up and go open the door for the crying 3 yr old who got locked out. He's a little slow getting through doors with his toy lawn mower that he's pushing everywhere these days.
OK, I'm in a hurry, so I show my boss how I kill the spids, verify they are gone, then select the database backup and change the path. Do this and click "ok". the database is restored. Cool, we are cooking now. Perform the same procedure, but select transaction log and click ok.
Error. I forgot to restore the database in a non-useable state so I could do the logs. Redo the restore.
Nice, now I've got a database in "loading" state. Only problem is it's the bug tracker database, not the production database. Whoops, restore that guy from his 3:00backup. I'll square this with QA later.
OK, now I restore the database and check the "read-only" box on the options tab. That works and I work my way through the 12 logs I've got with a baby in my arms. She actually cooperates pretty well, only tossing papers on the flow and not smacking the keyboard. Hence the reason I had no notes. It's easier to work my way through it from memory than take notes and teach someone.
Once all the logs are restored, I add the production login to the server. Now I select my "prod" database as the default, but I cannot add database access for this login. Why not? It was already added by the restore. So once I get the login added, I connect via Query Analyzer to the database and run
sp_change_users_login 'auto_fix', 'username'
Now we're set and the developers verify the web app can connect.
I repeat this for the other database and we're back online. Total downtime? About 2 hours, 45 minutes. Time for me to restore the databases once I got to the office? About 45 minutes.
Not great, but the web site is working.
So what did I learn? Well, I learned my backup process works well. I'm not a big fan of log shipping, but I'm sure my boss will want some testing of it now. So I well. I also need to develop a script that will quickly build the restore commands I need for my servers. If I'd had this, I would have been able to point it to my boss and skip the office visit with my kids. Not that they minded, but it could have been worse.
I also learned that no matter how much stress and pressure, I should be sure of my actions. I restored the database incorrectly once and restored to the wrong database once as well. It was only a 10 minutes waste, but in the middle of the day, that would have been bad.
I also should have probably setup differential backups every 2 hours between my fulls. That would have further reduced the restore time and made me look even better.
Don't get me wrong. This was a success for me. My boss and IT team are happy and the site is running. I'll pick up more pieces tomorrow and get a better server in place, but that's not good enough. I see lots of room for improvement in many areas, including documentation, and I need to get on it.
As always I welcome feedback on this article using the "Your Opinion" button below. Please also rate this article. I'd like to hear what others think, get similar stories, and help those who haven't been through this prepare for it happening to them.
©dkRanch.net April 2002
|Advertisement: Disasters can strike at anytime. One way to be prepared is to verify that your backup strategy is copying your backup files to remote machines and you are testing restores. SQL Backup Pro automates this for you to ensure that you can recover in the event of a disaster.|