In The Real World - Disaster!

  • Comments posted to this topic are about the content posted at http://www.sqlservercentral.com/columnists/sjones/intherealworlddisaster.asp

  • Looks like you dealt with fairly smaller databases. The same scenario could have been worse, if the databases are of size, say 10 GB (in terms of restore times). Logshipping would have saved a lot of time in this case!


    HTH,
    Vyas
    SQL Server MVP
    http://vyaskn.tripod.com/

  • So far myself I have been lucky in that a very few number of our databases have a restore commitment of less than 24 hours and those that are are less than 100MB each and replicated to another location. I have however had the loving experience of within a month of each other both the primary and backup sites lost a drive in a RAID5 array. Fortunately this was one drive and we got replaced before lossing any more (we did have to wait a week on 1 drive and boy was everyone sweating it).

    "Don't roll your eyes at me. I will tape them in place." (Teacher on Boston Public)

  • Logshipping is definitely a worthwhile thing. I created some ultra-basic scripts (since I don't have the enterprise edition of SQL Server) that do the equivalent thing (look on comp.databases.ms-sqlserver). My goal is to be up within 5-10 minutes.

    Some ideas for log shipping -

    1) For all the databases that only require nightly backups (and can easily survive a day's loss of data), set a script to restore nightly on the backup server in operational mode. That way, they're ready to go & don't need to be touched, shaving off precious minutes.

    2) Scripts scripts scripts. I have one to restore the transaction logs, one to run through and fix the users, etc, etc. I try to make them as generic as possible, using inputs to tell it which database/files to work on.

    3) Jobs - I have 3 jobs set that will bring everything back up. They run the aforementioned scripts with the necessary parameters. The only thing they don't do is change the IP address, server name, and run the setup program so that everything synchronizes.

    4) Documentation. Do a complete run through, documenting everything. Make it so easy your kids can run it!

    5) Assume you won't be there. Assume you'll be hit by a bus. Although, granted, at that point you won't care if the databases aren't brought up quickly. 😉

    Great article, thanks!

  • I'm not sold on log shipping, mainly because I've seen too many errors with MS tools like this. I prefer to do it myself.

    Total DB size (3 dbs) was about 1GB, though this was initially implemented because we had the servers in another location and ftp'd the data back every 15 minutes. A larger db wouldn't have changed this, though the fulls would probably have been weekly and differentials daily.

    Good ideas below and I'd like to implement them, but I don't have a spare server. In this case, we pressed the QA server into production. However, I do practice the restore every Monday to reset the QA environment, so I've got good documentation on that. Only thing we missed, explaining the repointing of the web servers to a new database. Since this was a temp fix, we did not want to rename the server.

    Steve Jones

    steve@dkranch.net

  • ty, Mr. Warren

    Steve Jones

    steve@dkranch.net

  • Performing a cold backup of master,model and msdb once in a while on the local disk could also help when you get a call from the System Engineer saying that there was a controll failure and I had to rebuilt the box and restore the files but the SQL Server services won't start. Obviously because the backup software that was used was not backing up the *.mdf and *.ldf files and these files were never restored. Fortunately I had taken a cold backup of all the system files and renamed it with different extensions which were restored. All I had to do was rename the files back to their orignal extensions and place them in the data folder and start the services and restore the rest of the user databases.

  • Good point, though I'd be sure I had these two on another server as well. Wouldn't have done any good in this case to have them on the local drive.

    Steve Jones

    steve@dkranch.net

  • Hi Steve,

    is this article the reason you've lost your job?

  • Ouch ,

    no, this actually occurred two weeks ago on a Wed. I succsessfully moved everything over and ordered a new RAID controller the next day. We were configuring the production machine that Fri, when our board meeting ended and the CEO came and told us that the company would be folding. It had nothing to do with the technology, the salesman just couldn't sell enough stuff over the last two years. I and my former CTO are extremely proud of the software and systems we built. It was the best, most flexible and reliable software I've ever been a part of and I was sorry to see it go.

    I have some notes over the last two weeks at my http://www.dkranch.net site, if you are interested. Also I am consulting with a couple former customers of IQD that want to continue to use the software and had an escrow agreement for it's use.

    If I hadn't gotten things running that night, then I might have deservered to lose my job, but that wasn't the case.

    Steve Jones

    steve@dkranch.net

  • I apologize if I came across rude, but that was my reflex after seeing your little add in sqlcentral newsletter.

    Your experience really made me think again about backup procedures. I also discussed your article with my colleagues at work (IT dept). Some people think 2h 45m under those conditions is formidable recovery time, some disagree. I think that hardware failure plus not having enough time to prepare backup box plus not physically being there made it all very difficult. But I was surprised you didn’t have restore scripts ready (taking care of spids, logins/users, etc.)

    Couple of questions:

    1.) How many databases were recovered?

    2.) What was used to move backups to tape?

    3.) You said one box was co-located. Different domain I guess. I also gues you transferred users/logins through script?

    Very useful article, thanks!

  • Only a few users/logins and they were synched with sp_change_users_login (we use SQLAuth). They were added with sp_addlogin because our backup box was "appropriated" a couple months ago for another task. Had this been a more critical item for the company, we would have had it done quicker.

    np, it was just a shock this am seeing the post. I understand your seeing the two events together. I'd actually written this about 12 hours before being told the company was failing.

    The backups were not on tape. Actually they were, but only for the previous night. I ftp backups and logs every 15 minutes to our ftp site, so we recovered the most recent from there.

    A total of 3 databases were recoved that night. I did msdb the next day.

    Steve Jones

    steve@dkranch.net

  • A couple of things:

    1) The easiest way to kill all spids is actually to go through the "detach database" form, which allows you to kill all active processes. You do not actually have to detach the database for it, it's just a convenience given to you with courtesy of MS (and for some reason is only found in the "Detach DB" winform).

    2) There has to be a better way to have your database server redundant. Starting with a clustered active-passive two server configuration, through a raid 10 disk array, with spare parts sitting around just for those few remaining single points of failure (e.g. raid controller). Wouldn't you agree?

  • About 80% businesses failed after a major data disaster

    happened to them. This we have already seen recently

    in the UK when major floods caused a major breakdown for

    various IT and non-IT companies to lost their data and they

    were out of business.

    Using backup software which is embedded in OS is a good idea when you are sure

    about your hardware. Major data lost occur due to hard disk failure or controller

    failure. In this case, you wont be able to retrive from your hard disk and basically

    you will lose all your data unless you have done offsite backup.

    To avoid this, any business, either SMB or enterprise, must

    have disaster recovery plan. D2D Bare Metal Recovery is one

    technology which is available in the market but not all

    people know about this. Check this out at,

    http://www.unitrends.co.uk

    They are the originator for Bare Metal term. Using this

    technology one can restore OS and Data very quickly.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply