Backup Testing

  • Xerox Machine

    I've often said that backups aren't really important. But restores are critical.

    It's kind of a funny way of looking at things, but most of my backups, 99.9% of them are never used. They're just useless. It's the rare backup that's actually used to restore a database, even in testing. Now that's not always true as I know people that restore their databases every night to a test server, and in fact, the recommendations that I've heard from Microsoft are that you should restore one of your databases every night.

    I wasn't sure what the point was in this. If you're testing media, I'm not sure most places have the capability to move to tape and back off every night. If you're restoring from disk, then you aren't really testing the media. You're testing that you can get the restore done, but it's a limited test.

    And you're not building skills because no DBA in the world is going to manually store a database every night. They'll automate it, which means that if it ran without an issue for a year, they hadn't kept their skills sharp. So I'm wondering something:

    How often should you restore?

    I think I'm looking here for some guidelines on why you give your answer. What is the purpose of restoring the database.

    I was originally thinking that it didn't make sense to do a restore every day, but now I'm thinking that every night seems like a good idea. If I had the space, I'd restore every database every night after its backup and have it ready in case someone pulled the classic "delete with no WHERE clause" (probably me).

    If I was in a large environment, with dozens or hundreds of databases, I think I'd set up some sort of rotation so that I was restoring some database every night, automatically of course, to both test the backups and have some data ready to go. I'd also add I think a DBA should manually initiate the restore at least every month to be sure they've got the skills ready when they need them.

    And after hearing Paul Randal explain the ins and outs of checkdb, I might even be sure to have that running on all my restored databases.

  • I only go to restore when I need to restock!

  • Looking after development, test and live environments I find I do restores two or three times a week without the need to choose databases - people are always wanting dev databases moved to test, live to test, changing servers and so on and I always do this with backup and restore.

    As an application developer as well I have quite enough to do without making more work for myself!

  • I believe it depends on the importance of the system to the existence of the company.

    Restores for financial and transactional systems should be tested at least once a month in my opinion, since most business cycles are monthly.

    Reporting systems can do with a little less, since they're less critical.

    And I think it's a good habit to test backup and restore after changes have been implemented in the database or in the system as a whole. Maybe it has grown in size and suddenly it doesn't fit on the media you're using anymore.

     

  • The short answer is "as often as possible." What this translates to varies widely based on system, space, etc.. The reason is pretty much as you state. You need to know that the backups are working and workable. Further, most people set up backups and walk away, happy & content in a warm backup blanket. That's until the day that something goes south and they suddenly need to run a restore. Now what was that syntax again. Oh crap, why am I getting this error message. The sweat starts popping out as the gaggle of VP's behind your chair observe & take notes on every keystroke & mouse click... Just practice the restores.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Hmmm... testing backups, there's a good idea...

    Steve's question "How Often..." surely deserves It depends...  But I like his point that you have to avoid making routine something that you will called upon to do only in a non-routine situation.  It is one thing to test the quality and usability of backups as part of a routine, but another to test the restore procedures.

    I do enough occasional restores when moving databases to do an ad hoc restore any time.  But I have not tested the entire process:  can our Ops people lay hands on the right tapes, can they restore the required files using Veritas, and can I then restore these to resurrect the required database(s)?  It's not on the same scale as testing disaster recovery, but I suspect the exercise would reveal some communication and procedural gaps.

  • Our testing is when we need the backup... which has been rare.  maybe twice since i've worked at my current company.

    Our test and live db's are pretty out of sync though.  I'm trying to convice the team that we should roll the live backups to the test once a month to keep things in better sync. 

    Adding a bonus of being able to verify the validity of our backups and getting monthly practice at restoring backups might be something to add to the pros list to convice them.

  • In a month I have pretty much restored all our databases from production to dev and test.  Those are all my disk backups of course.

    I do however randomly ask for tapes to be pulled from Iron Mountain to be tested.  In May 2007 I requested a tape from a particular day in April 2006!  Restored just fine (as a side note, that backup was made on an unpatched SQL 2000 server, restored to SQL 2005 SP2).

    Bigger question with tapes is of course the ability to restore the tape if the tape software has changed since it was made.

    I do however not have a fixxed schedule for doing restores.

    About every 6 - 9 months I like to take a brand new server and try to do a complete reinstall of our environment to make sure my documentation is up to date.  The restore of a single database is pretty simple and really ought not to be a problem for a DBA with some experience, rebuilding the environment to get it back to production if your server completely died can be somewhat of a challenge.

  • We restore the main production DB backup every night as a reporting backup or in case we need it for some data-related issue.  But checking backups is a very good idea.  Case in point:

     

    One day a developer ran a query against production (don't get me started -- the developers have read-write access to prod in our environment and I haven't won the battle to lock that down yet) that incorrectly deleted about 10,000 notes.  We wanted to recover them all and we knew the time of day the query was run.  So just restore the backup and the log backups to just before then, right?

     

    Wrong -- the log backup for 6:45 AM that day was corrupt.

     

    There wasn't a thing I could do -- the RESTORE VERIFYONLY said it was a legit backup, externally it looked fine, we had a Microsoft support specialist look at it and could not make it work either.  For whatever reason, that one log backup was corrupt and useless, thereby making it impossible to recover data for the rest of that day.

     

    Since then we have implemented log shipping for DR.  That way we have an assurance that our transaction log backups at least are always good as well as having a resource in case of disaster.

  • We restore all our Dbs to our DR site daily and run a homegrown log shipping solution. This assures us our backups are working and are restorable. we also move them down to dev and test frequently.

  • Maybe another interesting take to throw out there: Seems we have been talking mostly about user db's. What about system db's?

    I know that I have had to restore the system db's once about 3 years ago. I have also moved them from on server to another. That is something that although one doesn't need to do often, it is a pretty important task to be able to accomplish.

  • "delete with no WHERE clause"

    Hehe... sometime in my first week of my first real job, I did an 'update with no WHERE clause' and I was staring at the message saying "953 Row(s) Affected" and wondering how the hell I was going to cover that up, developers all over the office started to stand up and look around, as if someone had unplugged the network or something. I had already realized what I did, and it wasn't too long before people started to notice that I was the only one in the office who looked like I was about to cry. I got to spend the next 2 days going through documentation of our various systems to figure out and manually replace some of the data which wasn't on our backup. I had to search through C++ COM code to find the ID numbers of stored procs, and then try to identify which procs went to which ID numbers, by inferring what the code was trying to do when it called 'proc 34' or whatever. That experience taught the company that we had a really weak single point of failure in our system which was simply data in a table, so it ended up being a good thing, but god I was freaked out when I saw that message...

    953 Row(s) Affected

    I will never forget that.

  • I remember reading of a site where they suddenly needed several year old backups, and they pulled the tapes from the archive, mounted them,... and then realized that no one had kept a record of the password rotation.

     

     

    ... oops ..

    ...

    -- FORTRAN manual for Xerox Computers --

  • Since this seems to have turned into a horror story thread.... (gee, that could be a new forum for the site and I bet it would see a lot of growth! Name it Stupid SQL Tricks or something.)

    Former boss needed to do a delete, he had a barely functional understanding of SQL Server. Consultant in California (friend and former co-worker of the boss) tells him what to type in. He types it in, hits execute, wipes out 85,000 records. It might have been the entire table, I don't recall.

    Fortunately I had good tlog backups and did a point in time recovery to just prior his delete.

    I then showed him:

    --delete xyz

    select * from xyz

    where foo = barr

    so that he could see what records would be affected, then uncomment the delete line and comment out the select.

    I think the initial problem was that the consultant, making more money than I did, created all of the date fields as nvarchar and didn't convert the field to a date type prior to doing the compare. Something like that. The entire table was a lousy design and littered with huge nchars and it was not an international database. Millions of rows, who knows how much wasted space. I'm very glad to be out of there.

    -----
    [font="Arial"]Knowledge is of two kinds. We know a subject ourselves or we know where we can find information upon it. --Samuel Johnson[/font]

  • The frequency of the backup / restore process will be dependent on your environment.  My recommendation is to have at minimum a documented / tested backup recovery plan. 

     

    1. Look at the backup / restore process of a database as a small piece of the big picture of disaster / recovery process of an entire server or group of servers. If my database sits on that server, I want to make sure everything my databases uses is backed up by the network staff. (I have a story about networking staff changing the backups and failing to get files - led to his firing)
    2. Mixed Backup Processes:  If you have very large, transactional databases, you may be doing a variation of full, differential, and transactional backups during an actual disaster.  Make sure you have this process well documented.  I would recommend testing this process on a smaller database, if your production databases are to large.
    3. System Databases: Make sure you can restore the server incase a system database became corrupted.  At one company they placed special stored procedures in the master database, so there is an extra step.
    4. Document any special server options.  One company set the ‘no count’ on the connections tab.
    5. Disk / Tape – I recommend the DBA be responsible for backing up databases to disk and then have the network staff backup the database backups.  I hate to rely only on tape backups.  In my experience they have not always been reliable.
    6. A DBA should be responsible for backing up databases.  Don’t let networking backup your databases to tape with a third party tool.  I have had issues with hanging spids in my database and also backups that were supposed to be successful failing on restore.
    7. Replication: I have much experience in Transactional Replication.  Make sure you know how to restore the publisher, distributor, and subscriber.  Any special coding, fields, or permissions you specifically place on the subscriber are overwritten by a snapshot.

     

    I have been through several server upgrades, including ones that did not go as planned and plenty of crashed servers.  Also, I was the DBA during disaster recovery drill and these events have had a profound affect on how I view the backup / restore process.

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply