Current best practice wrt very large databases (100GB) + wrt Backup / Restore / Archive ?

  • Envrionment:

    1. 1 Production server (VMWare, Server 2008, SQL Server 2008 R2 Enterprise)

    2. 1 SAN ( SQL server storage allocated 3.5TB (600GB SAS drives X 12))

    3. 100 databases (avg size 25GB, largest DB 980GB, sum DB sizes 2.6TB)

    Currently:

    SQL Server OS vmdk has image level snapshot copied to archive, to be stored offsite daily.

    Backups made of critical databases: Full daily, 15 minute trans; Full weekly, daily diffs, 15 minute trans; Daily Fulls; Weekly Fulls; Monthly Fulls with Daily diffs, in order of criticality and Acceptable Loss policy.

    Backups stored on seperate server (not on SAN, not virtualised, Unix)

    Differences between backups copied offsite, weekly full sync of all databases.

    Scenario 1: Full server failure and all database destruction.

    Best way to mitigate this? Best practice?

    My intention is to have a secondary server offsite in standby mode ready with latest backups restored in read-only mode (much like log shipping, but bulk imports are recorded too).

    Scenario 2: Individual Database corruption of 1 or more databases

    Best practice?

    My SQL Server can restore a database to a point in time prior to detected corruption, if corruption is within Acceptable Loss policy restrictions.

    If only a table is corrupted, a restored database to a different server can be used as a source to repair the affected table.

    These are the two main scenarios, total server failure, and undetected corruption, and I was wondering if this is as good as it gets?

    The last time we had to do a restore of one of the larger databases it took 8 hours. We used VDI devices, striped across as many stripes as we have CPU's. If I had to restore all my databases to recover from Scenario 1 it would like take more than 24 hours.

    What's the best way to get the users back to work fast?

    If you have had the unfortunate experience of needing to rebuild a whole server, with recovery using restores, or other ways, give us a detailed description of what went wrong (scenario), what you needed to do (number of databases, size of data) and how long it took and what hardware you had it on. Also what you do differently now that you have had this experience.

    So long, and thanks for all the fishpaste 😉

  • Instead of log shipping if you're looking at offsite, check out mirroring. It's much more resilient.

    In general, you're doing things right. There's two issues when it comes to backups, RTO and RPO, or Recovery Time Objective and Recovery Point Objective. Taking RPO first, this is basically how much data loss the company can live with. The answer is always ZERO, but that costs a fortune. You're already largely doing well with this with a full backup and log backups, so except for setting up mirroring or upgrading to 2012 and getting synchronous availability groups in place with offsite storage, you can't do much more. RTO is the hard part.

    For an entire server going away, you have two options. First, you're exploring, a second server, somewhere else, with your data completely, or nearly completely, up to date and ready to go. Done. Second, recovery to a new server. With the second option, there's no magic bullet. It just takes time. So, two things you can do to mitigate this. First, practice the restore process as often as you reasonably can so that you know how to get it done when the time comes. Second, prioritize the restores so you get the most important stuff online first and the rest second. Additional stuff, like you already have, such as differential backups can help but they're going to add overhead as well. You just have to deal with it.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply