Disaster Recovery of 60TB SQL DB

  • Hi,
    we have a large SQL 2012 DB - around 60TB. We are required to provide a DR solution for it. We are thinking of either log shipping or Double Take. We are also required to perform a DR drill once every 6 months. Our preference is log shipping over DT. Want to know how can we do the DR drill.

    If we stop the log shipping during the drill, will we be able to bring up the secondary in a read/write mode ?
    Will the logs be backed up at primary or secondary
    Will log shipping start from where it left off
    Is DT  a better solution ?

    Thanks

    TP

  • When planning for recovery, I favor 'lower tech' solutions - keep things as simple as you can while still hitting your RPO/RTO. I deal with log shipping on 100TB++ databases and a 24/7 SLA, where simplicity is critical to fixing things that have gone wrong in the shortest time.

    Log shipping is as simple as it gets: you should already be taking frequent log backups with a database that size, all you need to add is something that copies those backups somewhere else, and restores them continuously (WITH NORECOVERY) to another database on another instance. I also encourage you to build that yourself, so you understand and have complete control of each step. You may also consider taking log backups more frequently, as the DR copy will only hold transactions that have been backed up from the primary and restored to the DR copy.
    There's a log-shipping Wizard in SSMS, but if you use that, you will need to become deeply knowledgeable on the Wizard, how it works, and what you must do to it in the event of a disaster/failover to keep it from torching one of the databases.

    From there, you must be prepared to do the failover in the event of a catastrophe:
    1. Take a 'tail of the log' backup (WITH NOTRUNCATE, NORECOVERY) from the primary, if possible
    2. Ensure all available log backups are restored (WITH NORECOVERY) to the DR copy
    3. Recover the DR database (RESTORE [...] WITH RECOVERY)
    4. Point all applications to the DR (now primary) database.
    5. Any transactions that committed on the Primary but did not get backed up and restored to the DR copy are lost.

    For DR testing, please understand that once you recover a log-shipped database, you cannot continue to apply log backups from the original database, unless you leave the original database in a state that can resume log shipping. 

    To make it possible to switch back to the original primary after you have recovered the DR database:
    0. Keep the instances hosting the Primary and DR databases patched to the same Service Pack and Cumulative Update. This is not an absolute requirement, but a very good idea anyway.
    1. Pause/disable all log-shipping and backup jobs.
    2. Take the final log backup from the original Primary WITH NORECOVERY, which will place the original Primary in Recovery (and also unavailable to users). This will leave it in a state where you can take log backups from the recovered DR copy and apply them back to the original (reversing the log shipping direction). 
    3. Restore all unrestored log backups (WITH NORECOVERY) to the DR copy, including the backup you just took on the Primary.
    4. Recover the DR copy, update pointers to the new database, perform all tests.
    5. (When it is time to fail back to the original Primary) Reverse the log shipping: take log backups from the live DR database and restore them on the original Primary 
    6. When you're ready to switch the primary from the DR copy back to the original, repeat the steps 2-4, switching the original primary for the DR copy:
         i. Take final log backup from DR copy WITH NORECOVERY
        ii. Apply all log backups taken from the DR copy to the original Primary WITH NORECOVERY
       iii. Recover the original primary, update application pointers to this database, resume operations

    Logins, jobs, Linked Servers, Service Broker Routes, and many other instance-level things are not part of the database, and will not be transferred with the database backups. You must set these up before the disaster.
    After recovering the database in another instance, you may also need to perform additional steps depending on the features in use in the database. For example, if you have a Database Master Key (DMK) encrypted with the Service Master Key, you must also decrypt the DMK with a password (created before the DR failover) and re-encrypt it with the Service Master Key on the new instance. If you're using TDE, then you must set up the TDE Certificate or EKM objects (Provider, Credential, Asymmetric Key) prior to starting log shipping, and so on and so on - this is why you absolutely must test your DR recovery prior to an actual disaster. The metadata you need may not have survived the disaster, so be sure you copy it out beforehand.

    If you wish to simply test recovery and access to the DR copy without taking the Primary offline, you can RESTORE DATABASE ... WITH STANDBY, which will recover and open the database in read-only mode. You can query it, check logins and so on, but won't be able to make any changes to it. You also can put it back in Recovery and continue log-shipping without doing any work.

    Before fiddling with your production systems, you can set up a couple SQL instances on your computer and/or in an Azure/AWS/GCP/etc. cloud demo account, create a tiny database in Full Recovery mode with a just few tables and a few rows to play with backups, recovery, log-shipping and switching between different log-shipped copies. You'll do yourself favors by going somewhere safe and breaking things a few times to see where things can go wrong and how to prevent and solve those things so your personal toolbox is ready to roll when disaster strikes. 

    Always assume disaster will strike. You'll make better recovery plans.
    Good luck 🙂
    -Eddie

    Eddie Wuerch
    MCM: SQL

  • To answer your main questions:
    If we stop the log shipping during the drill, will we be able to bring up the secondary in a read/write mode ? 

    • You can bring the DB up in Standby Mode and read from it, but you can't write to it. 

    Will the logs be backed up at primary or secondary

    • With log shipping, log backups are only possible on the Source Database, 

    Will log shipping start from where it left off

    • Yes and no! If you failover, and recover the DR side without using standby mode, you will need to re-establish log shipping from scratch in the other direction, UNLESS you are very careful. There is a way of ensuring a controlled LS failover can be configured in the other direction without a full backup and restore. Too much to add in this post, but you could probably look it up on the net.

    Is DT a better solution ?

    • Personal opinion only - No. As a DBA I like to be in control of my DR/HA and not pass it off to a System Admin. Also, why pay for DT when MS SQL already has some really excellent DR/HA solutions that are fully integrated and paid for.

    I would be looking at AlwaysOn High Availability if this is at all feasible on your infrastructure.

    • Not needing to modify DNS entries or connection strings to re-point applications to the DR server as this is managed by the Windows Failover Cluster Manager and DNS automatically via the Listener.
    • No need to re-establish the synchronization process (from backup) after a test failover, assuming you do updates on the DR side when testing.
    • No "tail of the log" backup, manual restores, etc. AoHA is ensuring zero data loss (Synchronous) or minimal data loss (async and manual failover).
    • You can take backups against the Secondary
    • The Secondary provides some protection against database corruption
    • Failover can be set to Automatic, providing HA and DR

    The one disadvantage of AoHA is that it doesn't allow for Point in Time recovery, where you need to get back to before some corrupted point, but I'm assuming from your post that you aren't looking to recover from this type of disaster. Obviously you need Enterprise Edition or SQL 2016 SP1, but I'd generally not be running a 60TB database on 2012 Standard because I'd want to use some features that are Enterprise only.

    We as a company have multiple client with large databases, and it's not that uncommon for them to use AoHA.

    Cheers
    Leo
    Nothing in life is ever so complicated that with a little work it can't be made more complex.

    Leo
    Nothing in life is ever so complicated that with a little work it can't be made more complicated.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply