Sql Server DR Strategy - Advice Reqd

  • Hi SQL Guru’s,

    We plan to implement a DR strategy for our mission critical production server. We would need your advice on choosing the best strategy for implementing this.

    Production Database Server Environment:

    Database Server: Active/Passive Sql Server 2008 (Clustered Instance)

    The production Sql Server instance (Server X) will have all the database and log files (including system databases) in the SAN storage (LUN A). There are approx 10- 15 databases on the production server.

    Staging/DR Database Server Environment:

    Database Server: Active/Active Sql Server 2008 (Clustered Instance)

    The Staging Server will have all the database and logfiles (including system databases) on LUN C.

    DR Strategy Proposed:

    This being a mission-critical database server the RPO and RTO has been defined as 15 min.

    Fig(i):

    We have file replication happening from LUN A to LUN B. Please refer fig (i) for the setup defined. LUN B is not connected to the clustered database instance (Server Y) at normal case. At the time of disaster in production server X, LUN C will be disconnected from server Y, instead LUN B will be connected as the Storage for Server Y.

    The Server version and edition (Service Packs and Hot Fixes) will be the same for X and Y.

    We visualized three scenario’s to ensure availability of the standby server (Y) for application connectivity:

    Scenario 1:

    1. We will re-point the master database path (database and log file) in the start-up parameter via the Sql Server Configuration Manager.

    2. Since this master contains the details of all the user databases the server should automatically be able to recognize all the production databases.

    3. We would have to probably rename the server instance (using sp_addserver and sp_dropserver) to point to the original instance name (i.e Server Y) as the master will be pointing to Server X.

    Advantages:

    1. No need for login creation and remapping.

    2. SSIS and Jobs are already present in the “msdb” database so eliminating the need to transfer them separately.

    3. Resource database will also be mapped correctly.

    Concerns:

    1. We have not tested re-pointing the master databases via the start-up parameters. Will this method work?

    2. This being a clustered instance would any complication rise? If so, kindly elaborate.

    Please advice.

    Scenario 2:

    Before Disaster – one time activity:

    1. Install one more Sql Server instance on the Staging Environment called instance Z on Server Y (keeping the system databases alone on the SAN - not on LUN C).

    2. Keep a backup of “master” and “msdb” databases (from server X) on LUN A so that it gets replicated to LUN B.

    3. As a one time activity, attach LUN B. Restore the backup of master (from server X) on instance Z.

    4. We would have to probably rename the server instance (using sp_addserver and sp_dropserver) to point to the original instance name (i.e Server Z) as the master will be pointing to Server X.

    5. Repoint the resource database.

    6. Restore the msdb database on instance Z.

    7. All the user DBs should be automatically recognized. If not, manually attach the user DBs.

    8. Stop the Instance Z.

    9. Detach the LUN B.

    At the time Disaster

    9. At the time of disaster, we only need to detach LUN C, reattach the LUN B and start the instance Z.

    10. May require to fire the login script (to get the latest logins) and probably restore the “msdb” database to get the latest job and packages.

    11. May require attaching some of the newly created user DBs.

    Advantages:

    1. Very little overhead of making the Standby Server online for applications at the time of disaster.

    Concerns:

    1. A backup of “master” and “msdb” must be taken on LUN A everyday.

    2. The login script must be taken to LUN A everyday.

    3. Manual overhead to check for any newly created databases and attach them if required.

    Please advice.

    Scenario 3:

    1. Re-point the Server Y to store the system databases on the SAN (LUN which are not part of LUN C).

    2. Take the login script and “msdb” backup from Server X and keep on LUN A.

    3. At the time of disaster detach LUN C and attach LUN B on instance Y.

    4. Fire login script and restore msdb database.

    5. Attach all user databases manually on Server Y.

    Advantages:

    1. Minimal overhead of bringing the Server Y online as standby for Server X.

    Concerns:

    1. A backup of msdb must be taken on LUN A everyday.

    2. The login script must be taken to LUN A everyday.

    3. Manual overhead of attaching user databases.

    4. Increased RTO and RPO

    Please advice.

    Scenario 4:

    We may use Image backup. But this will definitely breach the RTO and RPO. Please advice if any other mechanism is possible.

  • pdchandran (8/21/2009)


    We have file replication happening from LUN A to LUN B.

    What kind of file replication are you using and has it been certified to work with SQL Server databases?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hey there,

    Ok where to start. Firstly I think you may be over complicating things for yourself. I'll start by saying what we do and hopefully someone else will be able to help further cause im off on holiday in 10 min.

    We have an a/p (sql 2005). The san is mirrored to our scf san (but we have found that unpresenting, re-presenting lun's is time consuming and tricky). Our scf is the other side of the city, i am assuming you dr cluster is in the same location?

    Basically we mirror all the databases to the scf sql server and have ssis packages to handle logins, jobs and that stuff (time consuming to set up but after it just runs).

    In a dr test you can manually failover the mirrs. In a real dr you just remove the endpoints and restore the mirrored db's with recovery. Job done.

    In your case with a dr cluster you have a failover for you dr as well.

    One thought though, if your in the same network you could have a 3 node cluster with the third node being your dr node (if you loose both prode nodes the dr come into play), obviously here the san LUN becomes a spof.

    Its entirly upto you, but i think our way is simpler, easier to test and maintain.

    Hope this helps, seeya im off.

    🙂

    Adam Zacks-------------------------------------------Be Nice, Or Leave

  • The important which I had missed is - Sql Server has the database and log files locked as long as the Sql Server service is running. It would not be possible to copy/replicate these files when they are in locked state. So all the scenarios which we had proposed earlier would fail. Its a shame that I had not thought of that.

    The only way implement DR is to use in-built Sql Server features namely Log-Shipping or Mirroring. Probably Log-Shipping would be a much better, since this is a remote DR site, nearly 200 miles away from the production Data-Centre. We will install a new instance on this site and implement log-shipping for all the 20 DBs one by one.

    THanks all for your support.

  • pdchandran (8/23/2009)


    The important which I had missed is - Sql Server has the database and log files locked as long as the Sql Server service is running. It would not be possible to copy/replicate these files when they are in locked state. So all the scenarios which we had proposed earlier would fail. Its a shame that I had not thought of that.

    The only way implement DR is to use in-built Sql Server features namely Log-Shipping or Mirroring. Probably Log-Shipping would be a much better, since this is a remote DR site, nearly 200 miles away from the production Data-Centre. We will install a new instance on this site and implement log-shipping for all the 20 DBs one by one.

    THanks all for your support.

    That really depends upon what SAN you are using and whether or not they can perform snapshots of the volumes for SQL Server databases.

    Netapp SAN's have a product call SnapManager that 'freezes' the SQL Server database and snaps the LUN. This snap can then be backed up, restored on another server - used in a SnapMirror configuration and other options. This production uses the Microsoft VDI (Virtual Device Interface) to interact with SQL Server and allow for the 'copying' of database files while online. This process only takes a few seconds on very large databases.

    Other SAN's have other software that allow for the same thing - however, all of these solutions do not have the ability to snap the system databases LUN. In any business continuity scenario - you are going to have to account for either rebuilding the server and restoring the user databases, or have a process in place to restore the system databases.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • The typical solution we use at my company is the following:

    1. Active/Passive Primary instance.

    2. Standalone secondary instance for DR (can certainly be clustered, but we only expect it to be used in a disaster.. so we generally dont get the same exact config due to cost)

    3. Asynchronous mirroring without a witness between the two sites.

    4. Level-4 VIP for SQL re-direction and application configuration (global load balancing).

    We use asynchronous for performance and because our backup site is across the country.

    We don't use a witness because of the primary instance being clustered. If Node A (primary host for the clustered instance) goes down, SQL Server must go offline in order to come back online on Node B. With a witness, it may fail over to the DR partner without staying on the clustered instance.

    The level-4 VIP allows us to create a standard name for all application connection strings to use. It redirects the traffic between the primary clustered instance (also a virtual name), and the standalone DR instance. This is configured for a 90 second timeout where if the global load balancer sees the primary clustered instance offline for 90 seconds, it will then redirect application traffic to the DR site. If this happens, you may need to forcibly bring the mirrored databases online.

    A drawback to this approach is that all logins, scheduled jobs, ssis packages, etc that are on the primary instance need to be implemented on the DR site, disabled, then re-enabled in a DR situation where the DR site becomes live.

    Hope that gives you an additional approach.

    Steve

  • Hi Jeff,

    I will check and come back to you, as to what type of replication the Server team is planning to implement.

    Steve,

    Our primary instance has about 20 DBs. Would be a good idea to mirror all of the? I heard that the official MS recommendation is not to mirror more than 10 DBs on an instance.

  • Sorry about that- I overlooked the 10-15 databases you mentioned up top. Yeah, you are correct that mirroring that many might pose a problem depending on load. However, there is no official number, Microsoft's recommendation I would say suggests a certain amount of load will be present. Mirroring will send over transactions constantly, so that's where the performance hit will come into play.

    Best of luck,

    Steve

  • users are complaining that database are responding very slow?

  • ferozsql (9/19/2012)


    users are complaining that database are responding very slow?

    create a new post for this question...the original post is from 2009, and very very few peopel will follow it.

    also include a lot more information...remember we cannot look over your shoulder...

    is this all of the sudden? what cchanged?

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply