DR test - failover with potential data loss on live server.

  • Hi All,

    So I want to test failover to my DR site.  My current set up is 2 x replica in site A (synchronous), 1 in site B (async).   It's a busy live server.  How would you do this?

    Options I see:

    • Tell client they risk data loss & put them off.
    • Set site A - B to synch & wait until synchronised then failOver.  Tell client we risk performance hit on whichever replica is primary.
    • Failover with potential data loss, but stop all activity on site A & monitor DMVs until it looks like we're all up to date.  Failover.  Failback.

    Open to any other ideas?

     

  • Do the clients fully understand the risk of lost data and uptime in testing on the live system? Would a test on duplicate environments suffice. Variables (IP addresses, server names, permissions if you aren't careful, maybe performance if  specs aren't identical) differ in a duplicated environment, so it's not a perfect test of the live system, but the client needs to be willing to take the risk.

    What are the client's RPO (recovery point objective -- how many minutes of data are they willing to lose?) & RTO (recovery time objective -- how quickly do they require failover to complete?)?  That should determine whether (and how long) you wait to synchronize vs. failing over immediately.

    What DR approach are you using? Availability groups? mirroring? Log shipping? 3rd party/custom?

    If client wants to avoid risk to live system:

    • You could duplicate the environment on separate servers the same sites (if there is sufficient network throughput to handle extra traffic), or in duplicate sites (if you can't risk any network contention)  to minimize impact to the live production system.
    • If that's not possible, and the server is up to the task, you could duplicate the databases on the same servers and failover the copies.
  • So I have an AOAG set up.  I'm not keen to go for it.  Duplication isn't really an issue - (don't ask! there are lots of constraints here).

    I'm presuming it's possible to:

    1. cease all activity on site A, perhaps single user all the dbs watch the DMVs to get a point where there are no queues & nothing pending to be sent
    2. force the failover allowing data loss (but knowing there shouldn't be any given the log LSNS from the DMVs)

    Unless there's a proper prescribed way of testing. I'm not suggesting this is a good idea.  Just that it's presumably logically possible. I'm just prepping for all my questions from the client.

    Totally take on board that there is a risk of data loss & this should be done on a lovely test rig that duplicates all on live.

  • What are you testing here?  A nice, safe, planned failover from A to B to check the DR site works?   If that is the case then I would switch to synchronous, take a possible peformance hit and switch sites with no data loss.  You will then be able to switch back, again cleanly with no data loss.  This will show the DR site works once any apps are connected to it and tested.

    However, if you are testing something more realistic where you've suddenly lost site A for some reason then you'd have to do a failover with (potential) dataloss.  Then work out the plan to get back to site A.

  • Fair question.  Good point made.  Thanks

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply