Always On Availability Groups Failover Question

  • Hi,

    I'm curious if anyone knows the answer to this off the top of their head since I'm not able to test this at the moment.

    Let's say I have an AOAG cluster with two servers, a primary (S1) and a secondary (S2) with one database (mydatabase) on SQL Server 2016 Enterprise.

    Then, let's say the following events occur:

    1) S1 goes down and "mydatabase" fails over to S2. Good--that's what's supposed to happen.
    2) S2 goes down, so there's no servers up at this point.
    3) S1 comes back up.
    4) S2 comes back up.

    Am I correct in assuming the following:

    1) Even after 3) there still isn't a usable database because "mydatabase" on S2 (which had become the primary) could've had transactions on it that S1 doesn't know about yet?
    2) After 4), S1 gets all of the transactions that occurred on S2 while S1 was down?
    3) S2 continues to be the primary until I manually fail it back over to S1 even though "mydatabase" is synced on both servers?

    I hope my explanation of the scenario makes sense. Would someone be able to confirm (or deny) my assumptions?

    Thank you,

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • 1) Even after 3) there still isn't a usable database because "mydatabase" on S2 (which had become the primary) could've had transactions on it that S1 doesn't know about yet?

    •    Correct, with both down, no usable database replica will be available. 

    2) After 4), S1 gets all of the transactions that occurred on S2 while S1 was down?

    •   S1 will come online, but won't be in-sync until S2 is online either, it does not know about the changes yet, so correct assumption.
    •   Once S2 comes online, Synchronization can occur again.

    3) S2 continues to be the primary until I manually fail it back over to S1 even though "mydatabase" is synced on both servers?

    •     depends, how is your AG set up, Synchronous with Automatic Failover or Asynchronous with Forced Failover?

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    This thing is addressing problems that dont exist. Its solution-ism at its worst. We are dumbing down machines that are inherently superior. - Gilfoyle

  • Henrico Bekker - Monday, June 18, 2018 11:02 PM

    1) Even after 3) there still isn't a usable database because "mydatabase" on S2 (which had become the primary) could've had transactions on it that S1 doesn't know about yet?

    •    Correct, with both down, no usable database replica will be available. 

    2) After 4), S1 gets all of the transactions that occurred on S2 while S1 was down?

    •   S1 will come online, but won't be in-sync until S2 is online either, it does not know about the changes yet, so correct assumption.
    •   Once S2 comes online, Synchronization can occur again.

    3) S2 continues to be the primary until I manually fail it back over to S1 even though "mydatabase" is synced on both servers?

    •     depends, how is your AG set up, Synchronous with Automatic Failover or Asynchronous with Forced Failover?

    Thanks, Henrico.

    It's set up as Synchronous with Automatic Failover. Would you be able to tell me what would happen with S2 in either scenario--synchronous or asynchronous with forced failover?

    Thanks,

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • In synchronous mode, after #3, there will be no active server.  S1 will not be an a state that will allow for no data loss, so automatic failover will not occur.  You can force this to failover by running the following command:
    ALTER AVAILABILITY GROUP AGName FORCE_FAILOVER_ALLOW_DATA_LOSS;

    In async mode, after #3, there WILL be an active server.  S1 will fail itself over as the primary. When S2 comes back online, it will be the secondary. 
    That being said, I did have an AG set up in the same manner that did not come back online.  S1 was shut down for days, and S2 was shut down. The AG didn't come back until S2 was re-started. 
    I have the luxury of having a set of servers to play around with all the different scenarios.  It worked as you expected, except in the (unplanned) scenario where S1 was down for about 10 days.

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • Michael L John - Tuesday, June 19, 2018 1:13 PM

    In synchronous mode, after #3, there will be no active server.  S1 will not be an a state that will allow for no data loss, so automatic failover will not occur.  You can force this to failover by running the following command:
    ALTER AVAILABILITY GROUP AGName FORCE_FAILOVER_ALLOW_DATA_LOSS;

    In async mode, after #3, there WILL be an active server.  S1 will fail itself over as the primary. When S2 comes back online, it will be the secondary. 
    That being said, I did have an AG set up in the same manner that did not come back online.  S1 was shut down for days, and S2 was shut down. The AG didn't come back until S2 was re-started. 
    I have the luxury of having a set of servers to play around with all the different scenarios.  It worked as you expected, except in the (unplanned) scenario where S1 was down for about 10 days.

    Thank you, Michael! I appreciate you testing that for me.

    I imagine that most people would prefer to use synchronous with automatic failover, correct? I mean, I can see how there's a bit of a performance advantage to asynchronous, but at the risk of losing data...? I can't imagine many people would think that's a smart trade off.

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • Mike Scalise - Tuesday, June 19, 2018 2:08 PM

    Michael L John - Tuesday, June 19, 2018 1:13 PM

    In synchronous mode, after #3, there will be no active server.  S1 will not be an a state that will allow for no data loss, so automatic failover will not occur.  You can force this to failover by running the following command:
    ALTER AVAILABILITY GROUP AGName FORCE_FAILOVER_ALLOW_DATA_LOSS;

    In async mode, after #3, there WILL be an active server.  S1 will fail itself over as the primary. When S2 comes back online, it will be the secondary. 
    That being said, I did have an AG set up in the same manner that did not come back online.  S1 was shut down for days, and S2 was shut down. The AG didn't come back until S2 was re-started. 
    I have the luxury of having a set of servers to play around with all the different scenarios.  It worked as you expected, except in the (unplanned) scenario where S1 was down for about 10 days.

    Thank you, Michael! I appreciate you testing that for me.

    I imagine that most people would prefer to use synchronous with automatic failover, correct? I mean, I can see how there's a bit of a performance advantage to asynchronous, but at the risk of losing data...? I can't imagine many people would think that's a smart trade off.

    Mike

    You cannot use anything but synchronous with automatic failover.  And I think I misunderstood when I gave the answer for async.   That'a what happens when your brain and your eyes are out of sync. 

    In async, unless it's manually failed over, regardless of which one came offline or online, and it what order, there will be no primary unless manual intervention occurs,

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • Michael L John - Tuesday, June 19, 2018 2:18 PM

    Mike Scalise - Tuesday, June 19, 2018 2:08 PM

    Michael L John - Tuesday, June 19, 2018 1:13 PM

    In synchronous mode, after #3, there will be no active server.  S1 will not be an a state that will allow for no data loss, so automatic failover will not occur.  You can force this to failover by running the following command:
    ALTER AVAILABILITY GROUP AGName FORCE_FAILOVER_ALLOW_DATA_LOSS;

    In async mode, after #3, there WILL be an active server.  S1 will fail itself over as the primary. When S2 comes back online, it will be the secondary. 
    That being said, I did have an AG set up in the same manner that did not come back online.  S1 was shut down for days, and S2 was shut down. The AG didn't come back until S2 was re-started. 
    I have the luxury of having a set of servers to play around with all the different scenarios.  It worked as you expected, except in the (unplanned) scenario where S1 was down for about 10 days.

    Thank you, Michael! I appreciate you testing that for me.

    I imagine that most people would prefer to use synchronous with automatic failover, correct? I mean, I can see how there's a bit of a performance advantage to asynchronous, but at the risk of losing data...? I can't imagine many people would think that's a smart trade off.

    Mike

    You cannot use anything but synchronous with automatic failover.  And I think I misunderstood when I gave the answer for async.   That'a what happens when your brain and your eyes are out of sync. 

    In async, unless it's manually failed over, regardless of which one came offline or online, and it what order, there will be no primary unless manual intervention occurs,

    Got it.
    "You cannot use anything but synchronous with automatic failover."

    I know. I was asking if most people go with synchronous over going the asynchronous route and opening themselves up for data loss. Do you have a sense if one is way more widely used?

    Thanks,

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • Mike Scalise - Tuesday, June 19, 2018 2:35 PM

    Michael L John - Tuesday, June 19, 2018 2:18 PM

    Mike Scalise - Tuesday, June 19, 2018 2:08 PM

    Michael L John - Tuesday, June 19, 2018 1:13 PM

    In synchronous mode, after #3, there will be no active server.  S1 will not be an a state that will allow for no data loss, so automatic failover will not occur.  You can force this to failover by running the following command:
    ALTER AVAILABILITY GROUP AGName FORCE_FAILOVER_ALLOW_DATA_LOSS;

    In async mode, after #3, there WILL be an active server.  S1 will fail itself over as the primary. When S2 comes back online, it will be the secondary. 
    That being said, I did have an AG set up in the same manner that did not come back online.  S1 was shut down for days, and S2 was shut down. The AG didn't come back until S2 was re-started. 
    I have the luxury of having a set of servers to play around with all the different scenarios.  It worked as you expected, except in the (unplanned) scenario where S1 was down for about 10 days.

    Thank you, Michael! I appreciate you testing that for me.

    I imagine that most people would prefer to use synchronous with automatic failover, correct? I mean, I can see how there's a bit of a performance advantage to asynchronous, but at the risk of losing data...? I can't imagine many people would think that's a smart trade off.

    Mike

    You cannot use anything but synchronous with automatic failover.  And I think I misunderstood when I gave the answer for async.   That'a what happens when your brain and your eyes are out of sync. 

    In async, unless it's manually failed over, regardless of which one came offline or online, and it what order, there will be no primary unless manual intervention occurs,

    Got it.
    "You cannot use anything but synchronous with automatic failover."

    I know. I was asking if most people go with synchronous over going the asynchronous route and opening themselves up for data loss. Do you have a sense if one is way more widely used?

    Thanks,

    Mike

    I guess the decision to pick one over the other rests with your needs. 

    For example, here is the setup of a few of the AG's in my environments (I have 8 total)
    AG#1, Primary and 2 secondaries. Both sync and async
    The primary and one secondary live in the same data center, same subnet, same server specs.  They are set up to be synchronous.  The secondary is read-only, and is used for reporting and any select queries.  Read-only routing has been configured between these 2 servers. 
    The other secondary lives in a different data center in a different state.  This is set to asyc, it doers not have as many resources allocated to the VM, and is used for DR only.  

    This provides little downtime and no data loss for things like patching, and less than 15 minutes (that's the SLA/RPO for us) in a DR situation.  The servers that are synchronous automatically fail over and back, the async server will only come alive if I take manual action. With read-only routing set up, we have also distributed the workload. AG#2, Primary and 1 secondary, syncronous
    Simply, this is HA.  Functions in the same manner as a cluster.  The bonus is that we have DR because the secondary is at the co-lo. It's set to automatic failover. The secondary is not read only, there no need.

    AG#3, Primary and 1 secondary, asyncronous.
    This is DR only, there is no need for HA. It's an internal operation(s) AG, It get's patched once a quarter, and is unavailable for a few hours, it's no big deal.  The RPO is 24 hours, the RTO is 15 minutes.  The longest measured recovery time of the data was 12 minutes, it averages 9 seconds.  

    Confused yet?  That makes 2 of us!

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • Michael L John - Tuesday, June 19, 2018 3:05 PM

    Mike Scalise - Tuesday, June 19, 2018 2:35 PM

    Michael L John - Tuesday, June 19, 2018 2:18 PM

    Mike Scalise - Tuesday, June 19, 2018 2:08 PM

    Michael L John - Tuesday, June 19, 2018 1:13 PM

    In synchronous mode, after #3, there will be no active server.  S1 will not be an a state that will allow for no data loss, so automatic failover will not occur.  You can force this to failover by running the following command:
    ALTER AVAILABILITY GROUP AGName FORCE_FAILOVER_ALLOW_DATA_LOSS;

    In async mode, after #3, there WILL be an active server.  S1 will fail itself over as the primary. When S2 comes back online, it will be the secondary. 
    That being said, I did have an AG set up in the same manner that did not come back online.  S1 was shut down for days, and S2 was shut down. The AG didn't come back until S2 was re-started. 
    I have the luxury of having a set of servers to play around with all the different scenarios.  It worked as you expected, except in the (unplanned) scenario where S1 was down for about 10 days.

    Thank you, Michael! I appreciate you testing that for me.

    I imagine that most people would prefer to use synchronous with automatic failover, correct? I mean, I can see how there's a bit of a performance advantage to asynchronous, but at the risk of losing data...? I can't imagine many people would think that's a smart trade off.

    Mike

    You cannot use anything but synchronous with automatic failover.  And I think I misunderstood when I gave the answer for async.   That'a what happens when your brain and your eyes are out of sync. 

    In async, unless it's manually failed over, regardless of which one came offline or online, and it what order, there will be no primary unless manual intervention occurs,

    Got it.
    "You cannot use anything but synchronous with automatic failover."

    I know. I was asking if most people go with synchronous over going the asynchronous route and opening themselves up for data loss. Do you have a sense if one is way more widely used?

    Thanks,

    Mike

    I guess the decision to pick one over the other rests with your needs. 

    For example, here is the setup of a few of the AG's in my environments (I have 8 total)
    AG#1, Primary and 2 secondaries. Both sync and async
    The primary and one secondary live in the same data center, same subnet, same server specs.  They are set up to be synchronous.  The secondary is read-only, and is used for reporting and any select queries.  Read-only routing has been configured between these 2 servers. 
    The other secondary lives in a different data center in a different state.  This is set to asyc, it doers not have as many resources allocated to the VM, and is used for DR only.  

    This provides little downtime and no data loss for things like patching, and less than 15 minutes (that's the SLA/RPO for us) in a DR situation.  The servers that are synchronous automatically fail over and back, the async server will only come alive if I take manual action. With read-only routing set up, we have also distributed the workload. AG#2, Primary and 1 secondary, syncronous
    Simply, this is HA.  Functions in the same manner as a cluster.  The bonus is that we have DR because the secondary is at the co-lo. It's set to automatic failover. The secondary is not read only, there no need.

    AG#3, Primary and 1 secondary, asyncronous.
    This is DR only, there is no need for HA. It's an internal operation(s) AG, It get's patched once a quarter, and is unavailable for a few hours, it's no big deal.  The RPO is 24 hours, the RTO is 15 minutes.  The longest measured recovery time of the data was 12 minutes, it averages 9 seconds.  

    Confused yet?  That makes 2 of us!

    Michael,

    Thanks for all of this info--it's really great to see what others are doing. Believe it or not, I actually understand everything you wrote. I do have one questions, though--In both of your DR scenarios (AG#1 and AG#3), I noticed you're using asynchronous mode. Yes, I know that's better for performance (especially if your secondary lives in another state), but you are opening yourself up for data loss, correct? So, is it that you just have an acceptable %o of potential data loss as part of your DR policy? If so, then I see why asynchronous makes sense, but if execs are saying we must have 0% data loss, well then, regardless of performance, you'd have to use synchronous, right?

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • Mike Scalise - Monday, June 25, 2018 6:44 AM

    Michael L John - Tuesday, June 19, 2018 3:05 PM

    Mike Scalise - Tuesday, June 19, 2018 2:35 PM

    Michael L John - Tuesday, June 19, 2018 2:18 PM

    Mike Scalise - Tuesday, June 19, 2018 2:08 PM

    Michael L John - Tuesday, June 19, 2018 1:13 PM

    In synchronous mode, after #3, there will be no active server.  S1 will not be an a state that will allow for no data loss, so automatic failover will not occur.  You can force this to failover by running the following command:
    ALTER AVAILABILITY GROUP AGName FORCE_FAILOVER_ALLOW_DATA_LOSS;

    In async mode, after #3, there WILL be an active server.  S1 will fail itself over as the primary. When S2 comes back online, it will be the secondary. 
    That being said, I did have an AG set up in the same manner that did not come back online.  S1 was shut down for days, and S2 was shut down. The AG didn't come back until S2 was re-started. 
    I have the luxury of having a set of servers to play around with all the different scenarios.  It worked as you expected, except in the (unplanned) scenario where S1 was down for about 10 days.

    Thank you, Michael! I appreciate you testing that for me.

    I imagine that most people would prefer to use synchronous with automatic failover, correct? I mean, I can see how there's a bit of a performance advantage to asynchronous, but at the risk of losing data...? I can't imagine many people would think that's a smart trade off.

    Mike

    You cannot use anything but synchronous with automatic failover.  And I think I misunderstood when I gave the answer for async.   That'a what happens when your brain and your eyes are out of sync. 

    In async, unless it's manually failed over, regardless of which one came offline or online, and it what order, there will be no primary unless manual intervention occurs,

    Got it.
    "You cannot use anything but synchronous with automatic failover."

    I know. I was asking if most people go with synchronous over going the asynchronous route and opening themselves up for data loss. Do you have a sense if one is way more widely used?

    Thanks,

    Mike

    I guess the decision to pick one over the other rests with your needs. 

    For example, here is the setup of a few of the AG's in my environments (I have 8 total)
    AG#1, Primary and 2 secondaries. Both sync and async
    The primary and one secondary live in the same data center, same subnet, same server specs.  They are set up to be synchronous.  The secondary is read-only, and is used for reporting and any select queries.  Read-only routing has been configured between these 2 servers. 
    The other secondary lives in a different data center in a different state.  This is set to asyc, it doers not have as many resources allocated to the VM, and is used for DR only.  

    This provides little downtime and no data loss for things like patching, and less than 15 minutes (that's the SLA/RPO for us) in a DR situation.  The servers that are synchronous automatically fail over and back, the async server will only come alive if I take manual action. With read-only routing set up, we have also distributed the workload. AG#2, Primary and 1 secondary, syncronous
    Simply, this is HA.  Functions in the same manner as a cluster.  The bonus is that we have DR because the secondary is at the co-lo. It's set to automatic failover. The secondary is not read only, there no need.

    AG#3, Primary and 1 secondary, asyncronous.
    This is DR only, there is no need for HA. It's an internal operation(s) AG, It get's patched once a quarter, and is unavailable for a few hours, it's no big deal.  The RPO is 24 hours, the RTO is 15 minutes.  The longest measured recovery time of the data was 12 minutes, it averages 9 seconds.  

    Confused yet?  That makes 2 of us!

    Michael,

    Thanks for all of this info--it's really great to see what others are doing. Believe it or not, I actually understand everything you wrote. I do have one questions, though--In both of your DR scenarios (AG#1 and AG#3), I noticed you're using asynchronous mode. Yes, I know that's better for performance (especially if your secondary lives in another state), but you are opening yourself up for data loss, correct? So, is it that you just have an acceptable %o of potential data loss as part of your DR policy? If so, then I see why asynchronous makes sense, but if execs are saying we must have 0% data loss, well then, regardless of performance, you'd have to use synchronous, right?

    Mike

    Yes, we are aware of the potential data loss.  This will only failover to this node in the event of a DR scenario.  There is a 15 minute RPO, we have never come close to this.  The maximum data loss was less than 3 minutes.

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • Michael L John - Monday, June 25, 2018 11:10 AM

    Yes, we are aware of the potential data loss.  This will only failover to this node in the event of a DR scenario.  There is a 15 minute RPO, we have never come close to this.  The maximum data loss was less than 3 minutes.

    You also have to consider that in synchronous mode - data isn't committed on the primary until it has been committed on the secondary.  If your DR site is in another state (or further) - there is a high likelihood of latency impacting the transactions.  This could cause unacceptable performance on the primary system - and accepting some data loss to insure acceptable performance during normal business would probably make sense.

    I have my AG setup in asynchronous - because the secondary is *only* utilized for reporting and only has the necessary databases to support reporting.  There is no concern about losing data because this secondary will never be utilized as a primary replica.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • Jeffrey Williams 3188 - Monday, June 25, 2018 2:07 PM

    You also have to consider that in synchronous mode - data isn't committed on the primary until it has been committed on the secondary. 

    only once synchronised, otherwise the log on the primary replica will just build up unsent log records possibly forcing the log to grow to excessive sizes

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Mike Scalise - Monday, June 18, 2018 7:49 PM

    1) Even after 3) there still isn't a usable database because "mydatabase" on S2 (which had become the primary) could've had transactions on it that S1 doesn't know about yet?

    Correct, in this situation if you wanted to bring this database online you would have to force service allow data loss.

    Mike Scalise - Monday, June 18, 2018 7:49 PM


    2) After 4), S1 gets all of the transactions that occurred on S2 while S1 was down?

    If you did not failover manually, S2 coming back up would start as primary replica and unsent log would start to be sent to S1 and marked synchronising

    Mike Scalise - Monday, June 18, 2018 7:49 PM


    3) S2 continues to be the primary until I manually fail it back over to S1 even though "mydatabase" is synced on both servers?

    Yes

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Thank you, everyone, for the clarification! It's extremely helpful to me.

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply