Connections during AG Failover

  • Hello everyone,

    I am playing with Availability Groups for the first time (a bit late to the party) and there is something I don't understand about connections during a Failover.

    I currently have a 2 nodes AG (SQL 2017 CU15) in Synchronous commit mode but without readable secondary (licensing reason).

    I have 2 very basic powershell test scripts, the first one inserting data and the second one reading data, both every seconds. I'm using simple Invoke-Sqlcmd with a -QueryTimeout of 30 seconds. Both scripts are connected using a listener.

    When I trigger a manual failover, I was expecting the test queries to sort of "hang" while the failover occurs and then resume, assuming it takes less than my 30s timeout. Or perhaps for the currently executing queries to be killed and then the following queries to wait until the failover finishes.

    But what I'm currently seeing is multiple queries failing for 15s before resuming normally. It starts with the error: "Unable to access availability database 'Test_DB_1' because the database replica is not in the PRIMARY or SECONDARY role" and after a few seconds it becomes "The target database, 'Test_DB_1', is participating in an availability group and is currently not accessible for queries".

    I also get an error if I try to run a brand new query during the failover (in order to eliminate connection caching as an issue).

    Is this the normal behavior? Is it only powershell that handles the failover that way?

    This article says:

    If the availability group comes back online during a client application's connection attempt but before the connect timeout period, the client driver may successfully connect during one of its internal retry attempts and no error will be surfaced to the application in this case.

    Unless I'm misunderstanding it that seems to indicate that new connections should wait their timeout setting before returning an error?

    Thanks!

  • I can't answer your question about new connections but any existing connections, and therefore any queries using them, will be dropped during a manual or automatic failover.

  • I figured as much. It's unfortunate, I was hoping for something akin to X locks. Hold new connections, wait n seconds for transactions to finish, kill everything that didn't finish in time, failover and resume the connections on hold. This wouldn't help for long running transaction of course but for certain workloads it would be almost seamless. Better a little lag then needing to implement retry logic in all applications (most of which are not under our control anyway).

  • If you have access to the connection strings then, if you haven't already, it might be worth having a look at the multisubnetfailover parameter. It's supposed to improve the reconnection speed after failover even if you only have one subnet. I should be clear that I've never used it, I've only seen reference to it in the documentation, so I can't provide a recommendation or any criticism of it. The driver I was using, in our applications, didn't support it and the potential benefits weren't worth the risk and time required for changing the driver in our case.

    • This reply was modified 4 years, 4 months ago by  as_1234. Reason: Grammar correction
  • Availability Groups are part of High Availability disaster recovery.  Which promises high, not perfect, sql server availability.

  • I wrote a quick and dirty .net application this morning to test the MultiSubnetFailover parameter and I wanted to share my findings. The application simply writes a record every second and count the number of success and failure.

    Without specifying MultiSubnetFailover the results were pretty consistent, every Failover would cause the application to miss 15 inserts, so a total of 15 seconds of downtime.

    With MultiSubnetFailover set to true, I saw 3 different scenarios. Sometime it was the same as without it (15 rows lost). Sometime it would "lock" at the beginning and then it would miss 5 inserts and sometime it would "lock" for the whole 15 seconds and no rows would be lost.

    I'm not yet sure what cause the different behavior, I'm assuming it might depend on the current state of the connection (inserting, connecting, sleeping, etc.) while the Failover occurs but I haven't tested. Still the results are interesting and  it seems generally better with the flag on. However you do need to include a connection recovery mechanism because the connection is force closed sometime.

    Thanks for the suggestion!

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply