This problem was first flagged by a daily operational validation test I wrote using everything in my recent webinar to health check all of my AG components. My test wrapped some of the logic of the Test-Sql* functions built into the SqlServer module:

These actually trigger corresponding classes of policy based management policies which are installed with SQL Server 2012+. They cover off quite a few different things and so are worth running rather than writing your own from scratch.

I could see in the Jenkins test history that validation had been working correctly forever and so something out of the ordinary was going on - exactly the kind of thing validation is meant to pick up! Here was what Jojoba output recorded in Jenkins looked like:

UserName  : CODYKONIOR\Jenkins
Suite     : Ops
Timestamp : 13/03/2017 1:19:56 PM
Time      : 100.2030612
ClassName : Test-OpsAvailabilityGroupHealth
Name      : C1N1
Result    : Fail
Message   : 
Data      : Group [AG1] primary C1N1 is not healthy
	    Replica [C1N1] is not healthy
	    Database C1N1/Database1 is not healthy
	    Database C1N1/Database2 is not healthy
	    Database C1N1/Database3 is not healthy
	    Database C1N1/Database3 is not healthy

I logged onto that node and the AG Dashboard looked okay at first glance. But the test was still failing when I re-ran it manually, so, I looked deeper.

I logged onto the second node and noticed the AG was completely gone. All the databases were in recovery but there was no sign of the AG at all. Nothing. Nada. Zip. (I don't have any other words). It's like it was never there.

At first I thought someone must have done something awful. I quickly poured a coffee while checking the default trace which usually records system-level configuration changes like dropping an entire replica but in this case nothing relevant showed up.

Declare @tracefile Nvarchar(Max)
Select  @tracefile = Reverse(Substring(Reverse([path]),
        Charindex('\', Reverse([path])), 260)) + N'log.trc'
From    sys.traces
Where   is_default = 1;

Select  *
From    sys.fn_trace_gettable(@tracefile, default)
Where   TextData Is Not Null

Next stop was the SQL ERRORLOG. And of course… buried deep within…

Error: 41035, Severity: 16, State: 2.
Failed to enumerate Windows Server Failover Clustering (WSFC) registry value (Error code 1726).  The WSFC service may not be running or may not be accessible in its current state or the specified arguments are invalid.  For information about this error code, see "System Error Codes" in the Windows Development documentation.
Error: 19435, Severity: 16, State: 1.
AlwaysOn: WSFC AG integrity check failed for AG 'AG1' with error 41035, severity 16, state 2.
AlwaysOn: The local replica of availability group 'AG1' is being removed. The instance of SQL Server failed to validate the integrity of the availability group configuration in the Windows Server Failover Clustering (WSFC) store.  This is expected if the availability group has been removed from another instance of SQL Server. This is an informational message only. No user action is required.

So what happened? Well the WSFC had a panic attack and in a fraction of a second it summarily removed all traces of an AG from a secondary replica with no further appeals or discussion. Although the primary replica stayed online there were failed logins being recorded against the databases in recovery on the now non-existent secondary replica. Yay. (In retrospect I should have looked into this further at the time to confirm if they were new connections, cached connection pools, agent jobs or something else - because apparently this is not meant to happen).

The fix was to re-add the replica to the AG and everything resumed as normal…

Afterwards I went into the cluster logs to gather further information however it had just cycled out past the retention period. I also Google'd around and this kind of scenario seems extremely rare and undocumented - naturally - because all of the horrible things that can possibly happen to SQL Server only ever seem to happen to me.

Now that you know WSFC can summarily murder your AG replicas in cold blood it's a good time to ask yourself if you test your AGs daily (and whether the rest of your staff can, through a web UI and with historical reporting). And if you don't maybe you should start.

Action items:

  1. Upvote my Trello item to fix these tests in the SqlServer module. Why? Oh because they don't work properly at scale and can throw spurious errors. See point 3 below.
  2. Watch my video about operational validation of SQL Server at scale if you haven't already, set up your Jenkins and PowerShell environment, and start living the good life.
  3. Build your own test to find the same problem.

    You can do this by wrapping those faulty functions in Use-DbRetry {} blocks, which is a function that comes with my DbData module I mention in the presentation, and which will retry SQL timeouts, deadlocks, and PBM-based exceptions.