Always On Availability Groups (AGs/AG) provides a fairly convenient and reliable High-Availability Disaster Recovery ("HADR") solution. Each database is configured to provide a fail-over solution using multiple servers running Windows Clustering replicas. The two replicas, primary (supporting reads and writes) and secondary (a.k.a. read replica), are synchronized via your choice of a synchronous or an asynchronous process. Once an AG Listener identifies an unhealthy primary, it will bring read replica as a primary on a different server.
For both the primary and secondary servers to host a primary database at any point in time, both servers would need to have identical database jobs configured. For example: any SSIS job that runs against the ERP database on what is currently your primary server could suddenly swap to a new server. Unfortunately, Microsoft doesn't yet support the synchronizing of the msdb database (the database that stores all jobs related information) in an AG.
The problem is, how do you support identical jobs on both servers without some of the jobs failing? The same job that writes data into a primary database will fail trying to write into a secondary replica (read-only).
The easiest option, of course, is to not have the jobs on both servers and copy the jobs from the primary to the secondary when a fail-over occurs. This is a great idea on paper, but it creates a maintenance nightmare for the DBA to copy all the jobs within an allowable SLA timeframe (read: highly impractical).
Another option is to keep all jobs enabled on the primary instance and disabled on the secondary instance, and then run a script to disable/enable jobs once the fail-over occurs. While this is a better option, it still needs a human to trigger the "enable/disable all jobs" job and requires a very, very delicate knowledge in your current environment's jobs to know which jobs are part of the AG, which jobs are local, and even which jobs are disabled by the user for some other reason (read: babysitting needed) .
A comprehensive solution should take care of all existing jobs, support disabled jobs by user, and also support jobs that are designed to run on only the instance acting as the primary.
How are we going to achieve this? We are going to rely on a job description (HADR.AA.PRIMARY - AG-aware . availability group name . replica role), and dynamically inject code into the first job step to stop the the job gracefully based on the job description. As a bonus, the job will remove the conditional step first step from jobs that are no longer AG-aware based on the job description.
The following code will create a stored procedure (that can be scheduled) that will:
- Select all the available jobs with HADR keyword in a job description (exclude jobs that already have "HADR check" first step)
- Iterate one-by-one through all those jobs (job description example: HADR.AA.PRIMARY)
- Add a new "HADR check" first step into every HADR job:
- Check on the state of Always On Availability Group (Online, Healthy, Connected)
- Compare availability group name (AA) and replica role (PRIMARY) to a job description and add the following logic:
- If matched, stop the job (using job tokens)
- If not matched, continue to a next step
- Select all the available jobs with "HADR check" step but without the HADR keyword in a job description
- Iterate one-by-one through all those jobs
- Remove the "HADR check" first step from all non HADR jobs
-- ============================================= -- Author: Steve Rezhener -- Create date: 10/01/2020 -- Description: Create/Remove HADR check for HADR jobs -- ============================================= -- EXEC [dbo].[usp_maintain_HADRcheck4HARDjobs] CREATE PROCEDURE [dbo].[usp_maintain_HADRcheck4HARDjobs] AS BEGIN SET NOCOUNT ON; DECLARE @step_command AS NVARCHAR(MAX) DECLARE @job_id AS UNIQUEIDENTIFIER DECLARE @job_name AS VARCHAR(50) -- add hadr check to all hadr-aware jobs DECLARE agwarejobs LOCAL STATIC READ_ONLY FORWARD_ONLY CURSOR FOR SELECT jobs.job_id, jobs.name FROM msdb.dbo.sysjobs as jobs inner join msdb.dbo.sysjobsteps as steps on steps.job_id = jobs.job_id WHERE 1=1 and jobs.description like '%HADR%' -- include only hadr aware jobs and steps.step_name != 'HADR Check' -- exlcude jobs that have AGSteps1 step already and steps.step_id = 1 OPEN agwarejobs FETCH NEXT FROM agwarejobs INTO @job_id, @job_name WHILE @@FETCH_STATUS = 0 BEGIN -- HADR.AG1.PRIMARY SET @step_command = N'' + ' DECLARE @ag_name AS VARCHAR(50), @replica_role AS VARCHAR(50)' + ' SELECT @ag_name = PARSENAME(description, 2), @replica_role = PARSENAME(description, 1)' + ' FROM msdb.dbo.sysjobs' + ' WHERE job_id = $(ESCAPE_NONE(JOBID))' + ' IF EXISTS (' + ' SELECT 1' + ' FROM sys.dm_hadr_availability_replica_states as hadr_avrest' + ' INNER JOIN sys.availability_groups as avgr ON avgr.group_id = hadr_avrest.group_id' + ' INNER JOIN sys.dm_hadr_availability_group_states as hadr_avgrst ON hadr_avgrst.group_id = hadr_avrest.group_id' + ' WHERE 1=1' + ' AND avgr.name = @ag_name ' + ' AND hadr_avrest.role_desc = @replica_role ' + ' AND avgr.name = @ag_name ' + ' AND hadr_avrest.role_desc = @replica_role ' + ' AND hadr_avrest.operational_state_desc = ''ONLINE''' + ' AND hadr_avrest.connected_state_desc = ''CONNECTED''' + ' AND hadr_avrest.synchronization_health_desc = ''HEALTHY''' + ')' + ' BEGIN' + '' + ' PRINT ''stop the job''' + ' EXEC msdb.dbo.sp_stop_job @job_id = $(ESCAPE_NONE(JOBID));' + '' + ' END' + ' ELSE PRINT ''continue with other steps''' PRINT @step_command IF EXISTS(SELECT name from msdb.dbo.sysjobs WHERE job_id = @job_id) -- just in case BEGIN PRINT 'add hadr check step to ' + @job_name + ' job' EXEC msdb.dbo.sp_add_jobstep @job_id = @job_id , @step_id = 1 , @step_name = N'HADR Check' , @subsystem = N'TSQL' , @on_success_action = 3 -- Go to next step , @command = @step_command END FETCH NEXT FROM agwarejobs INTO @job_id, @job_name END CLOSE agwarejobs DEALLOCATE agwarejobs --HADR.AA.PRIMARY -- remove hadr check from all not hadr-aware jobs = left over due to job description change DECLARE notagwarejobs LOCAL STATIC READ_ONLY FORWARD_ONLY CURSOR FOR SELECT jobs.job_id, jobs.name FROM msdb.dbo.sysjobsteps as steps LEFT OUTER JOIN msdb.dbo.sysjobs as jobs on steps.job_id = jobs.job_id WHERE 1=1 and steps.step_name = 'HADR Check' -- include jobs that have AGSteps1 step = left over and steps.step_id = 1 and ISNULL(jobs.description, '') NOT like '%HADR%' -- exclude jobs that are not hadr aware OPEN notagwarejobs FETCH NEXT FROM notagwarejobs INTO @job_id, @job_name WHILE @@FETCH_STATUS = 0 BEGIN PRINT 'remove hard check step from ' + @job_name + ' job' EXEC msdb.dbo.sp_delete_jobstep @job_id = @job_id , @step_id = 1 FETCH NEXT FROM notagwarejobs INTO @job_id, @job_name END CLOSE notagwarejobs DEALLOCATE notagwarejobs END
The code above will allow you to control the "runnability" of jobs on servers that are part of Always On Availability group. The same job that exists on both servers (primary and secondary replicas' servers) would only run on one server based on a job description (HADR.AA.PRIMARY = AG-aware. availability_group_name.replica_role). The added step will instruct the SQL Server Agent if this job is going to stopped or alternatively run as usual.
This blog post is partially based on the following articles: