Make SQL Server Agent Jobs HADR Aware

,

Introduction

Always On Availability Groups (AGs/AG) provides a fairly convenient and reliable High-Availability Disaster Recovery ("HADR") solution. Each database is configured to provide a fail-over solution using multiple servers running Windows Clustering replicas. The two replicas, primary (supporting reads and writes) and secondary (a.k.a. read replica), are synchronized via your choice of a synchronous or an asynchronous process. Once an AG Listener identifies an unhealthy primary, it will bring read replica as a primary on a different server.

This blog post is going to cover the process to make SQL Server Agent ("Agent") AG-aware. Some knowledge of cursors, sp_executesql and job tokens is needed.

The Problem

For both the primary and secondary servers to host a primary database at any point in time, both servers would need to have identical database jobs configured. For example: any SSIS job that runs against the ERP database on what is currently your primary server could suddenly swap to a new server. Unfortunately, Microsoft doesn't yet support the synchronizing of the msdb database (the database that stores all jobs related information) in an AG.

The problem is, how do you support identical jobs on both servers without some of the jobs failing? The same job that writes data into a primary database will fail trying to write into a secondary replica (read-only).

The easiest option, of course, is to not have the jobs on both servers and copy the jobs from the primary to the secondary when a fail-over occurs. This is a great idea on paper, but it creates a maintenance nightmare for the DBA to copy all the jobs within an allowable SLA timeframe (read: highly impractical).

Another option is to keep all jobs enabled on the primary instance and disabled on the secondary instance, and then run a script to disable/enable jobs once the fail-over occurs. While this is a better option, it still needs a human to trigger the "enable/disable all jobs" job and requires a very, very delicate knowledge in your current environment's jobs to know which jobs are part of the AG, which jobs are local, and even which jobs are disabled by the user for some other reason (read: babysitting needed) .

Solution

A comprehensive solution should take care of all existing jobs, support disabled jobs by user, and also support jobs that are designed to run on only the instance acting as the primary.

How are we going to achieve this? We are going to rely on a job description (HADR.AA.PRIMARY - AG-aware . availability group name . replica role), and dynamically inject code into the first job step to stop the the job gracefully based on the job description. As a bonus, the job will remove the conditional step first step from jobs that are no longer AG-aware based on the job description.

The following code will create a stored procedure (that can be scheduled) that will:

  1. Select all the available jobs with HADR keyword in a job description (exclude jobs that already have "HADR check" first step)
  2. Iterate one-by-one through all those jobs (job description example: HADR.AA.PRIMARY)
    1. Add a new "HADR check" first step into every HADR job:
    2. Check on the state of Always On Availability Group (Online, Healthy, Connected)
    3. Compare availability group name (AA) and replica role (PRIMARY) to a job description and add the following logic:
      1. If matched, stop the job (using job tokens)
      2. If not matched, continue to a next step
  3. Select all the available jobs with "HADR check" step but without the HADR keyword in a job description
  4. Iterate one-by-one through all those jobs
    1. Remove the "HADR check" first step from all non HADR jobs
-- =============================================
-- Author:	Steve Rezhener
-- Create date: 10/01/2020
-- Description:	Create/Remove HADR check for HADR
 jobs
-- =============================================
-- EXEC [dbo].[usp_maintain_HADRcheck4HARDjobs]
CREATE PROCEDURE [dbo].[usp_maintain_HADRcheck4HARDjobs] 
AS
BEGIN
SET NOCOUNT ON;
	
DECLARE @step_command	AS NVARCHAR(MAX)
DECLARE @job_id			AS UNIQUEIDENTIFIER
DECLARE @job_name		AS VARCHAR(50)
	
-- add hadr check to all hadr-aware jobs
DECLARE agwarejobs LOCAL STATIC READ_ONLY FORWARD_ONLY CURSOR FOR
SELECT	jobs.job_id, jobs.name
FROM	msdb.dbo.sysjobs as jobs
	inner join msdb.dbo.sysjobsteps as steps on steps.job_id = jobs.job_id 
WHERE	1=1
	and jobs.description like '%HADR%'				-- include only hadr aware jobs
	and steps.step_name != 'HADR Check'					-- exlcude jobs that have AGSteps1 step already
	and steps.step_id = 1 
	OPEN	agwarejobs
	FETCH NEXT  FROM agwarejobs  INTO @job_id, @job_name		
	WHILE @@FETCH_STATUS = 0
	BEGIN
	-- HADR.AG1.PRIMARY
	SET @step_command = N'' + 
	' DECLARE @ag_name AS VARCHAR(50),  @replica_role AS VARCHAR(50)' +
	' SELECT @ag_name = PARSENAME(description, 2), @replica_role = PARSENAME(description, 1)' +
	' FROM msdb.dbo.sysjobs' + 
	' WHERE job_id = $(ESCAPE_NONE(JOBID))' +
		' IF EXISTS (' +
		' SELECT 1' +
		' FROM 	sys.dm_hadr_availability_replica_states as hadr_avrest' +
	'		INNER JOIN sys.availability_groups as avgr						ON avgr.group_id = hadr_avrest.group_id' +
	'		INNER JOIN sys.dm_hadr_availability_group_states as hadr_avgrst	ON hadr_avgrst.group_id = hadr_avrest.group_id' +
	' WHERE	1=1' +
	' AND avgr.name = @ag_name ' +
	' AND hadr_avrest.role_desc = @replica_role ' +
	' AND avgr.name = @ag_name ' +
	' AND hadr_avrest.role_desc = @replica_role ' +
	' AND hadr_avrest.operational_state_desc = ''ONLINE''' +
	' AND hadr_avrest.connected_state_desc = ''CONNECTED''' +
	' AND hadr_avrest.synchronization_health_desc = ''HEALTHY''' +
		')' +
		' BEGIN' +
		'' +
		' PRINT ''stop the job''' +
		' EXEC msdb.dbo.sp_stop_job @job_id = $(ESCAPE_NONE(JOBID));' +
		'' +
		' END' +
		' ELSE PRINT ''continue with other steps'''
		PRINT @step_command
		IF EXISTS(SELECT name from msdb.dbo.sysjobs WHERE job_id = @job_id) -- just in case
		BEGIN
			PRINT 'add hadr check step to ' + @job_name + ' job'
			EXEC msdb.dbo.sp_add_jobstep @job_id = @job_id
			, @step_id = 1
			, @step_name = N'HADR Check'
			, @subsystem = N'TSQL'
			, @on_success_action = 3 -- Go to next step
			, @command = @step_command
		
		END
		FETCH NEXT  FROM agwarejobs  INTO @job_id, @job_name	
	END
 
	CLOSE agwarejobs
	DEALLOCATE agwarejobs
	
	--HADR.AA.PRIMARY
	-- remove hadr check from all not hadr-aware jobs = left over due to job description change
	DECLARE notagwarejobs LOCAL STATIC READ_ONLY FORWARD_ONLY CURSOR FOR
	SELECT	jobs.job_id, jobs.name
	FROM	msdb.dbo.sysjobsteps as steps 
		LEFT OUTER JOIN msdb.dbo.sysjobs as jobs on steps.job_id = jobs.job_id 
	WHERE	1=1
		and steps.step_name = 'HADR Check'							-- include jobs that have AGSteps1 step = left over
		and steps.step_id = 1 
		and ISNULL(jobs.description, '') NOT like '%HADR%'			-- exclude jobs that are not hadr aware
	OPEN	notagwarejobs
	FETCH NEXT  FROM notagwarejobs  INTO @job_id, @job_name		
	WHILE @@FETCH_STATUS = 0
	BEGIN
	PRINT 'remove hard check step from ' + @job_name + ' job'
	EXEC msdb.dbo.sp_delete_jobstep @job_id = @job_id
						, @step_id = 1
		
	FETCH NEXT  FROM notagwarejobs  INTO @job_id, @job_name	
	END
 
	CLOSE notagwarejobs
	DEALLOCATE notagwarejobs
END

Summary

The code above will allow you to control the "runnability" of jobs on servers that are part of Always On Availability group. The same job that exists on both servers (primary and secondary replicas'  servers) would only run on one server based on a job description (HADR.AA.PRIMARY = AG-aware. availability_group_name.replica_role). The added step will instruct the SQL Server Agent if this job is going to stopped or alternatively run as usual.

Disclaimer

This blog post is partially based on the following articles:

Rate

4 (2)

Share

Share

Rate

4 (2)