The first part of this series explained why we need a job controlling mechanism in an AG environment, which system views will be used for the control mechanism, and how to determine the role of the replica in AG. This article will provide the actual implementation of the control mechanism.
Let me remind you that we already defined function ags.udf_AGHadrGroupIsPrimary which returning the role of replica in AG. And result Y mean replica is primary, N – replica is secondary, F – failure. Not every job that is on the server has to be disabled/enabled in primary/secondary replica scenario. However, we do need a control table to be able identify those jobs.
CREATE TABLE [ags].[OBJT_Job]( [JobID] [int] NOT NULL PRIMARY KEY, [JobName] [varchar](255) NOT NULL, [AGRequireDisableFlag] [char](1) NOT NULL);
In addition, we are going to create two history tables to manage job status automatically during automatic failover scenario. This will also minimize the time jobs on the former primary replica stay active and jobs on the current primary replica (former secondary replica) stay inactive.
We should run tests often to verify which replica is the primary and which is the secondary. In our case, I set this to two minutes. It is up to the DBA to make such a decision based on the frequency of the jobs.
CREATE TABLE [ags].[HIST_ServerAGRole]( ServerAGRoleID int identity(1,1) NOT NULL primary key, AGName [varchar](50) NOT NULL, IsPrimaryFlag char(1) not NULL, ActionBit tinyint not null, ActionStatus char(1) not NULL , DateCreated datetime default (getdate()) ; CREATE TABLE [ags].[HIST_EnableDisableJob]( EnableDisableJobID [int] identity(1,1) NOT NULL primary key, ServerAGRoleID int not NULL, [JobName] [varchar](255) NOT NULL, CommandText [varchar](4000) NOT NULL, ActionBit tinyint not null , DateCreated datetime default (getdate()) ;
To reduce overhead, we do not need to run the deactivation/activation scripts each time a check is run. If the server was an active replica in AG and is still active, then all jobs already activated and vice versa with secondary replica. One of the reasons is overhead. Another reason is more in line with the practical thoughts. As DBA, sometimes, I need to disable job for a certain period. In this case, it should not be accidently activated by the automated control mechanism.
The table, HIST_ServerAGRole, keeps the previous replica status and allows our detecting mechanism to determine if any action is required at this time. IsPrimaryFlag keeps the replica role as Primary or Secondary. ActionBit is 1 if jobs were activated and 0 if deactivated. ActionStatus provides me with knowledge if activation/deactivation was completed successfully or if process ran into some issues and end up with errors.
The table, HIST_EnableDisableJob, keeps the actions that were completed for each individual job to activate/deactivate. The command text is stored there to allow user to see the script that was executed and run it manually if needed. ActionBit has the same meaning as in the previous table and allows correlations between command text and the action. They should match. The stored procedure below is the simplistic version, which allows automatic enabling/disabling of jobs based on the replica role.
CREATE procedure [ags].uspAG_DisableEnableJobsandAlerts @AGName sysname as begin SET NOCOUNT ON declare @jobtbl table (jobname sysname, tid int identity(1,1) primary key); declare @cmd varchar(1000), @minid int = 0, @maxid int = -1, @jobnm varchar(255), @IsPrimary char(1) = 'N', @IsPrimaryInt tinyint, @IsPrimaryFlagPrev char(1), @ServerAGRoleID int; --========= BEGIN TRY --========= -- define if server is primary or secondary select @IsPrimary = master.ags.udf_AGHadrGroupIsPrimary (@AGName) ; set @IsPrimaryInt = CASE WHEN @IsPrimary = 'Y' THEN 1 ELSE 0 END; ;with histtbl (ServerAGRoleID) as ( select max([ServerAGRoleID]) from [ags].[HIST_ServerAGRole] where [ActionStatus] = 'S' and AGName = @AGName) select @IsPrimaryFlagPrev = [IsPrimaryFlag] from [ags].[HIST_ServerAGRole] hsr join histtbl h on h.ServerAGRoleID = hsr.ServerAGRoleID; INSERT INTO [ags].[HIST_ServerAGRole] ([AGName] ,[IsPrimaryFlag] ,[ActionBit] ,[ActionStatus]) select @AGName, @IsPrimary, @IsPrimaryInt, 'I'; -- initial status set @ServerAGRoleID = SCOPE_IDENTITY(); -- no action required IF (ISNULL(@IsPrimaryFlagPrev,@IsPrimary) = @IsPrimary ) begin Update [ags].[HIST_ServerAGRole] set [ActionStatus] = 'S', [Date_Modified] = getdate(), [Source_Modified] = @_Source_NM where [ServerAGRoleID] = @ServerAGRoleID; return; end; -- activate/deactivate jobs insert into @jobtbl (jobname) select [JobName] from [ags].[OBJT_Job] j join msdb.dbo.sysjobs sj -- exclude jobs that may be already not exist in the server but still is in control table. on j.jobname = sj.name where [AGRequireDisableFlag] = 'Y' order by jobname; select @minid = min(tid), @maxid = max(tid) from @jobtbl; while @minid <= @maxid begin select @jobnm = jobname from @jobtbl where tid = @minid; set @cmd = 'EXEC msdb.dbo.sp_update_job @job_name=''' + @jobnm + ''',@enabled = ' + Cast(@IsPrimaryInt as varchar(1)) + '; '; exec (@cmd); insert into [ags].[HIST_EnableDisableJobAlert] (ServerAGRoleID , [CommandText],[ActionBit]) values (@ServerAGRoleID, @cmd, @IsPrimaryInt); set @minid = @minid + 1; end; Update [ags].[HIST_ServerAGRole] set [ActionStatus] = 'S' where [ServerAGRoleID] = @ServerAGRoleID; --======== END TRY BEGIN CATCH --=========== Update [ags].[HIST_ServerAGRole] set [ActionStatus] = 'F' where [ServerAGRoleID] = @ServerAGRoleID; --=========== END CATCH --=========== SET NOCOUNT OFF return; end;
In real life, this procedure has more features than presented here. It has the ability to override action and force activation/deactivation. It can generate scripts without inserting records to the history tables. Those features help debug stored procedure and allow manual disabling/enabling of jobs if necessary. It also has the actual error handler. You can check these posts to find good information about organizing an error handler in the stored procedures.
When I was working on the task of how to control job’s status on the primary or secondary AG replica, I asked myself, “What happens if somebody accidently starts a job on secondary replica and this job does harm?” It took me a while to find a solution that would not be very intrusive, but at the same time protects against such the incidents. The idea is very simple. What if the very first step in the job is checking the role of the AG replica, then, if it runs in the secondary replica, fail the step and job?
This idea was implemented on all jobs that should not run in secondary AG replica. First, I added special error message to the servers.
USE master; GO EXEC sp_addmessage @msgnum = 99999, @severity = 16, @msgtext = N'SQL Server is not the primary replica in AG. Job should not be running.', @lang = 'us_english', @with_log = 'TRUE', @replace = 'replace' ; go
Then I test the code that will be running as a step #1 in each job, which has to be enabled or disabled on the primary or secondary servers.
DECLARE @AGName sysname = 'testAG'; IF (select master.ags.udf_AGHadrGroupIsPrimary (@AGName) ) = 'N' raiserror (99999,16,1);
Then I tested the code which allows me to add the step to the job to prevent accidental job firing and minimize the headaches in the future.
declare @jobnm varchar(255); set @jobnm = 'TestJob'; EXEC sp_add_jobstep @job_name = @jobnm , @step_id = 1, @step_name = N'Test If Server is the Primary replica in AG1', @subsystem = N'TSQL', @command = N'declare @AGCode sysname = ''AG01''; IF (select master.ags.udf_AGHadrGroupIsPrimary (@AGCode) ) = ''N'' raiserror (99999,16,1);', @retry_attempts = 0, @retry_interval = 0 , @on_success_action = 3, @on_fail_action = 2; GO
The parameter, @on_success_action, = 3 means that on success of the step, the job will execute the next job step. The parameter, @on_fail_action, = 2 means quit the job reporting failure.
Finally, the code for all necessary jobs is to be modified by adding step one.
declare @tbljobs table (jobnm sysname, tid int identity(1,1) primary key); declare @minid int = 0, @maxid int = -1, @jobnm sysname, @cmd varchar(4000), @cmdpart varchar(4000); declare @doexec char(1) = 'N', @showcmd char(1) = 'Y'; declare @stepname varchar(1000); set @stepname = 'Test If Server is the Primary replica in AG'; set @cmdpart = ' , @step_id = 1, @step_name = N''' + @stepname + ''', @subsystem = N''TSQL'', @command = N''declare @AGCode sysname = ''''AG01''''; IF (select master.ags.udf_AGHadrGroupIsPrimary (@AGCode) ) = ''''N'''' raiserror (99999,16,1);'', @retry_attempts = 0, @retry_interval = 0 , @on_success_action = 3, @on_fail_action = 2; '; insert into @tbljobs(jobnm) select name from [ags].[OBJT_Job] order by name; select @minid = min(tid), @maxid = max(tid) from @tbljobs; while @minid <= @maxid begin select @jobnm = jobnm from @tbljobs where tid = @minid; set @cmd = 'EXEC msdb..sp_add_jobstep @job_name = ''' + @jobnm + '''' + @cmdpart; IF @showcmd = 'Y' print @cmd; IF @doexec = 'Y' exec (@cmd); set @minid = @minid + 1; end;
In this post, we are looking at how to dynamically detect the role of the SQL Server replica in the AG and how to deal with internal SQL Server jobs in this context. This shows that it is more problematic than you might think.
The given solution is one of the few available. The described methodology can be expanded to make it more comprehensive. In my case, there are jobs that can be fired by alerts. The solution was also targeting it by disabling such custom alerts and disassociating alerts and jobs on the secondary replica or enabling alerts and associating them with their corresponding job based on another control table. By now, this solution is working for a few months. I did multiple tests and it works without any issues.