SQLServerCentral Article

AlwaysON Availability Group and Jobs Part 2

,

The first part of this series explained why we need a job controlling mechanism in an AG environment, which system views will be used for the control mechanism, and how to determine the role of the replica in AG. This article will provide the actual implementation of the control mechanism.

Let me remind you that we already defined function ags.udf_AGHadrGroupIsPrimary which returning the role of replica in AG. And result Y mean replica is primary, N – replica is secondary, F – failure. Not every job that is on the server has to be disabled/enabled in primary/secondary replica scenario. However, we do need a control table to be able identify those jobs.

CREATE TABLE [ags].[OBJT_Job](
             [JobID] [int]  NOT NULL  PRIMARY KEY,
             [JobName] [varchar](255) NOT NULL,
             [AGRequireDisableFlag] [char](1) NOT NULL);      

In addition, we are going to create two history tables to manage job status automatically during automatic failover scenario. This will also minimize the time jobs on the former primary replica stay active and jobs on the current primary replica (former secondary replica) stay inactive.

We should run tests often to verify which replica is the primary and which is the secondary. In our case, I set this to two minutes. It is up to the DBA to make such a decision based on the frequency of the jobs.

CREATE TABLE [ags].[HIST_ServerAGRole](
              ServerAGRoleID int identity(1,1)  NOT NULL primary key,
              AGName [varchar](50) NOT NULL,
              IsPrimaryFlag char(1) not NULL,
              ActionBit tinyint not null,
              ActionStatus char(1) not NULL ,
              DateCreated datetime default  (getdate()) ;
  
CREATE TABLE [ags].[HIST_EnableDisableJob](
              EnableDisableJobID [int]  identity(1,1) NOT NULL primary key,
              ServerAGRoleID int not NULL,
              [JobName] [varchar](255) NOT NULL,
              CommandText [varchar](4000) NOT NULL,
              ActionBit tinyint not null        ,
              DateCreated datetime default  (getdate()) ;

To reduce overhead, we do not need to run the deactivation/activation scripts each time a check is run. If the server was an active replica in AG and is still active, then all jobs already activated and vice versa with secondary replica. One of the reasons is overhead. Another reason is more in line with the practical thoughts. As DBA, sometimes, I need to disable job for a certain period. In this case, it should not be accidently activated by the automated control mechanism.

The table, HIST_ServerAGRole, keeps the previous replica status and allows our detecting mechanism to determine if any action is required at this time. IsPrimaryFlag keeps the replica role as Primary or Secondary. ActionBit is 1 if jobs were activated and 0 if deactivated. ActionStatus provides me with knowledge if activation/deactivation was completed successfully or if process ran into some issues and end up with errors.

The table, HIST_EnableDisableJob, keeps the actions that were completed for each individual job to activate/deactivate. The command text is stored there to allow user to see the script that was executed and run it manually if needed. ActionBit has the same meaning as in the previous table and allows correlations between command text and the action. They should match. The stored procedure below is the simplistic version, which allows automatic enabling/disabling of jobs based on the replica role.

  CREATE procedure [ags].uspAG_DisableEnableJobsandAlerts
              @AGName sysname
  as
  begin
  
  SET NOCOUNT ON
  
  declare @jobtbl table (jobname sysname,  tid int identity(1,1) primary key);
  declare @cmd varchar(1000), @minid int = 0, @maxid int = -1, @jobnm varchar(255), @IsPrimary char(1) = 'N',
                          @IsPrimaryInt tinyint, @IsPrimaryFlagPrev char(1), @ServerAGRoleID int;
  
  --=========
  BEGIN TRY
  --=========
  -- define if server is primary or secondary
  select @IsPrimary = master.ags.udf_AGHadrGroupIsPrimary (@AGName) ;
  
  set @IsPrimaryInt = CASE WHEN @IsPrimary = 'Y' THEN 1 ELSE 0 END;
  
  ;with histtbl (ServerAGRoleID) as
               (           select max([ServerAGRoleID])
                          from [ags].[HIST_ServerAGRole]
                          where [ActionStatus] = 'S' and AGName = @AGName)
  select @IsPrimaryFlagPrev = [IsPrimaryFlag]
  from [ags].[HIST_ServerAGRole] hsr
              join histtbl h
                          on h.ServerAGRoleID = hsr.ServerAGRoleID;
  
  INSERT INTO [ags].[HIST_ServerAGRole]
                                                      ([AGName]
                                                     ,[IsPrimaryFlag]
                                                     ,[ActionBit]
                                                     ,[ActionStatus])
  select @AGName, @IsPrimary, @IsPrimaryInt, 'I';  -- initial status
  
  set @ServerAGRoleID =  SCOPE_IDENTITY();
  
  -- no action required
  IF   (ISNULL(@IsPrimaryFlagPrev,@IsPrimary) = @IsPrimary )
              begin
                          Update [ags].[HIST_ServerAGRole]
                                      set [ActionStatus] = 'S',
                                                  [Date_Modified] = getdate(),
                                                  [Source_Modified] = @_Source_NM
                          where [ServerAGRoleID]  = @ServerAGRoleID;
                         
                          return;
              end;
  
  -- activate/deactivate jobs
  insert into @jobtbl  (jobname)
  select [JobName]
  from [ags].[OBJT_Job] j
              join msdb.dbo.sysjobs sj         -- exclude jobs that may be already not exist in the server but still is in control table.
                          on j.jobname = sj.name
  where [AGRequireDisableFlag] =  'Y'
  order by jobname;
  
  select @minid = min(tid), @maxid = max(tid) from @jobtbl;
  
  while @minid <= @maxid
              begin
                          select @jobnm = jobname
                          from @jobtbl
                          where tid = @minid;
  
                          set @cmd = 'EXEC msdb.dbo.sp_update_job @job_name=''' + @jobnm + ''',@enabled = ' + Cast(@IsPrimaryInt as varchar(1)) + '; ';
  
                          exec (@cmd);
  
                          insert into [ags].[HIST_EnableDisableJobAlert] (ServerAGRoleID  , [CommandText],[ActionBit])
                          values (@ServerAGRoleID, @cmd, @IsPrimaryInt);
                         
                          set @minid = @minid + 1;
              end;
  
                          Update [ags].[HIST_ServerAGRole]
                                      set [ActionStatus] = 'S'
                          where [ServerAGRoleID]  = @ServerAGRoleID;
  --========
  END TRY
  BEGIN CATCH
  --===========
                          Update [ags].[HIST_ServerAGRole]
                                      set [ActionStatus] = 'F'
                          where [ServerAGRoleID]  = @ServerAGRoleID;
  --===========
  END CATCH
  --===========
  
     SET NOCOUNT OFF
     return;
  
  end;

In real life, this procedure has more features than presented here. It has the ability to override action and force activation/deactivation. It can generate scripts without inserting records to the history tables. Those features help debug stored procedure and allow manual disabling/enabling of jobs if necessary. It also has the actual error handler. You can check these posts to find good information about organizing an error handler in the stored procedures.

When I was working on the task of how to control job’s status on the primary or secondary AG replica, I asked myself, “What happens if somebody accidently starts a job on secondary replica and this job does harm?” It took me a while to find a solution that would not be very intrusive, but at the same time protects against such the incidents. The idea is very simple. What if the very first step in the job is checking the role of the AG replica, then, if it runs in the secondary replica, fail the step and job?

This idea was implemented on all jobs that should not run in secondary AG replica. First, I added special error message to the servers.

  USE master;
  GO
  EXEC sp_addmessage  @msgnum = 99999, @severity = 16,
     @msgtext = N'SQL Server is not the primary replica in AG. Job should not be running.',
     @lang = 'us_english',
              @with_log = 'TRUE',
              @replace =  'replace' ;
  go

Then I test the code that will be running as a step #1 in each job, which has to be enabled or disabled on the primary or secondary servers.

DECLARE @AGName sysname = 'testAG';
IF   (select master.ags.udf_AGHadrGroupIsPrimary (@AGName) ) = 'N'
            raiserror (99999,16,1);

Then I tested the code which allows me to add the step to the job to prevent accidental job firing and minimize the headaches in the future.

declare @jobnm varchar(255);
  set @jobnm = 'TestJob';
 
  EXEC sp_add_jobstep 
       @job_name = @jobnm ,
              @step_id = 1,
      @step_name = N'Test If Server is the Primary replica in AG1', 
      @subsystem = N'TSQL', 
      @command = N'declare @AGCode sysname = ''AG01'';
 
  IF (select master.ags.udf_AGHadrGroupIsPrimary (@AGCode) ) = ''N''
              raiserror (99999,16,1);',  
      @retry_attempts = 0, 
      @retry_interval = 0 ,
              @on_success_action = 3,
              @on_fail_action = 2; 
  GO 

The parameter, @on_success_action, = 3 means that on success of the step, the job will execute the next job step. The parameter, @on_fail_action, = 2 means quit the job reporting failure.

Finally, the code for all necessary jobs is to be modified by adding step one.

declare @tbljobs table (jobnm sysname, tid int identity(1,1) primary key);
  declare @minid int = 0, @maxid int = -1, @jobnm sysname, @cmd varchar(4000), @cmdpart varchar(4000);
  declare @doexec char(1) = 'N', @showcmd char(1) = 'Y';
  declare @stepname varchar(1000);
  
  set @stepname = 'Test If Server is the Primary replica in AG';
  
  set @cmdpart = ' ,
      @step_id = 1,
        @step_name = N''' + @stepname + ''', 
        @subsystem = N''TSQL'', 
      @command = N''declare @AGCode sysname = ''''AG01'''';
  IF (select master.ags.udf_AGHadrGroupIsPrimary (@AGCode) ) = ''''N''''
              raiserror (99999,16,1);'',  
      @retry_attempts = 0, 
      @retry_interval = 0 ,
      @on_success_action = 3,
      @on_fail_action = 2;  ';
 
  insert into @tbljobs(jobnm)
  select name
  from [ags].[OBJT_Job]
  order by name;
  
  select @minid = min(tid), @maxid = max(tid) from @tbljobs;
  
  while @minid <= @maxid
              begin
                          select @jobnm = jobnm
                          from @tbljobs
                          where tid = @minid;
  
  set @cmd = 'EXEC msdb..sp_add_jobstep  @job_name = ''' + @jobnm + '''' + @cmdpart;
                          IF @showcmd = 'Y'
                                      print @cmd;
  
                          IF @doexec = 'Y'
                                      exec  (@cmd);
                          set @minid = @minid + 1;
              end;

Conclusion

In this post, we are looking at how to dynamically detect the role of the SQL Server replica in the AG and how to deal with internal SQL Server jobs in this context. This shows that it is more problematic than you might think.

The given solution is one of the few available. The described methodology can be expanded to make it more comprehensive. In my case, there are jobs that can be fired by alerts. The solution was also targeting it by disabling such custom alerts and disassociating alerts and jobs on the secondary replica or enabling alerts and associating them with their corresponding job based on another control table. By now, this solution is working for a few months. I did multiple tests and it works without any issues.

Rate

5 (3)

You rated this post out of 5. Change rating

Share

Share

Rate

5 (3)

You rated this post out of 5. Change rating