SQLServerCentral Article

Controlling Unusually Long Running Jobs

,

Again and again as a DBA I am trying to say that automation is the way for a DBA to make the job easier. In many cases I have seen that DBAs are using a manual labor to verify if scheduled jobs are running, backup files are created or deleted, and so on. But it is a time for the automation of any ideas that may popup in your head. Especially in the cases when DBA is maintaining multiple servers and many databases. Let’s see an automated approach for the next situation with scheduled job. The job that is running on a server has its duration. Usually, the duration for a job step in a server is a permanent value. It means that if today job step is running within 20 minutes you may expect that tomorrow it will have the same duration e.g. 20 plus/minus few minutes. There are some situations when a job step is succeeding (not failed) but was running an unusually long time. It may happen for many reasons – network issues, high CPU utilization, some excessive blocking situations etc. And as a DBA you probably would like to know when it is occurs. To get an idea of such suddenly unusual situations I did setup the automated process. There are few ways to organize the process. It depended from the scheduled jobs and job’s settings. Let’s see how we can use the SQL Server system tables to setup an automated control and when job step duration was unusually long email a warning.

Logical idea of the process is to analyze the duration for the job step and find the difference between average running time and the duration the job step had. To make a clear shut, we should exclude unusually long running instances of the job step from previous executions. In many cases it is more practical to analyze each step separately from the job itself. In addition, unusually long running behavior should be defined as well. Based on my practical experience, I do consider job duration time unusually long if it is about 20 - 30% over the average time. But if job step duration is short (I do identify short as less than 1 minute) then check threshold should be bumped to 100%.

For example, if job step runs 30 seconds then warning flag should be raised only if duration more than 60 seconds. In case of job step duration is 200 seconds a warning should be raised when duration is over 260 seconds. For the long running job steps (over 30-45 minutes) level of awareness may be lowered to 20% (sometime even 10-15 percent).

Let’s see the code for the logic. First, create a table to keep all running job steps historical information in convenient format with some additional flags.

Create table DBA_JOB_STEPS_HISTORY  (
            history_id int identity (1,1) Primary key,
            Insert_DT            datetime default (getdate()),
            instance_id  int,                         -- system job step history unique identifier
            job_id uniqueidentifier,             -- Unique ID of the job
            job_step_id int,                        -- step number in the job
            job_nm            sysname,             --  Name of the job
            step_nm varchar(128),               -- name of the step
            last_run_date datetime,             -- Date and time the job last started execution
            duration          int,                      -- time job run in seconds
            long_running_flag char(1)           -- flag to identify if long running this time 
)           

Before inserting any information into the table, verification process will be conducted and based on the result outcome an email will be sent to the recipients.

CREATE procedure dbo.P_CHECK_JOBS_RUN_TIME 
                @recipients varchar(100)                 
as  
BEGIN  
SET NOCOUNT ON  
declare @message varchar(4000), @instance_id INT, @flag_exists int, @subject varchar(80), 
                @long_running_flag char(1), @minid int, @maxid int, @job_id uniqueidentifier, 
                @job_nm varchar(128),@step_id int, @avgduration int, @duration int,  @step_nm varchar(128), @last_run_date datetime 
  
DECLARE @allactivejobs table  (instance_id int, job_id uniqueidentifier, job_step_id int, 
                job_nm varchar(128), step_nm varchar(128), 
                last_run_date datetime,duration int, long_running_flag char(1), tid int identity(1,1) )  
SELECT @instance_id = max(instance_id) from DBA_JOB_STEPS_HISTORY  
set   @instance_id = IsNull(@instance_id,0) 
set @message = 'Unusually Long Run Steps:  ' 
set @flag_exists = 0 
set @subject = 'Unusually Long Run Steps on server  ' + @@servername 
set @long_running_flag = 'N'
  
insert into @allactivejobs 
  ( instance_id, job_id , job_step_id, job_nm, step_nm,  last_run_date, duration, long_running_flag) 
select      soh.instance_id,sj.job_id, sos.step_id,    sj.name ,sos.step_name,   
                CAST 
( Left(cast(run_date as varchar(8)),4) + '/' +       substring(cast(run_date as varchar(8)), 5,2) + '/' + 
                Right(cast(run_date as varchar(8)), 2)    + ' ' +    
                cast( ((run_time/10000) %100) as varchar ) + ':' +  cast( ((run_time/100) %100) as varchar ) + ':' + 
                cast(  (run_time %100) as varchar )         as datetime), 
                ( run_duration % 100 )+                     -- seconds 
                (((run_duration/100) % 100 ) * 60) +                 -- minutes in seconds 
                (((run_duration/10000) % 100 ) * 3600),                -- hours in seconds 
                'N' 
from msdb..sysjobs sj 
                inner join msdb..sysjobsteps sos on sos.job_id = sj.job_id 
                inner join msdb..sysjobhistory soh on soh.job_id = sos.job_id and soh.step_id = sos.step_id 
where soh.instance_id > @instance_id    and soh.run_status = 1       -- successful jobs 
order by sj.name,sos.step_name
 select @minid = min(tid),@maxid = max(tid) from @allactivejobs
 while (@minid <= @maxid) 
  begin  
    set @long_running_flag = 'N' 
    select  @job_id = job_id,        @job_nm = job_nm,          @step_id = job_step_id, 
            @step_nm = step_nm,     @duration = duration,  @last_run_date = last_run_date 
    from @allactivejobs
    where tid = @minid 
-- exclude checks the first 2 times 
IF ( (select count(*) from DBA_JOB_STEPS_HISTORY where job_id = @job_id and job_step_id = @step_id) >= 2 ) 
 BEGIN 
    -- define average run time excluding unusually long running instances 
    select @avgduration = avg(duration)      from DBA_JOB_STEPS_HISTORY 
    where job_id = @job_id and job_step_id = @step_id    and @long_running_flag = 'N' 
                                         
    set @avgduration = IsNull(@avgduration,1) 
    IF (@avgduration < 60) 
     begin 
      IF (@duration > (@avgduration * 2)  ) 
        begin 
          select @message = @message + ' 
           ' + @job_nm + ': ' + @step_nm 
          set @flag_exists = 1 
          set @long_running_flag = 'Y' 
        end
     end 
    IF (@avgduration > 60 and @avgduration < 1800) 
     begin 
      IF (@duration > (@avgduration * 1.3)  ) 
        begin 
         select @message = @message + ' 
          ' + @job_nm + ': ' + @step_nm 
         set @flag_exists = 1 
         set @long_running_flag = 'Y' 
        end 
     end 
  
    IF (@avgduration > 1800) 
     begin 
      IF (@duration > (@avgduration * 1.2)  ) 
       begin 
        select @message = @message + ' 
        ' + @job_nm + ': ' + @step_nm 
        set @flag_exists = 1 
        set @long_running_flag = 'Y' 
       end 
     end 
   IF (@long_running_flag = 'Y') 
    begin 
     update @allactivejobs 
      set long_running_flag = 'Y' 
      where tid = @minid 
    end 
  end                  
  select @minid = @minid + 1 
end 
IF (@flag_exists = 1) 
 begin 
  exec master..xp_sendmail 
       @recipients = @recipients, 
       @subject = @subject,  
       @message = @message                
 end 
insert into DBA_JOB_STEPS_HISTORY 
  ( instance_id, job_id , job_step_id, job_nm, step_nm,  last_run_date, duration, long_running_flag) 
 select instance_id, job_id , job_step_id, job_nm, step_nm,  last_run_date, duration, long_running_flag 
  from @allactivejobs 
 SET NOCOUNT OFF 
END 

I did setup clean up of the table DBA_JOB_STEPS_HISTORY as one of the steps in my weekly job based on 30 days of historical information.

 Delete from DBA_JOB_STEPS_HISTORY  Where datediff(dd, Insert_DT, getdate()) > 30

Additional check may need to be established, especially if job is running unusually long time during 2 first runs. Usually, I am manually controlling any new job created on a SQL Server within the first week or first 2-3 times the job runs (whatever happened first). In many other cases table DBA_JOB_STEPS_HISTORY may help with statistical information about jobs scheduled on the SQL Server. Some other issues may arise when job steps are rearranged or some steps are deleted because the column job_step_id in table msdb..sysjobsteps will change the its value. It is not very often jobs are changed in production server but a change may create some confusion. The stored procedure above can be modified to utilize a step name instead of a step id, but there is a possibility to get the same problem if a step name is changed. The approach of using a step name, probably safer because we less likely changing job step names but more often rearranging the steps.

In addition, changes can be done to help analyze those job steps which are still running but duration of the step is already unusually long. It can be achieved by adding additional criteria to the select statement for the field run_status.

Simply exchange line

and soh.run_status = 1 -- successful jobs

with line

and ( soh.run_status = 1 or soh.run_status = 4 ) -- successful and in progress jobs

In this case you may have an issues with average duration, and may need to exclude the rows for the steps with run_status = 4 (in progress) from insertion to the table DBA_JOB_STEPS_HISTORY

Conclusion

This article shows you the way to identify and automate a control for an unusually long running job steps. You can summarize all the steps and create a procedure for a whole job or add some additional columns to the table DBA_JOB_STEPS_HISTORY and make the code more sophisticated. Same process can be extended to the degree where control for the long running processes will be provided from one centralized location. But it will require dynamic SQL or use of VBS code.

Rate

5 (2)

You rated this post out of 5. Change rating

Share

Share

Rate

5 (2)

You rated this post out of 5. Change rating