Job History – Back to Basics

,

How necessary is that SQL Server Agent job that you have running on the server? I ask that question of my clients on a routine basis.

Sometimes, I will ask that question as a routine part of a health check for the server. Others, it might be due to a migration or upgrade that is required for the server. Generally, the answer to the question will be one of two things: a) “Yes it is necessary.” or b) “What is that job doing?”.

Believe it or not, both answers will typically spawn more questions. You see, I don’t usually ask that question unless the job is failing on a regular basis. You wouldn’t believe how many jobs exist out there that are scheduled and just fail every time as well.

When I encounter a situation like this, it means it is time to have a discussion. In order to have that discussion, one needs to have done a bit of homework in order to better understand the situation. For me, part of this homework involves running the numbers and figuring out the frequency of the job’s failure or success.

Data Gathering

For me, I like to understand how often a job has executed and what is the frequency of failure for that quantity of executions. If I see a job that has not succeeded successfully in 60 consecutive executions, it is probably a safe bet that the job is not needed. Why? Well, if nobody has noticed the job hasn’t been working for that long, the likelihood of the job providing any use to the business is slim to none. In this case, I would present a case to the business as to why it should be removed.

But, how do I get to that point? Well, you could go through the job history for each job one by one and run some manual analytics. Or, you could take advantage of a script. I prefer the script route because it is faster, more reliable and a lot less mundane.

WITH jobhist AS
	(SELECT
			   jh.job_id
			 , jh.run_date
			 , jh.run_status
			 , jh.step_id
			 , ROW_NUMBER() OVER (PARTITION BY jh.job_id, jh.run_status ORDER BY jh.run_date DESC) AS rownum
		 FROM  dbo.sysjobhistory jh
		 WHERE jh.step_id = 0)
   , agglastsixty AS
	(SELECT
						sjh.job_id
					  , sjh.run_status
					  , COUNT(*) AS RunCount
		 FROM			dbo.sysjobhistory sjh
			 INNER JOIN jobhist			  jh
				 ON jh.job_id = sjh.job_id
					AND jh.run_status = sjh.run_status
					AND jh.step_id = sjh.step_id
		 WHERE			CONVERT(DATE, CONVERT(VARCHAR, sjh.run_date)) > DATEADD(
																				   d
																				 , -60
																				 , CONVERT(DATE, CONVERT(VARCHAR, jh.run_date))
																			   )
						AND jh.rownum = 1
		 GROUP BY		sjh.job_id
					  , sjh.run_status)
   , aggtotal AS
	(SELECT
						sjh.job_id
					  , sjh.run_status
					  , COUNT(*) AS RunCount
		 FROM			dbo.sysjobhistory sjh
			 INNER JOIN jobhist			  jh
				 ON jh.job_id = sjh.job_id
					AND jh.run_status = sjh.run_status
					AND jh.step_id = sjh.step_id
		 WHERE			jh.rownum = 1
		 GROUP BY		sjh.job_id
					  , sjh.run_status)
	SELECT
							j.name										 AS JobName
						  , sc.name										 AS CategoryName
						  , sp.name										 AS OwnerName
						  , j.owner_sid
						  , j.date_created
						  , j.enabled
						  , CONVERT(DATE, CONVERT(VARCHAR, oa.run_date)) AS RunDate
						  , CASE oa.run_status
								WHEN 0
									 THEN 'Failed'
								WHEN 1
									 THEN 'Succeeded'
								WHEN 2
									 THEN 'Retry'
								WHEN 3
									 THEN 'Canceled'
								WHEN 4
									 THEN 'In Progress'
							END											 AS run_status
						  , als.RunCount								 AS Last60StatusCount
						  , agt.RunCount								 AS TotalStatusCount
		FROM				dbo.sysjobs			  j
			LEFT OUTER JOIN		jobhist				  oa
				ON oa.job_id = j.job_id
			LEFT OUTER JOIN agglastsixty		  als
				ON als.job_id = oa.job_id
				   AND als.run_status = oa.run_status
			LEFT OUTER JOIN aggtotal			  agt
				ON agt.job_id = oa.job_id
				   AND agt.run_status = oa.run_status
			INNER JOIN		sys.server_principals sp
				ON j.owner_sid = sp.sid
			INNER JOIN		dbo.syscategories	  sc
				ON j.category_id = sc.category_id
		WHERE				oa.rownum = 1
		ORDER BY			RunDate DESC;

Running that script against my sandbox, I may see something such as the following.

Here you will note that the “wtf” job has two entries. One entry for “Succeeded” (in green) and one entry for “Failed” (in red). Each row receiving counts for number of executions.

This is the type of information I can use in a meeting to discuss with the business. This is no longer a discussion of opinion, but rather one that is now based on facts and data. It now becomes very easy to demonstrate to the business that a job has failed 60/60 times and nobody noticed it or cared enough about the consistent failures to do anything about it. Imagine if the failing job happens to be the database backups. I wonder what the action items for that job failure might include.

The Wrap

An important part of any DBAs job is to ensure database related jobs are running prim and proper. Sometimes that just doesn’t happen. When jobs are being overlooked, it is useful to be able to gather data related to consistency of job success or failure. This script will help you in your investigation efforts. In addition, I also recommend this article in your agent job audit efforts.

This has been another post in the back to basics series. Other topics in the series include (but are not limited to): Backups, backup history and user logins.

Original post (opens in new tab)
View comments in original post (opens in new tab)

Rate

Share

Share

Rate