SQL jobs are great as the SQL agent job engine allows for very granular control over jobs that are created by allowing great flexibility with the scheduling, execution, flow actions and notifications. Invariably, when jobs are setup, they just run and run without many problems, but the other day I was called in to look at a problem where a SQL job seemingly wasn’t doing everything that it should be doing, namely the job steps.
After a bit of digging around, I discovered that the flow actions had gone a little bit off and as such some job steps were being missed. When I talk about the flow actions, I’m referring to the “On success action” and “On failure action” settings of the individual job steps.
Now if I was to use SSMS UI to add/update the job steps and configure the On success and On failure actions in such a way that job steps would be missed, I would see this error:
Which is a really useful warning that you really should do something about (unless you intend that behaviour of course!) before saving the job.
However, most of the jobs and job step logic that I implement are scripted and if you modify the job using plain old t-sql, you don’t get a helpful warning if you inadvertently create these unreachable job steps. For example, you could add a new job step to an existing job, but if you don’t modify the previous job step’s success/fail action from “Quit job reporting success” or “Quit job reporting Failure”, the new job step may never be executed. And in the case of my problem, I think this is what happened. A new job step was added at the bottom of the existing job steps, but the previous last job step flow on success was not modified to “Go to the next step”. It basically looked a bit like this:
So, after fixing the immediate problem I set about creating a script that would check all the jobs within a SQL instance and attempt to detect any jobs that had any orphaned or reachable job steps and came up with the following:
SELECT sj.name, sjs1.step_id, sjs1.step_name
FROM sysjobsteps sjs1
INNER JOIN sysjobs sj on sj.job_id = sjs1.job_id
LEFT JOIN sysjobsteps sjs2 ON sjs2.job_id = sjs1.job_id AND sjs2.step_id <> sjs1.step_id AND
(sjs1.step_id = sjs2.on_success_step_id AND sjs2.on_success_action = 4) OR
(sjs1.step_id = sjs2.on_fail_step_id AND sjs2.on_fail_action = 4) OR
(sjs1.step_id = sjs2.step_id + 1 AND sjs2.on_success_action = 3) OR
(sjs1.step_id = sjs2.step_id + 1 AND sjs2.on_fail_action = 3) OR
(sjs1.step_id = sj.start_step_id)
WHERE sjs2.step_id IS NULL
ORDER BY sj.name, sjs1.step_id
What this does is that it checks all job steps are reachable by either:
What the script doesn’t do is to recursively list all unreachable steps. So for example, if you had a job with three steps, 1, 2 and 3 where step 3 is only reachable by step 2, but step 2 itself is unreachable, then this code will only detect step 2 as unreachable. It won’t detect that step 3 is “unreachable” because of step 2. For sure you could create a query that would give you this information but it’ll be a lot more complicated. The very fact that this script is detecting a job with an unreachable step is good enough for me to investigate further and I don’t need to know which downstream job steps are unreachable.
So to prove the script, I created a job with some crazy logic flows as shown here (the script to create this job and detect the unreachable steps can be downloaded here):
The job is set to start at step 2 and as you can see there are some unreachable steps in there because of the flows. Running the script returns the following:
Steps 1 and steps 6 are directly unreachable because of the configuration of the job and the job step flows. It doesn’t detect step 5 as being indirectly unreachable because of the reasons listed above.