I was looking into a failed job on a SQL Server 2008R2 instance. My first step was to check the history of the job, to see why it failed:
Okay, this is interesting. The job did fail, but the only step to run succeeded. Hmm, let me look at what the job is doing:
The job starts with step 1, and continues through all of the steps. In this example, each step just does a “SELECT 1″.
So, why is this job only running the first step, and why is it failing? Well, after a bit of head scratching, I decided to script out the job so that I could drop and recreate it. When I examined the script, I was able to find out what the problem was. The first step has this code:
EXEC @ReturnCode = msdb.dbo.sp_add_jobstep @job_id=@jobId, @step_name=N'Step 1', @step_id=1, @cmdexec_success_code=0, @on_success_action=4, @on_success_step_id=5, @on_fail_action=2, @on_fail_step_id=0, @retry_attempts=0, @retry_interval=0, @os_run_priority=0, @subsystem=N'TSQL', @command=N'SELECT 1', @database_name=N'master', @flags=0
Do you see what the issue is? The on success action is set to 4. If we look this up in BOL, we see that 4 is used to run a specified step. And the step to run is step 5 – a step doesn’t exist.
At this point, just saving the job from the GUI will fix the issue, since the step is showing in the GUI to run the next step. After doing this, scripting out the job again will show that the success action is 3 (which is to run the next step) and the step id is 0.
So thinking that I had found a bug, I decided to see if this still happens in newer versions of SQL. In SQL Server 2012, we have a difference:
Here we have a much better message as for why the job failed. It’s much better because it actually is the right reason for why the job did fail.
However, in SQL 2012, SSMS still shows step 1 as set to “Go to the next step”:
What about SQL 2014? In SQL 2014 CTP2, it’s the same as in SQL 2012.
I feel that the GUI should not show as “Go to the next step” if that is not what it is set as. Whether it says to just “Go to step (5)”, or “Go to <invalid step>” doesn’t matter, what does matter is that the incorrect action for the step is being shown. Without scripting out the job, you would never see this. So I have opened up a connect item for fixing the GUI. Please go out and vote for it. (If you want to try this out yourself, a script is attached to the connect item that will create a job with this issue.)
Note that to set up a job in this way, it needs to be done through a script. And when you do create the job, it does give you a warning about a non-existent job step. However, you also get that warning if you set up step 1 to go to step 3, because at the time that step 1 is created, the specified step doesn’t exist. Yes, the person running the script should investigate further – but the tool for doing so makes it look like everything is okay, and in fact it is showing wrong information. If you are being brought in to troubleshoot the failed job, your job just became a bit harder since the GUI isn’t accurate. If you were to just click OK to save the job, it would update this job to go to the next step – but what if the job is actually missing a step that it should be going to instead? So, if you run into this situation, slow down and ensure that you are configuring the job correctly.