SQL agent failed job report last 24 hours

  • Hi All,

    Posting it on development group, if anyone can provide query.

    Basically, I am seeing 100+ failed jobs daily and it is taking more time to check in agent history and 80% of auto success in 20% are failed. Is there any way can only get report of failed non success jobs.

    -- Failed Jobs executed since the Last run hours, where the last status is failed. 
    --This skips jobs that failed once,and subsequently executed successfully on the following run(s).

    DECLARE @Layer VARCHAR(30); SET @Layer = '[CMSServerLayer]'

    DECLARE @dt CHAR(8);
    SET @dt=CONVERT(CHAR(8),DATEADD(HH,-24,GETDATE()),112)
    SELECT
    CONVERT(varchar(128),@@SERVERNAME) As Servername,
    SUBSTRING(T2.name,1,140) AS [SQL Job Name],
    --msdb.dbo.agent_datetime(run_date, run_time) as 'RunDateTime',
    COUNT(*) AS TotalFailures,
    CAST(MIN(CONVERT(DATETIME,CAST(run_date AS CHAR(8)),101)) AS CHAR(11)) AS [MinFailure Date],
    MAX(msdb.dbo.agent_datetime(T1.run_date, T1.run_time)) AS 'RunDateTime',
    MIN(T1.run_duration) StepDuration,
    CASE MIN(T1.run_status)
    WHEN 0 THEN 'Failed'
    WHEN 1 THEN 'Succeeded'
    WHEN 2 THEN 'Retry'
    WHEN 3 THEN 'Cancelled'
    WHEN 4 THEN 'In Progress'
    END AS ExecutionStatus

    FROM
    msdb..sysjobhistory T1 INNER JOIN msdb..sysjobs T2 ON T1.job_id = T2.job_id
    WHERE
    T1.run_status NOT IN (1,2,4)
    AND T1.step_id != 0
    AND run_date >= @dt
    GROUP BY
    T1.step_name,
    T2.name

    https://www.sqlservercentral.com/forums/topic/sql-agent-failed-job-report-last-24-hours#post-3893685

    • This topic was modified 2 weeks ago by  sqltester1.
  • Personally, I would look at getting a reporting tool like RedGate SQL Monitor (it is what I use, I do not work for RedGate).

    Now, in my mind, a failed job is a failed job.  Doesn't matter if it succeeded on the next run or not, I want to know why it failed so I can correct it and have it not fail in the future.

    With 100+ jobs, that may be a tricky task, but tackle it 1 job at a time.  Set the jobs up to email you (and likely the DBA team) on failure and you can use your inbox to track the failures.  Find the ones that are business critical and must be corrected ASAP and the ones that are OK to have a failure as long as it succeeds within X hours, and the ones where they are known to fail periodically and set up rules in your inbox for them.

    Now, you have a priority system in place so you can tackle the critical, "drop everything and fix it" ones, the important ones, and the "when you have free time" ones.

    Tackle one job at a time and determine what is causing it to fail, correct that issue or capture it in a less disruptive manner and move onto the next job.  What I mean by capture it in a less disruptive manner is if your ETL load is failing to populate a table due to someone putting bad data in at the source, have the job succeed but throw the bad rows into an "error" table.  That way the reports will still have data for end users, and you can fix the problem later.  Backups on the other hand should be disruptive to the DBA's if they fail and would fall into my "critical" bucket.  A failed backup MAY mean that I can no longer meet my RPO if something fails.  If my RPO is hourly and my TLOG backups are happening hourly and a TLOG backup fails, I may miss the RPO if the system fails during that window.  Mind you, if my RPO was an hour, I would probably be doing 15 minute TLOG backups so I could have up to 3 fail before I start sweating and swearing.

    Now, if you are NEEDING to do this via TSQL like you are asking, I think you are going to need to do it with a NOT EXISTS and may need to get a bit fancy.

    What I would start by doing is grab a list of all jobs and their most recent status and toss that into a CTE or temp table for easy validation later.  Next, grab a list of all failed jobs in the past hour and store that in a CTE or temp table for easy validation later.  Finally, grab a list of all succeeded jobs in the past hour and store that in a CTE or temp table for easy validation later.  Now, if you want a list of all jobs that failed in the past hour that their most recent status is failure, that is just joining the 2nd CTE/temp table to the 1st on the JOB ID WHERE the run status from the 1st CTE/TT = 1 (succeeded).

    Might not be the most efficient approach, but it gives a nice way to validate the data as you go and reuse it as you need to.  And since it is an Admin/DBA type script, I like those to have chunks that are reusable.

  • Thanks for your detailed response. I will try to write some code, I do only basic code writing and will try CET.

  • I have tried could not get expected results. Can anyone have script.

  • ok, so step 1 from what I had was to get all of the job status's within the timeframe.  The query you have handles that, but I'd pull out some columns and tweak the WHERE so you end up with this:

    DECLARE @dt CHAR(8);
    SET @dt=CONVERT(CHAR(8),DATEADD(HH,-24,GETDATE()),112)
    SELECT DISTINCT
    CONVERT(varchar(128),@@SERVERNAME) As Servername,
    T2.name AS [SQL Job Name],
    msdb.dbo.agent_datetime(T1.run_date, T1.run_time) AS 'RunDateTime',
    MAX(msdb.dbo.agent_datetime(T1.run_date, T1.run_time)) OVER (PARTITION BY [T2].name) AS 'MostRecentRunDateTime',
    CASE T1.run_status
    WHEN 0 THEN 'Failed'
    WHEN 1 THEN 'Succeeded'
    WHEN 2 THEN 'Retry'
    WHEN 3 THEN 'Cancelled'
    WHEN 4 THEN 'In Progress'
    END AS ExecutionStatus
    FROM
    msdb..sysjobhistory T1 INNER JOIN msdb..sysjobs T2 ON T1.job_id = T2.job_id
    WHERE
    run_date >= @dt

    The above will tell you the information on all jobs run within the past 24 hours and that status.  Capture that in a table variable or temp table or permanent table; depending on how you plan to use it.

    Next you want to capture all of the failed jobs, so use the above query but add to the WHERE clause "AND T1.run_status = 0".  Capture that in a table variable or temp table or permanent table.

    Now, join those 2 tables on the Job Name on RunDateTime from the FAILED table is equal to MostRecentRunDateTime from ALL JOBs.  So using CTE's instead of temporary objects, you get something like:

    DECLARE @dt CHAR(8);
    SET @dt = CONVERT( CHAR(8)
    , DATEADD( HH
    , -24
    , GETDATE()
    )
    , 112
    );

    WITH [AllJobs]
    AS
    (
    SELECT DISTINCT
    CONVERT( VARCHAR(128)
    , @@SERVERNAME
    ) AS [Servername]
    , [T2].[name] AS [SQL Job Name]
    , [msdb].[dbo].[agent_datetime]( [T1].[run_date]
    , [T1].[run_time]
    ) AS [RunDateTime]
    , MAX([msdb].[dbo].[agent_datetime]( [T1].[run_date]
    , [T1].[run_time]
    )
    ) OVER (PARTITION BY
    [T2].[name]
    ) AS [MostRecentRunDateTime]
    , CASE [T1].[run_status]
    WHEN 0
    THEN 'Failed'
    WHEN 1
    THEN 'Succeeded'
    WHEN 2
    THEN 'Retry'
    WHEN 3
    THEN 'Cancelled'
    WHEN 4
    THEN 'In Progress'
    END AS [ExecutionStatus]
    FROM [msdb]..[sysjobhistory] AS [T1]
    INNER JOIN [msdb]..[sysjobs] AS [T2]
    ON [T1].[job_id] = [T2].[job_id]
    WHERE [T1].[run_date] >= @dt
    )
    , [FailedJobs]
    AS
    (
    SELECT DISTINCT
    CONVERT( VARCHAR(128)
    , @@SERVERNAME
    ) AS [Servername]
    , [T2].[name] AS [SQL Job Name]
    , [msdb].[dbo].[agent_datetime]( [T1].[run_date]
    , [T1].[run_time]
    ) AS [RunDateTime]
    , MAX([msdb].[dbo].[agent_datetime]( [T1].[run_date]
    , [T1].[run_time]
    )
    ) OVER (PARTITION BY
    [T2].[name]
    ) AS [MostRecentRunDateTime]
    , CASE [T1].[run_status]
    WHEN 0
    THEN 'Failed'
    WHEN 1
    THEN 'Succeeded'
    WHEN 2
    THEN 'Retry'
    WHEN 3
    THEN 'Cancelled'
    WHEN 4
    THEN 'In Progress'
    END AS [ExecutionStatus]
    FROM [msdb]..[sysjobhistory] AS [T1]
    INNER JOIN [msdb]..[sysjobs] AS [T2]
    ON [T1].[job_id] = [T2].[job_id]
    WHERE [T1].[run_date] >= @dt
    AND [T1].[run_status] = 0
    )
    SELECT
    [AllJobs].[SQL Job Name]
    FROM [AllJobs]
    JOIN [FailedJobs]
    ON [FailedJobs].[SQL Job Name] = [AllJobs].[SQL Job Name]
    AND [FailedJobs].[RunDateTime] = [AllJobs].[MostRecentRunDateTime]
    AND [AllJobs].[ExecutionStatus] = [FailedJobs].[ExecutionStatus];

    And that should tell you all jobs that have not succeeded since the last time they were run for jobs that have been run in the past 24 hours... I think.  I don't have any test cases to try that on as all of my jobs are currently succeeding.

    If that doesn't work, it should at least give you a good starting point to work off of, right?

Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply