Job failure count mismatch?

  • Hi All,

    I need some help in the below query. There are 2 queries.

    1st query which will tell me the monthly wise job report which includes columns such as TotalJobExecutions, Success Count, Failure count, Retry count, Cancel count.

    2nd query, is completly concentrated on Failure count. Here i wanted to find out the reasons for the job failures.

    These 2 queries is showing correct output on my local instance. bascially, I installed a sql instance and created 3 jobs and made them fail for few runs. On my local instance, the queries and counts are matching perfectly. But, when I run the set of queries, the counts are mismatching. i.e. The sum(failure count) column in the 1st output doesnt match up with the Sum(totalCount) in the 2nd output.

    Note: also I am considering only last 3 months worth data.i.e. i am interested in last 3 months job failures. Please check with WHERE clause conditions.

    Now I wanted to understand when I ran the same 2 queries in PROD server, why there is count mistmatch. I am completely not able to understand. How can I fix this? Is there anything am I missing in the 2nd query. Also, in the 2nd query I am considering step id =0 (i.e. job outcome) like I did in 1st query.Instead, to get the exact error message, I am considering the failed step to know the exact error message and I am joining the messageid to sys.sysmessages dmv to know the actual reasons for the job failures. (stepid<>0)

    Is there any special scenario or extra condition I am missing in the 2nd query. I am literally confused here. does running rerunning any failed step from point of failure will that a make a difference in the counts? I have also tested that case as well still counts show me correct on local instant but on prod where we have around 100 jobs which is showing wrong results.

    or else if a job is has multiple schedules does it make any difference????

    All I am looking for is, if I sum up the values of "FailureCount" column values, that should match if I sum TotalCount column values in the 2 query output.

    Can anybody help ?

    Query 1

    ==============

    use msdb

    go

    ;

    with MyCte

    as

    (

    SELECT j.name JobName,

    h.run_status,

    msdb.dbo.agent_datetime(h.run_date, h.run_time) rundatetime,

    case h.run_status when 0 then 'failed'

    when 1 then 'Succeded'

    when 2 then 'Retry'

    when 3 then 'Cancelled'

    when 4 then 'In Progress'

    end as ExecutionStatus

    --h.message MessageGenerated

    FROM sysjobhistory h inner join sysjobs j

    ON j.job_id = h.job_id

    where h.step_id = 0

    and msdb.dbo.agent_datetime(h.run_date, h.run_time) between '2016-06-01 00:00:00.000' and '2016-09-21 23:59:59.000'

    )

    select

    @@servername AS Servername, --//CHANGE THE SERVERNAME

    [Month]=cast(month(rundatetime) as varchar(2))+'/'+cast(year(rundatetime) as varchar(4)) ,

    [TotalExecutions] = count(*)

    ,SuccessCount = sum(case when ExecutionStatus = 'Succeded' then 1 else 0 end)

    ,FailureCount = sum(case when ExecutionStatus = 'failed' then 1 else 0 end)

    ,RetryCount = sum(case when ExecutionStatus = 'Retry' then 1 else 0 end)

    ,CancelledCount = sum(case when ExecutionStatus = 'Cancelled' then 1 else 0 end)

    from MyCte

    group by cast(month(rundatetime) as varchar(2))+'/'+cast(year(rundatetime) as varchar(4))

    go

    Query 2

    ==============

    ;

    with FailureReasons

    AS

    (

    SELECT [JobName] = JOB.name,

    msdb.dbo.agent_datetime(HIST.run_date, HIST.run_time) as RunDateTime,

    [Step id] = HIST.step_id,

    [StepName] = HIST.step_name,

    HIST.run_status,

    [Status] = CASE WHEN HIST.run_status = 0 THEN 'Failed'

    WHEN HIST.run_status = 1 THEN 'Succeeded'

    WHEN HIST.run_status = 2 THEN 'Retry'

    WHEN HIST.run_status = 3 THEN 'Canceled'

    END,

    HIST.sql_severity,

    HIST.sql_message_id,

    HIST.message

    FROM sysjobs JOB

    INNER JOIN sysjobhistory HIST ON HIST.job_id = JOB.job_id

    where HIST.run_status = 0 -- means only failed

    and msdb.dbo.agent_datetime(HIST.run_date, HIST.run_time) between '2016-06-01 00:00:00.000' and '2016-09-21 23:59:59.000'

    and step_id <> 0

    )

    SELECT t1.sql_message_id "Error #",t2.[Description], [TotalCount] = COUNT(1)

    from FailureReasons t1

    inner join sys.sysmessages t2 on (t1.sql_message_id = t2.error)

    where t2.msglangid = 1033

    group by t1.sql_message_id,t2.[Description]

    ORDER BY [TotalCount] DESC

    Attached the sample mismatching counts.

    Thanks,

    Sam

  • As you note, the first restricts results to step_id=0, while the second restricts results to step_id<>0.

    That's the most likely cause of a discrepancy I can see. A single job failure (step_id=0) could include many step failures if it has multiple steps.

    Other possibilities include having a step's On Success action being to quit the job reporting failure or its On Failure action being to quit the job reporting success. Those would also result in discrepancies between the number of step failures and the number of job failures.

    Ultimately, the number of job failures (query 1) will not necessarily match the number of failed steps (query 2), for a variety of reasons. I would just accept that you're counting two different things and not worry if the counts don't match.

    Cheers!

  • Thanks Jacob for the help.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply