• Rudy,

    Great article. It talked more about your monitoring framework which I feel is the hardest part for a DBA to decide on and setup. I too perform automated monitoring using linked servers. I use sqlcmd instead of isql or osql, but still the same concept. I love this framework for monitoring b/c it's easy and it's allowed my team and I to monitor everything (jobs, space, error log, blocking, automated killing of blocking, replication failures and latency, etc) across ~70 installations around the globe from a central location.

    I saw your solution for job monitoring where you join on sysjobservers to see if a job failed or not. Another join you could add is sysjobhistory. By joining on sysjobhistory you can also report/alert on the actual error message from the failed job. Here's a snippet of my logic to trap failed jobs. I run it every 5mins (for some servers it may be overkill) so hence my long date calculation/check at the end. The filtering isn't perfect, but seems to trap ~99% of the failed jobs for us. In SQL2005, some job failures do not have error codes, some will fail a job step but still mark the job successful. Other jobs will show a warning for a job step but still mark the job successful so I've learned to not trust job outcomes 100% of the time. That's why I also trap on certain key words as well. So far it's kept my team and I honest, although in the earlier days there were some painful learnings as we fine-tuned the logic =). I hope this is helpful:

    select @@servername, jobs.[name], his.sql_message_id, his.sql_severity, his.[message], his.run_status, his.retries_attempted

    ,LastRunTime = convert (varchar(100), (ltrim(str(run_date))+' '+stuff(stuff(right('000000'+ltrim(str(run_time)), 6) , 3, 0, ':'), 6, 0, ':') ) )

    from msdb..sysjobhistory his with(nolock), msdb..sysjobs jobs with(nolock), msdb..syscategories cat with(nolock)

    where his.job_id = jobs.job_id

    and jobs.category_id = cat.category_id

    and his.run_status in (0,4)

    and SUBSTRING(CONVERT(char(8), his.run_date),7,2) = DATEPART(dd, getdate())

    and SUBSTRING(CONVERT(char(8), his.run_date),5,2) = DATEPART(mm, getdate())

    and SUBSTRING(CONVERT(char(8), his.run_date),1,4) = DATEPART(yy, getdate())

    --Filter certain job categories from alerting

    and cat.[name] not in ('Test')

    and (his.sql_message_id <> 0 or his.[message] like '%fail%' or his.[message] like '%Error%')

    --the date conversion below is so we only trap jobs that failed in the last 5mins

    and (convert (varchar(100), (ltrim(str(run_date))+' '+stuff(stuff(right('000000'+ltrim(str(run_time)), 6) , 3, 0, ':'), 6, 0, ':') ) )) >= (getdate() - 0.0034722222222222222222222222222222)

    lc