A Failed Jobs Monitoring System

  • Scott, do you really want an email every time a job fails on every SQL instance? Must have a lot of free time on your hands and space in your in-box. Avg. 20 jobs per instance * 80 instances * running N times per day = WHOA!

    - Tim Ford, SQL Server MVPhttp://www.sqlcruise.comhttp://www.thesqlagentman.com http://www.linkedin.com/in/timothyford

  • I agree that writing it yourself vs buying software is a compelling argument, and people are on both sides of the camp. In my case, the buy it option was not an option. ever. Unfortunately.

    So, instead of doing it in a lot more manual process, this is the solution that allows me to sit back and do other things, while i feel safe that the jobs are being monitored. Even if it cost a lot more for me to develop. And thats part of the reason i am sharing it. Why should you reinvent the wheel i just reinvented. Unless its for the self education on the process. Which was another selfish reason for writing it. Loads of baggage went into the system, and i just wanna share it.

    I love that there have been so many different ideas shared here on this topic. its one close to my heart, and i love that so many options are out there. hopefully people will be albe to pick the best one for their shop.

  • Timothy Ford (2/6/2008)


    Scott, do you really want an email every time a job fails on every SQL instance? Must have a lot of free time on your hands and space in your in-box. Avg. 20 jobs per instance * 80 instances * running N times per day = WHOA!

    That's why I monitor by exception. 😀

    I have very few job failures. What generally causes something to fail? Something changing. The systems are locked down and we have a rigorous (and improving) change process. For things like disk space, that's being monitored (DB growth rates as well) and flagged before backups & the like fail.

    Granted, our systems are relatively straightforward as well, no complex replication scenarios (we do have replication), no flaky network links.

    This may also change once we get a better centralised monitoring tool in place, where alerts can be sent to a console. Our current monitoring software doesn't handle that so well.



    Scott Duncan

    MARCUS. Why dost thou laugh? It fits not with this hour.
    TITUS. Why, I have not another tear to shed;
    --Titus Andronicus, William Shakespeare


  • I agree. I only monitor of the exceptions/failures. Don't need to know that a job ran successfully. But I have other reports that should all jobs successful or not. In the morning I just review the failed ones. Have an email alert don't really help me as we are not a 24x7 shop so the web page report (with SSRS) works great.

    Rudy

    Rudy

  • Nice solution! However I got a message that sp_SQLSMTPmail does not exist. Downloaded it somewhere. Next error about '@vc parameter that does not match with this procedure'. What am I doing wrong? Thanks in advance. Peter

  • This is a great article! How can I take it to the next level - I have a multistep job that is executing some but not all of the steps. The SQL logs/tables report job success, but I suspect now it is become at least SOME part of the job succeeded. Without parsing each step out into seperate jobs, is there way to get detailed information about the success or failure of each step within the job?

  • Interesting solutions. For large environments, enterprise monitoring software is very useful. Our environment has 200+ SQL instances and we use NetIQ. We finished up a project that monitors all types of problems and will create a service ticket based on the monitoring events. The best part is the set-up for new servers. Install the client, add the server to the proper monitoring group in NetIQ, and we can start getting monitoring events from SQL (including failed and long-running jobs).

    Thanks,

    Eric

  • that is the next step. its difficult to gather specific info fom the job steps, cause they dont key into the job, just seems they are ordered by time, and linked to a job... so the organization is a bit lacking. step 0 is the job report. other steps could fail. But finding the specific message that failed, and specific step, may be difficult, if the job failes a lot of times in a row...

    Im struggling with piecing this together now.

  • @Strommy: In my last job we used NetIQ. A lot easier than Tivoli. You just have to watch out for adjusting child jobs - any changes to the parent job will overwrite it. A gotcha for people who don't watch what view they are in on the console.

    @stew_b: A possible solution is to have a step between each to handle failures. Like so:

    Step 1: Step1Tasks onsuccess goto step 3, onfail goto next

    Step 2: Report failure of step 1 onsuccess goto next, onfail goto next

    Step 3: Step3Tasks onsuccess quit with success, onfail goto next

    Step 4: Report failure of step 3 onsuccess quit with success, onfail quit with fail

    It depends what your steps are doing and whether you can incorporate error-reporting within them (e.g. RAISERROR). Failing that, use the above structure with e.g. RAISERROR('Step x failed', 16, 1) WITH LOG and check the ERRORLOG (manually or automate it) or get it to send an email (you may get spammed, watch out for that!).



    Scott Duncan

    MARCUS. Why dost thou laugh? It fits not with this hour.
    TITUS. Why, I have not another tear to shed;
    --Titus Andronicus, William Shakespeare


  • Good points about the job failure logic. In general, if we ever have a job that fails and then sends a failure notification, we have that job step exit failed, regardless of success or failure of that step. As an aside, if we have a ticketing system monitoring job failures, we don't need the notification step. That was part of our goals for integrating with NetIQ.

    Thanks,

    Eric

  • Great idea Scott - Thank you!

  • I came across this site, so i thought let me share with forum.

    SQL job manager

    View and manage SQL Server Jobs

    Free for a limited time

    http://www.idera.com/Products/SQLjobmanager/

  • If you're already setting up a separate server, why not use MOM?

    I've looked into that and cannot find a rule to monitor failed jobs. If I have to build this rule myself, then why not go for a tailormade solution anyway?

    Personally, this seems like a lot of heavy lifting and manual labor, and the effort involved seems to outweigh the costs of commercial tools already available.

    This is correct, however, have you even looked into the extra load some of these tools add to a server? Dunno about Sentry's Eventmanager, but others where a firm no-no after a day or so of testing. Idera JM was 1 of those.

    For job monitoring, I just get each server to send an email (per job) when a job fails.

    Yes, nice 1, we have more then a 1000 jobs running all over the place, and alas, not only the DBA can go there and add/delete jobs. I prefer a central monitoring system.

    However, I miss the job to run this setup. I can figure it out, but do I miss something here? It's not part of the download?

    Greetz,
    Hans Brouwer

  • Using MOM. Yes this is a good tool but not everyone has the budgets to get MOM. This is the situation at my location, although there would be money available now, but I think this is working much better.

    We use this tool/setup to also collect additional information from not just SQL servers but also from Oracle servers and MySQL servers as well. Not certain if MOM does Oracle/MySQL but so far this setup is working well.

    Just go to show that there are many ways to get things done and you can bet a DBA/Develop will figure something out.

    Thanks for the comments,

    Rudy

    Rudy

  • We use a similar system to this to monitor all our servers. Had a look at Idera SQL Job Manager but it ran very slowly on my machine and we needed more information about other aspects of the servers than just the jobs. At the moment lots of details on disk space, db size, job outcomes and a number of logging tables are pulled back to a central server for reporting against. SQL Reporting service is then used to report against the data and send out email reports as needed.

Viewing 15 posts - 16 through 30 (of 38 total)

You must be logged in to reply to this topic. Login to reply