A Failed Jobs Monitoring System

  • Comments posted to this topic are about the item A Failed Jobs Monitoring System

  • I too use an underscore when naming a job. I started this practice after upgrading my workstation to SQL Server 2005. SQL Management Studio sorts jobs differently than Enterprise Manager.

    I got tired of scrolling to the bottom of the list to see my jobs, so I started adding a underscore to make life easier.

  • We do employ a very similar method of managing our SQL servers. We have about 30 or so SQ Servers and then most of those are replicated through multiple development, test and hotfix (virtual) environments.

    We have a large number of SQL servers to maintain and products like SQL Stripes become very expensive.

    Only just finishing off the setup, but having a few issues with the distributed transactions, but its looking good so far.

    Graham

  • If you're already setting up a separate server, why not use MOM? You then get the other benefits of MOM like other SQL Server monitoring and monitoring of the servers besides failed jobs. We monitor databases for close to 200 SQL Server applications and MOM works great. MOM automatically installs it's client on a server when it detects a new SQL Server installation and starts monitoring. It works great for both SQL 2000 and 2005.

  • I came up with a similar solution. Except I have a proc that creates Linked Servers using a single name (LS_Q), before executing the queries, this makes the subsequent coding easier - no dynamic SQL to deal with, and easier to expand the processing - so far I have expanded my system to include SQL File statistics (Size / Used) for all MDF, NDF and LDF files, some basic IO stats, Failed Jobs, Current Backups, etc.

    The code could have been better written - not easy to read / understand (SQL Refactor is good for this if you are an untidy coder, like most of us).

  • I am not sure if this method has already been shared because I have not figured out how to go to the the beginning of the thread (new user)!

    Anyway, I am not a true DBA but I am a manager of system operations. I developed a proactive method for alerting me when a job fails by simply creating triggers on the sysjobhistory table and then emailing me the log output that was generated by the job (provided I enabled the job to produce a log.

    It saved me time and loads of money messing around with 3rd party products or figuring out how to get other components of MS SQL 2005 to work how I wanted them to work. Obviously, db_sendmail needs to be enabled on each server that you are planning on deploying the solution to (or if you use some kind of linked server setup, just enable it on one server). I advise that anyone considering this solution check with their company's security practices when deciding for or against enabling mail on the database server.

  • I've been using a job failure email notification method for some years now on SQL 2000 with the following differences:

    1) I don't use a monitor server to check the other servers, but rather just include a final email failure step on each job, only executing this step if a previous step fails. That way I avoid the linked server problems mentioned in the article and I keep it simple. (The one disadvantage is if the server goes down, no email gets sent; whereas a central monitoring server solution can catch this. But our operations unit knows within seconds if a server goes down, so not a problem for me.)

    2) On my SQL 2000 servers, I don't use SQLMail because a) it requires a MAPI client and b) it's flaky (often started failing after months of no problems). I replaced it with an smtp solution developed by Gert Drapers (formerly a software architect for Microsoft) called xp_smtp_sendmail. I've used it for years and it is very robust, has never failed once.

  • First of All: Nice Job TJay.

    I too built a similar system using linked servers in order to monitor failed jobs, backup history/backups out of date tollerance, databases::applications::servers, space issues at the file level, etc. What I've moved onto in SQL 2005 is a similar system, with the same Reporting Services reports as I was using prior (and practically the same schema) using SSIS. You can populate a table with your server/instance names and so long as you keep that up to date as you bring new instances online you're good-to-go. It was based in-part off of an article in SQL Server Magazine in May or June of 2007 on SSIS. It runs seamless against my 80+ instances and 800+ databases with much less overhead than my old linked server solution. I've only had it fail once, but that was due to an instance being down for maintenance. With a little more work on the customized logging that would have been apparent however.

    - Tim Ford, SQL Server MVPhttp://www.sqlcruise.comhttp://www.thesqlagentman.com http://www.linkedin.com/in/timothyford

  • Good article. I too have created a Job Failure management system that easier to use as it uses linked servers along with a reports created in SSRS. It's funny how no vendor has created an application for DBAs to manage multiple servers easily and effectively. My company has over 100 instances on 80 servers that I manage..alone.

    Just my 2 cents worth,

    Rudy

    Rudy

  • Personally, this seems like a lot of heavy lifting and manual labor, and the effort involved seems to outweigh the costs of commercial tools already available. For example, SQL Sentry's Event Manager is a very affordable option, and it does everything your solution does and more... including elaborate event notifications not only for failed jobs but also job runtime thresholds, event chaining and a very nifty Outlook-style calendar view that gives you a graphical view of your job schedules across many instances.

    I'm usually a total advocate for reinventing the wheel, if you are going to add something very powerful that isn't already available in packaged solutions (or if you really want to learn the API, catalog views, DMVs, Agent subsystem, etc). In this case I suggest your readers take trial versions of available packages for a spin before going too far down the "roll-your-own" road. At the very least you will see what you might want (and might not want) when you build your own solution, but more often than not you will realize how complex it can be to go that route, and that you will actually save money by spending money on a ready-made tool.

    This is a classic argument that I have been having for ages. Back when ASP was a popular web language I used to argue until I was blue in the face with people who wanted to write their own mail or upload component, to save the $99 or $150 of the premier such component already available to buy on the spot. For most of us, if it takes us more than hour to build such a component, we're already behind. Then there is testing, debugging, performance testing, etc. etc. All of which you get for free when you have a reputable vendor behind the product.

  • I agree that there are tools out there that can do most of the work. I say most because there is always something that the tool does not cover. For example my reporting system also analyzes the SQL error logs (my report show the errors that I need to look into or can use to troubleshoot) but apps I've seen don't. If all the apps did everything we need, then most of this site would not be needed. Just look at all the great scripts, process improvements and articles!

    The other aspect is dollars. Not every DBA can spend $2,000.00 dollars per instance or per server to buy ABC program to help out. Generally they have a very limited budget and have to spend the money wisely.

    I use several purchased tools to help my work (not going to mention them as I'm not in sales) but there is alway things that are custom to your company and you have to figure out a way to automate or at least change their process to make it easier on the DBA.

    Rudy

    Rudy

  • Yes, it's true, not every tool will do everything you want. That is true of pretty much all software out there, whether you bought it, wrote it, bought the source code and adapted it, etc.

    However I use Event Manager on all of my instances and there is nothing that I have needed to date that it doesn't do.

    Also, analyzing SQL Server Error Logs seems to be a bit of a disparate process from being told that a job failed, and I don't know that it is something you should expect from a tool dedicated to monitoring SQL Server Agent. It sounds like something more along the lines of a SQL Server error monitor tool.

    As for the $2,000 figure, well, you can certainly get tools like Event Manager for less than $2,000 per instance. And what I hate about these debates is that managers and even some IT folks and DBAs think that their time spent developing software is free. Surely the amount of time you spent writing your code was worth well more than $2,000 in opportunity cost that you could have devoted to other tasks, and so you have not saved any money really, just shifted the line item on the income statement from "3rd party software" to "employee salary." Especially if you end up charging overtime or off-hours compensation because you couldn't get your normal work done while you were re-inventing the wheel. 🙂

  • I feel you pain ... we have over 200 SQL Server 2000 instances that need to be monitored, and not just for SQL job failures.

    The solution I went with works somewhat in the opposite direction as yours. Instead of setting up a repository server with linked servers to all the instances, I set up all the instances with a linked server to the repository server. I then configured all the instances as target servers to a master MSX server. The MSX server pushes out a SQL job to all the target servers and this SQL job collects data (including job history) and feeds it into the repository server. The repository server then processes the collected data and sends notifications as necessary. To make sure that all the target servers are actually sending their data to the repository server, they all update a time stamp as part of their collected data. I can then query the time stamps to make sure all the target servers are sending over thier data.


    Kindest Regards,

    DrewTheEngineer

  • The reason for me to pull the information instead of pushing is basically that new servers can be added to the network and I don't have to install or setup anything on that server. By updating the table on my repository server, it will then automatically connect to it.

    Nice to see all the different ideas. Different is good as long as it makes your job easier and better.

    Rudy

    Rudy

  • For job monitoring, I just get each server to send an email (per job) when a job fails. DBMail on SQL205, SMTP stored proc for 2000.

    As far as auditing goes, if you have a Service Desk, log an Incident for each job failure (or add an entry to an existing incident if it is a work in progress sort of thing) and put the resolution in there. You should be able to search for any historical incidents relating to a particular server easily enough. Having some sort of known problem repository helps as well (especially for those failures that only happen every 6 months or so).



    Scott Duncan

    MARCUS. Why dost thou laugh? It fits not with this hour.
    TITUS. Why, I have not another tear to shed;
    --Titus Andronicus, William Shakespeare


Viewing 15 posts - 1 through 15 (of 38 total)

You must be logged in to reply to this topic. Login to reply