A Failed Jobs Monitoring System

This is one DBA’s tale of monitoring failed jobs. There will be many different ways that you can approach this task, and reach a solution. This solution may be similar to or different from mine, but I will explain how I reached my solution and give you some good ideas of how to accomplish a similar task for your own systems.

One could easily say, “Let’s install mail on the db server, and let the system notify us of failures”. That is a fine solution. This is easy on SQL 2005, a little less easy on SQL 2000. Most of my systems are still on 2000, so I shied away from this solution.

Maybe you are excited about implementing Reporting Services, and envision a solution that utilizes this technology to feed you data about failed jobs. At one point I was excited about RS and did implement it to help me report on job failures. But I chose to not use RS this time, because of the quantity of servers I will be monitoring.

There are some fine 3^rd party solutions out on the market that can help you organize your jobs and even let you know of failures. These are all fine ways to reach the goal, but in the end I implemented my own system.

General Overview of the Failed Jobs System

My shop is under the watchful eye of Sarbanes-Oxley. Not only do I, as a DBA, need to know that a job failed, I also need to remediate it and provide an auditable trail of my actions. Sure, in a given 3 day period I can have a job fail 10 times. Maybe it failed for connectivity issues a couple times, maybe it deadlocked another time, and maybe it broke because of newly rolled code and an altered schema (see my article on Monitoring Database Changes). And if an Auditor asked me today why this job failed, I will gladly repeat the above reasons. However, next week, I may forget that a deadlock occurred. A week later, I may forget all the reasons, but vaguely recall that the job failed. In a month, I will not even remember the job name. Being able to document my steps to remediate the failure is vital to our organization and to our SOX monitoring effort. It not only proves that we DBA’s are on top of these failures, are aware of them, fix them, and even leave a trail of breadcrumbs of what we did to fix them. Granted, the notes are only as good as I put the time in to make them. So it’s vital that I buy into this vision, and provide enough detail to be useful.

The way I decided to accomplish this task was as follows.

I create a central Monitor Server with SQL Server installed on it.
For each server I wish to monitor, a linked server will be created and tested.
The Monitor Server will have a job that fires off, and reaches over to another server (via linked server), and pulls data down about the jobs.
This data will be stored locally, on the Monitor Server
An email will be sent to all interested parties detailing each failed job.
The DBA will fill in the remediation data associated with each failure.
Reports can be generated for the Auditors of interested parties, detailing job failures and their remediation.

Justifications

Email Reports

The method for reporting is a single email for each failed job. This provides the DBA a timely notification of the failure. As stated previously, I had used Reporting Services to generate these reports, and have a snapshot emailed to the DBA. With many multiple servers to be monitored, this would prove to be a rather large email report. A one by one email was chosen as the notification medium.

Manual Editing of Remediation fields

This data is stored in a history table, and extra fields are associated with the job failure to be used for future remediation. The DBA will fill these fields in for one, or multiple jobs failures, and call the failure researched, fixed and remediated. This information allows me, the DBA, to research failed jobs, and allows the Auditor enough information to prove we are on top of the system. A UI could very well be written for this, and I attempted this in the past. But what I ended up with was a clunky application that bothered me. So I resorted to SQL statements to edit this remediation data.

Linked Server Justification

There are always issues with linked servers. I chose this easy method, because we can ensure security on the Monitor Server, disallowing any one from accessing it, and being able to gain access to the monitored servers it monitors. We control the password changes on the monitored servers, and can alter them at will. We have a mix of SQL Server 2000 and 2005, and did not want to go the mail route on each server. Centrally gathering the data is easily accomplished with a central Monitor Server, and linked servers is an easy way to accomplish this task.

System Overview

On the Monitor Server, there is an Audit database. Within this database, there are 5 tables, a couple jobs, and a few stored procedures.

Tables

[AuditFailedJobs]

This is the staging table for gathering the failed job information from other servers. The data will sit here, until it has been processed and reported to the DBA. The fields in the [AuditFailedJobs] table are as follows.

ServerName

Job Name

Enabled

Status

Run_Date

Run_Time

Run_Duration

Message

Instance_ID

This information is the core data we retrieve from the server that experienced the job failure. It will provide ample information for you to investigate the job that failed, and even give you reasons in the [Message] field for why it failed. I have added the ServerName to all job failures, because I monitor multiple servers, and need to know where the job is located. I have no idea why I included the Enabled or Status fields; these fields haven’t provided me with much information over the years, but I choose to not alter the system to remove them. The [Run_Date], [Run_Time], and [Run_Duration] fields are essential fields for determining when the job failed. The [Instance_ID] field allows me to keep track of which jobs I have already gathered, so I do not duplicate my efforts.

[AuditFailedJobsHistory]

The data gathered in the previous table, once it has been reported as a failure, is then sent to the History table, where other fields are added for ease in remediation. Since the term ‘remediate’ was awkward as a field name, I chose to use the term ‘resolved’. The fields in the [AuditFailedJobsHistory] table include the above, plus the following:

Resolved

ResolvedBy

ResolvedDate

Resolution

These 4 fields allow the DBA to add information for the auditable trail. I reiterate the importance of filling these fields with valid and useful information, or the system will not serve to be useful for you or the Auditor.

[AuditFailedJobsConfig]

This contains a ServerName field and an Enabled flag. These two simple fields allow the system to know if a Server is enabled to be processed by this job system. Simple. Select the Enabled records, and viola! You now have the result set of servers to process.

[AuditFailedJobsClientConfig]

It was an addition to the system that allowed me to group servers by a Client and Type of Server. At one point, I had trouble with the job failing that processes the servers part way thru, because of a linked server failure. This would cause all subsequent servers to not be monitored. This table was introduced to group servers together. The fields of this table are as follows:

ClientName

ServerName

ServerType

Order

This table allows me to group a client’s servers together, and further subgroup them by ServerType. For example, I have a client with two types of servers, Workflow servers and Dataentry servers. I can call each of these separately to be processed, by Client and ServerType. Or I can call them to be processed all at once, by simply choosing the Client parameter. Multiple clients can be called separately. Or I can call all servers to be processed by leaving off both parameters. These features allowed me to setup multiple steps in the processing job (described later) and continue to the next step on failure, allowing better processing of the server’s failed jobs.

[AuditFailedJobsNotifications]

This table contains the following fields:

ServerName

Enabled

PositiveConfirmation

This table allows me to store the individuals who want to be notified, per server. There is an enabled flag on this record as well. And the PositiveConfirmation flag is intended to allow emails to be sent when a job fails or succeeds. If you want positive confirmation that the job executed, regardless of outcome, this flag enables this ability. I, typically, do not use this, as we only care about failures.

Stored Procedures

[sp_ArchiveFailedJobs]

This stored procedure inserts all the records from the [AuditFailedJobs] table into the [AuditFailedJobsHistory] table. Then it removes the records from [AuditFailedJobs] table.

[sp_EmailFailedJobs]

This stored procedure cycles thru the enabled records in the [AuditFailedJobsNotifications] table, and looks for matching records in the [AuditFailedJobs] table limited by ServerName. If records are found, an email is created and sent with the particular job information, for each job. The email is sent to the individuals in the [AuditFailedJobsNotifications] table.

[sp_ProcessFailedJobs]

This stored procedure takes two parameters. These nullable params are Client and ServerType. These allow you to limit the scope of servers that will be processed. It’s a good idea to itemize your servers by Client and ServerType, so that individual job steps can call each subgroup, and be allowed to fail on its own, and not fail the entire process job.

Based on the server we are about to process, a dynamic SQL query is created, with the ServerName being used as the linked server name, in the object naming. We select from [sysjobs] and [sysjobhistory] in the MSDB database of the monitored server. We limit the jobs to those that are Step_ID = 0, run_status = 0 and greater than the Max(Instance_ID) we have stored on the Monitor Server. We only want to get the latest jobs on the monitored server, the ones we have not processed yet on the Monitor Server.

These records are inserted into our [AuditFailedJobs] table to be stored and processed later. We tack on the ServerName, and do some case work on a couple fields to get better English terms.

Jobs

[_Audit – Failed Jobs]

I feel impressed to justify my naming convention. I use the underscore to keep these jobs at the top of the list of jobs, sorted alphabetically. This is a strange habit I picked up years ago. Since the job is and Audit job, I have labeled it as such with the term ‘Audit’. ‘Failed jobs’ processes the failed jobs. The first incarnation of this job had 3 steps. Get Job Failures, Email and Push to Archive.

Step 1: Get Job Failures

When I realized that the linked servers caused issues, and caused the Job Failures step to fail, before processing all servers, I broke that step up into as many Client/ServerType groupings I had to monitor. For each Client and ServerType subgroup, I have a single step now. Each of these steps is identical, except for the parameters, so I will only bore you with the explanation of a single one.

This step calls the stored procedure [sp_ProcessFailedJobs] with the appropriate parameters for Client and ServerType. As explained above, this simply pulls the latest failed job information from the servers, and stores it for processing. An important note in this step is to make sure that the advanced options of this step are altered. We want this step to ‘go to the next step’ on failure. This will allow all the steps to continue, even if this one step fails. The most widely used reason for failure at this step is the linked server not being able to be reached. And since I cannot find a successful way to poll a linked server, without actually connecting, and receiving an unforgivable error, this was and is the work around I have chosen. In other words, a step will attempt to process the failed jobs, and if it cannot, it will simply go on.

Step 2: Email

This step simply calls the stored procedure [sp_EmailFailedJobs]. This has been described previously. Simply put, this will send email to those individuals indicated in the job system config for the servers they care to be informed about.

Step 3: Push to Archive

This step simply calls the stored procedure [sp_ArchiveFailedJobs]. This has been described previously. This will move the data from the staging table, to the history table.

The Rest of the Process

The rest is up to you. All other steps that you take from this point are not dictated by system jobs or timetables, other than the ones you indicate in your job requirements as a DBA. You will now receive an email that a job failed. You can act upon it or not. It’s your choice. As I wrote this article, I received 10 emails indicating various failures. I reviewed them briefly, and will deal with them later. You can get to them on your system when you need too. But the important fact is that they are now sitting, waiting for you to do something. You ‘must’ remediate them. Regardless of the reality of your solution, you need to document it in this Job System. Who did what, when it was done, and what was done. Mark this item as resolved and give it good notes. You will find yourself referring to this information in the future, when you need too. Why did job X fail? You can now research it, and even mine data on its past failures. Now, when the Auditor’s come knocking on your door, you have a solution in place that allows you too easily provide them a report that shows all job failures, and all remediation.

A Failed Jobs Monitoring System

Resources

Rate

Share

Categories

Share

Rate

A Failed Jobs Monitoring System

Resources

Rate

Share

Categories

Share

Rate

Related content

Discuss content posted by Tom Thomson

Discuss content posted by Kristian Ask

Discuss content posted by Kaifi

Enhancing the readability of your code: Table aliasing in sql

Default trace - A Beginner's Guide