RE: A Failed Jobs Monitoring System

SSChampion

Points: 10386

February 6, 2008 at 3:25 pm

Timothy Ford (2/6/2008)
Scott, do you really want an email every time a job fails on every SQL instance? Must have a lot of free time on your hands and space in your in-box. Avg. 20 jobs per instance * 80 instances * running N times per day = WHOA!

That's why I monitor by exception. 😀

I have very few job failures. What generally causes something to fail? Something changing. The systems are locked down and we have a rigorous (and improving) change process. For things like disk space, that's being monitored (DB growth rates as well) and flagged before backups & the like fail.

Granted, our systems are relatively straightforward as well, no complex replication scenarios (we do have replication), no flaky network links.

This may also change once we get a better centralised monitoring tool in place, where alerts can be sent to a console. Our current monitoring software doesn't handle that so well.

Scott Duncan

MARCUS. Why dost thou laugh? It fits not with this hour.
TITUS. Why, I have not another tear to shed;
--Titus Andronicus, William Shakespeare