Monitoring for Non Existent Events

  • Comments posted to this topic are about the item Monitoring for Non Existent Events

  • See as so many of us will have studied state machines in one way or another, it remains astounding that so many of us, so many times do not setup monitoring to cater for all states. This is rather lax of us and very remiss. It is hardly surprising when not enough consideration is paid to instrumentation in general.

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

  • It takes a little more time, but changing the exception handling process for a job from "wait for users to notify the right person" to "have SQL Server notify the right person" is definitely worthwhile.

    I have checks for jobs that haven't run or haven't finished in a reasonable period of time but we can always use more. It's the irregular processes that run once a month that are hard to pin down.

  • That's one of the points in my DB corruption presentation. Don't assume everything's OK. Don't assume that the backups are succeeding just because you're not getting backup failure messages.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks to SQL Sentry we have built on to our job step -- alert-- and can now check for jobs that run too long. But I do see one more hole... jobs calling code that is not well built, and it completes with out error, but does not do what you expected... I have had this come up from time to time.. the latest issue was bad data flowing into the database and was not picked up until the weekly maint schedule found that the data did not match the data type. One other time we had a trigger that got turned off.. and not turned back on...

    David

  • "A job that runs long or doesn't run at all can sting just as bad as one that fails."

    What's the difference?

    There is no difference.

  • Developers have in the past cursed bean counters for various reasons. However there are reasons for counting and tracking. Maybe we need to regularly check and see if we have the appropriate beans.

    Not all gray hairs are Dinosaurs!

  • GoofyGuy (7/31/2014)


    "A job that runs long or doesn't run at all can sting just as bad as one that fails."

    What's the difference?

    There is no difference.

    Sure there is. A long running job might be stuck, but it's done some work. If you clear the issue, it may run quicker. Depending on the job, ETL or some check, it might not affect your day to day operations.

    One that doesn't run is bad because you might not realize the event hasn't occurred. If there is no issue, like a corruption check, then it might not affect you, but certainly it could in the future. A failure of the same job would be indicative of a problem, at least it's likely.

    These all can cause problems, but there certainly is a difference in many cases. Not all, but many.

  • Steve Jones - SSC Editor (7/31/2014)


    GoofyGuy (7/31/2014)


    "A job that runs long or doesn't run at all can sting just as bad as one that fails."

    What's the difference?

    There is no difference.

    Sure there is. A long running job might be stuck, but it's done some work. If you clear the issue, it may run quicker. Depending on the job, ETL or some check, it might not affect your day to day operations.

    One that doesn't run is bad because you might not realize the event hasn't occurred. If there is no issue, like a corruption check, then it might not affect you, but certainly it could in the future. A failure of the same job would be indicative of a problem, at least it's likely.

    These all can cause problems, but there certainly is a difference in many cases. Not all, but many.

    All three cases represent failure to design and test properly. There is no difference, in my mind, from that perspective.

  • Miles Neale (7/31/2014)


    Developers have in the past cursed bean counters for various reasons. However there are reasons for counting and tracking. Maybe we need to regularly check and see if we have the appropriate beans.

    Cool beans.

    If the bean counter is our check and balance, maybe that is an indicator we haven't been doing enough testing and verification on our part.:-D

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • dwilliscp (7/31/2014)


    Thanks to SQL Sentry we have built on to our job step -- alert-- and can now check for jobs that run too long. But I do see one more hole... jobs calling code that is not well built, and it completes with out error, but does not do what you expected... I have had this come up from time to time.. the latest issue was bad data flowing into the database and was not picked up until the weekly maint schedule found that the data did not match the data type. One other time we had a trigger that got turned off.. and not turned back on...

    David

    That is such a pain. When that happens it usually involves implementing an additional process to verify success and alert if there is a smell of failure.

    I'd rather add the extra checks and code to ensure less headache down the road.:cool:

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • GoofyGuy (7/31/2014)


    Steve Jones - SSC Editor (7/31/2014)


    GoofyGuy (7/31/2014)


    "A job that runs long or doesn't run at all can sting just as bad as one that fails."

    What's the difference?

    There is no difference.

    Sure there is. A long running job might be stuck, but it's done some work. If you clear the issue, it may run quicker. Depending on the job, ETL or some check, it might not affect your day to day operations.

    One that doesn't run is bad because you might not realize the event hasn't occurred. If there is no issue, like a corruption check, then it might not affect you, but certainly it could in the future. A failure of the same job would be indicative of a problem, at least it's likely.

    These all can cause problems, but there certainly is a difference in many cases. Not all, but many.

    All three cases represent failure to design and test properly. There is no difference, in my mind, from that perspective.

    Perhaps - but the fallout of a partially completed job can be substantially harder to recover from, than, say, the job didn't run because someone disabled the scheduler.

    Also - depending on the type of process you're dealing with, it may not be physically possible to test every single permutation, so - yes in some cases you might not be able to completely dummy-proof or fail-proof some jobs.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • the difference between a proactive or reactive state.

    - or perhaps - "Isn't it the users job to tell the DBA when a job did not finish?"

    The more you are prepared, the less you need it.

  • Andrew..Peterson (8/1/2014)


    The more you are prepared, the less you need it."

    Andrew, the last line in your post reminded me of something from years ago. For the jobs that failed we use to use checkpoint/restart features, such that we kept reference of state and when the job was restarted it resumed execution from that state/checkpoint. Using this technique, we were able to save a tremendous amount of time in those old Big Iron days.

    Also in those jobs that failed to complete and just ran in loops, the last checkpoint may have been the one that caused a loop, wait, or other anomaly. If I remember rightly, and it has been a number of years, we could determine the state of the last correctly completed function or record, and then fool the checkpoint to restart just after the last successful process after the data error or other logic was fixed that caused the problem.

    I wonder after reading this series of posts if the old technique of checkpoint restart has been lost, or forgotten.

    Not all gray hairs are Dinosaurs!

  • Matt Miller wrote:

    ... depending on the type of process you're dealing with, it may not be physically possible to test every single permutation, so - yes in some cases you might not be able to completely dummy-proof or fail-proof some jobs.

    Maybe not, but it's no excuse to bypass developing the appropriate test cases, either.

    The time spent actually writing software should be almost vanishingly small, compared to the time expended on design up front and testing in back.

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply