Controlling Alerts

  • Comments posted to this topic are about the item Controlling Alerts

  • We had this arrangement in a 24/7 environment. We had started life using a mainframe, and so had operators on site over night. Any new job had to include 'OpIns' - operator instructions. This included possible actions to take that might remedy the situation, and also a section on what to do if these failed. This might be 'leave as is' to be dealt with in the morning, and an indication of who should be called, if was necessary for this to be fixed asap.

    It definitely enforces some discipline when creating new overnight processing.

  • Routing late night operational alerts is one thing, but from a DevOps perspective, we also have the power to auto-fix certain issues without even raising an alert. For example, if there is a job or stored procedure that frequently deadlocks or becomes blocked, you can set a shorter timeout and then wrap retry logic around it. Perhaps you can even make the call asynchonous so the user doesn't notice the longer delay.

    During maintenance or deployment windows, set the database mode to restricted_user (seriously). Have the application check for restricted mode and then display a message informing that the system is temporarily unavailable and direct inquiries to a dedicated email address (not your on-call address) for which you can catch up on and respond to the following morning. It's ultimately in everyone's best interest that the database deployment complete as quickly and reliably as possible. The small complaint you receive the following morning from a late night user who didn't get the memo is nothing compared to the fallout you'll receive from management should the deployment fail or run overtime due to blocking.

    Back in the '90s I supported a client server application, installed at multiple client sites in a dozen different time zones. For those who have experience with the FoxPro database engine running on Novell Netware, you can imagine I'd get frequent late night calls regarding the infamous index corruption issue. So I had the error handler trap this specific event and run an index re-build operation prior to closing the application. When the user re-loaded the application, all was in proper working order, and I'd learn about by reviewing the system logs the following morning.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • A recent client had a system where known potential faults, e.g. missing data that cause overnight job failures, had a documented resolution. These resolutions were not aimed at fixing the issue but parking it.

    The key aim was two-fold:

    • Allow for all other data to be processed.
    • Allow for the "broken" data to be fixed (by code or data) and successfully processed at a later time.

    EDIT: Spelling mistake. Oops!!!

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

  • Gary Varga (5/21/2015)


    A recent client had a system where known potential faults, e.g. missing data that cause overnight job failures, had a documented resolution. These resolutions where not aimed at fixing the issue but parking it.

    The key aim was two-fold:

    • Allow for all other data to be processed.
    • Allow for the "broken" data to be fixed (by code or data) and successfully processed at a later time.

    Yes, an all-or-nothing ETL process, especially if it's ingesting data from external clients and even more so if it's injesting healthcare data, will frequently end up loading nothing due to data quality issues.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • I agree with you in principal but sometimes, due to things out of your control, you can't do anything about it. I'm in a situation now where I have identified the root cause of a problem with a SSIS package. However the author of the application, as well as the database and everything within it, is in a different state. She told me over the phone that the problem is our software is too out of date. (And perhaps its even been changed in ways that wasn't ever in any previous version of her software.) She wishes to analyzing it to determine how out of date it is, so I've got her a copy of our SSIS package. But in the mean time our system makes these occasional errors as it tries to report out our data. There's nothing I can do about it, because I've haven't enough information to resolve the issue myself. And I don't blame the out of state person, because she's got her own job to attend to. She'll get to our problem, when she can.

    So sometimes there's nothing you can do but wait, hoping that these occasional issues don't come up too often.

    Rod

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply