The Aware DBA

  • Comments posted to this topic are about the item The Aware DBA

  • Yes, setting up an event monitoring solution is phase 1. The next phase is setting a solution for alerting, collaboration, triage, and incident workflow. Historically, we had error and threshold alerts emailed to the DBA group. That's more or less still in place, but I don't really keep a close eye on that now we have alerts going to a designated Slack channel, which provides many additional benefits over email like: centralization and less cluttered conversation threads. We also use a 3rd party service called PagerDuty which monitors the channel, sends alerts to the cell phone of whomever is on call that week, and then the oncall DBA can login to PagerDuty to get more detail from the incident and manage resolution workflow.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Great topic, Steve. I'm with you, I'd had to develop a monitoring system, because I now know the system being monitored would likely change. Who wants to constantly upgrading your monitoring system to accommodate changes in the system?

    I think that knowing what's normal is good for people in other disciplines, like developers. It's good to know how a system should behave, especially before you go and change it.

    But I also see a lot of not monitoring systems. One of my colleagues works in a small town several hundred miles south of the rest of us. Just a few minutes ago he texted me to inform me that the network he's on was down, again. This seems to happen to that location at least once a week! According to my colleague no one knows why it does this, but you can just about count on their being unavailable for a few hours at least, each week. it really frustrates my coworker, who loses so much productivity, when the network just goes down so often. And its appalling how the network people seem quite OK with this level of interrupted service. I guess this is some sort of "normal" to them, to which they've become. If "normal" means poor or unreliable service, then I think we need to come up with a new definition of normal.

    Kindest Regards, Rod Connect with me on LinkedIn.

  • I know a few people that like the Slack channel sending, which we even added to SQL Monitor at Redgate. I agree a central spot to get some alerts is preferable, especially if you can mark them as read/working/cleared/etc. The downside of chaining to slack and then Pagerduty is that is Slack is down, what happens? It's not often Slack goes down, but with my prior luck, this would happen when something failed.

    PagerDuty, is a great system, which is something I've appreciated, and longed for, at different jobs.

  • @rod

    Normal has some odd definitions to people. When I worked in a mainframe environment, we knew thing were going down when everyone stood up in their cubes. This happened multiple times a week, but there wasn't an easy fix, and so we lived with it. A coffee/smoke break for many.

    Networks failing, not to condone it, but I've had flaky equipment on side or at vendor's locations. Sometimes you live with it until there is a chance or option to fix it. We used to have a flaky DB2 linked server driver that leaked memory. We knew that we'd have to reboot the SQLServer every week to month, depending on load.  No fix, just take an outage when you need it.

  • Those are good points, Steve. And perhaps I'm being too harsh on the network folks. With networking a major portion of it is out of our control. We use CenturyLink for networking. I know, your being in Colorado, you know about CenturyLink.

    Kindest Regards, Rod Connect with me on LinkedIn.

  • I sure do. Used to lose Internet out here. We were the max distance from the CO (around 19k feet) And there would be intermittent issues at times with either water in a box (took about a year to replace the box with new seals) or someone making wiring changes and bumping other wires. No other good choice out here at the time, so my office had "network" issues regularly. Fortunately, I could usually drive to town and be annoyed with a cup of coffee.

    Now I have microwave. Different issues, but mostly stable across the last 5 years.

     

  • Steve Jones - SSC Editor wrote:

    .. The downside of chaining to slack and then Pagerduty is that is Slack is down, what happens? It's not often Slack goes down, but with my prior luck, this would happen when something failed ..

    It is amazing (and unnerving) how a medium sized company like Slack has become the part of the bedrock for so many enterprise IT departments.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • It is. Fascinating how the workday changes when it's down.

    I think it's a great tool, I just wouldn't put it in the middle of the chain. I'd have it as a target for issues, but I'd want to have pagerduty directly getting alerts.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply