An Alert Philosophy

Many of you reading this will be responsible in some way for managing a system. This might be a test/development system or a production one, but often you want to know how well the system is working. Or maybe you want to know if the system is working at all. Even developers care if their server is up.

There are plenty of ways to get information about a server. Some of us monitor in an automated fashion, some of us check when we think something is wrong, but no matter what you do, you are often looking for some data about the state of the system. When the system lets you know automatically, this is what we call an alert, though getting the alert because you can't connect and the system is down might not be the best type of alert.

I've managed lots of production systems, and usually have implemented some sort of process to let me know when things happen. These could be good or bad things, or just things, but they are alerts that I can about. They provide me with information that I will use in some manner to make a decision. In other words, some sort of human decision and response is needed here.

A alert should be something that calls for action, or at least, that's what Google thinks. This short piece contains information Google's SRE work. Their definition of an alert is something a system (not a user) generates and something that requires human action, not automated responses. The article talks about good alerts and hierarchies of alerts, and more. Everyone has their own method of picking and configuring alerts, but you should think about what interruptions you need, and what sort of timeliness is required from a human.

Personally, I decide if an alert requires immediate action and if so, then the alert needs to hit my device (phone, pager, fax machine, wife's mobile, whatever). That way I can make a decision to actually deal with the situation or pawn it off on Kendra or Grant. Those are real time decisions and there ought to be few of those in any system.

If it's not something I need to fix now, then it can be filed in email or as a lower priority item in my monitoring software. Those items need to be alerts and not logs because I only look at logs when I'm really confused and can't fix something. The low level alerts are things like I'm running low on disk space for a system and will run out in 30 days. That's not something I break away from a date with my wife for, but it is something I want to start thinking about if there's a lead time to make changes.

No matter how you view alerts, it does pay to think about them and try to reduce the number and frequency of alerts that hit your administrators. That might be configuring your monitoring differently, it might mean adding resources, or it might mean fixing broken software. We can burn out people as well as customers with constant breakage, so fix those things that are worth alerts.