The Complexity of Metrics

Steve Jones, 2023-02-06 (first published: 2023-01-30)

Monitoring your SQL Server instances is important to ensure you can meet your SLAs. Availability, performance, reliability, quality, whatever you care about, it's important that whoever is responsible is looking at how the database is performing. At Redgate, we have multiple teams working on SQL Monitor to enhance and grow it to meet your needs.

A short while ago there was an internal conversation recently about page life expectancy. We've had some customers ask about this and setting alerts to watch this value. Our developers and sales engineers asked for a few thoughts from Grant and others on how to respond. There are a variety of opinions, some saying monitor it, some saying don't bother.

I think both pieces of advice have merit, which is to say that this isn't a metric that you can look at in isolation. There is no value of PLE that is good or bad, or that says x is wrong or y is right. There is both a subtlety and a complexity to understanding what PLE is telling you about your system. If PLE is growing, you have to look deeper. If it's falling, same thing. If it suddenly drops, there are multiple possible causes, and you need to examine other things. However, in many cases, this isn't an actionable metric, but one that provides context about what might be happening in the database when combined with other values you monitor.

This certainly isn't a metric that you want to set an alert on because it can rise or fall and many times the change isn't indicative of an acute problem.

This is just one metric of many that are available in SQL Server, and knowing which ones to monitor is something good administrators learn. They know that very few values they instrument have a good or bad value, and often the rate of change needs to be combined with the actual reading to determine if there is a problem. We also often want to know if a high (or low) reading appears for an extended period of time. Having 100% CPU being used for 3 minutes likely isn't an issue. If it lasts for 3 hours, I might feel differently.

Metrics have more complexity than just having a range in which we ignore them and a limit at which we alert people. They are intended to be combined with each other, with observations by clients, and with the experience of looking at past observations over time. Our systems often develop patterns, and we don't get too concerned about any values when the pattern repeats. It's when something new happens and someone complains that we dig in to determine if there is a problem or the start of a new pattern.

We definitely need monitoring of our database metrics, but we also need to understand why values move and the implications of them doing so. That's something which isn't as simple as setting alert for each one based on some value we think should never be exceeded.