DBA Hypochondria


SQL Server, like other server systems, has become such a complex machine that it is beginning to take on some of the attributes of an android. When DBAs discuss ailing machines, something of a medical air affects the conversation. There is talk of symptoms, diagnostics, cures, viruses, leakage, dumps, and other concepts borrowed from human health.

The same errors that sometimes afflict the medical profession affect DBAs too. It is tempting to collect server symptoms randomly, in the forlorn hope that the more they collect, the clearer a picture they'll assemble of what ails the server. They should, of course, be looking at the whole patient (hang on; I think I meant 'server').

"Doctor, come quickly! Buffer Cache Hit Ratio is way down; Disk Queue Length is through the roof, Page Life Expectancy has dipped below the magic '300'. On with the white coat, alerts are going off all over the place…the patient is in terrible shape; should we call the relatives?"

In fact, of course, there is a good chance the 'patient' is just fine. Examining metrics such as these is the equivalent of the eighteenth century doctor examining only the patient's bowel movements, regardless of the reported symptoms. The chances of a correct diagnosis or, indeed, of discerning whether anything at all is wrong with the patient, beyond their diet, are slim.

Some metrics are more-or-less meaningless (BCHR). Others seem inextricably linked to 'magic thresholds' that ceased to be accurate many years ago (PLE). Many more lose meaning when applied with a uniform threshold value, across all environments. As Adam Machanic noted is his 10 commandments of Monitoring, the only good universally applicable threshold value is "server power is on". All others have a significance that is frustratingly ambiguous. Not only won't such metrics help you find the real problem, they will also distract and annoy you every time you receive an alert that a threshold is breached.

What's required in SQL Server diagnosis is a more formal approach based on good medical diagnosis where the whole patient is assessed; lifestyle, stress, baseline activity . Rather than randomly inserting thermometers and extracting blood, a doctor will sit the patient down, get a feel for their overall health, check their medical history, formulate a possible diagnosis and perform specific tests to prove or disprove it. They look for symptoms that stray significantly from "normal", for the patient as well as humanity in general.

Likewise, for each of your SQL Server environments, establishing what is "normal" is the key, and the cure for DBA hypochondria. In other words, you need means baselines, and ideally some statistical algorithms for analyzing the data over time and spotting significant trends amongst the "noise".

Who owns up to being a DBA hypochondriac? Or wants to share their "baseline formula" cure?