RE: Making performance monitoring practical

SSC-Dedicated

Points: 31731

May 3, 2016 at 2:13 pm

jeff.mason (5/3/2016)
This is the mentality that drives me crazy as a DBA. At my last job they were building a dashboard-type of report and wanted me to give them criteria on what was a problematic wait time -- even though we had 500 instances of SQL ranging from 4 GB to 256 GB, 2 CPU cores to 80, and from that be able to say that once a certain wait type went over a certain number that was always a problem. It. Just. Does. Not. Work. Like. That. The conversation ended something like this -- "Can't you just give me a straightforward number so I can make my dashboard for the business?" "If I could, you wouldn't need to have hired me. The whole reason I have a job is that one size does not fit all."

Thanks for you reply. I hear what you're saying, but my goal is not to have one exact set of bounds for all stat cases. That is why I said "are there published ranges for what values are healthy vs problematic or how to determine those ranges for my server if there is no one absolute answer?"

However, in my experience I have often found myself having to troubleshoot an issue without any consistent set of measurements to compare to the time before the incident happened. To me it is not enough to say "one size does not fit all." I am not looking for exact answers, just a framework to have some explanation for why a particular counter or stat is captured in the first place.

To use an example from medicine, I understand why a doctor takes someone's temperature. A temperature above or below 98.6 degrees F may be a problem or may just be the person's own natural temperature. But generally speaking a person with a temperature of 104 degrees is not likely to be in good shape. Probably something is wrong with them, even though it takes more research to find out exactly what the cause is.

As Gail suggested to me (I picked her answer), seems like the best thing to do is find what is "normal" for my server and what is "not normal" for my server.

I have made a list of counters and stats that I have read are good to monitor when creating a baseline and that might correlate with a database issue. I was hoping to find some way of organizing those counters and stats so that they are annotated with why certain values might deserve further investigation while others probably not.

I understand that there is a good deal of nuance on these points but at some point (now, for me) the rubber has hit the road and I need to institute a practical performance monitoring process to help me keep track of my environment.

-------------------
A SQL query walks into a bar and sees two tables. He walks up to them and asks, "Can I join you?"
Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html