Making performance monitoring practical

  • Hello experts,

    I am working on creating a DBA process to collect and review performance data. Although I am well along in coding and testing some ways of collecting the stats I want, I don't yet have a way to turn that into actionable tasks.

    I have a general idea that the stats need to be reviewed after setting up a baseline, but for each stat (say a CXPACKET wait) are there published ranges for what values are healthy vs problematic or how to determine those ranges for my server if there is no one absolute answer?

    Similarly, I guess part of my process should also include a kind of flow chart, such as, is the stat value within the acceptable bounds? If YES, then move on to the next stat. But if NO, then [steps to investigate and resolve].

    Am I correct in this approach? Just trying to sanity check it with everyone out there.

    Thanks for any help.

    - webrunner

    -------------------
    A SQL query walks into a bar and sees two tables. He walks up to them and asks, "Can I join you?"
    Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html

  • There are very, very, very few stats, perfmon, SQL or other, that have a single defined threshold value, whether it be universal or just for your server.

    There is 'normal for your server and workload' and 'not normal for your server and workload'.

    I suggest at least the first chapter of https://www.simple-talk.com/books/sql-books/troubleshooting-sql-server-a-guide-for-the-accidental-dba/

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (5/3/2016)


    There are very, very, very few stats, perfmon, SQL or other, that have a single defined threshold value, whether it be universal or just for your server.

    There is 'normal for your server and workload' and 'not normal for your server and workload'.

    I suggest at least the first chapter of https://www.simple-talk.com/books/sql-books/troubleshooting-sql-server-a-guide-for-the-accidental-dba/

    Thank you, Gail, for the link.

    - webrunner

    -------------------
    A SQL query walks into a bar and sees two tables. He walks up to them and asks, "Can I join you?"
    Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html

  • Just a bit of a recommendation based solely on personal experience...

    I've found that a whole lot of people spend an eternity on collecting "stats". I've also found that a lot of people who do "baselines" don't realize that their baseline currently sucks. 😛 Even when a stat supposedly goes "out of spec", they only occasionally identify the problem correctly.

    Except for the occasional hard-disk problem, performance problems usually exist (there are always exceptions to every rule) in the code. Oddly enough, that's also where the solutions exist but people seem to have a hard time justifying things like 60-1000X performance improvements that can be realized by fixing the right code. Instead, they'd rather study stats and perhaps make a server tweak or add some hardware that might five them a 0.5X improvement.

    Performance is in the code. Find the 10 worst code problems and fix them. Repeat until the users perceive a real difference. Chances are, the first 10 will provide very high ROI in the areas of performance improvement and reduction in resources.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • This is the mentality that drives me crazy as a DBA. At my last job they were building a dashboard-type of report and wanted me to give them criteria on what was a problematic wait time -- even though we had 500 instances of SQL ranging from 4 GB to 256 GB, 2 CPU cores to 80, and from that be able to say that once a certain wait type went over a certain number that was always a problem. It. Just. Does. Not. Work. Like. That. The conversation ended something like this -- "Can't you just give me a straightforward number so I can make my dashboard for the business?" "If I could, you wouldn't need to have hired me. The whole reason I have a job is that one size does not fit all."

  • jeff.mason (5/3/2016)


    This is the mentality that drives me crazy as a DBA. At my last job they were building a dashboard-type of report and wanted me to give them criteria on what was a problematic wait time -- even though we had 500 instances of SQL ranging from 4 GB to 256 GB, 2 CPU cores to 80, and from that be able to say that once a certain wait type went over a certain number that was always a problem. It. Just. Does. Not. Work. Like. That. The conversation ended something like this -- "Can't you just give me a straightforward number so I can make my dashboard for the business?" "If I could, you wouldn't need to have hired me. The whole reason I have a job is that one size does not fit all."

    Thanks for you reply. I hear what you're saying, but my goal is not to have one exact set of bounds for all stat cases. That is why I said "are there published ranges for what values are healthy vs problematic or how to determine those ranges for my server if there is no one absolute answer?"

    However, in my experience I have often found myself having to troubleshoot an issue without any consistent set of measurements to compare to the time before the incident happened. To me it is not enough to say "one size does not fit all." I am not looking for exact answers, just a framework to have some explanation for why a particular counter or stat is captured in the first place.

    To use an example from medicine, I understand why a doctor takes someone's temperature. A temperature above or below 98.6 degrees F may be a problem or may just be the person's own natural temperature. But generally speaking a person with a temperature of 104 degrees is not likely to be in good shape. Probably something is wrong with them, even though it takes more research to find out exactly what the cause is.

    As Gail suggested to me (I picked her answer), seems like the best thing to do is find what is "normal" for my server and what is "not normal" for my server.

    I have made a list of counters and stats that I have read are good to monitor when creating a baseline and that might correlate with a database issue. I was hoping to find some way of organizing those counters and stats so that they are annotated with why certain values might deserve further investigation while others probably not.

    I understand that there is a good deal of nuance on these points but at some point (now, for me) the rubber has hit the road and I need to institute a practical performance monitoring process to help me keep track of my environment.

    -------------------
    A SQL query walks into a bar and sees two tables. He walks up to them and asks, "Can I join you?"
    Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html

  • Jeff Moden (5/3/2016)


    Just a bit of a recommendation based solely on personal experience...

    I've found that a whole lot of people spend an eternity on collecting "stats". I've also found that a lot of people who do "baselines" don't realize that their baseline currently sucks. 😛 Even when a stat supposedly goes "out of spec", they only occasionally identify the problem correctly.

    Except for the occasional hard-disk problem, performance problems usually exist (there are always exceptions to every rule) in the code. Oddly enough, that's also where the solutions exist but people seem to have a hard time justifying things like 60-1000X performance improvements that can be realized by fixing the right code. Instead, they'd rather study stats and perhaps make a server tweak or add some hardware that might five them a 0.5X improvement.

    Performance is in the code. Find the 10 worst code problems and fix them. Repeat until the users perceive a real difference. Chances are, the first 10 will provide very high ROI in the areas of performance improvement and reduction in resources.

    Thank you, Jeff - I will incorporate some kind of top 10 worst queries into my performance monitoring process. Makes great sense, as you said, because it could have one of those 80/20 principle benefit ratios.

    Thanks again.

    - webrunner

    -------------------
    A SQL query walks into a bar and sees two tables. He walks up to them and asks, "Can I join you?"
    Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html

  • You are on the right track with a baseline and deviation from the baseline. Really, that is all you have. Numbers that are bad on one server might be normal on another. But when the numbers on the same server change suddenly, that's when you know there is a problem. Point I make, though, is that I cannot tell you a percentage of waits that is normal for a given system. Is it OLTP? You may have more WRITELOG. Is it OLAP? You might have more PAGELATCH types. CPU is always a given. But if the server activity is light, those waits will still be there and at certain percentages. Without a baseline, I don't know that there is a quick and easy way to tell you if anything is wrong without lots of time and research. With a baseline though, a seasoned DBA can look at charts and graphs and see a problem quickly. So get the baseline and start monitoring for large changes in values for signs of problems.

  • Plus Jeff's suggestion on hitting the worst queries too.

    Honestly, what I do as a DBA is make sure I have a monitoring tool on all of my instances. My weapon of choice is Solarwinds DPA. If I have that, I will quickly know my worst queries, and I will quickly see what changes in my baseline have happened. I can point it to the instance and walk away, and if there is a sudden issue, 99 times out of 100 I will see something I can start on if there is truly a database issue. Or if I see no change at all, I can report that the DB is healthy and the application servers can be looked at.

  • Thanks jeff (mason) for your additional feedback. Much appreciated.

    - webrunner

    -------------------
    A SQL query walks into a bar and sees two tables. He walks up to them and asks, "Can I join you?"
    Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html

  • Jeff Moden (5/3/2016)


    Except for the occasional hard-disk problem, performance problems usually exist (there are always exceptions to every rule) in the code.

    Strongly disagree! 🙂

    Unless your meaning of "code" includes DDL code too.

    On a badly designed data storage SQL code improvements may be as effective as hardware tweaking.

    Speeding up queries only leads to increased number of deadlocks.

    Most effective code fixing performance issues usually starts with "CREATE TABLE".

    _____________
    Code for TallyGenerator

  • Sergiy (5/3/2016)


    Jeff Moden (5/3/2016)


    Except for the occasional hard-disk problem, performance problems usually exist (there are always exceptions to every rule) in the code.

    Strongly disagree! 🙂

    Unless your meaning of "code" includes DDL code too.

    On a badly designed data storage SQL code improvements may be as effective as hardware tweaking.

    Speeding up queries only leads to increased number of deadlocks.

    Most effective code fixing performance issues usually starts with "CREATE TABLE".

    There's no doubt that correct design of tables and indexes is the right place to start. Unfortunately, that's normally not the case. Normally, you have to put up with one bad design or another. The databases that I currently have to work with are just that... really bad design.

    Even with that, though, I've been able to greatly reduce overall CPU time, run durations, and I/O (logical reads and writes to log files) by huge amounts. And, that action has also drastically reduced the number of deadlocks because there's a whole lot less contention than there was when I first started at the company.

    Most of the improvements have included more than a 60x reduction in CPU, Duration, and I/O with about half of those exceeding 250x reductions and some reductions as high as 10000X.

    Yeah... the code was that bad. And, except for the occasional addition of an index and one very heavily hit audit table that I fought to have changed, it was all DML and no DDL.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply