How to create a baseline using counters to analyze server performance?

  • I am currently working on creating a baseline for our server, using various counters to analyze performance. The server exclusively hosts SQL Server.

    Could anyone here share their experiences or methodologies on how to create such a baseline and use it for performance analysis?

  • Generally, just measuring performance of the hardware under no load, is a bit of a waste of time. You're just checking, mainly, that the people who sold it to you didn't lie about how fast the chips are, how fast the disk controller is, stuff like that. However, if you were to throw some load at the system using something like the RML Utilities, you can be sure that you've got the system configured correctly. That's about the main thing I'd use that for. Otherwise, I'm generally focused down more on the databases and what's inside them from a performance standpoint.

     

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • To add to what Grant says - "generating" a baseline from a point in time is often useless. My recommendation is to use a monitoring tool and gathering metrics over time. The longer the period you capture the data, the more accurate the baseline will be. But for me with my DBA hat (one of many hats), I don't really need a baseline on overall as it isn't helpful. I care more about a baseline for the queries and query store does a good job of telling me that. Query store will tell me when something went suddenly slow and I can investigate. I can see the duration and can use that to determine if it is worth tuning the query or not.

    And like Grant said, it depends on what sort of baseline you are looking for. Do you want the average across the day or average across the business day? For my company, those numbers are vastly different (ranging from around 90% utilization on the CPU average during the business day and closer to 30% across a 24 hour period). But CPU utilization is not the greatest metric either - I want high CPU, but not for it to be a bottleneck. If it is under 100%, then it is not likely the bottleneck; still could be if I need more cores, but in that case, some tweaks to MAXDOP (max degrees of parallelism) and CTFP (cost threshold f or parallelism) can help bring that under control. Memory, I want high utilization, so the higher that number, the happier I am as long as it isn't at 100% because then something is going to page to disk and I'll be a grumpy guy. Disk I/O I want it high but not 100% either. High means I have proper disk performance for my system and not too much - having too fast of disk means I wasted company money. BUT if it is frequently hitting 100% utilization and it is impacting my queries, then I need to get faster disk.

    ALSO are you looking to baseline query performance OR hardware performance OR overall server performance? They are vastly different metrics. Query performance can be impacted by overall server performance - chkdsk or defrag or antivirus scans can use up disk I/O, CPU, and Memory which can cause query performance to degrade if improperly tuned and set up (software and hardware configuration). From a strictly DBA perspective, baselining (to me) means that you are getting the average execution time of a query. Things that can make it challenging is that as data changes, query performance can change. So if you are looking to get a baseline for query performance, you are going to need to capture query performance metrics over a period of time. SELECT from an empty table is fast. SELECT from a table with 100GB of data may be slow as you now have multiple bottlenecks coming into play - disk I/O, network I/O, free memory, and to a lesser extent - CPU. AND to make it even more fun - those things can all be impacted on server side AND client side. Pulling 100GB of data into SSMS (for example) is going to put a lot of load on your system as stuff will be trying to go to memory and very likely paging to disk. Plus SSMS doesn't do well with such large data volumes in the "results to grid" option. Results to file will be much faster to run the query, but you still will need some tool to open that file AND the disk space to store the 100 GB of data.

    My question would be more why do you need this data? What problem is it trying to solve? For me, as long as the server is online and my queries are performing good enough for the end users, things are good. IF it is slow for the end users, I tune the slow queries and do a best effort to not impact other queries. My "baseline" for query tuning is current performance. I generally copy a database from live to test, do any required data masking, then run the query with no code changes 3-5 times with statistics io and time on to get a "baseline" for how that query performs with multiple runs. Then I tune it and run it 3-5 times and see if I've made it better or worse. If performance improves, then I work to migrate it to live. If I made it worse, I will continue tuning. IF I have done a change that I think MAY impact other queries (adding/removing indexes for example), then my testing scope increases as I need to evaluate the full impact of my changes as it MAY make my query faster, but could make a LOT of other queries slower in which case my change should be denied by the code reviewer unless my test plan indicates I tested other queries AND performance impact was not measurable OR was a small enough difference that go live is still recommended.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • My recommendation would be to buy a tool.  This site exists based on the generosity of RedGate.  Perhaps their monitoring to tool would save you some time, aggravation (now and in the future), do a better job than something homegrown, and save you some money in the long run.

    If not, then Brent Ozar also has his "Blitz" series of system procs and those are awesome because they also come up with links to documentation on his site of what everything means and what to do about it.  And, it's all FREE.  I strongly recommend having a look on his site.

    https://www.brentozar.com/blitz/

    Last but certainly not least is Adam Machanic's  "sp_WhoIsActive".  It's free and does a good job.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply