Monitoring SQL Server performance. SQL Server, Disks baselining

  • Hey everyone, hope you all are doing great.

    I got a couple of questions regarding OS/SQL Server information capture and its regarding how you guys capture data and baseline everything in your SQL Server/OS Performance.

    Here it goes, this might not make perfect sense at the beginning so I hope you guys could cut some slack at me, hahah. its about how you guys do it, I'm gonna explain the way I am doing it and if you guys could give me some kind of different approach, because I know my approach is not 100% accurate, heck not even 80% and I know I'm missing stuff.

    I've seen different videos/posts about performance, they assume you already have the numbers, the numbers regarding baselining, but what if you enter a company and nobody have that data because everyone is in their comfort zone, and they don't know or care that much if something is wrong underneath, maybe performance is degrading a little and nobody in the inside notices and they notice when the user is already having issues and raise the red flag.

    When talking about SQL Server performance I'm talking about Logical/Physical reads/CPU consumption/Waits/ Most expensive queries and all that.

    When talking about OS Performance I'm really talking about LUNs, disks performance counters since that's all I think care on a SQL box also including not putting pressure on C: drive, because we want the OS disk smoothly running so we don’t have issues when the OS and SQL are talking.

    I baseline using a couple of counters of all the drives, mostly these:

    • Avg. Disk sec/Read - Average disk seconds per read - Latency for Reads
    • Avg. Disk sec/Write - Average disk seconds per write - Latency for writes
    • Disk Read Bytes/sec - Disk read bytes per second - throughput for reads
    • Disk Reads/sec - Disk reads per second - IOPS for reads
    • Disk write bytes/sec - Disk writes bytes per second - throughput for writes
    • Disk writes/sec - Disk writes per second - IOPS for writes

    I believe capturing the data from these disks and getting an average should do the job, you guys think that's enough?

    For the SQL Server I use the next ones:

    • Query Store
    • Adam's Machanic SP_Whoisactive

    I don’t know if using Adam Machanic to capture the data is that great, I mean, I get what’s running at the time. I don’t mean to say Adam’s sp_whoisactive is usesless, it’s the procedure I use the most, its 10/10.

    After i get the information from the drives and start watching graphs and seeing the spikes in average disk sec writes/reads counter average going higher at some times of the day, even when its really early in the morning, 1:00-5:00am, no heavy loads like indexes maintenance or DBCCs are happening at these times, i go to Query Store and the table i used to capture sp_whosiactive queries running at the time the spike happened, i find queries but i don't find anything eating the datafiles/log/tempdb disks performance, the queries running according to the information does not consume that much of tempdb/read/writes yet i see sometimes spikes.

    How would you go about these kind of cases, i think i should find a way to capture the OS processes and the files they processes are aiming at these times to see if my disk spikes are not from SQL Server but might be from OS processes using the disks.

    Regarding baselining the latency of disks it looks like everyone have their baselines. 0.1 ms great, 0.5 good etc etc etc. how do you guys determine these numbers for your system? I don’t have the support of the SAN guys to get these numbers or how fast the disk speed is, is there a way I could make these numbers capturing the data using performance counters? Maybe the average I get from sec/write/reads could be my baseline, is it right? It’s the best? How can I tell? How would you tell?

    Thanks you so much for making some time to read this post.

  • My approach - get a monitoring tool to grab the data for you rather than watching things yourself and check the tool over a large timeline to get a baseline.

    And spikes that are brief may be safe to ignore, or may need more investigation.  It really depends on your setup.  And spikes in disk I/O for example may happen when the VM is being backed up or when your SAN does some stuff on the back end.  It may not be SQL doing anything.

    My advice is to monitor the metrics that are useful to you that you can act on or that you may need as proof of where a problem is.  And nobody will every say "I captured too much data for this problem", but you may get hit with someone asking you "what is the network latency?" and you missed capturing that.  In a 100% perfect world, if you were the DBA, you would capture database metrics and the networking team would capture network metrics and the server admin would capture server metrics.  The problem with capturing metrics outside of your responsibility scope is that nobody likes being told "if you had captured these metrics, you would see the problem is the SAN".

    My approach is to capture the default that the tool (I use RedGate SQL Monitor, but other tools are out there) offers and let that run for a month or 2 to capture a baseline for myself and have weekly reports sent to me so I can check if the performance is degrading.  then once per quarter I compare the values at the start of the quarter with the current values and repeat that for yearly as well.  This gives me a rough idea on if my systems are degrading and gives me the ability to tweak things or investigate.  Slow degrades are usually not noticed by end users, but if I notice a drastic change in my weekly reports, I can fix it before anyone complains.

    The problem with capturing things yourself is that it is EASY to miss things you should have captured without realizing it.  Using a 3rd party tool, it will capture most of the things that should be helpful and anything it misses, you can add in.  For example, I have service broker set up on my servers and I added it to be monitored by RedGate.  If any service broker queue goes down (poison message is the usual culprit), I want to be notified immediately so I can correct it and get it back online before anyone notices.

    TL;DR - I would invest in a SQL monitoring tool.  It will save you headache of trying to monitor things with multiple tools and hope you captured everything without causing a performance hit with your queries and investigation.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Monitoring tools are a great purchase - I agree with what Brian wrote. But if that's not a possibility, Are you responsible for the entire server or just SQL Server? If your responsibilities are limited to SQL Server then I wouldn't worry about the disk stats in perfmon. I usually don't anyway and primarily use the DMV dm_io_virtual_file_stats since I primarily care about SQL Server performance. And that DMV shows you what the stalls are and on which files so you can see the read or write latency per file. Some of the performance numbers you see about what they should be for latency for SQL Server have nothing to do with what SAN folks would tell you or what disks your on. That doesn't mean the disks you use don't matter rather it's one of those where the reads, writes in SQL Server should be on average below whatever number of milliseconds no matter what disks you are using. A lot of those you see are taken from a Microsoft paper "Analyzing characterizing and io size considerations". It's available to download somewhere - I'm just lazy right now and didn't look it up. You'll find it if you don't already have it 🙂 But this article posts some of those numbers if you don't want to find that article (since it is Friday) and also discusses using the DMV:

    Monitoring Read/Write Latency

    Read the section on "What Values are good or bad". That lists some of the numbers you've probably seen. Those are just seen as what you should be shooting for with any disk subsystem. The article also has a link to a script on sqlskills you can use to capture the disk numbers using the DMV.

    Sue

  • Hey guys,

    thanks for the replies.

    Yeah i wish we had a monitoring tool but we don't have one specifically for SQL Server and the ones we have only see the requests going to SQL and the response times, they raise a flag when the reply of SQL is taking longer or the end user have timeouts which are never SQL Server fault but oh hey, SQL Server is the layer in which if something beneath the SQL Service is having issues or not configured properly (OS, Network, SAN, Antivirus, etc etc.), the victim will always be SQL Server because the service is the one being affected, and we always need to find a way to somehow tell them that SQL is not the issue, that we might be having some network issue, or some other service is eating the whole CPU/memory in the OS, and most of the times is hard because its not something you can catch with the eye, how can you tell you are having network issues? man we have had to read a ton of networking, packages, packages that were not built correctly and we had to prove them that these might have been the issue.

    Yes, sadly in our case we have to go in to the OS and check for metrics which are not being captured, hence why the questions, we have to worry if something is bad at the OS level and SQL Server level. thanks for the information regarding the white paper and SentryOne URL, they seem very informative, i also got some white papers regarding wait, queues, most common issues from Microsoft and free Redgate books.

    I'll start geting the hang of the DMVs regarding i/o stats also and start capturing them and getting some sort of baseline.

    Thanks again!

    Best regards.

     

    Edit: found the document you were talking about earlier, its a docx.

    analyzing characterizing and io size considerations

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply