Blog Post

I/O Baseline Comparison – Idera Diagnostic Manager

,

What servers need the most attention when it comes to I/O?  When I increase the memory on a server what effect does it have on I/O?  What was it before the change and what was it afterwards?

These are tough questions, and I have a script that I hope will help you out.

Most monitoring tools, including Idera, are great at looking at what a single server is doing right now, but not so much with comparing performance and loads between multiple larger windows of time and seeing which servers are doing the most work.  So I wrote a script to see how much I/O load I’m doing on each server and compare that to how much I did in that same time window in previous weeks.

Your servers will vary, a lot.  They’ll vary from my servers, and they’ll vary from each other as well.  Depending on the workload, specifically the difference between OLTP and OLAP, higher I/O may be expected and additional memory will have less of an impact.  For OLAP, if you have 32 GB of memory and read a 64 GB table followed by a different 92 GB table then it’s all physical reads.  If you up your memory to 48 GB, they’re still all physical reads.  The cost to up this server (or VM) is the same as the cost to upgrade an OLTP server from 32 GB to 48 GB, but the story will end quite a bit differently there.

The problem is that this change, like any other change, should be measured before, measured afterwards, then have the costs of the change analyzed and justified.  The results you’ll see below coupled with my Wait Stats Baseline Comparison using Idera post will help you figure out where to make a change like that and how to measure it.

There are other reasons you’d want to look at this data as well.  Even a new index that dramatically dropped the physical reads for one query could have the opposite effect on your server as a whole.  The point is, even something very straightforward needs to be verified that it’s doing what you expect it to.  Looking at the details you see in the canned reports is great, but so is the ability to stand back and look at the big picture that I’ll get into here.

The solutions I’ve put into place using this type of query to varied from query tuning, timing of larger processes, adding memory like I mentioned above, buying faster disks, and many other options including acknowledging we were doing a ton of reads and leaving it be.

My Default Monitoring Message

I’ll copy/paste this section on multiple posts, so feel free to skip it:

The biggest issue is the word “before”.  If you weren’t watching your servers before then you have nothing to compare the current state of your server to.  Also, if you’re watching it consistently then you’ll be able to make an apples to apples comparison.

Your production servers should have some kind of monitoring software in place, and the monitoring software is capturing this type of information to help make all the pretty charts and graphs you see through the GUI.  You need this software to run so you can get alerted when things go wrong, but pulling this data will just make in more valuable.

Monitoring software doesn’t perform magic, so you can pull this data on your own if you don’t have any.  However, that’s taking on a lot of responsibility, time, and effort.  If something is collecting the data already, take advantage of that.

For me, I use Idera Diagnostic Manager, so my queries are going to look at the data collected using this tool.  Other monitoring tools are going to collect the same basic data in a database you can read.

If anyone from one of these companies wants to provide me with scripts that fill the same need, I’ll tack them on here and give you credit for writing them.  I can only write against what I can test against, it’s nothing personal.

I/O Differences Script

What I look for are the true apples-to-apples comparisons, so I’m typically looking at a longer time range (I like 24 hours) during peak hours (12:00 to 21:00 UTC on weekdays for me, yours will differ, even server-to-server) comparing this week to the last couple weeks.

DECLARE @StartTime DateTime
, @EndTime DateTime
, @InstanceName sysname
, @Weekdays bit
, @BusinessHours bit

SET @EndTime = GetUTCDate()
SET @StartTime = DateAdd(Hour, -24, @EndTime)
SET @InstanceName = NULL --Do 'ServerInstance' for individual server
SET @Weekdays = 1
SET @BusinessHours = 1
SELECT S.InstanceName
, StartTime = @StartTime 
, EndTime = @EndTime
, Reads_GB = CAST(SUM(St.PageReads)/128/1024.0 AS DEC(20,1)) 
, ReadAhead_GB = CAST(SUM(St.ReadAheadPages)/128/1024.0 AS DEC(20,1))
, Writes_GB = CAST(SUM(St.PageWrites)/128/1024.0 AS DEC(20,1)) 
, Lookups_GB = CAST(SUM(St.PageLookups)/128/1024.0 AS DEC(20,1)) 
, PctPhysical = CAST(CAST(SUM(St.PageReads)/128/1024.0 AS DEC(20,1)) / CAST(SUM(St.PageLookups)/128/1024.0 AS DEC(20,1)) * 100 as DEC(20,1))
, AvgPLE = Avg(St.PageLifeExpectancy)
, AvgCache_MB = AVG(St.BufferCacheSizeInKilobytes)/1024
FROM SQLdmRepository..MonitoredSQLServers S
INNER JOIN SQLdmRepository..ServerStatistics St ON S.SQLServerID = St.SQLServerID
WHERE UTCCollectionDateTime BETWEEN @StartTime AND @EndTime 
AND (DATEPART(WEEKDAY, UTCCollectionDateTime) BETWEEN 2 and 6 or @Weekdays = 0)
AND (DATEPART(HOUR, UTCCollectionDateTime) BETWEEN 12 and 21 OR @BusinessHours = 0)
AND (UPPER(S.InstanceName) = UPPER(@InstanceName) OR @InstanceName IS NULL)
GROUP BY S.InstanceName
ORDER BY 4 DESC
SELECT @StartTime = @StartTime - 7 
, @EndTime = @EndTime - 7
SELECT S.InstanceName
, StartTime = @StartTime 
, EndTime = @EndTime
, Reads_GB = CAST(SUM(St.PageReads)/128/1024.0 AS DEC(20,1)) 
, ReadAhead_GB = CAST(SUM(St.ReadAheadPages)/128/1024.0 AS DEC(20,1))
, Writes_GB = CAST(SUM(St.PageWrites)/128/1024.0 AS DEC(20,1)) 
, Lookups_GB = CAST(SUM(St.PageLookups)/128/1024.0 AS DEC(20,1)) 
, PctPhysical = CAST(CAST(SUM(St.PageReads)/128/1024.0 AS DEC(20,1)) / CAST(SUM(St.PageLookups)/128/1024.0 AS DEC(20,1)) * 100 as DEC(20,1))
, AvgPLE = Avg(St.PageLifeExpectancy)
, AvgCache_MB = AVG(St.BufferCacheSizeInKilobytes)/1024
FROM SQLdmRepository..MonitoredSQLServers S
INNER JOIN SQLdmRepository..ServerStatistics St ON S.SQLServerID = St.SQLServerID
WHERE UTCCollectionDateTime BETWEEN @StartTime AND @EndTime 
AND (DATEPART(WEEKDAY, UTCCollectionDateTime) BETWEEN 2 and 6 or @Weekdays = 0)
AND (DATEPART(HOUR, UTCCollectionDateTime) BETWEEN 12 and 21 OR @BusinessHours = 0)
AND (UPPER(S.InstanceName) = UPPER(@InstanceName) OR @InstanceName IS NULL)
GROUP BY S.InstanceName
ORDER BY 4 DESC
SELECT @StartTime = @StartTime - 7 
, @EndTime = @EndTime - 7
SELECT S.InstanceName
, StartTime = @StartTime 
, EndTime = @EndTime
, Reads_GB = CAST(SUM(St.PageReads)/128/1024.0 AS DEC(20,1)) 
, ReadAhead_GB = CAST(SUM(St.ReadAheadPages)/128/1024.0 AS DEC(20,1))
, Writes_GB = CAST(SUM(St.PageWrites)/128/1024.0 AS DEC(20,1)) 
, Lookups_GB = CAST(SUM(St.PageLookups)/128/1024.0 AS DEC(20,1)) 
, PctPhysical = CAST(CAST(SUM(St.PageReads)/128/1024.0 AS DEC(20,1)) / CAST(SUM(St.PageLookups)/128/1024.0 AS DEC(20,1)) * 100 as DEC(20,1))
, AvgPLE = Avg(St.PageLifeExpectancy)
, AvgCache_MB = AVG(St.BufferCacheSizeInKilobytes)/1024
FROM SQLdmRepository..MonitoredSQLServers S
INNER JOIN SQLdmRepository..ServerStatistics St ON S.SQLServerID = St.SQLServerID
WHERE UTCCollectionDateTime BETWEEN @StartTime AND @EndTime 
AND (DATEPART(WEEKDAY, UTCCollectionDateTime) BETWEEN 2 and 6 or @Weekdays = 0)
AND (DATEPART(HOUR, UTCCollectionDateTime) BETWEEN 12 and 21 OR @BusinessHours = 0)
AND (UPPER(S.InstanceName) = UPPER(@InstanceName) OR @InstanceName IS NULL)
GROUP BY S.InstanceName
ORDER BY 4 DESC
SELECT @StartTime = @StartTime - 7 
, @EndTime = @EndTime - 7
SELECT S.InstanceName
, StartTime = @StartTime 
, EndTime = @EndTime
, Reads_GB = CAST(SUM(St.PageReads)/128/1024.0 AS DEC(20,1)) 
, ReadAhead_GB = CAST(SUM(St.ReadAheadPages)/128/1024.0 AS DEC(20,1))
, Writes_GB = CAST(SUM(St.PageWrites)/128/1024.0 AS DEC(20,1)) 
, Lookups_GB = CAST(SUM(St.PageLookups)/128/1024.0 AS DEC(20,1)) 
, PctPhysical = CAST(CAST(SUM(St.PageReads)/128/1024.0 AS DEC(20,1)) / CAST(SUM(St.PageLookups)/128/1024.0 AS DEC(20,1)) * 100 as DEC(20,1))
, AvgPLE = Avg(St.PageLifeExpectancy)
, AvgCache_MB = AVG(St.BufferCacheSizeInKilobytes)/1024
FROM SQLdmRepository..MonitoredSQLServers S
INNER JOIN SQLdmRepository..ServerStatistics St ON S.SQLServerID = St.SQLServerID
WHERE UTCCollectionDateTime BETWEEN @StartTime AND @EndTime 
AND (DATEPART(WEEKDAY, UTCCollectionDateTime) BETWEEN 2 and 6 or @Weekdays = 0)
AND (DATEPART(HOUR, UTCCollectionDateTime) BETWEEN 12 and 21 OR @BusinessHours = 0)
AND (UPPER(S.InstanceName) = UPPER(@InstanceName) OR @InstanceName IS NULL)
GROUP BY S.InstanceName
ORDER BY 4 DESC

The Results

This is, by default, going to show you your biggest I/O users that are being monitored.  It’s great to know, but it’s not the whole picture.  Some servers are going to be almost pure OLTP, others are going to be almost pure OLAP, and most will be some kind of mix.  The more OLAP you do, the more physical reads (Reads_GB and ReadAhead_GB) should be expected.  It’s really not a bad thing, it’s how your databases are used.

Then there’s the realistic version of the users’ expectation (not always the version they tell you) to consider.  Are the queries coming back as fast as they should?  Large annual reports can take a bit, but a website that takes more than a couple seconds isn’t going to be used.

This report shows the load you’re throwing at the disks, but user expectations are measured in seconds, not gigabytes.  Yeah, more gigabytes means more seconds, but it’s not like there’s a solid ratio.

That being said, the true value of this information is when you combine it with the type of server it is, the reasonable expectations of the users, the amount of time the server is waiting on I/O, and so much more.  To start you out, look at the Wait Stats Baseline Comparison using Idera, too.

If there are OLTP servers that stand out as doing a ton of physical reads and the wait stats show that it’s an issue as well, look into more memory and more time for tuning.  Take it a step further and use all of the trending info (not just this report) to split out I/O load and plan for faster disks.  There’s a lot of potential to use this info to do a lot of good.

My Idera Scripts

I keep all the scripts that query Idera Diagnostic Manager in one place, on my Idera Diagnostic Manager Scripts page.  Before this script had it’s own post, it was posted there.  As I make posts for each script, that page will live on as a directory of what scripts I have to get more use out of this product.

This is the first time I’m doing something like this, but I’ll eventually make other pages that pull together topics like this as well.  It feels good to find a great way to take blogging to the next level like this.

Speaking of taking it to the next level, let me know your thoughts on this script, others you’d like to see, and anything else in the comments below.

Original post (opens in new tab)
View comments in original post (opens in new tab)

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating