In my last article, I mentioned that I had a system to monitor my servers and that I used it to gauge the current health of my production environment and to help me to predict if things were going to go wrong in the near future. Let’s start the discussion of how to do this for yourself by addressing the high points. We’ll work our way down from here.
This article is part of a series. The complete list of articles is:
- How Often Do You Give Your Server a Physical?
- Let's Start with Architecture (this article)
Step 1: Basic Setup
You want data collected from each instance to wind up in one central repository. Therefore, you’ll want to setup a database on each of your local servers and choose one server to act as your monitoring repository. On your central server, set up a repository database as well. I do it this way for performance when I report on it. You can call the local and hub databases anything you want, but it should be the same name on each server.
For our purposes, let’s call the local server databases LocalDBMonitoring. Every production server should have this database on it. On the central server, you’ll have that database plus another one to server as the monitoring repository. We’ll call it LocalDBMonitoringHub for now, but name it whatever you like.
Step 2: What Information Do You Want to Collect?
I gave a list last time of the data that I collect:
- Brent Ozar’s Blitz results and whether or not they were fixed (done once monthly and recorded to a table on each server, then sent to the hub for archiving)
- CPU utilization – both by SQL Server and non-SQL Server processes. I find breaking these out is helpful
- IOs by drive and server (again, breaking them out is helpful)
- PLE – here I’m looking for sustained low page life expectancy, rather than quick dips. Are the dips getting longer? What times of the day? Can I correlate them to application processes reliably?
- How long is it taking for Ola’s indexing and CHECKDBs to complete on each server?
- How many times have we had to revert either to a backup or a database snapshot?
- How many times has a server been rebooted?
- What are the top five waits on each server? Have they increased or changed?
- What are my top ten databases across production in terms of size, and how much are they growing by month/quarter/year?
- How much space is each server taking up on the SAN?
You may or may not want all of this information, or you may have other data points that you would rather have. That’s okay. Just come up with a plan, and how you can get your hands on it. Each data point will be its own table in the LocalDBMonitoring database, which can be populated however you choose (SQL Server Agent jobs, PowerShell, whatever looks good to you). I’ll be going into that in a little more detail in the coming weeks, along with sample table architecture.
Step 3: How Often Do You Need to Collect the Data Points?
This is likely to vary. You could choose to gather metrics such as CPU, IO, PLE a little more frequently (say, every 20 minutes for example), but other metrics like Blitz fixes may only be gathered once a month. However frequently you decide to collect the data, the next step is to have the information shipped off to LocalDBMonitoringHub as often as you deem necessary.
The local tables can be filled, data shipped off, then truncated before the next fill to keep the local tables small and stable in terms of size. This is all managed through SQL Agent jobs, the workings of which will be explained in a later article. The LocalDBMonitoringHub will have an archive mechanism (of which more later) to manage the size of its data as well.
Step 4: How Long Do You Want to Retain the Information?
We have a strategy to manage the size of our local tables, but now we need to think about the LocalDBMonitoringHub tables. If you are wanting to trend baseline data over time to track changes while still seeing what the server is doing right now, you are going to want to archive some of the data. This will allow your “what is the server doing right now” report to stay more performant, while providing the data for a separate forecast report.
So, at this point, we conceptually have a LocalDBMonitoring database on each instance that collects the data points of your choosing and frequency. Data is shipped off to a central repository (LocalDBMonitoringHub). This database brings all the data together for reporting purposes. It retains the local server name and data points in similarly named tables for clarity.
For instance, if LocalDBMonitoring has a PLE table, then LocalDBMonitoringHub has a PLE table to store the data from all of the servers. There would also be a PLE_HistoricalData table to hold archived data. Maybe the PLE table holds data for a month, and then data is archived for six months before it is deleted – it is up to you and the needs of your organization.
I find that retaining the data over a ninety-day period works nicely for my purposes. In the end, we will architect this in such a way that your final “What is the Server Doing Right Now” reporting sprocs consist largely of SELECT * FROM dbo.DataPointTableName statements. The forecast report sprocs will be PIVOT tables, but the logic will be pretty streamlined, allowing for a quickly executable report.
Next up: How to start planning and coding for your process. We’ll look at sample table architecture and the specifics for how to collect data for your local servers, as well as discussing the mechanics of using SQL Agent jobs to bring your data to your repository.