It's Monday morning, and your calendar reminder just popped up and let you know that you are on call. Since its Monday morning, and you got in before anyone else, to have some alone time with your systems, this is the perfect time to start in on your on-call duties. Where to begin? Which task do you perform first? What tasks were performed last by the preceding on-call rotation individual? I hope to stimulate some thought on this process and get you ready to better satisfy the time you have to monitor your systems and ensure their availability and uptime.
I would first like to suggest that you keep a diary or record of tasks, and their results. This record will allow others to see what has and has not been done previously. It will allow you to start a baseline and gather metrics. See trends and patterns. We have chosen a simple spreadsheet that keeps track of tasks, results and other gathered data. A template tab exists to copy from, to a new tab. Each day a new tab is created and populated. As you complete tasks, the results are filled in. Those tasks that do not get completed, simply have no results associated with them. This way others can look back on specific days and see what results were, or which items were not done. The need to completely fill in all tasks will be determined in your individual companies and teams. Each server that we are responsible for has a column in this spreadsheet, where individual data can be collected, on a per server basis.
Some of the items that we have on our task list are as follows.
- Scan OS error events.
In Computer Management, in the Event Viewer, we perform a review of the errors that appear in the Application, Security, and System sections. Filter each of these sections by errors and look for anything that has occurred since the last time this process was completed. Anything that appears in these sections, detail it in your record keeping, and you may even have to dig in and find out the reasons for it, and mediate it. This task can be quick, or occur over a few days, depending on the events that you encounter.
- Check on Backups
We all have maintenance plans, 3rd party solutions, or whatever to ensure we have backups of our systems. Whatever the solution you have, make it a habit to check it as often as you can while on call. Ensure that backups are being processed properly. If you do this on a daily basis, while on-call, odds of you going a couple days without a backup will diminish greatly. Unfortunately, most shops that implement this task usually do so after finding no exhausting backups for a period. Don't let this happen to you.
- System specific output files
You may have reports, text files, dumps, snapshots, etc. that are output from your system. These will be for a variety of reasons. Identify them, document them, and then monitor them. Ensure that they are occurring on a regular basis, and that you have the means to prove so.
- Log and data file sizes
To keep a handle on the growth of your systems, you should devise a way to monitor and keep tabs on the sizes of your database files. A simple solution is to run a query that gathers all this info, and paste it into a spreadsheet. More complex solutions could be implemented. The end result needs to be that you know the sizes of these files, and be able to have metrics over time to help you plan and monitor those systems. Doing this task on a daily basis, while on-call, will help keep tabs on growth and expected results.
- Space available / Free space on drives
We could have other processes that take up space on our servers. Maybe these reside on your drives with your data files. If this is the case, you need to monitor the free space to ensure that your databases don't run up against a wall. This has occurred to me on simple database servers, and the results are often wild and unpredictable. This task may not be relevant in all of your systems or database servers. However, I think it's worth noting, and thinking about, at least to discount it as a necessary task. If it is necessary, add this into your on-call duties.
- Replication health check
If you have replication executing on your systems, how do you monitor it? How do you know that it is functioning properly? What about latency? Can you tell what latency is during peak times, compared to non-peak times? You may have third party replication or native replication. Determine the best way to monitor it, and document it. Gathering the data associated with it and creating a baseline will help solve future issues as well.
- Scan SQL Server Logs
Something that is often missed is simply looking into the SQL Server Logs. Make it a habit to scan these logs and you will soon become more knowledgeable about the logs and what they can teach us. Make it a habit to peruse them on a repeatable basis, and document what you see.
- Other Notes of Interest
During your on-call rotation, you may encounter odd things that need to be noted. Make sure you comment on these and document them. If you resolve them, document this as well. Other individuals will greatly appreciate your notes and observations of these odd occurrences. If persistent, you may want to add them into the above rotation.
- Specific Needs
Since your shop will have specific needs, you will need to come up with more of these tasks. You may have items that do not appear on this list, that you need to add to your on-call duties. Share these with the rest of us, as well as your fellow DBA's at your shop.
If we can take the time, when we are not head-long into problems of the day, we can better gather our wits about us, and devise solutions to make our jobs easier, more automated, and successful. This is an important hump to get over, so that you are not fire-fighting all the time, but have a plan of action to solve issues as they arise. Keeping a record of these tasks, and results is a sure-fire way to see patterns and way to fix those pesky problems that always seem to get placed on the back burner. By creating metrics to measure yourself by, you can spend more time on the important tasks, and not just fix things as they appear. There's nothing like 'knowing' that your systems are healthy, and being able to prove it.