At first, this was going to be a technical article about an index maintenance horror story. I received an off-hours page because the index maintenance job was running for a long time on a big table. I have an agent job setup to use Ola's scripts, and I made sure to set the thresholds for fragmentation per the examples on the Ola website. All was seeming very good in terms of the agent job setup, but I found a few other problems with standards.
One big problem was there was no data retention policy for this large table. The same goes for almost all our tables, big or small; they have no data retention policy. Data collects in there forever or until we run out of disk space. It was this table's lucky night because it finally had enough fragmentation that it was getting that well-deserved reorg. As it turns out, that reorg taxed the log file, which made it grow and grow.
All the while, the hourly log backup couldn't complete, so the disk was filling and filling. Then came the disk alerts. The warning only comes via email, so I didn't see it immediately since it was during off-hours. The disk blew past the warning, which happens at 89%. The disk was at 96% full one hour later, which then triggered the OpsGenie alert. For us, a 95% full alert triggers a page, day or night.
I chose to kill the log backup to avoid more off-hours disk space alerts. My coworker shrunk the log file a bit. Another coworker (even worse, it was the middle of the night for him), added some disk space. It turns out that he added the additional disk space incorrectly. He created an extended disk that didn't follow our Windows disk standard configuration.
The next day we had to move the log files to a new, properly configured, much larger, 1TB sized disk. That was a lot of fun because we also needed the new disk to be the same letter as the old disk. We wanted and needed to follow our own standards. Anyway, it's a fairly simple script to get all that magic done, but there is downtime involved. We could have avoided downtime if we used our disk standards.
This is when I realized all these problems (and their solutions) were about setting and following standards:
- Data retention - We wouldn't have so many issues with index maintenance if we didn't have such large tables. While it's not always possible to keep tables small, this table contains raw data. The raw data is processed into a more permanent table. This raw table definitely needs a data retention policy that ensures far fewer rows of data are stored.
- Adding disk space - A simple document about how to add disk space depending on the OS would be very helpful. Of course, people need to know it exists and where to find it.
- Specific lettered drives - We needed to use the standard lettering we always use for log files. Even if we had to move them to a new drive and re-letter that drive after the move. Even if this caused more work for us at the time. We value our standards.
- Ola Agent jobs - We do proper maintenance on every server with these jobs. This includes index maintenance and integrity checks. Ola Agent jobs also handle backups. Again, following our standards.
- Database monitoring - Using Idera, we have standards and thresholds for alerting. If an alert is critical, it sends an alert to OpsGenie, which pages the on-call DBA.
Do you have standards? Do you follow them? We can all fall prey to "oh it doesn't matter because I'm tired today and don't feel like it. Does it really matter if everything is aligned and done right?" It does matter and your future self will thank you.
Another Standards Example
At a previous company, I created a dashboard to show if it a database was online or not. There were far too many times that I would get a middle-of-the-night call asking me if the database was online. The caller was having some issues, and they weren't sure if the database was causing it. This is why I created a dashboard that would be available to consult for these simple kinds of questions.
When someone new started on the overnight shift, I would get calls again asking if the database is online or not. The conversation would go something like this: Did you check the dashboard? No. Well, check it.
We finally got the database dashboard on a monitor in the network operations center where everyone could see the information at a glance. That helped to stop the random calls during the night. The standard became this dashboard is on the monitor and everyone uses it.
You Need Standards
It's all an issue of standards. You need to make sure that:
- you have standards
- everyone understands why
- standards documentation is easy to find
- following standards is mandatory
It saves a lot of grief, heartache, and misery, especially when it prevents off-hours pages.
If nothing else, can you convince people to value their off-hour time? Because having standards in place will help ensure that off-hours time is more protected from work intrusions. We can all agree that we don't want to be disturbed while we are sleeping. This is why standards matter.
Sort of a Solution
If you are wondering, we still haven't solved the problem of the index maintenance job filling up the log drive. For now, the solution is making the log drive much bigger, 1 TB in size. I know that's not a good long-term solution, but it can be difficult to get the right decision-makers to agree on a data retention policy. There's only so much we as IT operational people can do.
Keep this in mind as you create and follow your standards: treat yourself with compassion and know you are doing the best you can.
P.S. here is the script to move the log files:
/* alter database log drive location to O drive */ USE master ALTER DATABASE dbname MODIFY FILE ( NAME = dbname_log, FILENAME = 'O:\MSSQL_LOG\dbname_log.ldf'); /* make sure no active connections to db with sp_whoisactive */ sp_whoisactive /* set the db offline */ ALTER DATABASE dbname SET OFFLINE WITH ROLLBACK IMMEDIATE; /* move the files to O drive */ /* set the db online */ ALTER DATABASE dbname SET ONLINE; /* alter database log drive location to G drive */ USE master ALTER DATABASE dbname MODIFY FILE ( NAME = dbname_log, FILENAME = 'G:\MSSQL_LOG\dbname_log.ldf'); /* make sure no active connections to db with sp_whoisactive */ sp_whoisactive /* set the db offline */ ALTER DATABASE dbname SET OFFLINE WITH ROLLBACK IMMEDIATE; /* rename G drive to another unused letter rename O drive to G */ /* set the db online */ ALTER DATABASE dbname SET ONLINE;