The Enterprise DBA Mindset as a Practical Problem-solving Approach

In order to keep the demands of the job under control, any DBA needs to automate as many as possible of the suitable tasks that their role demands. What makes a task suitable? How do you judge whether it is worthwhile? Can we take a 'managed', consistent, decision? Joshua Feierman explains the practicalities with a real example.

In a previous article that I wrote here at Simple-Talk, I argued that automation was a crucial part of adopting what I termed the “Enterprise DBA Mindset”. As the series was well received, I thought it might be useful to give a fully thought-out example of the process of automation. We won’t go into great technical depth here; instead I want, in this article, to explain how I apply the procedural framework to evaluate whether automation will be successful and worth-while. I’ll describe this from the beginning of a fully automated solution through to completion.

First, let me recap the criteria that indicate whether a process is a suitable target for automation.

  1. How much time do we spend doing the tasks?
  2. How complex would it be to automate the process?
  3. How complex is the logic behind the task?
  4. Are existing interfaces available?

Now we’ll look at the detail of the process we will be automating to see whether it fits these criteria.

Defining the Problem

In our principal customer-facing server environment at my work, we use AlwaysOn Availability Groups to ensure that the databases give reliable continuous service. Groups are organized by geographic region; for example we have one group for U.S. East coast customers, another for those on the West coast, and a third for those in Europe. We configured the groups in this way so that planned failovers can be done during off-peak hours for that particular group. For example, planned failovers for the U.S. West customers might occur sometime in the very early morning hours of Pacific Standard time, whereas operations for the Europe customers might occur at 1 A.M. Greenwich Mean time. This is to ensure that planned operations have as slight an impact to customers as possible. It’s a method that serves us well, with one notable problem: someone has to be awake in order to actually complete the failover. I’m not sure about you, but for me, being out of bed at 3AM is not an appealing prospect!

As well as needing to complete our regular monthly planned system update activities, we must carefully coordinate ad-hoc group-failovers over the course of around a week, usually for applying patches. All these must be completed during narrow time-windows and in a specific order, based on which servers are patched on what nights. As a result, the manual coordination required is rather intense.

After going through this exercise several times, I said to myself, “Self! This is foolish. Surely there is a way to use automation to make this process less painful. After all, you can only drink so many cups of coffee before you start arguing aggressively with hat-stands.” So, I sat down to consider how I might make this process as manageable as possible.

First, I asked myself ‘how does this process stack up against the criteria?’

  • How much time do we spend doing the targeted tasks?
    The actual failovers themselves were quite fast, however the fact that they often had to be done outside working-hours certainly makes them more inconvenient. In fact, pondering this question made me realize that an update to my original list of criteria is required. Rather than considering the amount of time a particular task takes, we should consider the overall inconvenience that is entailed in completing the work. It’s one thing if you have to run a couple of commands in the middle of the day once a month, but if the work must be done at the proverbial zero-dark-thirty, then that’s another thing entirely.
  • How complex would it be to automate the process?
    The actual commands to cause a failover to occur are fairly simple and well documented. The hardest part would probably be laying out all the rules to handle all foreseeable eventualities, as well as the actual coding of the process. But still, nothing too burdensome.
  • How complex is the logic behind the task?
    This is by far the most difficult piece of this process, given the complexity and the problems we might encounter. For example, what do we do if the replica to which we want to fail over is an asynchronous one? What if the replica isn’t fully synchronized? Do we abort immediately, or do we wait? I’ll document these fully in the next section, but for now let’s say that this took a good few hours of hard thinking.
  • Are existing interfaces available?
    Fortunately the interfaces for working with availability groups are numerous and well documented. We could utilize either straight Transact-SQL (see the ALTER AVAILABILITY GROUP command), or some of the many classes exposed in the SQL Server Management Objects library (see here for an example).

After going through these points, it was clear to me that, while having some challenges, this automation project was certainly something worth pursuing. The next step then was to document and standardize the rules behind these operations, since these are pre-requisites to any proper automation effort.

Defining the Rules of the Game

First I spelled out the existing manual process. Here are the discrete steps I came up with.

  1. At the appointed time (again, the window is determined by defined maintenance periods), log on to the server where the availability group will be failed over to.
  2. If the replica was in asynchronous mode, connect to the current primary and set it to synchronous.
  3. If the replica was not synchronized (either due to some transient condition or because it was previously in asynchronous mode), wait a defined period to see if it synchronizes before proceeding. If it does not synchronize, abort the operation.
  4. Process the failover to the new primary server.
  5. If in step 2 the new primary was found in asynchronous mode, we need to ensure that the new secondary (the former primary) is set to asynchronous mode as well. After all, there is probably a good reason why this is the case, such as the servers being geographically separated, or workload characteristics requiring a permanent asynchronous configuration.
  6. If everything processes through to completion without problems, no notification is required. However, if something goes wrong, a mechanism must exist to ensure that relevant personnel are alerted to the failure. We don’t want our availability groups sitting in an odd state for hours on end after all!

So far, so good. In addition to these rules, I wanted some additional contingencies in place given that computers tend to do only what they are told, and I wanted to err on the side of caution (that is, I’d much rather a planned failover not take place than accidentally process one, or worse, process one that results in data loss).

  • No matter what happens, we must be absolutely sure that the state of the replicas remains functionally identical to their state when the process started, even if errors occur. For example, let’s say we set the availability group to synchronous mode, but for some reason the replica does not synchronize properly and our wait period elapses. We must be sure to set the secondary replica back to asynchronous mode, so that it is returned to its original configuration.
  • We need to allow for user defined tasks to take place after a failover occurs. For example, in our environment we needed to mark a row in a table on the new primary server, so that our hosting provider’s automated alert mechanism was silenced.

Finally, I needed to decide if my target was to fully automate this process, or if partial automation was sufficient. As discussed in my previous article, there are often cases where full automation simply isn’t possible or worth the effort. However, in this case I saw no reason why the process couldn’t be handled entirely in the context of automation. The rules were clear, there were numerous ways in which to interact with the involved systems, and frankly, there was little value in automation if I still had to be up in the wee hours of the morning completing failovers. Full automation was therefore the goal.

Now that we have all the relevant data collected, it was time to decide what tool(s) to use to automate this process. Let’s look at a few possibilities.

Picking Our Tools

In my previous article, I gave a number of examples of tools that could be used for automating our work in SQL Server. Here I’m going to go through several of them and walk through my evaluation process when selecting the most appropriate technology.

Transact-SQL

As noted in the previous article, Transact-SQL has a number of distinct advantages. For example, DBAs are likely to be extremely familiar with its operation. In addition, with T-SQL being the native language of SQL Server, there is very little that cannot be done with it. For example, all of the steps I listed in the previous section can be accomplished with pure T-SQL, mostly using the ALTER AVAILABILITY GROUP command. For checking on the state of the availability groups, we can use several of the many dynamic management views available. I could certainly accomplish my “post” user-defined step requirement as well, by keeping a list of stored procedures (accepting a standard set of input parameters) to execute in a table, and calling a “controller” procedure to look over and run everything contained there at the appropriate time. Overall, this seemed like a pretty good fit.

As I look down my list of steps, however, a problem emerges. In several of the steps, the phrase “connect to” appears. This poses a problem, because in pure T-SQL, there is no way to execute a query on some other arbitrary server. There are some hacky workarounds for this, however all would have require some kind of setup on each of the servers in the availability group topology. For example, if I wanted to use linked servers, I would need to ensure that one was set up to point to the other server in the primary / secondary pairing (this assumes a simplistic one-to-one setup in the topology, where there are no more than two replicas per availability group). Sure, my automation routine could be smart enough to create the linked server if it didn’t already exist, but still, that’s a lot of extra work. So for the moment, T-SQL doesn’t look to be a fit, at least by itself.

Still, there was one place where leveraging SQL Server itself was a clear fit: the storage and maintenance of the metadata around the failover tasks themselves. As previously noted, we needed to define when the failovers would occur, as well as where the groups would fail to. This information was easily persisted in a table in our utility database, which the rest of our automation framework can access.

PowerShell

PowerShell is a very versatile tool for sure. As I wrote previously, “You can interact with the operating system through WMI or other interfaces, execute T-SQL or use built in .NET assemblies like SMO to perform work, and you can blend them all together seamlessly.” Speaking of SMO, it has a number of classes that allow you to manage availability groups, such as the AvailabilityGroup and AvailabilityReplica classes. In addition, you can easily connect to remote SQL instances, so it meets the aforementioned requirement of establishing connections with other servers in the availability group. For example, to connect to a primary server and set the secondary as synchronous, I could simply use the following snippet.

In addition, PowerShell supports dynamic loading and executing of script files (through something like the Invoke-Command cmdlet, or using the special “Call” operator symbol of “&”. This means it would be easy for me to execute an arbitrary list of script files stored somewhere on the filesystem, passing in some outputs from the main automation framework. For example, this snippet will loop over all the files in a subdirectory and execute them, while passing in a hash object of parameters (which could be generated from the pipeline output of the main framework).

For these reasons, PowerShell seems like the ideal place to house the majority of the automation. However, it does require some kind of trigger mechanism in order to run. For example, I wanted my automation to check at pre-defined intervals (say every 15 minutes) to see if any failover tasks are ready to be processed. Fortunately there is a logical choice for this task, namely SQL Server Agent.

SQL Server Agent

For this automation effort, I decided to use SQL Agent for two main purposes: first, as a job execution mechanism, and second, as a means to generate alerts should something go wrong during the planned failover operation. For scheduling purposes, it’s easy to set up a schedule that executes every fifteen minutes. For alerts, we can utilize the built in Operator notification feature.

In terms of the actual execution, I initially wanted to use SQL Agent’s built-in PowerShell job-step type. However, on closer inspection, I found that there are numerous limitations to this approach, including the fact that the step runs in a slimmed down “mini-shell” version of PowerShell 2.0. After getting frustrated for some time with this, I went with a cleaner approach, namely calling the “powershell.exe” process from within a CmdExec step (which executes a command line process from within SQL Agent). This also allowed me to configure a proxy account under which the job executes. This is important since, in order to execute the necessary steps to generate a failover, the job must run as an account which has sysadmin level access to the remote instances, and granting this access to the SQL Agent service account directly would likely not be a good idea.

By knitting these various tools together, I was able to create a robust and fully automated solution to the problem at hand. No more waking up in the dark hours of the morning, no more fumbling through a checklist. Automation, for the win!

Conclusion

At the beginning of this article, I laid out a problem to illustrate in detail how we can make a ‘managed’ decision using the framework I provided in my previous article to help to determine whether an aspect of our work as SQL Server DBAs is worth automating. We went through the process step by step, starting with carefully defining our requirements and documenting existing processes. We then ensured that this effort was worth our precious time, and quickly determined that although the time spent on the work was minimal, the inconvenience factor trumped all other concerns. Finally, we evaluated a number of possible tools to use in creating our automation framework. While we found no one tool could meet all our needs, we were able to craft a solution by connecting several different pieces together. In doing so, we’ve now not only ensured that this process will be done in a repeatable and predictable fashion, we’ve greatly increased our chances for a good night’s sleep. Those two outcomes alone are well worth the time spent.

If the framework described in this article sounds like it would be of use to you, I’ve made it freely available here. It’s an evolving thing, as most automation is, but it has helped me tremendously, and I hope it might do the same for you. If you have any comments or suggestions to make it better, there is a link on the page for any submissions. Enjoy!