Today’s SQL University post will highlight the need to have a methodology to address issues that we as IT professionals may encounter in the course of our day. We will start off by looking at why we need to have a plan for every situation, and then we will dig into a methodology that I have developed by stealing bits and pieces of other people’s approaches over my career.
Most IT professionals lean on either knowledge, instinct or some mix of the two to solve the problems that they encounter. Knowledge and instinct are powerful tools that develop with experience. In the case of knowledge, we can rely on the experience of those that are gracious enough to share their experience via books, blogs etc. to make us all stronger. Instinct is a much harder developed tool. There is no way to do a Bing search to see if you have experienced a similar situation before and how you reacted to it.
What happens if we encounter a completely new problem that nobody has ever experienced before? Instinct and knowledge both require experience to move forward but there is none to draw from. At this point there is a real chance of getting caught up in what I like to call the Reaction Cycle. The Reaction Cycle is the technological quicksand that waits for us outside of our knowledge or when we are misled by our instincts. We apply an opposite force without thought, iteratively making things worse.
Think of the Reaction Cycle in terms of a brand new junior DBA. One day while the rest of the team is off at lunch they start getting alerts from the production sales database. From the alerts it appears that the server briefly lost its connection to the SAN and now there is some database corruption. The junior DBA, seeing all of the disk errors, decides the box needs a reboot while most of the company is still at lunch and proceeds to kick it over. When the server comes back up there are multiple inaccessible databases. By this time help has arrived, the rest of the team has returned from lunch. They quickly decide that at this point the best option they have is to restore the databases. They will lose some data but since this happened over lunch it is not the end of the world. They all agree that this is much faster than trying to work through the database corruption; after all they need to react quickly before too much money is lost. As they are restoring the sales database they discover they cannot apply transaction logs after 4 AM. A little further digging reveals that the junior DBA reacted to an alert about a log filling last night by truncating the database log. At this point the production database is gone and they have to go with what they have. 8 hours of sales are now gone due to the Reaction Cycle.
Shrinking databases might be said to kill kittens but reacting can definitely kill a career. When something goes wrong, your first step is to collect information. What is wrong? What do the logs say? If you have an error message then what does Bing say about it?
The next step is to process the information that was gathered. Based on the facts and data collected you can begin to formulate a response to what is going on. Sometimes based on this you need to go back and collect more data, like if you decide you need to restore you may want to try out the restore to a development server to make sure you have all the steps down.
Finally, after all the collecting and processing it is time to respond. What we are doing here is making a carefully planned move based on based upon the collection and processing we have done in the previous steps. We almost always know at least our next move if not the next couple of moves and we have thoughts on what may go wrong and how we would respond to that.
After responding we start over with the collection phase and move into the processing phase. Did we get the results we wanted? Are things better or worse? We continue to cycle through this process until we find no response is necessary.
With that, it is time to unveil the Collect, Process, Respond methodology for troubleshooting. In simple terms, we want to gather all necessary data to develop a plan then execute it. Feel free to print the image below and hang it on your cube wall to remind you to use the methodology.
Now that we have covered why it is important to have a methodical approach to problem solving and taken a high level look at the Collect, Process, Respond methodology it is time to start digging into the individual phases.
The first phase we are going to dig into is the Collect phase. The most important thing about the Collect phase is that only 1 person does the data collection per system. With multiple people collecting data from the same system the risk of reacting to monitoring induced symptoms goes up exponentially. A good example of this is if someone fires up a Profiler trace on a server to collect data about an issue while someone else is looking at active user sessions. The person monitoring user sessions may see the sessions start to pile up but not get back to the source of the waits before the trace is stopped. This could lead to user sessions piling up being misattributed as a symptom of the problem rather than a necessary byproduct of troubleshooting. If it is truly necessary to have more than 1 person collect data from a system then 1 person has to call the shots and everyone has to communicate well.
The Collect phase is where we start scoping the issue or figuring out how wide to cast the net. I like to refer to it as ruling things in. We want to err on the side of ruling things in because it is easy for knowledgeable people to rule them out later. Remember our goal here is to figure out what is wrong.
So how do we go about actually collecting data? I like to ask these questions:
The starting point for the Collect phase is to look at what the symptoms are. This is a key question because it helps us figure out what to collect. The symptoms may even tell us exactly what is wrong. If users are reporting a SQL Server error message about an account being locked it then it becomes easy to know what is going on and we can jump to the Process phase.
Once we know the symptoms we want to look at what locations are involved. This is very important because it helps define the scope of further collection activities. If all users in a particular location are having issues then we would want to focus on what is unique to that location, but if all users in the company are having issues then we would want to look at what is common to all.
Based on the answers to the previous questions it is now time to start looking at what systems might be involved. The goal of this step is to find all of the moving parts that make up whatever activity it is that is failing. This step often includes web servers, network load balancers and SANs so it is important to start bringing in other teams at this point.
The last of the generalist tasks to get through before really digging in with the tools that we are comfortable with is to review the change control history. Many organizations have a calendar on a wiki or on SharePoint, sometimes there is even a log at the NOC in organizations large enough to have one. Worst case, talk to the primary on-calls or managers of systems that you may be involved to find out what changed lately. Spend some time here; almost everything that goes wrong is because of a change that someone implemented.
Finally, we get down to the part that DBAs love. We get to bust out our magical tools that we are so used to using. I always save this for last because it is easy to get lost in the data, especially if it feels like it is leading us somewhere. This is where you would go trolling on any involved servers looking for any data that supports what you have seen earlier or anything that does not match with established baselines or in absence of formal baselines then anything that does not look the way you are used to seeing it.
There are a number of third party tools to automate information gathering. Microsoft has the Management Data Warehouse and the Performance Dashboards. The great thing about these tools is they tell you at a glance what the important metrics are and usually have some sort of indicator when things are bad.
Remember to be careful not to react while collecting data, the idea here is to gather as much useful information as possible.
At this point we have collected all of the information we think we need and it is time to move into the Process phase. This is most people’s favorite phase because it feels the most like solving the problem but it is important to remember that we are not pushing any buttons or pulling any levers at this point. This phase is all about the making a plan to address the issue. You want to come away from this phase with an action plan, an expected result and a plan to rollback whatever action you take in case it makes things worse.
As we move from the Collection phase to the Process phase we need to ask these questions:
There is a reason that I have listed these questions first. These are the things that will let us short circuit out of the Collect phase. It pays to keep them in mind when working through the Collect phase to avoid prolonging outages with unnecessary analysis. A good example of a short-circuit out of the Collect phase would be if one of the symptoms was an error message stating that a SQL login was locked out. The problem is clear and the solution is simple and low risk. The follow-up monitoring is simply to make sure that the account does not lock out again. The trick here is to make sure that your intentions are in the right place. Be especially aware of decisions made to make you look good or avoid looking bad.
Next we have to look at the data we collected to see if any patterns can be identified. As patterns emerge theories will develop. It is important to create at least one test for each theory. I say at least one because we may have multiple moving parts and each moving part should get a test. Say that a client application hosts a Reporting Services report is slow, we would want to first run the report by calling it directly in a web browser. If it is slow in the browser then we would want to start looking at the parts that make it up, eventually pulling out individual queries and running them in a query window. We will eventually get to what the problem is.
A clearly defined problem almost always indicates what corrective action is necessary. To keep with the above example we may say the report is slow because the query to get customer orders for the last 10 years is missing an index. There are times when no matter how well defined the problem is the answer is not clear. In those cases having a clearly defined problem is invaluable in enlisting external help whether they are from another team or from a vendor. It is the only way to make sure you are asking for the right kind of help.
Once a list of possible actions is developed it is important to stack-rank the possible solutions by likelihood of success. The goal is to try the action that is most likely to resolve the issue while exposing you to the smallest amount of risk. I always recommend trying things out in another environment first. It helps get the steps and timing down and it exposes weaknesses in the plan so that the plan can be properly ranked. Think about this in terms of a situation where you have unrecoverable database corruption and short circuit all the way to this step, would you immediately start a point in time restore to production or would you try it out on another server first to make sure that your backup is good and that you have a bullet-proof script?
The last thing to do before moving on to the Respond phase is to define how to measure whether the change helped or made things worse. I like to define a measure for each benefit and each risk that I identified while ranking the possible actions. I might say that adding this index will reduce reads from 10,000 to 6 or that adding this index may cause inserts into the table to take longer. I may also say that if the index does make inserts slower and page splits are noticeably higher then I may alter the fill factor of the index. It really pays to define success and failure here to make it clear when to stay, when to rollback and when to tweak the implementation.
More than anything, you really need to make sure you have thought things out and are doing what is right for the situation you are facing and not doing something like rebooting because that is what you always do first.
The first step in the Respond phase is to communicate your intentions. Depending on the type of change you might just tell the rest of your team or you may have to through change control to get approval to do something. The more involved or risky the action you are going to take the more documentation you should have and the more people you should involve to make sure you are not missing anything. Think of it like pool where the shot doesn’t count unless you call it.
Next we make the change. To make the change we follow a written plan that we have hopefully rehearsed. Granted unlocking a user’s account is something you have done 100 times so you can say that is well rehearsed but how many times have you rebuilt the passive node of a production database cluster? Use your best judgment here, erring on the side of being conservative.
A single person should make the change so that the plan can be followed step by step. If something is missing from the plan then it should be added to the plan in case these steps need to be followed again or reversed to roll back the change.
After all that, it is time to go back and start collecting data again. The issue is closed when there are no more symptoms to be addressed and no more fixes to be deployed.
So there you have it, a flexible, scalable methodology for solving just about any problem that any of us might face in the IT world. Use it well.