The Troubleshooting List

,

Most DBA's grow into their jobs and part of that process is the baptism by

fire that occurs when things go wrong, especially on the performance side of

things. I started to make this performance centric, but so many of the steps and

thoughts are common with other problems that I tried to make it a little more

generic. In this article I've listed some tips and thoughts about what to do -

and why - when the sudden performance dip strikes or other serious problem

occurs. It's probably not a perfect list, but the key is to build a process that

keeps you on track.

#1 - It's Usually Blocking

When I get a call that 'everyone is locked up', 'everyone is slow', etc, the

first thing I look for is blocking. Various ways to check, but the tried and

true technique is to execute 'select * from master..sysprocesses where blocked >

0'. Works on SQL 7 forward and takes about 15 seconds to type in and run, then

you can decide what action to take. This usually occurs because someone is doing

something bad like bulk loading records during the day. You can find and kill

the blocking spid, but it's worth looking to see what the spid is doing first.

If it's been running for three hours you might be better off to let it finish

rather than wait for it to roll back.

#2 - It's Not Blocking, Is It Limited to One Application/Database/Physical

Location?

I'm usually asking this question while I'm headed to a machine to work on

step #1. If it's NONE of those things I can probably rule out recent

application deployments, recent proc changes, network problems to a WAN site. If

it is one physical location, I immediately get an email out to the network team

to have them check all their hardware. Sometimes a router will blip, or maybe

they just rebooted it and forgot to tell everyone. In practice it's usually

not the network, but sometimes it is! If it's just one application and/or

database, I'm immediately thinking 'what has changed'? If you know how to do

wait state analysis this is a good technique to think about applying here but it

takes time and you can often shortcut with the right questions.

#3 - If it's One Application, Can They Reproduce the Issue?

This is key. If they can show you it's always slow when they add a customer,

look up an order, etc, you can quickly zoom in by profiling and filtering to a

single host name. That will let you see exactly what is being executed and look

for something with a long duration - long being subjective, but anything close

to 30 seconds is a good candidate for any immediate look and anything on down to

5 seconds might need to be checked. Ideally here you can refer back to a

baseline so you know what those procedures usually cost. Usually application

problems can be traced back to either a change (app code or database), or SQL

has generated a new plan for a query that isn't performing well, or both. In the

worst case get a developer to step through the code against the production

database because it's not always a SQL problem. Check whatever error logging

system the application uses and the event log of the machine they are using it

on.

#4 - It's One Application, But They Cannot Reproduce on Demand

These can be miserable to solve. I usually start with running separate

profiler sessions on a couple users I can trust to call or email when the

problem happens and to be able to tell me a lot about what they were doing at

the time. I've seen some of these solved by the low tech technique of using a

camcorder to capture everything that happened, more recently I've used

Morae to let me go back in time

to see what happened. Both of those are more often needed for code issues rather

than SQL issues, but sometimes you're just looking for clues. If you can get the

users to log when it happens you can try to match up in Profiler, the hard part

is the query plans may be fine but they are being blocked or bogged down by

something happening elsewhere on the server. So you have to have a deep focus on

the user AND be looking at the server holistically.

#5 - Sometimes it's the data

This might happen because a lot of data has been added and the statistics

don't match the current state of the data, or just because in some cases there

is a lot more data than the developer expected. I dealt with a problem like this

once where IT users had not flipped a bit in a table that allowed users to cache

the results of an expensive query on the client, as a result they kept running

that query over and over and over, killing performance. Flipped the bit and

magically all was well!

#6 - Rarely Is it the Hardware

Perhaps most common in this category is a drive in a RAID set failing and no

one noticing, resulting in a performance decrease because parity is being used

to restore the missing data on the fly. I've also seen more than once incidents

where 'the server is down' but someone doing work in the server room

unplugged/knocked loose the wrong power/network cable. For the  most part

the hardware will either work or it won't. Doesn't hurt to have some check, but

unless the server is actually unavailable this should be further down the list.

#7 - It's Usually Because Someone Changed Something!

I bet we all have had a few of these happen. Having a good change control

process that is logged can really help when you're under stress. Something as

minor as deploying a proc change where they added a column or order by can have

a huge performance impact if the query is no longer covered. Sometimes the rules

change after a hotfix (less often) or a service pack (more often) and while

hopefully the changes are good, sometimes it isn't. Larger companies may be

applying patches on a regular basis and after a while you start to ignore them,

until something goes wrong. I'll also add a plug here for rebooting one extra

time after applying a service pack because I want to be absolutely sure there

are no pending changes waiting to be made live on the next reboot which might be

months away - who would think of the last service pack as being an issue when it

was applied three months ago?

#8 - It's Incredibly Important to Communicate

As soon as you know an incident is going to last more than few minutes

someone needs to get the word to operations. Work may need to be redirected to

other sites, hourly employees put on break or sent for an hour of required but

not all that important training, or in worse cases sent home for the day. Even

smaller companies may have hundreds of employees/customers affected and the cost

per hour of down time can be significant. Tell operations what you know via your

boss, let them decide. It's also important to immediately communicate problems

to everyone in IT. A lot of times you'll get an immediate reply back from

someone who was bulk loading data, rebooting a switch, etc, that will

immediately explain the problem. You also will need people to help you chase

down leads so that you can stay focused, and someone to make sure you get

something to eat and drink while you work.

#9 - Know when to ask for Help

I think the more experienced you are the easier this is to do. If you're a

brand new DBA you hate to get the credit card from your boss to call PSS, or ask

for a consultant to come, because you don't want to look inept. Absolutely there

is a risk to that, but on the other hand, anything beyond a few hours and the

cost of either of those pales in comparison. Try setting the stage before the

incident by having one call to PSS pre-approved and consider establishing a

relationship with a SQL consultant just in case. And it doesn't have to just be

PSS, lot's of good challenging problems get posted in the forums here and on

MSDN every day and the people that answer them like a good challenge and the

learning it brings. Lean on your peers too, whether within the company or

someone you met at the local user group. With experience you get more

comfortable with the idea that you know what you need to know, you know what you

don't know, and that it is not a sign of weakness to ask for help. Sometimes

easier to say than to do!

#10 - Learn Your Lesson

I bet easily 8 out of 10 times when I'm finally done resolving a performance

problem that I can look back and see that I didn't ask the right question,

didn't really listen to an answer, took off following the first lead rather than

being methodical, or any of a bunch of other things I should have done - in

hindsight. This isn't about beating yourself up or holding yourself to some

incredibly high standard, but rather stepping past our pride enough to really

learn a lesson.

#Extra Credit - The Challenge

It's been a while since one was posted, but Steve and I had a lot of fun

writing about many of the near disasters we've been involved in, often of our

making. So my challenge to you is post your own performance issue checklist, or

tell us about the problem you should have solved sooner and why. Yeah, we may

smile a bit when we read them, but I bet more of us smile because we've done the

same thing, or because we know that's one trap we won't fall in to. As always I

hope you've learned something or sparked an idea or two, and look forward to an

interesting discussion!

Rate

4.57 (21)

Share

Share

Rate

4.57 (21)