Troubleshooting

  • The chicken approach would not be tolerated where I work.

  • Sounds like you work at a great place. I have not been doing technology very long. I have worked with SQL Server for 10 years but I have yet to not find folks executing that type of troubleshooting. Anyway, thanks for the comments, gives hope.

    __________________________________________________

    Mike Walsh
    SQL Server DBA
    Blog - www.straightpathsql.com/blog |Twitter

  • Beautiful article, and well written.

    It seems, as with so many things in life, the key element in the article can be summed up with a quote from the Hitchhiker's Guide "Don't panic."

    ---
    Timothy A Wiseman
    SQL Blog: http://timothyawiseman.wordpress.com/

  • Interesting analogy, but I'm afraid that is simply isn't true.

    Recent studies have shown that well trained personnel (I think it was Incident Controller, aka Lead Fireman specifically) simply do not run all the possibilities in an emergency situation, they just look and act. The training allows them to adjust procedures as the reality of the situation emerges. The key thing being that they can never have any significant understanding of the problem domain in advance, only of the tools at their disposal. In tech support, the reverse is more often true: people know how to configure devices, but only use the diagnostic tools when they need to, and therefore only ever learn the bits that they have used so far.

    Throw away your pocket calculators; visit www.calcResult.com
  • Nice job, Mike.

    I fixed the one type on "seen"

  • mike brockington (4/2/2009)


    Interesting analogy, but I'm afraid that is simply isn't true.

    Recent studies have shown that well trained personnel (I think it was Incident Controller, aka Lead Fireman specifically) simply do not run all the possibilities in an emergency situation, they just look and act. The training allows them to adjust procedures as the reality of the situation emerges. The key thing being that they can never have any significant understanding of the problem domain in advance, only of the tools at their disposal. In tech support, the reverse is more often true: people know how to configure devices, but only use the diagnostic tools when they need to, and therefore only ever learn the bits that they have used so far.

    Hey Mike - That is an interesting point and I guess I am slightly confused. You are right, not all possibilities get run through at a structure fire or car accident. I agree about the lack of significant domain knowledge though. In the firefighting/ambulance world, every call is different and we have been trained for a general domain knowledge of the potential problems but each one is a bit different. We apply the same pattern to the problems though. At least I do. I think through all of the potential scenarios when responding to a house fire or medical call or even a more common fire alarm activation call (bells and smells). I think through the potentials that could happen, as I approach I look for signs that lead towards a scenario, I plan for the worst case (throw my air pack on my back even if it's the 17th alarm call there this week, have my gear situated properly). You are right it does come down a lot to training on both the fire and ems side. On the ambulance call though we are thinking of the big picture and trying to proactively use a rule-out methodology to rule things out. I said differential diagnosis and perhaps that is a bit arrogant (we are certainly not doctors or nurses) but it is close to what we do. We go to our training, look at all of the information, make a decision that this is the general area of problem and treat that problem while transporting. If signs and symptoms change we go to our training and knowledge to redirect treatment. That isn't looking and acting though. Show up to a "working code" (a patient unresponsive, not breathing and pulseless) and it is a look and act situation.

    In technology we have occasions where we may have to do a look and act that relies on training (server is frozen, can't access any tools to gather more information, can't connect easily.. reboot). A lot more situations where we want to analyze, use a methodology and form a "differential diagnosis" of the issue at hand.

    As for the only using what you know when you troubleshoot technology. Sure that makes a lot of sense but that is why training and familiarization, disaster drills, etc. come in handy. To develop some of that "muscle memory" to learn about the tools you may need to use under stress. The focus in technology on training is far lower than that in the emergency services which I guess makes sense.

    I like your viewpoint and it makes me think about a lot 🙂 Should we do more training/drilling in technology? I think so but how much more? How much will it cost (time, resources, lack of attention towards reactive tasks and user requests) and what's the likelihood we'll ever need it? In Fire/EMS it's an easy decision, if you don't train lives could be lost, there is a LOT of downtime between calls and you are getting hit with such a variety.

    I still like my analogy on the medical side as it brings out the steps that work on both an ambulance call (at least it has for me, on both trauma and medical emergencies) and most other areas of life. There are a lot of places (like the fire example) where that methodology is not used and I never really thought of it like that 🙂

    Thanks!

    __________________________________________________

    Mike Walsh
    SQL Server DBA
    Blog - www.straightpathsql.com/blog |Twitter

  • Great article, Mike. One thing I would add to it is to emphasize in the IT world is that documentation should be occuring throughout the troubleshooting effort. Record what you're doing. Record what the results were. That way if you have to rollback, you know exactly what you did and in what order you did it. In the scenario you've described I'm sure there is documentation continuously going on. When you check vitals, etc., you're probably recording some info, just like you probably did with initial information. A lot of times in troubleshooting what frustrates me is when I get brought on to help find out what is wrong and I ask the folks that have already been working the problem, "What have you done to this point?" and they can't give me a good breakdown of what has already been tried and what they saw when those things were tried.

    K. Brian Kelley
    @kbriankelley

  • Great point about documentation! I mentioned it but as an after thought. The post-call documentation on an ambulance call is definitely after the fact but there is a lot of writing that should go on. I will often either write on the back of a glove, a scratch pad or just take the wider medical tape and tape it to my thigh. Vitals, Treatments, Things I checked.

    Wish I had thought of that and added a blurb about it. It is quite frustrating to find nothing documented, nothing to prevent the next time or at least make it easier to fix next time.

    __________________________________________________

    Mike Walsh
    SQL Server DBA
    Blog - www.straightpathsql.com/blog |Twitter

  • that is why training and familiarization, disaster drills, etc. come in handy

    Thinking about it further, I think you have stumbled on the biggest difference - both Fire and Ambulance services are primarily there for emergency situations, prevention is largely a different department and/or an activity carried out during quiet periods.

    For most IT people, the reverse is true: our performance is generally assessed on how _little_ time we spend on emergency care, as prevention is preferred.

    This leads into the classic dilemma over training: should we put time and effort into learning diagnostic techniques, and purchasing diagnostic tools, or should we concentrate on ensuring that disasters never happen? While many large organisations will be prepared to have a dedicated disaster team, is Google's approach of massive redundancy not an even better idea?

    Throw away your pocket calculators; visit www.calcResult.com
  • mike brockington (4/2/2009)


    that is why training and familiarization, disaster drills, etc. come in handy

    Thinking about it further, I think you have stumbled on the biggest difference - both Fire and Ambulance services are primarily there for emergency situations, prevention is largely a different department and/or an activity carried out during quiet periods.

    For most IT people, the reverse is true: our performance is generally assessed on how _little_ time we spend on emergency care, as prevention is preferred.

    This leads into the classic dilemma over training: should we put time and effort into learning diagnostic techniques, and purchasing diagnostic tools, or should we concentrate on ensuring that disasters never happen? While many large organisations will be prepared to have a dedicated disaster team, is Google's approach of massive redundancy not an even better idea?

    Exactly, that's where I ended up in that long-winded stream of conscious reply to you 🙂 I think Google has it right in ways and that covers a hardware or even a majority of software errors but there are still troubleshooting scenarios that come up. I think in IT we do need more training (not to the extent of the fire service, that would be a wasteful use of IT budgets), more tool familiarity and we should look at redundancies more. Google still has their failures though, the massive redundancy is great until you introduce a glitch across a massively redundant array of machines (yes I am very much oversimplifying that).

    Anyway I think your response and this conversation brings up some interesting points.

    __________________________________________________

    Mike Walsh
    SQL Server DBA
    Blog - www.straightpathsql.com/blog |Twitter

  • In that case, I suppose I must retract my earlier criticism of your analogy - I still think that the comparison wasn't very close, but the point of an analogy is to stimulate thought, and in that respect it has worked well!

    Throw away your pocket calculators; visit www.calcResult.com
  • In an emergency you must remember to panic!!

    Well, not really, but people who *are* panicking often get really angry at your failure to panic along with them. They see your calmness not as competence in the face of an emergency, but as a "failure" on your part to "understand the seriousness" of the situation. If you're not careful (and sometimes even if you are), you can suddenly find yourself defending your actions and your attitude from a frustrated and angry person instead of solving the problem.

    In my experience, the best bet is to remove the panicky people from the area where you are trying to problem-solve. This can help break the escalation of their panic (and give them a chance to calm down) and remove a fairly serious distraction while you're in an emergency situation.

  • If any one that think's SQL is less complex then Oracle or DB2 then I would assume you have excessive knowledge in these products to make such a decision. Then if one says that SQL server is not complex then take a minute and dwelve into the SQL server architecture..let's say memory Architecture ..you will be suprised that there are many complex components involved in the SQL server memory arhictecture and then on top of that try mapping those components back to Windows memory architecture ...wow seems simple enought...:hehe:

    But let's not focus on comments about complexity and focus on the article that was written and the way of thinking that must sometimes change to fix a problem in 10min or spend 1 hour....

  • CoetzeeW (4/2/2009)


    If any one that think's SQL is less complex then Oracle or DB2 then I would assume you have excessive knowledge in these products to make such a decision. Then if one says that SQL server is not complex then take a minute and dwelve into the SQL server architecture..let's say memory Architecture ..you will be suprised that there are many complex components involved in the SQL server memory arhictecture and then on top of that try mapping those components back to Windows memory architecture ...wow seems simple enought...:hehe:

    But let's not focus on comments about complexity and focus on the article that was written and the way of thinking that must sometimes change to fix a problem in 10min or spend 1 hour....

    I apologize for my reply about the complexity comments. I didn't convey what I was thinking properly. I completely agree SQL Server is complex inside. It has a simpler user interface (from my experience with the other platforms) but it by no means is a simple product 🙂 Teams of great developers have spent countless hours developing the various versions and it's architecture is complex. Thanks for the reply about that, you said it much better than I did 🙂

    As for the second part of your comment, are you saying a methodology would reduce time to 10 minutes or drag it out to 1 hour? I am hoping that it saves time, most of those steps and principles aren't lengthy and they only increase time by a little. In the ambulance call example, if we were to rush through, miss a couple steps we could have maybe saved a couple or few minutes but the outcome wouldn't have been better and could have been worse.

    __________________________________________________

    Mike Walsh
    SQL Server DBA
    Blog - www.straightpathsql.com/blog |Twitter

  • aureolin (4/2/2009)


    In an emergency you must remember to panic!!

    Well, not really, but people who *are* panicking often get really angry at your failure to panic along with them. They see your calmness not as competence in the face of an emergency, but as a "failure" on your part to "understand the seriousness" of the situation. If you're not careful (and sometimes even if you are), you can suddenly find yourself defending your actions and your attitude from a frustrated and angry person instead of solving the problem.

    In my experience, the best bet is to remove the panicky people from the area where you are trying to problem-solve. This can help break the escalation of their panic (and give them a chance to calm down) and remove a fairly serious distraction while you're in an emergency situation.

    It's sad to say but I know what you mean and have seen that. I have asked management to give me some time and space before. It wasn't rude but I said, "it will be easier for me to get you an update and fix things if I can focus on it". If they are good they'll understand and give you the space. A coworker once tried it in a ruder way... He was being asked every 5-10 minutes for an update and he replied, "if I didn't have to stop thinking to give you an updates every 5 minutes, I could have this solved a lot quicker!". It was effective but not an approach I suggest.

    mike brockington (4/2/2009)


    In that case, I suppose I must retract my earlier criticism of your analogy - I still think that the comparison wasn't very close, but the point of an analogy is to stimulate thought, and in that respect it has worked well!

    Fair enough 🙂 In my experience in both troubleshooting technology issues and people issues it works. I agree doesn't translate to a fire scene but so far it has worked for me in SQL and the Ambulance.

    __________________________________________________

    Mike Walsh
    SQL Server DBA
    Blog - www.straightpathsql.com/blog |Twitter

Viewing 15 posts - 16 through 30 (of 44 total)

You must be logged in to reply to this topic. Login to reply