Learning From Breakage

  • Comments posted to this topic are about the item Learning From Breakage

  • This was removed by the editor as SPAM

  • This was removed by the editor as SPAM

  • I broke a reporting system at month-end.  I had 18 hours of intense learning that day.

    With a production breakage, you can't step away from it, especially if you are a senior in a department. You are accountable, even if you weren't responsible.

    No one wants to be the cause of a breakage, and it is painful to own up.  As a principal engineer, I try to make an environment where juniors feel safe to own up and be honest about their involvement so we can work together.  It is a nightmare trying to solve a puzzle where half the pieces are missing.

    A breakage doesn't just teach us about a system.  It teaches us the value of a methodical diagnosis process.

    • What broke (make sure you identify the cause, not the symptom)
    • How did it break (can this be prevented in future.
    • When did it break (can be some time before the results become apparent)

    I anonymise who broke it in any written document.  My only interest in "who" is who can provide the most useful detail.  I'm not a fan of blame games.

    At my company, we put together a post-mortem document for any incident, whether we've caused the breakage or the cause is an external factor. This will contain the timeline of actions to diagnose, fix and confirm that the fix is satisfactory.

    It will also contain a list of actions to reduce the risk of it happening again.

    We go through any post-mortem as part of team retrospectives, so we all learn from it.  It is painful but necessary.

    • This reply was modified 1 weeks, 5 days ago by David.Poole.
  • David.Poole wrote:

    We go through any post-mortem as part of team retrospectives, so we all learn from it.  It is painful but necessary.

    I'm a big fan of post-mortems. Especially, when a written post-mortem is reviewed by a senior engineer who was not involved in the incident. Sometimes people aren't quite sure exactly why the issue occurred and end up convincing themselves of some narrative.

    Also, worst schema I've come across so far.....ticket numbers from the companies ticketing system. Oh God Why?!

  • Coffee_&_SQL wrote:

    David.Poole wrote:

    We go through any post-mortem as part of team retrospectives, so we all learn from it.  It is painful but necessary.

    I'm a big fan of post-mortems. Especially, when a written post-mortem is reviewed by a senior engineer who was not involved in the incident. Sometimes people aren't quite sure exactly why the issue occurred and end up convincing themselves of some narrative.

    Also, worst schema I've come across so far.....ticket numbers from the companies ticketing system. Oh God Why?!

    In my previous job we occasionally did post-mortems. We didn't tend to have major catastrophes.

    In my current job, I suspect that post-mortems are done, but only upper management are involved. We don't tend to have major catastrophes, either. In my 10 years there, there's only been one that I can recall. A upper level manager asked me some questions about the incident, then took that information with her to the post-mortem. Anyway, I'd like to be in one, even if it involves me, just so I can learn from it.

    I laughed out loud when I read your commend about ticket numbers from the companies ticketing system!

    Rod

Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply