Multiple Failures

  • Comments posted to this topic are about the item Multiple Failures

  • Are you kidding? I'm sure I'm exaggerating, but it sure feels like any non-trivial recovery scenario I've had to deal with involved multiple failures, either as a root cause, or cascading while trying to recover from the initial incident!

    (Not just hardware/infrastructure failures, but even things like making sure all the bills for relevant vendors are paid-up!)

  • we had a 'good' one back in 2007...

    had 2 servers, failover and primary, hooked up to a fiber attached SAN. (please note this was NOT a wintel/mssql environment and I didnt configure it!)

    primary kernel panicked on a hardware level error that no one recognised and the os vendor couldn't pinpoint. This occurred mid-write on the database, corrupting our accounts and orders files.

    our suppliers switched us to failover and did the best they could to repair the data.

    1/2 day downtime so far - back online after 5pm (call centre work til 8)

    following morning a techician came in at the behest of our suppliers to analyse the server to try to find the root cause. he stupidly booted it into multi user mode not single user mode whilst all the networking was still attached in the back. the server then grabbed the san back off of the failover knocking all the users off again, and, yes screwing the DB mid-write.

    another half day then switched back to failover having had to fsck & repair each disk area, and yes, repair the db again.

    engineer came in again the next day and disconnected the failed server before giving it an overhaul - completely unable to find any issues but saying looks like a blip it should be fine...............

    switched back to live (why did we do that? I dont know)

    3 hours later it kernel panicked again totally screwing the db and this time we had to restore from tape.

    at this point we discovered that the hardware ids associated with the kernel panic related to the fiber controller. further research highlighted a compatability issue between the server's O/S and the fiber controller! doh!

    each server, live and failover, now have their own internal disk arrays and the system is configured to synchronise overnight.

    in total this issue resulted in limited to no service for a period of 3 days and a total of 6 hours of data lost permanently.

    Lessons learnt;

    never allow for a single point of failure (in this instance our SAN was a spof - corruption caused by the main server crashing caused issues when switching to the failover)

    always ALWAYS ensure that any hardware/software purchased for any business machine is compatible (and supported) on/by every other piece of hardware/software in the solution.

    Ben

    ^ Thats me!

    ----------------------------------------
    01010111011010000110000101110100 01100001 0110001101101111011011010111000001101100011001010111010001100101 01110100011010010110110101100101 011101110110000101110011011101000110010101110010
    ----------------------------------------

  • It seems to me that 'multiple failure' is a basic problem in all walks of life, not just database issues.

    I missed the meeting because (perm any three)

    I got up late

    Lost the car keys

    Car wouldn't start

    Accident on the motorway

    Went to the wrong office

    Had the wrong time for the meeting

    Child was ill and had to be taken to doctor

    etc etc.

    So, in real life, and in the database world, there are many scenarios that we can plan for, and some combinations. Often we'll do this 'seat of the pants', but depending on the price of failure, will enumerate the possibilities and mitigations more carefully. But there could always be some set of circumstances we haven't thought of or which are just to expensive to mitigate for the price / probability which will lead to failure.

    As database professionals we need to have enumerated the likely failures, mitigations and costs (of mitigation and failure) and put appropriate things in place (and tested at least some of them) and have a coherent plan that looks reasonable and defensible before, and hopefully after, a disaster.

  • Yeah, in my experience, a lot of failures are due to multiple causes. I guess this is to be expected as testing multiple factors adequately is always orders of magnitude more complicated than testing one thing at a time.

    On slightly relates topic, I seem to attract system related coincidences on some sort of superhuman level. Things like:

    - data warehouse import fails due to corrupt data entering the system

    - switch off offending dimension's import (as it wasn't being used by the business at that time)

    - spend weeks/months clearing out the data, risk reviewing the changes to live data (with senior management signoff etc)

    - test import of offending dimension with copies of live data

    - finally switch import of dimension back on some months later

    - that very night, a completely unrelated issue appears in the table data and the import fails again

    So we're thinking it's the old issue again, start down a path investigating and take a very long time to work out it's a while new issue. This seems to happen time and time again to me. I must be jinxed.

  • Believe it or not theres a verse in the bible that mentions this... (yes the bible is applicable to databases)

    ecclesiastes 9:11 states; "time and unforseen occurence befall us all"

    dont think anyone will argue with that lol

    Major issues nearly always invlove more than 1 problem at a time. If problems occur one at a time, we, as dbas, take them in our stride and just deal with them as a matter of course - we call it bau support. perhaps an import job fails over night and at 9 am the business cant get report x. we re run the import manually and at 9.15 am the report is there. happy days.

    if this happens at the same time as a disk failure, the server is slowed down by rebuilding the raid array onto a new drive and the report finishes at 9.45 instead of 9.15 - too late for the manager to take it to their bimonthly board review. and thats assuming the report doesnt slow down the system so much the end users are nagging us to can the report as it is affecting their productivity.

    either one of those issues is bau on their own, when they occur at the same time however, they result in many people being cross -ourselves included.

    Ben

    ^ Thats me!

    ----------------------------------------
    01010111011010000110000101110100 01100001 0110001101101111011011010111000001101100011001010111010001100101 01110100011010010110110101100101 011101110110000101110011011101000110010101110010
    ----------------------------------------

  • Our storage engineer recently noticed a bad drive on a server. It is the OS drive, and is RAID 1. It is used in a major clinical system.

    As he pulled the drive to replace it, the mirrored drive failed.

    He was somewhat displeased.

    He had a deadline for another project that day. What should have been a 15 minute swap at most, took the entire day, and he was still unable to recover what he needed.

    Oh, the other project, deadline missed.

    Dave

  • Somtimes one thing has to break before we discover that something else has been broken for a long time. For example, it takes a hurricane to reveal the fact that the levees were improperly built or maintained.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • I think that multiple failures may actually be more common than single ones. As others have pointed out, there are a number of reasons why multiple failures may not be as uncommon as we might think.

    I thing that one of those reasons is that we're doing everything we can to eliminate single points of failure. Whatever effect that has on overall system stability and rates of failure, it does mean that multiple failures are nearly the only thing left that can keep life interesting.

  • What you are really talking about is called "Cascading failures". During my days in the Air Force there was a great deal of time spent on this phenom. In modern jet aircraft (as happens in computers, servers, etc.) when one system fails it can put a load on another or fail on other components that are dependant upon the intial component.

    The strangest aspect of cascading failures is that they are extremely common, not only in machines but in the human body as well, and yet we tend to think that one thing fails most of the time - and we create plans around that scenario - only to find that in actuality, when one thing breaks, others follow.

    Any good recovery scenario should plan for cascading failures, NOT single instance failures. In my experience in this business, companies that plan for one thing to break usually wind up lost when a batch of things go down. On the other hand, companies that plan for cascading failures are much better positioned to recover and maintain productivity.

    The lesson? Plan for the more common instance - not the rarer instance.

    There's no such thing as dumb questions, only poorly thought-out answers...
  • Are the multiple problems really problems, or are they symptoms of the real problem. Most times it not the problem we find, but the symptoms, that get our attention. Most of the time I find it was just one problem with multiple symptoms. The trick is to not get caught chasing the symptoms.

  • Cascading is probably a good term for these types of scenarios. Our most recent experience included a corrupt domain controller which also happened to host exchange. These are the experiences that we work so hard to avoid...

    Mark

  • Single failures don't even show up on the board anymore. Fault tolerance is built into so many systems in my current environment, and we monitor for those failures, that fixing them (replacing bad hardware, etc) is routine. We don't use cutting edge technology, and the systems we do use are very mature - which promotes stability and prevents corruption. (Its one of the reasons I love MS SQL, and tend to consistently be behind in upgrading versions.)

    In previous environments, (1) I didn't have enough monitoring to necessarily spot the failed components, so one failure would REVEAL a previous failure which would look like it was all part of a cascading issue. (2) Just-In-Time-Training / On-The-Job-Training is never a happy situation during a failure event -- and can lead to mistakes which can easily result in a cascading failure event... often known as a RGE (Resume Generating Event). (3) Undersized power and cooling infrastructure (or in small environments: the classic 'locked closet server room') can easily generate a whole series of hardware failures all on their own.

    But no, (knocking.on.wood.now) haven't had cascading failures in a decade now.

  • Rich Weissler (6/17/2011)


    I love MS SQL, and tend to consistently be behind in upgrading versions.

    Yep, we're just gearing up towards upgrading to 2008 r2. Even then, it's only been triggered by an upgrade of our core application.

  • Multiple simultaneous drive failures in a RAID5 array. 2 disks went bad at the same time. I think that's getting more at what Steve's original article was about.

    But the cascading failure is probably more common.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply