Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 12»»

Multiple Failures Expand / Collapse
Author
Message
Posted Thursday, June 16, 2011 9:15 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: Administrators
Last Login: Yesterday @ 8:20 PM
Points: 32,764, Visits: 14,928
Comments posted to this topic are about the item Multiple Failures






Follow me on Twitter: @way0utwest

Forum Etiquette: How to post data/code on a forum to get the best help
Post #1127062
Posted Thursday, June 16, 2011 11:10 PM


Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: 2 days ago @ 6:24 PM
Points: 4, Visits: 33
Are you kidding? I'm sure I'm exaggerating, but it sure feels like any non-trivial recovery scenario I've had to deal with involved multiple failures, either as a root cause, or cascading while trying to recover from the initial incident!

(Not just hardware/infrastructure failures, but even things like making sure all the bills for relevant vendors are paid-up!)
Post #1127081
Posted Friday, June 17, 2011 2:45 AM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Today @ 7:41 AM
Points: 363, Visits: 624
we had a 'good' one back in 2007...

had 2 servers, failover and primary, hooked up to a fiber attached SAN. (please note this was NOT a wintel/mssql environment and I didnt configure it!)

primary kernel panicked on a hardware level error that no one recognised and the os vendor couldn't pinpoint. This occurred mid-write on the database, corrupting our accounts and orders files.

our suppliers switched us to failover and did the best they could to repair the data.

1/2 day downtime so far - back online after 5pm (call centre work til 8)

following morning a techician came in at the behest of our suppliers to analyse the server to try to find the root cause. he stupidly booted it into multi user mode not single user mode whilst all the networking was still attached in the back. the server then grabbed the san back off of the failover knocking all the users off again, and, yes screwing the DB mid-write.

another half day then switched back to failover having had to fsck & repair each disk area, and yes, repair the db again.


engineer came in again the next day and disconnected the failed server before giving it an overhaul - completely unable to find any issues but saying looks like a blip it should be fine...............

switched back to live (why did we do that? I dont know)

3 hours later it kernel panicked again totally screwing the db and this time we had to restore from tape.

at this point we discovered that the hardware ids associated with the kernel panic related to the fiber controller. further research highlighted a compatability issue between the server's O/S and the fiber controller! doh!

each server, live and failover, now have their own internal disk arrays and the system is configured to synchronise overnight.

in total this issue resulted in limited to no service for a period of 3 days and a total of 6 hours of data lost permanently.

Lessons learnt;
never allow for a single point of failure (in this instance our SAN was a spof - corruption caused by the main server crashing caused issues when switching to the failover)

always ALWAYS ensure that any hardware/software purchased for any business machine is compatible (and supported) on/by every other piece of hardware/software in the solution.


Ben

^ Thats me!


----------------------------------------
01010111011010000110000101110100 01100001 0110001101101111011011010111000001101100011001010111010001100101 01110100011010010110110101100101 011101110110000101110011011101000110010101110010
----------------------------------------
Post #1127144
Posted Friday, June 17, 2011 3:54 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Monday, January 23, 2012 1:33 AM
Points: 11, Visits: 154
It seems to me that 'multiple failure' is a basic problem in all walks of life, not just database issues.

I missed the meeting because (perm any three)
I got up late
Lost the car keys
Car wouldn't start
Accident on the motorway
Went to the wrong office
Had the wrong time for the meeting
Child was ill and had to be taken to doctor
etc etc.

So, in real life, and in the database world, there are many scenarios that we can plan for, and some combinations. Often we'll do this 'seat of the pants', but depending on the price of failure, will enumerate the possibilities and mitigations more carefully. But there could always be some set of circumstances we haven't thought of or which are just to expensive to mitigate for the price / probability which will lead to failure.

As database professionals we need to have enumerated the likely failures, mitigations and costs (of mitigation and failure) and put appropriate things in place (and tested at least some of them) and have a coherent plan that looks reasonable and defensible before, and hopefully after, a disaster.
Post #1127176
Posted Friday, June 17, 2011 3:59 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Tuesday, March 11, 2014 6:17 AM
Points: 23, Visits: 284
Yeah, in my experience, a lot of failures are due to multiple causes. I guess this is to be expected as testing multiple factors adequately is always orders of magnitude more complicated than testing one thing at a time.

On slightly relates topic, I seem to attract system related coincidences on some sort of superhuman level. Things like:
- data warehouse import fails due to corrupt data entering the system
- switch off offending dimension's import (as it wasn't being used by the business at that time)
- spend weeks/months clearing out the data, risk reviewing the changes to live data (with senior management signoff etc)
- test import of offending dimension with copies of live data
- finally switch import of dimension back on some months later
- that very night, a completely unrelated issue appears in the table data and the import fails again

So we're thinking it's the old issue again, start down a path investigating and take a very long time to work out it's a while new issue. This seems to happen time and time again to me. I must be jinxed.
Post #1127178
Posted Friday, June 17, 2011 5:37 AM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Today @ 7:41 AM
Points: 363, Visits: 624
Believe it or not theres a verse in the bible that mentions this... (yes the bible is applicable to databases)

ecclesiastes 9:11 states; "time and unforseen occurence befall us all"

dont think anyone will argue with that lol



Major issues nearly always invlove more than 1 problem at a time. If problems occur one at a time, we, as dbas, take them in our stride and just deal with them as a matter of course - we call it bau support. perhaps an import job fails over night and at 9 am the business cant get report x. we re run the import manually and at 9.15 am the report is there. happy days.
if this happens at the same time as a disk failure, the server is slowed down by rebuilding the raid array onto a new drive and the report finishes at 9.45 instead of 9.15 - too late for the manager to take it to their bimonthly board review. and thats assuming the report doesnt slow down the system so much the end users are nagging us to can the report as it is affecting their productivity.

either one of those issues is bau on their own, when they occur at the same time however, they result in many people being cross -ourselves included.


Ben

^ Thats me!


----------------------------------------
01010111011010000110000101110100 01100001 0110001101101111011011010111000001101100011001010111010001100101 01110100011010010110110101100101 011101110110000101110011011101000110010101110010
----------------------------------------
Post #1127246
Posted Friday, June 17, 2011 7:12 AM
SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Yesterday @ 12:13 PM
Points: 442, Visits: 714
Our storage engineer recently noticed a bad drive on a server. It is the OS drive, and is RAID 1. It is used in a major clinical system.

As he pulled the drive to replace it, the mirrored drive failed.

He was somewhat displeased.

He had a deadline for another project that day. What should have been a 15 minute swap at most, took the entire day, and he was still unable to recover what he needed.

Oh, the other project, deadline missed.


Dave
Post #1127303
Posted Friday, June 17, 2011 7:48 AM


UDP Broadcaster

UDP BroadcasterUDP BroadcasterUDP BroadcasterUDP BroadcasterUDP BroadcasterUDP BroadcasterUDP BroadcasterUDP Broadcaster

Group: General Forum Members
Last Login: Today @ 7:14 AM
Points: 1,465, Visits: 4,262
Somtimes one thing has to break before we discover that something else has been broken for a long time. For example, it takes a hurricane to reveal the fact that the levees were improperly built or maintained.


"Winter Is Coming" - April 6, 2014
Post #1127356
Posted Friday, June 17, 2011 8:04 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Thursday, July 28, 2011 8:03 AM
Points: 70, Visits: 316
I think that multiple failures may actually be more common than single ones. As others have pointed out, there are a number of reasons why multiple failures may not be as uncommon as we might think.

I thing that one of those reasons is that we're doing everything we can to eliminate single points of failure. Whatever effect that has on overall system stability and rates of failure, it does mean that multiple failures are nearly the only thing left that can keep life interesting.
Post #1127372
Posted Friday, June 17, 2011 9:04 AM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Monday, May 07, 2012 9:23 AM
Points: 304, Visits: 716
What you are really talking about is called "Cascading failures". During my days in the Air Force there was a great deal of time spent on this phenom. In modern jet aircraft (as happens in computers, servers, etc.) when one system fails it can put a load on another or fail on other components that are dependant upon the intial component.

The strangest aspect of cascading failures is that they are extremely common, not only in machines but in the human body as well, and yet we tend to think that one thing fails most of the time - and we create plans around that scenario - only to find that in actuality, when one thing breaks, others follow.

Any good recovery scenario should plan for cascading failures, NOT single instance failures. In my experience in this business, companies that plan for one thing to break usually wind up lost when a batch of things go down. On the other hand, companies that plan for cascading failures are much better positioned to recover and maintain productivity.

The lesson? Plan for the more common instance - not the rarer instance.


There's no such thing as dumb questions, only poorly thought-out answers...
Post #1127439
« Prev Topic | Next Topic »

Add to briefcase 12»»

Permissions Expand / Collapse