SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Multiple Failures


Multiple Failures

Author
Message
Steve Jones
Steve Jones
SSC Guru
SSC Guru (146K reputation)SSC Guru (146K reputation)SSC Guru (146K reputation)SSC Guru (146K reputation)SSC Guru (146K reputation)SSC Guru (146K reputation)SSC Guru (146K reputation)SSC Guru (146K reputation)

Group: Administrators
Points: 146545 Visits: 19432
Comments posted to this topic are about the item Multiple Failures

Follow me on Twitter: @way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com
TheBlueAdept
TheBlueAdept
Forum Newbie
Forum Newbie (4 reputation)Forum Newbie (4 reputation)Forum Newbie (4 reputation)Forum Newbie (4 reputation)Forum Newbie (4 reputation)Forum Newbie (4 reputation)Forum Newbie (4 reputation)Forum Newbie (4 reputation)

Group: General Forum Members
Points: 4 Visits: 50
Are you kidding? I'm sure I'm exaggerating, but it sure feels like any non-trivial recovery scenario I've had to deal with involved multiple failures, either as a root cause, or cascading while trying to recover from the initial incident!

(Not just hardware/infrastructure failures, but even things like making sure all the bills for relevant vendors are paid-up!)
BenWard
BenWard
Ten Centuries
Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)

Group: General Forum Members
Points: 1262 Visits: 827
we had a 'good' one back in 2007...

had 2 servers, failover and primary, hooked up to a fiber attached SAN. (please note this was NOT a wintel/mssql environment and I didnt configure it!)

primary kernel panicked on a hardware level error that no one recognised and the os vendor couldn't pinpoint. This occurred mid-write on the database, corrupting our accounts and orders files.

our suppliers switched us to failover and did the best they could to repair the data.

1/2 day downtime so far - back online after 5pm (call centre work til 8)

following morning a techician came in at the behest of our suppliers to analyse the server to try to find the root cause. he stupidly booted it into multi user mode not single user mode whilst all the networking was still attached in the back. the server then grabbed the san back off of the failover knocking all the users off again, and, yes screwing the DB mid-write.

another half day then switched back to failover having had to fsck & repair each disk area, and yes, repair the db again.


engineer came in again the next day and disconnected the failed server before giving it an overhaul - completely unable to find any issues but saying looks like a blip it should be fine...............

switched back to live (why did we do that? I dont know)

3 hours later it kernel panicked again totally screwing the db and this time we had to restore from tape.

at this point we discovered that the hardware ids associated with the kernel panic related to the fiber controller. further research highlighted a compatability issue between the server's O/S and the fiber controller! doh!

each server, live and failover, now have their own internal disk arrays and the system is configured to synchronise overnight.

in total this issue resulted in limited to no service for a period of 3 days and a total of 6 hours of data lost permanently.

Lessons learnt;
never allow for a single point of failure (in this instance our SAN was a spof - corruption caused by the main server crashing caused issues when switching to the failover)

always ALWAYS ensure that any hardware/software purchased for any business machine is compatible (and supported) on/by every other piece of hardware/software in the solution.

Ben

^ Thats me!


----------------------------------------
01010111011010000110000101110100 01100001 0110001101101111011011010111000001101100011001010111010001100101 01110100011010010110110101100101 011101110110000101110011011101000110010101110010
----------------------------------------
SQL-939141
SQL-939141
Grasshopper
Grasshopper (17 reputation)Grasshopper (17 reputation)Grasshopper (17 reputation)Grasshopper (17 reputation)Grasshopper (17 reputation)Grasshopper (17 reputation)Grasshopper (17 reputation)Grasshopper (17 reputation)

Group: General Forum Members
Points: 17 Visits: 154
It seems to me that 'multiple failure' is a basic problem in all walks of life, not just database issues.

I missed the meeting because (perm any three)
I got up late
Lost the car keys
Car wouldn't start
Accident on the motorway
Went to the wrong office
Had the wrong time for the meeting
Child was ill and had to be taken to doctor
etc etc.

So, in real life, and in the database world, there are many scenarios that we can plan for, and some combinations. Often we'll do this 'seat of the pants', but depending on the price of failure, will enumerate the possibilities and mitigations more carefully. But there could always be some set of circumstances we haven't thought of or which are just to expensive to mitigate for the price / probability which will lead to failure.

As database professionals we need to have enumerated the likely failures, mitigations and costs (of mitigation and failure) and put appropriate things in place (and tested at least some of them) and have a coherent plan that looks reasonable and defensible before, and hopefully after, a disaster.
aphillippe
aphillippe
SSC Rookie
SSC Rookie (39 reputation)SSC Rookie (39 reputation)SSC Rookie (39 reputation)SSC Rookie (39 reputation)SSC Rookie (39 reputation)SSC Rookie (39 reputation)SSC Rookie (39 reputation)SSC Rookie (39 reputation)

Group: General Forum Members
Points: 39 Visits: 319
Yeah, in my experience, a lot of failures are due to multiple causes. I guess this is to be expected as testing multiple factors adequately is always orders of magnitude more complicated than testing one thing at a time.

On slightly relates topic, I seem to attract system related coincidences on some sort of superhuman level. Things like:
- data warehouse import fails due to corrupt data entering the system
- switch off offending dimension's import (as it wasn't being used by the business at that time)
- spend weeks/months clearing out the data, risk reviewing the changes to live data (with senior management signoff etc)
- test import of offending dimension with copies of live data
- finally switch import of dimension back on some months later
- that very night, a completely unrelated issue appears in the table data and the import fails again

So we're thinking it's the old issue again, start down a path investigating and take a very long time to work out it's a while new issue. This seems to happen time and time again to me. I must be jinxed.
BenWard
BenWard
Ten Centuries
Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)

Group: General Forum Members
Points: 1262 Visits: 827
Believe it or not theres a verse in the bible that mentions this... (yes the bible is applicable to databases)

ecclesiastes 9:11 states; "time and unforseen occurence befall us all"

dont think anyone will argue with that lol



Major issues nearly always invlove more than 1 problem at a time. If problems occur one at a time, we, as dbas, take them in our stride and just deal with them as a matter of course - we call it bau support. perhaps an import job fails over night and at 9 am the business cant get report x. we re run the import manually and at 9.15 am the report is there. happy days.
if this happens at the same time as a disk failure, the server is slowed down by rebuilding the raid array onto a new drive and the report finishes at 9.45 instead of 9.15 - too late for the manager to take it to their bimonthly board review. and thats assuming the report doesnt slow down the system so much the end users are nagging us to can the report as it is affecting their productivity.

either one of those issues is bau on their own, when they occur at the same time however, they result in many people being cross -ourselves included.

Ben

^ Thats me!


----------------------------------------
01010111011010000110000101110100 01100001 0110001101101111011011010111000001101100011001010111010001100101 01110100011010010110110101100101 011101110110000101110011011101000110010101110010
----------------------------------------
djackson 22568
djackson 22568
SSCrazy
SSCrazy (2.6K reputation)SSCrazy (2.6K reputation)SSCrazy (2.6K reputation)SSCrazy (2.6K reputation)SSCrazy (2.6K reputation)SSCrazy (2.6K reputation)SSCrazy (2.6K reputation)SSCrazy (2.6K reputation)

Group: General Forum Members
Points: 2630 Visits: 1241
Our storage engineer recently noticed a bad drive on a server. It is the OS drive, and is RAID 1. It is used in a major clinical system.

As he pulled the drive to replace it, the mirrored drive failed.

He was somewhat displeased.

He had a deadline for another project that day. What should have been a 15 minute swap at most, took the entire day, and he was still unable to recover what he needed.

Oh, the other project, deadline missed.

Dave
Eric M Russell
Eric M Russell
One Orange Chip
One Orange Chip (29K reputation)One Orange Chip (29K reputation)One Orange Chip (29K reputation)One Orange Chip (29K reputation)One Orange Chip (29K reputation)One Orange Chip (29K reputation)One Orange Chip (29K reputation)One Orange Chip (29K reputation)

Group: General Forum Members
Points: 29040 Visits: 11514
Somtimes one thing has to break before we discover that something else has been broken for a long time. For example, it takes a hurricane to reveal the fact that the levees were improperly built or maintained.


"The universe is complicated and for the most part beyond your control, but your life is only as complicated as you choose it to be."
Ron Porter
Ron Porter
SSC-Enthusiastic
SSC-Enthusiastic (142 reputation)SSC-Enthusiastic (142 reputation)SSC-Enthusiastic (142 reputation)SSC-Enthusiastic (142 reputation)SSC-Enthusiastic (142 reputation)SSC-Enthusiastic (142 reputation)SSC-Enthusiastic (142 reputation)SSC-Enthusiastic (142 reputation)

Group: General Forum Members
Points: 142 Visits: 316
I think that multiple failures may actually be more common than single ones. As others have pointed out, there are a number of reasons why multiple failures may not be as uncommon as we might think.

I thing that one of those reasons is that we're doing everything we can to eliminate single points of failure. Whatever effect that has on overall system stability and rates of failure, it does mean that multiple failures are nearly the only thing left that can keep life interesting.
blandry
blandry
SSC Eights!
SSC Eights! (975 reputation)SSC Eights! (975 reputation)SSC Eights! (975 reputation)SSC Eights! (975 reputation)SSC Eights! (975 reputation)SSC Eights! (975 reputation)SSC Eights! (975 reputation)SSC Eights! (975 reputation)

Group: General Forum Members
Points: 975 Visits: 723
What you are really talking about is called "Cascading failures". During my days in the Air Force there was a great deal of time spent on this phenom. In modern jet aircraft (as happens in computers, servers, etc.) when one system fails it can put a load on another or fail on other components that are dependant upon the intial component.

The strangest aspect of cascading failures is that they are extremely common, not only in machines but in the human body as well, and yet we tend to think that one thing fails most of the time - and we create plans around that scenario - only to find that in actuality, when one thing breaks, others follow.

Any good recovery scenario should plan for cascading failures, NOT single instance failures. In my experience in this business, companies that plan for one thing to break usually wind up lost when a batch of things go down. On the other hand, companies that plan for cascading failures are much better positioned to recover and maintain productivity.

The lesson? Plan for the more common instance - not the rarer instance.

There's no such thing as dumb questions, only poorly thought-out answers...
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search