The Chance of Failure

Question

The Chance of Failure

Steve Jones - SSC Editor

SSC Guru

Points: 734564
More actions
March 30, 2011 at 10:19 pm

#148795

Comments posted to this topic are about the item The Chance of Failure

Viewing 15 posts - 1 through 15 (of 30 total)

You must be logged in to reply to this topic. Login to reply

Martin Catherall Right there with Babe Points: 746 More actions · Answer 1

Good article Steve.

I live in Christchurch, New Zealand and I have to admit that I thought the possibility of losing an entire center here was extremely remote. However recent events (two major earthquakes in 6 months) have seen this risk realized, unfortunately.

I wrote two blog posts about this recently

http://www.sqlservercentral.com/blogs/martin_catherall/archive/2011/03/16/disaster-recovery-exposure-part-one.aspx

http://www.sqlservercentral.com/blogs/martin_catherall/archive/2011/03/16/disaster-recovery-exposure-part-two.aspx

Raju Lalvani Say Hey Kid Points: 695 More actions · Answer 2

Large disasters are one thing, but programmers are a lot of optimists who rarely check for error conditions as trivial as the following

If Flag = '1'

do

.....

else /* assume flag is '2' */

....

end if

Rather than

If Flag = '1'

do

.....

elseif Flag = '2'

....

Else

Error in Program

...

end if

SALIM ALI Right there with Babe Points: 799 More actions · Answer 3

I think Dinosaur comics summed up failure best with

Failure: it's just success rounded down

which is a healthy way to look at it.

blandry SSCarpal Tunnel Points: 4821 More actions · Answer 4

"Hope for the best, plan for the worst..."

Generally good advice if you are talking about sizable companies, data centers and the like, but in one's personal life (if you are a techie) this can get out of hand.

I have always had a network in my home going as far back as the early 80's, and I have always gone to great lengths backing up lots of data. Now, some 30 years later, I have multiple external drives and two very old machines that I use solely for backing things up. Problem is, I am still (to this day) backing up some stuff that must be close to 30 years old and though I constantly promise myself that someday I will review what I am backing up, I never seem to get the energy to go through that task - and instead, I just buy yet another big external drive and keep backing up.

Why do I still save program code from languages now dead? Why do I continue to hold onto Windows fonts that date back (I think) to Windows 3.11? I have close to 20,000 digital images and I still back them up, seemingly sure that one day I will go through them and actually throw away the pictures I inadvertently took of my own feet.

Sure, hope for the best, plan for the worst... But sometimes the worst is self-inflicted, and while I have spent decades ensuring that my "main machines" are clean and up to date, I now have more space occupied in backups than I do in actively used data.

And why didn't I get rid of stuff as the years passed? Well, though I am fairly sure that DOS-based Ryan/McFarland COBOL is not going to make any comeback in these days - if it does (!!!), I will be ready with the programs I wrote for it back when I didn't have grey hair and wasn't a member in good standing of AARP...

There's no such thing as dumb questions, only poorly thought-out answers...

jay-h SSCoach Points: 18816 More actions · Answer 5

Google has the right idea. Failure WILL happen and the issue is to design around it. Their approach is resilience

Unfortunately this reality has not fully found its way into the IT world. Failure should be considered an unfortunate byproduct of system operation and handled accordingly.

...

-- FORTRAN manual for Xerox Computers --

webrunner SSC-Dedicated Points: 31731 More actions · Answer 6

Thanks for the good reminder, Steve, and the links.

This may be covered in the links (haven't read them yet), but I recall reading a similar topic where it mentioned that large companies such as Google invest in a huge amount of what they call commodity hardware -- decent but not top of the line equipment that is designed to be replaced when it fails, rather than a system where all the eggs are in one basket of a single high-end setup. Still requires a large budget, but it does make sense from a disaster mitigation point of view.

Also -- and I am by no means a math expert, so please excuse any errors -- my understanding of the failure rates of systems is that the chance of a failure on any given day for modern equipment may be quite small, but the chance of failure over a longer time is almost certain. So it's only a matter of time before parts fail. I guess it's only a short step from that realization to the realization that if you don't really know when the part will fail, you always need to be ready for it to fail and have a backup plan to handle it.

Just my two cents....

- webrunner

-------------------
A SQL query walks into a bar and sees two tables. He walks up to them and asks, "Can I join you?"
Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html

Nadrek SSC-Insane Points: 20039 More actions · Answer 7

For those with RAID; how many actually run consistency checks on a regular basis, to detect single drive corruption?

For single drives, how often is an online disk check (much less an offline disk check) done?

For data, how many people actually generate checksum data (if you use Zip, raise your hand) to detect some corruption*, much less ECC data (par2 or ICE or DVDisaster, etc.) to detect and correct some corruption*?

*I almost never see real corruption; the typical "I can't explain this issue [optional: but a reinstall fixed it]; some file must have gone corrupt" that I actually check out shows no corruption at all; more commonly is a configuration change of some stripe.

I would also note that planning for regular failure is both very expensive and very limiting; most products don't support truly transparent high availability with 0% downtime at all. Big mainframe hardware (and perhaps midrange systems) does; commodity x86 based hardware and software typically doesn't, with a few exceptions.

Eric M Russell SSC Guru Points: 125524 More actions · Answer 8

Disaster recovery is actually easier in the modern IT world than it was in times past. One hundred years ago, if the county courthouse burnt to the ground, much of the archived documents would be lost forever with no backup copy.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

GAF Mr or Mrs. 500 Points: 546 More actions · Answer 9

Left out the the 3rd rule...

Hope for the best

Plan/Prepare for the worst

Always, always, always have a plan B!!!

jay-h SSCoach Points: 18816 More actions · Answer 10

Nadrek (3/31/2011)
For those with RAID; how many actually run consistency checks on a regular basis, to detect single drive corruption?
...
I would also note that planning for regular failure is both very expensive and very limiting; most products don't support truly transparent high availability with 0% downtime at all. Big mainframe hardware (and perhaps midrange systems) does; commodity x86 based hardware and software typically doesn't, with a few exceptions.

0% downtime is extremely expensive. One needs to be realistic. Most apps can tolerate occasional downtimes of varying degrees, and it's a lot less expensive to evaluate those needs realistically.

Interestingly, we used to have a clustered RAID SQL2000 installation. Every system failure we had was in the RAID control system which meant that the clustering did us no good whatsoever in the downtime area. (Fortunately the RAIDs did not lose data during those failures). We have a mirrored system now on completely separate hardware.

...

-- FORTRAN manual for Xerox Computers --

TravisDBA SSCoach Points: 15780 More actions · Answer 11

Eric M Russell (3/31/2011)
Disaster recovery is actually easier in the modern IT world than it was in times past. One hundred years ago, if the county courthouse burnt to the ground, much of the archived documents would be lost forever with no backup copy.

It might be easier as you say, but you would be surprised at how many shops do not plan for it by not even just backing up their databases up on a regular basis. It would astound you. I have seen shops I have gone into in the past that have not backed up their system databases in over a year!!! . When I asked why? the response was "Well we didn't need to, everything works just fine and if it isn't btoke we don't fix it... " :w00t: Absolutely incredible, there are bozos out there in IT like this but there are. . Just because DR has gotten easier doesn't mean people are doing it. 😀

"Technology is a weird thing. It brings you great gifts with one hand, and it stabs you in the back with the other. ...:-D"

Eric M Russell SSC Guru Points: 125524 More actions · Answer 12

If Joe's Bike Shop looses all their data without a backup, then that's a personal tragedy for their business, but it's not really a community or regional wide disaster. The chances of a government office losing all your tax records or a corporation permanently losing all your mortgage paperwork is practically unheard of.

Well... the mortgage company may sit on your escrow account refund for weeks or months claiming they "misplaced the paperwork", but they can find the records at any point, if they really wanted to. That's not an information technology issue.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Tom Bakerman Hall of Fame Points: 3367 More actions · Answer 13

This seems like rehashing old news. I have vague memories from a database class I took in the 80s where we talked about an airlines data center (AA or United or somebody big like that). Like I said, my memory is vague on this, but I believe they talked about Mean Time Between Failure of the disk farm on the order of 5 minutes (translation: there will be a disk failure about every 5 minutes).

The technology has changed, but the problems are still there and have to be dealt with.

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 14

Tom Bakerman (3/31/2011)
This seems like rehashing old news. I have vague memories from a database class I took in the 80s where we talked about an airlines data center (AA or United or somebody big like that). Like I said, my memory is vague on this, but I believe they talked about Mean Time Between Failure of the disk farm on the order of 5 minutes (translation: there will be a disk failure about every 5 minutes).
The technology has changed, but the problems are still there and have to be dealt with.

I'm now picturing some poor kid wandering the halls of the data center with a giant shopping cart of new drives in front of him, just slowly meandering down the aisles looking for red lights with this zombie-fied look on his face at 3 in the morning.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA