Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


The Chance of Failure


The Chance of Failure

Author
Message
Steve Jones
Steve Jones
SSC-Dedicated
SSC-Dedicated (36K reputation)SSC-Dedicated (36K reputation)SSC-Dedicated (36K reputation)SSC-Dedicated (36K reputation)SSC-Dedicated (36K reputation)SSC-Dedicated (36K reputation)SSC-Dedicated (36K reputation)SSC-Dedicated (36K reputation)

Group: Administrators
Points: 36039 Visits: 18734
Comments posted to this topic are about the item The Chance of Failure

Follow me on Twitter: @way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com
martin catherall
martin catherall
Old Hand
Old Hand (397 reputation)Old Hand (397 reputation)Old Hand (397 reputation)Old Hand (397 reputation)Old Hand (397 reputation)Old Hand (397 reputation)Old Hand (397 reputation)Old Hand (397 reputation)

Group: General Forum Members
Points: 397 Visits: 342
Good article Steve.

I live in Christchurch, New Zealand and I have to admit that I thought the possibility of losing an entire center here was extremely remote. However recent events (two major earthquakes in 6 months) have seen this risk realized, unfortunately.
I wrote two blog posts about this recently

http://www.sqlservercentral.com/blogs/martin_catherall/archive/2011/03/16/disaster-recovery-exposure-part-one.aspx

http://www.sqlservercentral.com/blogs/martin_catherall/archive/2011/03/16/disaster-recovery-exposure-part-two.aspx



Raju Lalvani
Raju Lalvani
Grasshopper
Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)

Group: General Forum Members
Points: 23 Visits: 154
Large disasters are one thing, but programmers are a lot of optimists who rarely check for error conditions as trivial as the following

If Flag = '1'
do
.....
else /* assume flag is '2' */
....
end if

Rather than


If Flag = '1'
do
.....
elseif Flag = '2'
....
Else
Error in Program
...
end if
SALIM ALI
SALIM ALI
Old Hand
Old Hand (370 reputation)Old Hand (370 reputation)Old Hand (370 reputation)Old Hand (370 reputation)Old Hand (370 reputation)Old Hand (370 reputation)Old Hand (370 reputation)Old Hand (370 reputation)

Group: General Forum Members
Points: 370 Visits: 246
I think Dinosaur comics summed up failure best with
Failure: it's just success rounded down
which is a healthy way to look at it.



blandry
blandry
Old Hand
Old Hand (353 reputation)Old Hand (353 reputation)Old Hand (353 reputation)Old Hand (353 reputation)Old Hand (353 reputation)Old Hand (353 reputation)Old Hand (353 reputation)Old Hand (353 reputation)

Group: General Forum Members
Points: 353 Visits: 723
"Hope for the best, plan for the worst..."

Generally good advice if you are talking about sizable companies, data centers and the like, but in one's personal life (if you are a techie) this can get out of hand.

I have always had a network in my home going as far back as the early 80's, and I have always gone to great lengths backing up lots of data. Now, some 30 years later, I have multiple external drives and two very old machines that I use solely for backing things up. Problem is, I am still (to this day) backing up some stuff that must be close to 30 years old and though I constantly promise myself that someday I will review what I am backing up, I never seem to get the energy to go through that task - and instead, I just buy yet another big external drive and keep backing up.

Why do I still save program code from languages now dead? Why do I continue to hold onto Windows fonts that date back (I think) to Windows 3.11? I have close to 20,000 digital images and I still back them up, seemingly sure that one day I will go through them and actually throw away the pictures I inadvertently took of my own feet.

Sure, hope for the best, plan for the worst... But sometimes the worst is self-inflicted, and while I have spent decades ensuring that my "main machines" are clean and up to date, I now have more space occupied in backups than I do in actively used data.

And why didn't I get rid of stuff as the years passed? Well, though I am fairly sure that DOS-based Ryan/McFarland COBOL is not going to make any comeback in these days - if it does (!!!), I will be ready with the programs I wrote for it back when I didn't have grey hair and wasn't a member in good standing of AARP...

There's no such thing as dumb questions, only poorly thought-out answers...
jay-h
jay-h
SSC Eights!
SSC Eights! (919 reputation)SSC Eights! (919 reputation)SSC Eights! (919 reputation)SSC Eights! (919 reputation)SSC Eights! (919 reputation)SSC Eights! (919 reputation)SSC Eights! (919 reputation)SSC Eights! (919 reputation)

Group: General Forum Members
Points: 919 Visits: 2220
Google has the right idea. Failure WILL happen and the issue is to design around it. Their approach is resilience

Unfortunately this reality has not fully found its way into the IT world. Failure should be considered an unfortunate byproduct of system operation and handled accordingly.

...

-- FORTRAN manual for Xerox Computers --
webrunner
webrunner
Hall of Fame
Hall of Fame (3K reputation)Hall of Fame (3K reputation)Hall of Fame (3K reputation)Hall of Fame (3K reputation)Hall of Fame (3K reputation)Hall of Fame (3K reputation)Hall of Fame (3K reputation)Hall of Fame (3K reputation)

Group: General Forum Members
Points: 3027 Visits: 3747
Thanks for the good reminder, Steve, and the links.

This may be covered in the links (haven't read them yet), but I recall reading a similar topic where it mentioned that large companies such as Google invest in a huge amount of what they call commodity hardware -- decent but not top of the line equipment that is designed to be replaced when it fails, rather than a system where all the eggs are in one basket of a single high-end setup. Still requires a large budget, but it does make sense from a disaster mitigation point of view.

Also -- and I am by no means a math expert, so please excuse any errors -- my understanding of the failure rates of systems is that the chance of a failure on any given day for modern equipment may be quite small, but the chance of failure over a longer time is almost certain. So it's only a matter of time before parts fail. I guess it's only a short step from that realization to the realization that if you don't really know when the part will fail, you always need to be ready for it to fail and have a backup plan to handle it.


Just my two cents....

- webrunner

-------------------
"I love spending twice as long and working twice as hard to get half as much done!" – Nobody ever.
Ref.: http://www.adminarsenal.com/admin-arsenal-blog/powershell-how-to-write-your-first-powershell-script

"Operator! Give me the number for 911!" - Homer Simpson

"A SQL query walks into a bar and sees two tables. He walks up to them and says 'Can I join you?'"
Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html
Nadrek
Nadrek
Ten Centuries
Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)

Group: General Forum Members
Points: 1029 Visits: 2673
For those with RAID; how many actually run consistency checks on a regular basis, to detect single drive corruption?

For single drives, how often is an online disk check (much less an offline disk check) done?

For data, how many people actually generate checksum data (if you use Zip, raise your hand) to detect some corruption*, much less ECC data (par2 or ICE or DVDisaster, etc.) to detect and correct some corruption*?

*I almost never see real corruption; the typical "I can't explain this issue [optional: but a reinstall fixed it]; some file must have gone corrupt" that I actually check out shows no corruption at all; more commonly is a configuration change of some stripe.

I would also note that planning for regular failure is both very expensive and very limiting; most products don't support truly transparent high availability with 0% downtime at all. Big mainframe hardware (and perhaps midrange systems) does; commodity x86 based hardware and software typically doesn't, with a few exceptions.
Eric M Russell
Eric M Russell
SSCarpal Tunnel
SSCarpal Tunnel (4.6K reputation)SSCarpal Tunnel (4.6K reputation)SSCarpal Tunnel (4.6K reputation)SSCarpal Tunnel (4.6K reputation)SSCarpal Tunnel (4.6K reputation)SSCarpal Tunnel (4.6K reputation)SSCarpal Tunnel (4.6K reputation)SSCarpal Tunnel (4.6K reputation)

Group: General Forum Members
Points: 4575 Visits: 9500
Disaster recovery is actually easier in the modern IT world than it was in times past. One hundred years ago, if the county courthouse burnt to the ground, much of the archived documents would be lost forever with no backup copy.


"The universe is complicated and for the most part beyond your control, but your life is only as complicated as you choose it to be."
GAF
GAF
SSC Rookie
SSC Rookie (36 reputation)SSC Rookie (36 reputation)SSC Rookie (36 reputation)SSC Rookie (36 reputation)SSC Rookie (36 reputation)SSC Rookie (36 reputation)SSC Rookie (36 reputation)SSC Rookie (36 reputation)

Group: General Forum Members
Points: 36 Visits: 337
Left out the the 3rd rule...

Hope for the best

Plan/Prepare for the worst

Always, always, always have a plan B!!!
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search