Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 1234»»»

The Chance of Failure Expand / Collapse
Author
Message
Posted Wednesday, March 30, 2011 10:19 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: Administrators
Last Login: Today @ 8:14 PM
Points: 31,036, Visits: 15,464
Comments posted to this topic are about the item The Chance of Failure






Follow me on Twitter: @way0utwest

Forum Etiquette: How to post data/code on a forum to get the best help
Post #1086606
Posted Wednesday, March 30, 2011 10:35 PM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Sunday, May 25, 2014 6:40 PM
Points: 384, Visits: 316
Good article Steve.

I live in Christchurch, New Zealand and I have to admit that I thought the possibility of losing an entire center here was extremely remote. However recent events (two major earthquakes in 6 months) have seen this risk realized, unfortunately.
I wrote two blog posts about this recently

http://www.sqlservercentral.com/blogs/martin_catherall/archive/2011/03/16/disaster-recovery-exposure-part-one.aspx

http://www.sqlservercentral.com/blogs/martin_catherall/archive/2011/03/16/disaster-recovery-exposure-part-two.aspx



Post #1086608
Posted Thursday, March 31, 2011 1:31 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Tuesday, September 9, 2014 7:52 AM
Points: 23, Visits: 154
Large disasters are one thing, but programmers are a lot of optimists who rarely check for error conditions as trivial as the following

If Flag = '1'
do
.....
else /* assume flag is '2' */
....
end if

Rather than


If Flag = '1'
do
.....
elseif Flag = '2'
....
Else
Error in Program
...
end if
Post #1086664
Posted Thursday, March 31, 2011 5:36 AM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Tuesday, July 15, 2014 3:22 PM
Points: 352, Visits: 173
I think Dinosaur comics summed up failure best with
Failure: it's just success rounded down
which is a healthy way to look at it.



Post #1086767
Posted Thursday, March 31, 2011 6:15 AM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Monday, May 7, 2012 9:23 AM
Points: 304, Visits: 716
"Hope for the best, plan for the worst..."

Generally good advice if you are talking about sizable companies, data centers and the like, but in one's personal life (if you are a techie) this can get out of hand.

I have always had a network in my home going as far back as the early 80's, and I have always gone to great lengths backing up lots of data. Now, some 30 years later, I have multiple external drives and two very old machines that I use solely for backing things up. Problem is, I am still (to this day) backing up some stuff that must be close to 30 years old and though I constantly promise myself that someday I will review what I am backing up, I never seem to get the energy to go through that task - and instead, I just buy yet another big external drive and keep backing up.

Why do I still save program code from languages now dead? Why do I continue to hold onto Windows fonts that date back (I think) to Windows 3.11? I have close to 20,000 digital images and I still back them up, seemingly sure that one day I will go through them and actually throw away the pictures I inadvertently took of my own feet.

Sure, hope for the best, plan for the worst... But sometimes the worst is self-inflicted, and while I have spent decades ensuring that my "main machines" are clean and up to date, I now have more space occupied in backups than I do in actively used data.

And why didn't I get rid of stuff as the years passed? Well, though I am fairly sure that DOS-based Ryan/McFarland COBOL is not going to make any comeback in these days - if it does (!!!), I will be ready with the programs I wrote for it back when I didn't have grey hair and wasn't a member in good standing of AARP...


There's no such thing as dumb questions, only poorly thought-out answers...
Post #1086778
Posted Thursday, March 31, 2011 6:55 AM
Right there with Babe

Right there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with Babe

Group: General Forum Members
Last Login: Tuesday, September 2, 2014 8:37 AM
Points: 751, Visits: 1,917
Google has the right idea. Failure WILL happen and the issue is to design around it. Their approach is resilience

Unfortunately this reality has not fully found its way into the IT world. Failure should be considered an unfortunate byproduct of system operation and handled accordingly.


...

-- FORTRAN manual for Xerox Computers --
Post #1086801
Posted Thursday, March 31, 2011 7:40 AM


SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Friday, September 19, 2014 8:20 AM
Points: 2,373, Visits: 2,726
Thanks for the good reminder, Steve, and the links.

This may be covered in the links (haven't read them yet), but I recall reading a similar topic where it mentioned that large companies such as Google invest in a huge amount of what they call commodity hardware -- decent but not top of the line equipment that is designed to be replaced when it fails, rather than a system where all the eggs are in one basket of a single high-end setup. Still requires a large budget, but it does make sense from a disaster mitigation point of view.

Also -- and I am by no means a math expert, so please excuse any errors -- my understanding of the failure rates of systems is that the chance of a failure on any given day for modern equipment may be quite small, but the chance of failure over a longer time is almost certain. So it's only a matter of time before parts fail. I guess it's only a short step from that realization to the realization that if you don't really know when the part will fail, you always need to be ready for it to fail and have a backup plan to handle it.


Just my two cents....

- webrunner


-------------------
"Operator! Give me the number for 911!" - Homer Simpson

"A SQL query walks into a bar and sees two tables. He walks up to them and says 'Can I join you?'"
Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html
Post #1086841
Posted Thursday, March 31, 2011 7:59 AM
SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Today @ 12:16 PM
Points: 870, Visits: 2,401
For those with RAID; how many actually run consistency checks on a regular basis, to detect single drive corruption?

For single drives, how often is an online disk check (much less an offline disk check) done?

For data, how many people actually generate checksum data (if you use Zip, raise your hand) to detect some corruption*, much less ECC data (par2 or ICE or DVDisaster, etc.) to detect and correct some corruption*?

*I almost never see real corruption; the typical "I can't explain this issue [optional: but a reinstall fixed it]; some file must have gone corrupt" that I actually check out shows no corruption at all; more commonly is a configuration change of some stripe.

I would also note that planning for regular failure is both very expensive and very limiting; most products don't support truly transparent high availability with 0% downtime at all. Big mainframe hardware (and perhaps midrange systems) does; commodity x86 based hardware and software typically doesn't, with a few exceptions.
Post #1086860
Posted Thursday, March 31, 2011 8:07 AM


SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: Today @ 10:28 AM
Points: 1,660, Visits: 4,750
Disaster recovery is actually easier in the modern IT world than it was in times past. One hundred years ago, if the county courthouse burnt to the ground, much of the archived documents would be lost forever with no backup copy.
Post #1086865
Posted Thursday, March 31, 2011 8:17 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Wednesday, May 28, 2014 8:39 AM
Points: 34, Visits: 319
Left out the the 3rd rule...

Hope for the best

Plan/Prepare for the worst

Always, always, always have a plan B!!!
Post #1086873
« Prev Topic | Next Topic »

Add to briefcase 1234»»»

Permissions Expand / Collapse