|
|
|
SSC-Dedicated
           
Group: Administrators
Last Login: Yesterday @ 5:37 PM
Points: 31,521,
Visits: 13,855
|
|
|
|
|
|
Old Hand
      
Group: General Forum Members
Last Login: Thursday, June 13, 2013 6:13 PM
Points: 355,
Visits: 280
|
|
Good article Steve.
I live in Christchurch, New Zealand and I have to admit that I thought the possibility of losing an entire center here was extremely remote. However recent events (two major earthquakes in 6 months) have seen this risk realized, unfortunately. I wrote two blog posts about this recently
http://www.sqlservercentral.com/blogs/martin_catherall/archive/2011/03/16/disaster-recovery-exposure-part-one.aspx
http://www.sqlservercentral.com/blogs/martin_catherall/archive/2011/03/16/disaster-recovery-exposure-part-two.aspx
|
|
|
|
|
Grasshopper
      
Group: General Forum Members
Last Login: Tuesday, April 23, 2013 6:51 AM
Points: 22,
Visits: 144
|
|
Large disasters are one thing, but programmers are a lot of optimists who rarely check for error conditions as trivial as the following
If Flag = '1' do ..... else /* assume flag is '2' */ .... end if
Rather than
If Flag = '1' do ..... elseif Flag = '2' .... Else Error in Program ... end if
|
|
|
|
|
Old Hand
      
Group: General Forum Members
Last Login: Tuesday, March 19, 2013 6:40 AM
Points: 351,
Visits: 158
|
|
I think Dinosaur comics summed up failure best with Failure: it's just success rounded down which is a healthy way to look at it.
|
|
|
|
|
Old Hand
      
Group: General Forum Members
Last Login: Monday, May 07, 2012 9:23 AM
Points: 304,
Visits: 716
|
|
"Hope for the best, plan for the worst..."
Generally good advice if you are talking about sizable companies, data centers and the like, but in one's personal life (if you are a techie) this can get out of hand.
I have always had a network in my home going as far back as the early 80's, and I have always gone to great lengths backing up lots of data. Now, some 30 years later, I have multiple external drives and two very old machines that I use solely for backing things up. Problem is, I am still (to this day) backing up some stuff that must be close to 30 years old and though I constantly promise myself that someday I will review what I am backing up, I never seem to get the energy to go through that task - and instead, I just buy yet another big external drive and keep backing up.
Why do I still save program code from languages now dead? Why do I continue to hold onto Windows fonts that date back (I think) to Windows 3.11? I have close to 20,000 digital images and I still back them up, seemingly sure that one day I will go through them and actually throw away the pictures I inadvertently took of my own feet.
Sure, hope for the best, plan for the worst... But sometimes the worst is self-inflicted, and while I have spent decades ensuring that my "main machines" are clean and up to date, I now have more space occupied in backups than I do in actively used data.
And why didn't I get rid of stuff as the years passed? Well, though I am fairly sure that DOS-based Ryan/McFarland COBOL is not going to make any comeback in these days - if it does (!!!), I will be ready with the programs I wrote for it back when I didn't have grey hair and wasn't a member in good standing of AARP...
There's no such thing as dumb questions, only poorly thought-out answers...
|
|
|
|
|
Say Hey Kid
      
Group: General Forum Members
Last Login: Friday, June 14, 2013 10:50 AM
Points: 691,
Visits: 1,724
|
|
Google has the right idea. Failure WILL happen and the issue is to design around it. Their approach is resilience
Unfortunately this reality has not fully found its way into the IT world. Failure should be considered an unfortunate byproduct of system operation and handled accordingly.
...
-- FORTRAN manual for Xerox Computers --
|
|
|
|
|
SSCrazy
      
Group: General Forum Members
Last Login: Tuesday, June 11, 2013 2:08 PM
Points: 2,121,
Visits: 2,226
|
|
Thanks for the good reminder, Steve, and the links.
This may be covered in the links (haven't read them yet), but I recall reading a similar topic where it mentioned that large companies such as Google invest in a huge amount of what they call commodity hardware -- decent but not top of the line equipment that is designed to be replaced when it fails, rather than a system where all the eggs are in one basket of a single high-end setup. Still requires a large budget, but it does make sense from a disaster mitigation point of view.
Also -- and I am by no means a math expert, so please excuse any errors -- my understanding of the failure rates of systems is that the chance of a failure on any given day for modern equipment may be quite small, but the chance of failure over a longer time is almost certain. So it's only a matter of time before parts fail. I guess it's only a short step from that realization to the realization that if you don't really know when the part will fail, you always need to be ready for it to fail and have a backup plan to handle it.
Just my two cents....
- webrunner
------------------- "The chemistry must be respected." - Walter White
"A SQL query walks into a bar and sees two tables. He walks up to them and says 'Can I join you?'" Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html
|
|
|
|
|
Say Hey Kid
      
Group: General Forum Members
Last Login: Monday, June 10, 2013 1:08 PM
Points: 679,
Visits: 2,038
|
|
For those with RAID; how many actually run consistency checks on a regular basis, to detect single drive corruption?
For single drives, how often is an online disk check (much less an offline disk check) done?
For data, how many people actually generate checksum data (if you use Zip, raise your hand) to detect some corruption*, much less ECC data (par2 or ICE or DVDisaster, etc.) to detect and correct some corruption*?
*I almost never see real corruption; the typical "I can't explain this issue [optional: but a reinstall fixed it]; some file must have gone corrupt" that I actually check out shows no corruption at all; more commonly is a configuration change of some stripe.
I would also note that planning for regular failure is both very expensive and very limiting; most products don't support truly transparent high availability with 0% downtime at all. Big mainframe hardware (and perhaps midrange systems) does; commodity x86 based hardware and software typically doesn't, with a few exceptions.
|
|
|
|
|
Ten Centuries
      
Group: General Forum Members
Last Login: Yesterday @ 7:38 AM
Points: 1,184,
Visits: 3,414
|
|
Disaster recovery is actually easier in the modern IT world than it was in times past. One hundred years ago, if the county courthouse burnt to the ground, much of the archived documents would be lost forever with no backup copy.
"Winter Is Coming"
|
|
|
|
|
SSC Rookie
      
Group: General Forum Members
Last Login: Yesterday @ 7:15 AM
Points: 34,
Visits: 259
|
|
Left out the the 3rd rule...
Hope for the best
Plan/Prepare for the worst
Always, always, always have a plan B!!!
|
|
|
|