13 Disasters

  • This is an interesting list of disasters that can befall a production system. It comprises more than database servers, but certainly can happen to them as well.

    The list is pretty good and of these I've had more than a few of them happen to me. There were a few that I've never seen and it makes me worry a bit that this guy doesn't have a good staff or set of vendors working with him. But one of these is very interesting.

    The second item on the list, a controller going bad and corrupting disks, is an interesting one in today's world. What would you do if this happened on your SAN? Actually I know what you'd do. First you'd be in denial, and I don't mean the river in Africa.

    Then you'd tell the SAN guys. They wouldn't believe you. You'd argue and they'd check things, then they'd wonder how it could happen and why? You'd scream at them to fix it, probably using 4 letter, not 3 letter words. They'd start looking for ways to recover data and your hair would be slowly thinning as upper managers started calling down looking for answers.

    It shouldn't happen, but it could. Some marketing VP is showing off that nice piece of SAN equipment with it's high speed switches, dozens of drives, and lots of colored wires. He sloshes some coffee onto the system, freezes, but when the flashing red lights and siren from the Enterprise don't go off (Red Alert!), he continues his tour while the controller scribbles on your disks.

    Restoring from backup might be easy. Of course if you have a system like some I've seen that shares physical disks among multiple LUNs, it might not.

    Disaster Recovery is rarely the hurricane Katrina type of issue. Usually it's something like a disk drive, raid controller, cut wire, etc. that you have to deal with. So think about all of the minor disasters that you want to be sure you can handle and get some practice in on those. Make sure you have spare parts, you can rebuild a server (QA machines are handy for this), you know how to perform a restore.

    And most of all, be sure that you know where the backup files are stored. Preferably on different disks than the production data.

  • They left off one very important disaster... and that's what happens when you have normal, healthy database growth and the code written against the database reaches a "tipping point".  Lots harder to fix than most anything listed in the article...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • We've just decided that the office lottery syndicate would be a major risk!! A lady in Scotland has just won £35 million and the comment was that even a smaller win shared between us would be enough for most people to walk out!!

    That leaves about two people who aren't in the syndicate to run the place

     

  • Back in the mid-90s when SANS were new and mainly used just for mainframes, a major bank had a very little problem with one of their SANs.  A microcode upgrade caused it to write binary zeroes over track 1 of all their LUNs.  Very little user data was lost, just the MBR and partition table...

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

  • Number 14 is where the Network technician drops the server on the floor. The effects of that gravity test may not show up right away, but I'm pretty sure it voids the warranty.

    ------------
    Buy the ticket, take the ride. -- Hunter S. Thompson

  • LOL, dropping the server is good.

    We've tended to use multiple people to move servers now that most of us are past the 20 something stage. Felt embaressed to use a cart to move in the new server a couple years ago for this site, but discretion being the better part of impressing

  • RE: Disaster planning for hardware -- most small to medium-sized shops I've been in lately blow their whole wad on production hardware and the other environments pale in comparison to the production hardware capacity (if the alleged development, integration, QA and/or staging environments even exist). The "shadow" environments couldn't shoulder the load even if they had a documented procedure for "failing back" to one of them from production. I have a current client with major SAN issues that they originally attributed to SQL Server without ever checking to see what their queue depth or disk i/o latency was on the pre-production staging environment. [But, of course, the hardware isn't representative of the planned production environment, so we'll be doing the same goat rodeo all over again in six months.]

    RE: Code Tipping - I saw one recently that was misidentified as a performance issue because they were reduced to mere records per second instead of the 10-15k records/second that they should've been getting. When I drew the Magic X with my Sharpie across the statement inside one stored procedure that was killing the ETL throughput, the developer noticed a data corruption bug which was causing a ~600M row OUTER JOIN because he was missing part of the ON clause filter. Too many rows were being inserted into a fact table for each inbound XML tax record (all of them, in fact). And had been since last May (over six months) in production. A data quality firedrill immediately ensued for this large govervment taxing authority. Doh!

    RE: Gravity - I've seen an unwise IT monkey showing off his shiny new 10K SATA drives for the homebrew RAID (a couple years back when they were $1,000 each) by tossing it from hand to hand while discussing its many and varied benefits. Before he dropped it. Fortunately for the dbas, the drive was DOA at that point... I shudder to think what might have happened if it were a sturdier FC drive that just got a little "bent" in the dropping but didn't fail outright... "Hey, y'all! Watch this!! Whoops!!!"

  • P Jones touches on one thing I think is overlooked a lot in DR planning.  What would happen if key personel was suddenly gone?  And I am not talking an orderly 2 week notice situation.  The lottery causing X number of people to just quit.  Or the sole DBA for the company being run over by the bus on the way home from work.

    What if in a disaster the management team dies in the disaster?  Are there plans for who can make decisions in lieu of the normal decison makers?

  • Anders brings up a good point, and one I'm addressing now. What happens if Steve leaves (I'm not planning on it).

    Everyone should have a backup, but I'm not sure you should worry about more than that. Unlikely that more than 1-2 people would leave at once, and it's too hard to plan for that. If you have critical people, you might institute things like not having them travel together.

  • Yes, Anders, you're absolutely right about Bus Number... but since most places don't have plans for normal disasters (or business continuity), the likelihood of them planning for "Acts of God" (such as winning the Lotto) are nil.

  • I quit a job for just this reason.  Nobody would be my backup.  If I went on vacation everything piled up until I got back.  "Oh we haven't been able to get e-mail for a week now.  We were waiting for you to come back.  How long will it take your to fix it?"

    I told the VP and president that things would not change as long as I was there.

    ATBCharles Kincaid

  • I remember some IBM drives on 360/370 machines.  If the drive was switched off the head would retract.  Unfortunately the write amplifiers were not disabled first.  This would result in a deadly "spiral write".

    I do mean deadly.  There was one pack that had such strong inter-track spikes that it would not properly format after ten tries.

    ATBCharles Kincaid

  • Obligatory "me, too." Well, it wasn't just that nobody would be my backup... It was the work hours and stress that went along with being that foci. It's nice to be loved, but it got utterly ridiculous. The director of product development should not have operational responsibilities - it's a violation of church and state or something. What are all the IT monkeys in Ops for if I have to come in off the golf course because the system is down?

    Funny thing is that it didn't change after I left, either. They just found a new single point of failure... "The Human Element" must be kryptonite.

  • Hmmm...   is is just coincidence that the number "13" and the word "Disasters" are so close together like this?

  • Egad. You're not one of those triskaidekaphobiacs are you? I'm still mortified and embarrassed that we're skipping Office 13 "just because" (not because we're superstitious, honest!). From Office 12 straight to Office 14. Geez.

    Mabye I'm biased because I was born on Friday the 13th... and I turned out just fine!

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply