Helpful Advice

  • Free Advice

    I have to be honest with you. It wasn't me.

    This advice on DR wasn't given to anyone by me. This post is from a blog entry last year before Tech Ed. I made a note of it, but it wasn't until recently I got around to actually writing about it.

    I'm kind of surprised to see some of this advice being given to people. Some of these I don't think are too bad, but would you do any of these on your production database?

    1. "Just run REPAIR_ALLOW_DATA_LOSS and you'll be fine..."
    2. Just rebuild your transaction log using these steps..."
    3. "Just restore your database and carry on..."
    4. "Run CHECKALLOC, then CHECKDB, then CHECKTABLE on all your tables, then..."
    5. "Just flick the power switch on and off a few times on one of the drives..."

    Actually I've done #4 and #3 is something I've had to do before as well. I can appreciate Paul's caution about not finding the root cause, but I've had more problems than I'd like to think about where we couldn't find a root cause in a reasonable time and decided to move on. And we never had the issue again. In the interests of getting the business going, there was a time where I explained everything, guesstimated the data loss, and we decided to just restore, lose the data, and have people re-enter it as quickly as possible.

    As for the other items, I don't think I'd run REPAIR_ALLOW_DATA_LOSS without someone like Bob Dorr from PSS on the sport. And I don't think I've ever even heard about anyone "rebuilding" a transaction log. That sounds like one of those urban myths where someone heard that someone said that they had a way to rebuild a log.

    And flicking the power on your drives? I'm not sure what I'd even say to someone who suggested it.

  • Interesting....

    I realise the humor in it... but you can be certain there are people out there just itching to give these a go!!!

    I bet the people that actually go ahead and do these things are certified DBA's too! Who of course, quickly blame "micrsoft support" afterall, they read it on a forum somehwere - so it must be true - right?

    Don't get me wrong I frequent forums a lot for the answers that I don't know - but I ALWAYS test something out on a development box first!

    shuffles off muttering.... It wasn't me sir. It fell apart in my hands - honest!


    Gavin Baumanis

    Smith and Wesson. The original point and click device.

  • Hi Steve, our enviroment is very OLTP intensive. Several clustered SQL boxes etc. No window during the week from production for optimizations etc etc. Loads of project work including bringing SQL Server up in our Polish plant. Downtime is unacceptable and our  main customer Dell score us on our uptime. We are only starting to look at SQL 2005. Point being extremely busy all the time. However the message I drive back to my boss as I am the only site DBA is, we can do all this project work and yes it must be done. However basic DBA attention is an absolute must. Test point in time recovery, actually look at the log files from transaction log backups and optimizations, review fragmentation, review SQL Server logs every day, verify integrity (checksum) of replication, check SQL Server Agent logs etc, etc. My rule is when I use a DBCC command, I understand exactly it's function and research as much as possible the pit falls if any. Hitting a switch or attempting to rebuild a transaction log seems to me to be a 'I really don't know what's going on here' senario. Why rebuild a log... restore to point in time surely and if you can't your goosed anyway. Rabitting on a bit now, but I get the gist and agree with you. By the way in the past I sat the 72-228 exam and failed it not by much, but failed it. The exam crashed when I was doing it and got zero marks for section 1. Hence my question last week should I do it again with SQL 2005 now on stream. I consider myself a good and attentive DBA. Disheartened after failing the exam, but no excuses for attention to detail and cert or no cert, my approach to my work is it's a hobby and flipint comments you have come across really annoy me. People who flip switches really don't have their finger on the pulse !!! Sorry for the long verse. Derek

  • Very intensive OLTP too. Running with three clusters. One of them is 64 bit using sql25k. About this kind of suggestions or scripts, me and the rest of the co-dba always test them on development server. I can't imagine do that stuff directly, without keep on mind a lot of factors and so on...

     

     

  • "And flicking the power on your drives? I'm not sure what I'd even say to someone who suggested it."

    Depends on context... our AS/400 has 3 RAID arrays of 15 disks each.  a few months back, we lost THREE disks in a 24 hour period.  The last went out mid day.  Needless to say, the system went down HARD.  We had a choice, restore to the previous night's backup, or... pull the dead drives, and give them a bit of a shake... We chose to pull the drives, give them a shake, get the array hobbeling along in safe mode, remove the one of the two that didn't spin back up, replace it with a new drive, and held our breath while the array rebuilt.

    We got lucky, we didn't need shut down our warehouse for a couple days while doing a complete cycle count, and POs/Orders/etc being rebuilt from the primary business systems... The worst that happened was a couple locking objects were damaged, and a bunch of indexes needed to be rebuilt... it could have been FAR worse.

    And we were saved by the equivelant of flicking the power switch on one of the drives a few times...

    Sometimes the best laid plans of disaster recovery just aren't good enough for the worst case scenerio.  They get you close... but... you never really expect 3 drives to fail in a 24 hour period when they normally fail 1 every 3-6 months.

    /HorrorStory

  • I rant a lot about disaster preparedness.  Mostly because I've had so many customers have actual disasters.  With so many years of working with dental practices I've been involed in some good ones.  In Florida hurricane Andrew took the wall off of the side of the building.  A deer (4 legs plus antlers) lept in through the window made havoc in the office and back kicked the server collapsing the machine by four inches (10 cm).  Imaging comming in to work and finding all your hardware gone.  Remember seeing that famous shot of the double decker highway in California that collapsed in the earthquake?  That guy in the blue lab coat going in to treat the injured is a dentist, and one of my clients.

    Disaster recovery is needed only after the disaster.  Disater preparedness is needed every hour.  "Practice, drill, rehearse!"  Don't start your praying as you pull the pin on the fire extingusiher.  Right this second do you know where the exits are?

    Test your recovery plans.  If management balks at extra facilities then put forth a cost projection of dollars per hour in down time if the recovery fails.

    ATBCharles Kincaid

  • There's a lot of bad programming advice out there, in any language. I can't imagine why, but I have some ideas.

  • Thing is, if this had been a hurricane, it wouldn't' have been in the middle of the work day, and our recovery plan would have worked.  It's the loss of the data between the morning backup, and the time of failure that was a problem.  If it had been after the business day, we could have rebuilt based on the load into our primary business systems.  Having the RAID array is our safety net for the time between 7am and 6pm.  We have a decent amount of drives... normally we see one go down here and there, but rarely 2 in a day, let alone 2 in the same array in the same day.  That day, Luck was against us the first half, and for us the second half.  the first half, we lost a total of 5 drives in 4 arrays on 3 machines... really, I have a feeling our backup would have been corrupt with those odds. 

    You can prepare for the worst... it just seems there's always something worse waiting in the wings.  And unluckily we have budgets... if we didn't, we'd all have atleast 1 real time mirror running offsite... But... a 2nd AS400 is actually far more expensive even to maintain then the losses of rebuilding the lost data from a day's transactions would ever be. 

    Besides, in most disaster situations,  that would effect our AS400 (warehousing system) chances are there would be damage to the goods in the warehouse as well that would mean the same type of recovery work as we would have done for the data loss anyway.

    I'm prepared... I know exactly what would need done to get everything running again.  I know about how much it would cost, and I know about how long it would take.  Are there better ways... yes.  Are they worth the cost? not really.

    But... the point of my post was that sometimes those things that just sound like a really bad idea are the best idea in the situation... without context, you can't really be sure... In this case, a bad idea saved us probably $25-$45k (what recovery probably would have cost, including upgraded shipping) but that's still less then the $150k+ we would have paid for better protection.

  • Redundant power grid feeds, dual balanced UPSs, redundant generators, SAN and NAS storage, etc. and still 2 total blackouts this weekend. No matter how prepared you are you cannot expect the 'unexpected'. Multiple minor failures cascaded twice almost 24 hours apart exactly to produce our test. Oh, we also had dual power grid failures as well. Yet we experienced no data loss and luckily less than a handful of equipment component replacements.

    There really is no practice, its all in the planning, execution and the caliber of the staff.

    As for the database recovery information. Well the internet is a double edged sword. The good edge is the abundance of information available. The bad edge is the abundance of information. By this I mean you have to vet and judge it. With this in mind I know I have a secure future for a long time to come.

    RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."

  • The internet is not the only place you can get questionable advice.  A while back we had some issues with the primary database server and an MS certified dba was brough in.  Top of his list adviced was "You don't need to run backup", it takes resources from the system away.

    The above was a case of a not enough I/O coupled with bad hardware crashing the server but that advice was scary in a mission critical system.  Of course the bean counters where at fault and since have seen the light, but it takes a lot to get them to part with moneys...actually we had to part with the said bean counters first, then funds where available and not further issues.

     

    michel

  • At a company I worked at years ago, the IT/Ops group at one point fell under the Accounting group... bad idea.  Once day, the person on ops went to the head of accounting, and said "Why do we need to put these tapes in, we just reuse them anyway, so, it's not writing anything important, and it takes hours to run that we could be using to get work done." The accountant promptly discontinued the backup plan, and it stayed down from what I understand for a couple years.  When I got to the company, backups were back, but only daily cumes.  there hadn't been a full save for years.  When we went to upgrade to a new box, we had to stop and do a full save before we could do the upgrade.  Once I realized that was the case, I pushed to get a weekly full save and daily cumes.  Sure, it cost us I think an extra 30 min every Sunday, but... I can think of far worse things to deal with... (like an added database that never got picked up on a cume because it never got backed up to begin with)

    We do some crazy stuff when we over simplify...

  • Kevin, you sigend off by saying "We do some crazy stuff when we over simplify...".  So true.

    One of the points that I get across in one of my classes is "Traditions are things that we do to reinforce lessons.  Any tradition divorced from the lesson is a ritual".  Your former company fell victim to ritualization.  The "lesson" did not go along with the "action".

    ATBCharles Kincaid

  • Charles, your quote is a keeper !

    RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."

  • One of the problems I've seen is that ritual is the norm at most companies.  I don't know how many times I've seen someone doing a job without realizing what they were actually doing, or why.  (This includes some more complicated technical tasks) more then once I've been forced to ask myself how the person can even do the job without understanding what it's doing.

    But... I've also discovered that you can train almost anyone to do almost anything... it's finding someone that can learn the whys, and hows of the job, and can understand it completely that are hard to come by. 

    It's way too hard to convince people that if I wanted a mindless zombie doing the job, I would have automated it, and let the computer do it for me.  The reason I have a person doing it instead of a machine is because I need a thinking being that can handle the unexpected.  If the person can't understand the entire process, then I'd be better off with a machine, it won't make mistakes, and it'll be a lot less expensive... although, it will be blind.

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply