Mission Critical

  • I used this blog post a little while ago as a starter to relate a story in my career. However I didn't really address the story in the post, so I'll do that now. In the post management doesn't want to upgrade email servers, which I'm assuming was cost related. They tell him they can tolerate "some" downtime in the system if it saves money. So he turns off the servers, tells them he's working on it and it will be fixed in the SLA they gave him, and waits to see what happens.

    They get upset, realize email is mission critical and then give him money. While I kind of smile and slightly applaud his willingness to take them at their word, I do agree this isn't necessarily the way to handle this problem.

    I've worked at a number of companies, some small, some large, and I've had people tell me that many systems were mission critical. However I'd also seen those systems go down unexpectedly, sometimes for more than a day, and none of the businesses failed. It didn't even affect revenue substantially or disrupt the business.

    People learned to work around issues, make phone calls, get faxes, and use paper to record transactions. And they had to work hard to catch up later, but the business wasn't crushed. I've heard before that being down for a day or two would cause many businesses to fail. I just don't believe that is the case for most businesses. Maybe a few, but the people would keep most businesses going.

    I'm not sure what mission citical means. I know Amazon takes a huge hit if their site is down, but does Dell? They still make lots of phone sales, POs could stack up, etc. Microsoft? Not sure it would matter.

    So while I don't think people like email going down, and it is probably mission critical, I think most businesses could survive it being down for a day. Databases? It depends on the system. I've seen all types of systems go down and the only ones that really impacted numbers, the only one from which you might report a charge (for a public company), would be an e-commerce system.

    Most everything else you can work around for hours, or even a day.

  • When it comes to medical systems down time could be critical, possibly fatal.  Out of date or missing medical notes could result in the wrong decisions being taken - or do you always assume that the computer record isn't the complete record and send for the paper notes, if there are any, and delay decisions until these are received?

    R C Evans

  • True - I have a friend who works for the ambulance service - if their db is down they can't route calls to ambulances and people die.

    However, on my manufacturing systems things aren't quite so exciting (although you wouldn't believe it sometimes). I remember needing to remove a Novell server from my network. I'd got rid of all the applications/file shares/printer shares until we were down to the "samples database". This (allegedly) controlled the creation of new products within the company and I was informed it was absolutely mission critical, so after talking to the users I came to the conclusion I needed to rewrite it in VB/SQL.

    Come the first day of analysis I do a workflow analysis. Here's what was happening. User fills in a word doc with relevant details of the new specification, passes it to administration who type the details into the samples database. Administrator writes sample number onto word doc, passes it back to user. User photocopies form 7 times and physically mails it around the company. Everybody files it.

    Hmmm... So, let's get this straight.

    Q:"What happens when you need to look up some sample information"

    A:"We look in the file"

    Q:"So why do you need the samples database?"

    A:"We need a sample number".

    Q:"So who uses the samples database?"

    A:"The administrator"

    Q:"What for?"

    A:"To get a sample number"

    Needless to say I managed to replace the samples database with a little book in which the administrator wrote down numbers, which saved her 15 minutes of typing per sample.....Another successful implementation from yours truly, then!!!

    I think the simple lesson to be learned here is that users really aren't very well equipped to judge what is and isn't mission critical. A more complex critique may conclude that users are in fact often pretty bloody stupid  

  • Good article,

    I agree with the post as far internal systems are concerned, far too many people place far too much importance on their own 'babies' be that software, databases or business process - to the point where I would recommend therapy .

    However when it comes to externally facing systems i do beleive these should be treated 'as if' any downtime affects bottom line profitability. Its all about customer perception, Microsoft may not be well loved around the world but can you imagine how people would feel if lines of communication with them were lost? If the admin guys at microsoft held the view that the website wasn't mission critical one downtime would lead to another and so on.

    I know my immediate impressions of a company set by the website and its external communications

     

    Chris

  • If your business isn't ecommerce and you have any system that cannot tolerate downtime you had better have some really good emergency procedures in place.

    As for the example of the Ambulance company, what is their database doing that they can't dispatch an ambulance without a database?  My guess would be mapping, in which case, get out a map and figure it out.  Just imagine what ambulance companies did before computers.  There is a solution.

    JimFive

  • As a person with a vague memory of what an office was like before computers... I find it impossible to believe that anything we do can bring the business to it's knees and make it fail outright. If you are that dependent on your computer systems, then you have a dangerous situation and have made a bad business decision. I pointed this out to somebody once on the phone... they basically couldn't (or wouldn't) fulfill my request because "the computers are down" - and I was kind of like "then why even answer the phone? Write down my request and do it later. Who's running things here anyway? The people or the computers, cuz if the computers are in charge here, we're all in a lot of trouble." I had to call back later and I was really PO'd by that time. I remember now... it was Excel energy. I'm in the middle of some problem with them right now... I would really like to sever that connection, but I need the electricity... and they know it.

  • Just-in-time manufacturing plants can run without their scheduling system, but for a particular auto manufacturer I worked at in the past, being without the scheduling system would have resulted in a loss of about $10,000 in profits per minute.  For one day, which is 20 hours of production time, this would have been a loss of $12 million in profits.

    Without the scheduling system running, the switch to manual processes would have meant one truck produced every 15 minutes compared to one truck produced every 45 to 50 seconds.  I'm not sure if that meets the definition of mission critical, but with the way big companies worry about stock prices, the plant floor production applications at this manufacturer were only allowed 20 minutes of outage each year.  Good thing we had the four hour maintenance window each day.

  • I know at one of my old jobs was at a call centre constantly taking stock trades over the phone, their fall back was of course pen and paper. However before I left there were rumours that this would no longer be possible as the humble pen and paper was more open to internal abuse than an IT system.....how we laughed.

     

    The point being that some systems are built with compliance issues in mind, therefore you can't do anything without the software as you won't be compliant

     

    I hope this doesn't include the ambulance service though.....

  • Back in the old days, one was taught to have a good manual system in place before converting to a computer based system.

    Still applies today.

    Recently, we went through an e-mail melt down. Our Exchange system has been running since 1999 without problem. A partial restore took 24 hours and a full restore of all the junk, er, important stuff that users store took about a week. The only real problem was the lack of being able to look up contacts and notes during that time. Phone and fax work fine same as they did before e-mail.

    The delay was due to lack of proper training on my part. I knew most of the drill but without a test environment to practice on, it was slow going.

    We also found out that the backup program was lying. I had tested other restores with no problem but never the Exchange system. Guess what? Turns out it wasn't needed. I gained a new respect for Exchange and some of its built in resiliency.

    A few complaints were heard. We have a Barracuda installed and it stored all the received e-mail. I could read it and pass on important information if needed. Turns out, little was missed. We use on online account for certain business needs. If somebody had to send an e-mail, they could use that. While not part of the contingency plan, it worked as such.

    About a year and half ago, our main server took a bullet to the disk controller. Being older equipment, it took two days to locate and ship a replacement in. While the drives where physically okay, the data was hosed because of the controller card. Complete backups were available but, they could not be used to restore the entire system, a known problem. It took a couple of days to rebuild and reconfigure everything. We were down for 1 week. It happened to be in our slow time. The end result was I finally got new servers with redundancy and imaging software.

    A previous loss of the SQL server (hardware) also resulted in improvements.

    Bottom line, the business survived and some improvements were gained. I tell users to make sure they print a customer and inventory list once in a while as even a prolonged power outage can have the same effect as a system crash. You must have a manual system in place to cover any system failure.

    Because it will happen.

  • A bullet to the disk drive? You hunting moose up there in Alaska?

  • We recently lost our ERP system, and in the most stupid fashion. It's running on two IBM eServers (one app, one SQL) plus a DS400 SAN unit for my SQL Server portion. Turns out that we lost a drive in the raid array in the app server back in February (IIRC) and that about a month ago another drive finally died and the whole thing fell down went *BOOM*. Naturally it was at the end of the month.

    My database was just fine, but it could only be accessed by the app server (except for Crystal Reports), so it idled a lot of people: utility bills and new accounts, business licenses, etc.

    It's also possible that a controller card failed, I don't recall.

    So IBM brings out replacement drives, but there's a firmware mismatch, and for some reason it was hard doing the firmware update.

    Eventually everything is back up and running.

    The problem? For reasons unknown, the management module that would have notified us of the first drive's failure, was never initialized and this little oversight was never spotted.

    It definitely had a huge affect on our productivity as lots of citizens were unable to do whatever they wanted to do with the City.

    Now here's the lovely thing. Once they got the server's array back up, everything died at the ERP level four days later because when they updated one of the security packages, they used a temporary license with a four-day life! And it failed on the following Saturday, just a few hours after people came back in to try and catch up on what they hadn't been able to do earlier in the week.

    This particular system is definitely mission-critical, and there is very limited fallback to continue functioning if it's dead. I'd argue for a duplicate set of servers, but we're city gov't and I'm sure there's no money for it.

    Our other systems (GIS, help desk, risk management, parks & recreation scheduling, optical documents management) can either survive extended downtime or could survive until I could get them moved to another box and restored. I've got three SQL Servers right now, but my P3 box is fixing to be retired. I hope they have a spare lined up for me just in case one of the existing boxes bursts into flames.

    -----
    [font="Arial"]Knowledge is of two kinds. We know a subject ourselves or we know where we can find information upon it. --Samuel Johnson[/font]

  • Yep.  We use that server every fall hunt.  When the external rubber belt that turns the disk on the harddrive begins to fail, it squeals just like a cow moose in heat.

    Draws in those bulls every time. 

    But newbie miss fired and hit the server instead.

    And missed the moose too

     

  • Bob - ROTFLOL.

    Steve, here's a non-economic misson critical server for you.  The little boxes they use in every tower at every national airport.  It's not about money when you lose that one.  It's about the lives of all those poor people stuck up in the air, taking off and landing.  Money is a concern that comes afterwards when all the families decide to sue.

    Just as necessary as the medical servers, IMHO.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply