What's an Outage?

  • http://www.gamerandy.com/archives/Earthlights-%20Power%20Outage.gif

    I caught this quote from Microsoft Watch, and thought it was very interesting. It definitely addresses an issue I've wondered about for some time. It talks about the WGA server issues and had this great quote from the WGA Product Manager:

    "It's important to clarify that this event was not an outage. Our system is designed to default to genuine if the service is disrupted or unavailable. In other words, we designed WGA to give the benefit of the doubt to our customers. If our servers are down, your system will pass validation every time. This event was not the same as an outage because in this case the trusted source of validations itself responded incorrectly"

    The individual writing the article questioned how this could not be an outage. After all, there was nearly a day where people could not validate their copy of Windows. So I'm asking this question today and with a few thoughts afterwards.

    What's an Outage?

    To me it's a client side issue. If the client's can't use the service/app/server, then it's down. If my SQL Server is running and I can connect, but no one outside the data center can, it's an outage.

    The question does get tricky. If I can get to SQLServerCentral.com, but people in Canada can't, then is it an outage? In some sense it is, and people have a legitimate gripe about the service not working. It might not be the fault of the service vendor, as it could be a network cut, an ISP issue, etc. However for the client it's down.

    Measuring uptime is a hard process these days. Most people would still consider their services up when a server fails if there are redundant servers, like a load balanced farm of web servers, that continue to work. For all I know there's always a few servers down at Google.com or Microsoft.com.

    So maybe it's a question of measuring your uptime at the edge of your network, outside your firewall. As long as things are working from there, you can claim uptime is still there. Just be sure you have a couple providers in case your one line goes down.

  • My definition would be; if one or more users are unable to use any part of the system, because of a problem with the system itself.

    [If they're able to only use certain parts that'd be a "bug", whilst if they can't use the system because of an external issue - e.g. their internet connection fails - then that's not a "system outage"].

    PH

    Paul

  • I agree.  If the client can't connect, for them it is an outage. 

    However, from a measurement perspective I think an outage should be anything within the scope of individual you are speaking about.  So for a "DBA" for example, if the DB is running, but Server crashes.  Assuming there is also a system admin.  "There was not a database outage"  For the System Admin there was.. or was there. 

    Let's assume that the SYSTEM failed over to another node in the cluster but the Database didn't start because of a bad DBA config.  In that case, there was not a system outage, but there was a Database Outage. 

    Perhaps I am splitting hairs here, and I have never been a fan of passing the buck, but I am going somewhere with this.

    I don't know about Microsoft (never worked there), but I do know in many of the companies that I have worked for there is often an incentive for higher uptime numbers.  For example if you can pull off a true 99.9% uptime, you might get a 7% bonus.  but if you pull off 99.99% it might go up to 15%.  That often gave the "incentive" to pass the buck, in order to get the bucks.  Which may be why the explanation was given the way it was. 

     

  • The last couple of places I have worked we counted multiple uptime counters.  One that is as seen from the client side, and one that is internal.  I.e. if a switch fails, that was counted against the uptime for the clients and the network, but not against the SQL uptime.

    Internet applications make this a bit harder, but generally it came down to wheter the IIS servers where receiving request from the internet or not.

  • Our office has a centralized data center/IT server area with over 20 remote offices, some only 1 or 2 people, some with up to 80 or more.  All of these offices are connected to us via AT&T direct lines and use Citrix to get to their apps.  If the servers on our end die, we think of it as an outage we are responsible for from the perspective of servicing our client (the people in the remote offices).  If the AT&T lines go out (too frequent for my taste) we don't.  To us an outage refers mainly to the IT infrastructure we can control.  Obviously the AT&T outages are crucial and we treat them as if it were an outage on our parts, but when you can't directly control the technology, how can you be held responsible for it?

  • I think MS was dancing around on this one, fooling themselves. Yes it was not an 'outage' in the way that MS anticipated, but that does not mean it's not an outage (politicians often use similar logic). What happened here is that an unaticipated mode of failure occurred and the system failed to treat it as an outage.

    Proper handling of critical systems means that the component that determines the 'outage' is completely different in principle and operation from the system itself so that malfunction of the system cannot affect the supervising system. Many industrial fail-safes have this kind of interlock, with a mechanical safety on top of an electronic control system, traffic lights (to prevent green in two ways) require a completely redundant interlock that will shut it down before it will allow an illegal light combination.

    System designers need to think this way too.

    ...

    -- FORTRAN manual for Xerox Computers --

  • I agree here. But would say bad data doesn't evn count as a bug.

    Unfortunately, even thou those folks couldn't get authenticated it is not even a bug because the system performed exactly as designed.

    The root issue here is what we as SQL Developers and DBA strive to prevent, introduction of bad data.

    In fact I am loaded 30k+ records tonight represent origanizational structure changes of people and their facility assignments into 2 different systems. Even thou I sanity checked the data and cleaned out dozens of readily found issues there are some discrepencies between the two that the end users will have to correct (relates to their job I am saving them time on 90+%) but I am knowingly introducing minor amounts of bad data. If a later report fails an it is found to be the data then this will be their issue not mine. In the MS case it is still MS's issue becuase the control and introduced the bad data themselves, in mine the customer has accepted responsibility for the bad data as they want loaded ASAP so they can go thru it anyway.

  • It is an outage, period. If you can't do your job it must be counted as an outage.

    Where I work people (including me) are crazy about all sorts of "metrics". None likes a failure to be on your lap but reality is different. The definition of a "failure" ( read outage ) is that someone can not perform his/her job and it is not because of him/herself.


    * Noel

  • I think an outage is any time a customer cannot use the system, for any reason not due to their own inability or lack of training.

    To me, the only question is what kind of outage is it, and who can fix that part?

    Take a web site, for example.

    1. A power outage (such as a regional blackout) means the customer can't use the web site. But there is nothing you can do about it directly. One can do something indirectly by trying to contact the people who may help expedite the repairs, but ultimately, customers will not blame the web site for that, because the same cause affected everything else in their home too.

    2. A problem on the customer's computer can cause an outage. For example, some problem with their hardware or operating system may cause their browser not to work or may make it unable to get to the site.

    3. There also can be an outage somewhere along the line between the customer's computer and the web site. For example, either the customer's ISP or the web site's ISP (or some place in between) can have a problem.

    4. Then there is the usual outage, which is something like: due to some problem at the specific web site (web server, database, virus/attack), one or more customers can't use some part of it or they can't use any of it. The customer can reach other web sites and can use other programs on their computer, but they can't use that one particular site.

    5. A bug is also an outage, but to me it is a subtype of outage. It is a case where one or more customers (depending on the bug) can't get to the site, but they can do their other work on the computer and may even be able to use other parts of the site. But they hit the bug that keeps them from doing what they want to. Again, the site will get the blame since their outage alone stands out right at the worst possible moment.

    I'm sure I am missing something, but I think this is a good start.

    In my experience, I don't think customers care about the differences between 3, 4, and 5 above. (Except if they are told that it is their ISP, in which case their attention switches to that company for resolution.) If they find out that the problem was somewhere between their ISP and the web site, they will expect that the web site is doing all it can to restore the service. Telling the customer that the problem is due to an intermediate ISP may be the truth, but most customers won't care. All they will remember is that "your web site was down." Part of that is bad luck, I'm sure, but part of it may involve knowing what the connections are and trying to have some kind of redundancy or plan B so all of the eggs are not in one basket. I'm sure that is incredibly difficult to do, but I bet the top sites like Amazon.com have it. I can recall only one time when I tried to get to Amazon.com and it was down for more than a minute. I don't think that is just my lucky result - they must go to great lengths to make sure that when someone goes to that URL, the site will work.

    Just my two cents.

    webrunner

    -------------------
    A SQL query walks into a bar and sees two tables. He walks up to them and asks, "Can I join you?"
    Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html

  • I think a lot of folks need to divide themselves from th customers point of view and only consider the system itself. If the system is operating as designed and actively running, it is online and never an outage of the application. You don't say you had a outage do to outside circumstances.

    Example given in the link was he installed XP on his mothers computer and due to file corruption in got flagged as invalid. The problem stemed from the computers hard drive not the software providing validation it did exactly what it was designed to do. This is not an outage of the validation software but could be classified as an outage with the hard drive or the media he installed from. In fact I had this same thing happen a few years ago on a machine with file corruption worked with MS and fixed it, but within 2 weeks that machines hard drive started making the clicks of death. So the hard drive was the issue and never anything to do with MS.

    Banks will not say they are having an outage if their ATM's aren't working and the power is out in the area (which many have battery backups but they only last so long). If an ATM is stolen it isn't an outage, but the customers who would otherwise have used it cannot. There are always circumstances beyond your control and if you spend your time appologizing for those thing then the day the customer themselves is the issue (ID10T error: Interface issue between entry and keyboard devices) they will still blame you.

    The key is you can only be responsible for your system, if a router goes down with an internet provider and a major company calls saying they can't use your system and claim 10 million in loses due to it are you going to pay them or state tey need to speak with their internet provider as there are no outages with your system. Customer may not b happy but there is no issue with your system so therefore it is not your issue.

    From the article he stated

    "If the power grid goes down, even by human error, that's an outage because electrical devices have no juice. If MSN Messenger service goes down because of human error and there is no instant messaging, that's an outage. The validation system failure is an outage."

    Th difference here is the power went out and MSN Messenger went down. Validation never went down, people were just getting an incorrect respone.

    I worked previously for a long distance carrier and one year in New York a telephone companies cable went out. The next day people trying to make phone calls that day of course called us complaining about service. Everything about our network was 100% the whole so we immediately directed them to the LEC (Local Exchange Carrier or telephone compay if you want). Of course some refused and decided to cancel an som just complained. As a friend said so elloquently whil listening to a rant and of course she was on mute "would you like a little cheese with your WHINE."

    You have to set the parameters of what classifies as an your responsibilty yourself, but if they are set beyond your own control you will be unhappy with it. Remember you cannot please everyone all of the time.

    Another thought, if you write a database for someone and they place it on a server. And their appliation fail whose fault is it, the DB or the App. If they call you up and say their app won't work which had worked for 3 years and they just introduced a new version are you going to research it? If you do and fin they mispelled a coulmn name in their app are yo goin to fix you database? If you do then you are asking for more trouble down the road as all issue will be yours from then on and they will say "hey remeber that time...", I know from experience. The outage is with the app, the database did not suffer an outage, thus in the scope of your responsibility there was no outage. Can everyone say Scope creep. Projects suffer from it and so do responsibilities, and until you stop letting it happen you are always to blame.

    All in all though MS has no said it was not their fault but that it was not an outage, just bad data leading to incorrect responses. The sytem was active, online and doin exacty what is was designed to do.

  • Sorry this really eats me up becuase or personal experience with customers who placed blame on me for things ultimately found to be their issue.

    Some folks here have stated they consider the power out to b an outage with their ap becuase the custmer is unable to use it.

    OK, what if the customers machine blows up (hard drive dies, power supply dies, motherboard issue). In the logic of the above this is an outage of the application. Now the person who just said they can go to a different machine keep in mind, power out at the house doesn't mean no power at the starbucks with WIFI down the road. The fact is they can use the application by just changing their position even if it means driving 50,100,200 miles out of an external issues affected area.

    Now another one. Suppose a user get's into a accident today after they leave work. When they next come in both arms had to be amputated. So now thru no issues of your own and evn thou the software is working as designed and actively running, with the same principal as the power outage thing te customer cannot use it. Thus in yor statemen this too is an outage???????

    Draw a line, stand on it and don't move it. Now call this your responsibility. Anything beyond it is not yours.

  • Banks will not say they are having an outage if their ATM's aren't working and the power is out in the area (which many have battery backups but they only last so long). If an ATM is stolen it isn't an outage, but the customers who would otherwise have used it cannot. ...

    I disagree. A stolen ATM, a local power failure, or in buiding electrical fault, or the bank building catches fire IS an outage. Now it may affect the process the bank needs to do to correct the outage, or it might not be really the bank's fault, but to claim it is not an outage is just irrelevant finger pointing. The system either works or it does not.

    There is another problem that MS is washing their hands of, and they should not be. The system failed to respond as it should (it is an outage) but they were unaware of it because their metric narrowly measured certain operations of the system and considered it a success. "oh, gee, we didn't expect the wrong code to be inserted...yada yada" . Except wrong code does get inserted. A hacker may compromise a system. A memory failure may occur in a webserver. Metrics that check for 'outage' must NOT be expecting any particular KIND of failure, they must measure whether failure has occurred (can the user get appropriate connection) .

    When designing systems, with monitoring failsafes it is essential that the monitor be agnostic as to the type of failure that occurs, including the unexpected.

     

    ...

    -- FORTRAN manual for Xerox Computers --

  • When the cable TV goes off, that is an outage. Nothing I can do about it and no credit to the account for service not rendered either! That's called a robbery.

    When one of my branch locations calls and says they are down, then they are experiencing an outage. This happens when contractors dig into some fiber and break it. All services are then cut. Is this an outage? Yes, but only locally. The rest of the system is up and running and doesn't even really care that the branch is down.

    When one of my servers goes down and everybody is affected, then that is a system outage. Can they still work? Yes, that's why you have manual backup procedures. The difference is people try and compare a complete outage (no power, lights, phone, computer, coffee machine) and the inability to work with a localized outage that only partially affects their ability to work. Printer down? Then print someplace else or fax it to yourself if you have to. Can't use your spreadsheet, whip out a calculator. Can't access your inventory control program? Then get up and check it manually? Can't look up cost or sell prices? Where is your manual back up copy as covered in your disaster plan? An outage should mean complete loss of access/control, not an inconvenience.

    What M$ experienced was a system outage. Did it prevent anybody from being validated? No. Will there be a day of reckoning later for those using pirated software that received a "free pass"? Hopefully.

    So yes, M$ had an outage but it was more along the line of early TV when you used to see the test pattern come on with a message that said "Technical difficulties. Please stand by"

    I must have stood there for an hour before my parents found me...

  • I'm a bit more severe.  Even bad training is an outage for that user.  We take support calls from users that simply did not know that a feature was avaialable.  For the time it took for the user to report to a supervisor and get the number to call, dial us, get to a person that can answer the question, receive and understand the answer, and finally impletment the answer it's an outage to that user.  It could even be considered an outage to the supervisor, though that might be a stretch.

    ATBCharles Kincaid

  • Lots of good information so far. Although there seems to me to be two different categories of businesses that categorize 'outages' - NON-MS companies have standards for doing this most of the time. However MS does not have a standard for these types of occurrances in my opinoin. Their approach is to just simply declare a 'new' standard !

    RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."

Viewing 15 posts - 1 through 15 (of 29 total)

You must be logged in to reply to this topic. Login to reply