Is 0% Downtime Possible?

  • Comments posted to this topic are about the content posted at

  • The problem is that management believes in 0% downtime. Fortunately, they have 50% downtime so its not too hard to sync downtime where they won't notice.

    The worst case of downtime was when some muggins switched off a server that had been happily running for 3 years solid. Once it had cooled the raid array had seized solid. Oh, and when they came to do the restore they found it had never been included on the back-up schedule for the network.

  • The only thing you should have added is where you measure your downtime, 1 day, 1 month, Mon-Fri 8 to 5. We set our service levels based on tiem frames for each database and it's related applications not a general note of downtime. Other that grat simple article.

    "Don't roll your eyes at me. I will tape them in place." (Teacher on Boston Public)

  • Good article. I agree, 0% downtime is impossible. I liked David.Poole's comments on planning downtime when no one will notice. We do this all the time. Many of our rollouts of new features happen between 11PM an 1AM. This is when we have almost no one on our site and so, no one notices our planned downtime for changes.

    Robert Marda

    Robert W. Marda
    Billing and OSS Specialist - SQL Programmer
    MCL Systems

  • Thanks guys, I have long felt that it was impossible overall and have adopted the items mentioned. There is always some maintenance window, sometimes on the spur of the moment .

    I've tried and haven't figured it out. If the telco's can't do it over years, I figure it can't be done.

    I agree with Antares that you have to build some downtime into your SLAs to handle this. BTW, I've worked with 7 long haul carriers and 5 co-location/managed service companies. Not one of them has 100% uptime for the network over a year.

    Oh well...

    Steve Jones

  • I worked previously with a telco and was suprised by how much the networks went down without anyones knowledge but the techs. A lot tracking goes on to keep in progress calls from dropping when a fiber cut occurrs and it almost instantly reroutes to another available circuit while carrying the call.

    "Don't roll your eyes at me. I will tape them in place." (Teacher on Boston Public)

  • They have fantastic rerouting and fault tolerance. BUT, there is still downtime and not just my local circuit. Mostly I deal with data, but I have seen minutes to hours of downtime for links without rerouting, mainly when there are

    1. a single route for an area. A good example was about five years ago ATT had a tremendous amount of frame traffic running from CA to the East through Las Vegas. The two fiber lines in the ring were physically close together and both were cut. Hours of downtime for frame customers.

    2.Upgrades. Seen numerous cases with colos when a major router upgrade occurs, could be softare or hardware, there is a loop that occurs and the routers flap between themselves, sometimes cutting off large blocks of IPs. The 4-8 carriers supplying connectivity don't help here.

    Steve Jones

  • Every kid knows that the absolute is unachievable. Question is how long downtime can be for 24x7 system? My record - 1 hour downtime in six months.

  • For SQL? I've run over 5-6 months on v6.5 with no downtime. Haven't really pushed the envelope on Ss2K, though I'd have to check my current box.

    I used to run a Novell network (> 1400 nodes) and we had a server used by the Operations department for logging information. Best run I know of, > 500 days.

    Steve Jones

  • OK, so we're all pretty much in agreement that true 0% downtime is not achievable. Some have suggested that maybe we need to change the way we measure downtime. This is not just a fancy cop-out, but a recognition that 'downtime' at midnight may not really be downtime because the system is idle. I would like to add to that the idea that planned downtime may not really be downtime. I'm familiar with a few industries where full plant shutdowns are implemented for maintenance and upgrades and nobody counts this as downtime! Presumably, the expense and lost business are more than recovered through more stable, more efficient, and/or higher capacity systems.

    What we really need is a way to 'profile' our downtime that takes into account user requirements, business requirments, presumed advantages associated with maintenance and upgrades, workload balancing, etc. Any ideas?

  • We mostly measure downtime in our area as times when due to unavailability business processes are impacted. For instance a database about inventory only needs to be availble between 8 AM and 5 PM Mon-Fri. But a database that processes bank transactions needs to run 24/7, but if you have a roll over server with the same data and you roll during scheduled outages you technically have 0% downtime as long as the busniess functions are still running.

    "Don't roll your eyes at me. I will tape them in place." (Teacher on Boston Public)

  • quote:

    For SQL?

    Yes. MSSQL 7.0 (now MSSQL 2000). Currently 98% of our downtime is problems with power supply. 1% - OS problems. 1% - database problems.

  • And what about log based replication ?

    Lots of banks have chosen Sybase or DB2 for this.

    This is an external process which scans the log buffer, queue and sends the modifications to the standby server (like log shipping in MS SQL but in memory).

    The standby server is actually available all the time (for reporting).

    It's independant on the OS (you can upgrade Windows) and on the database version.

    That seems to be very robust and very flexible.

    Is someone has had any experiences with this ?

  • quote:

    And what about log based replication ?

    I don't think that it is an effective way. Now we use cluster server to reduce downtime.

  • Log replication is not better than any other method but I use replication here on our SQL server for data that I need to make sure I have an almost up to the minute backup in case I have a server failure here. Basically, by copying the data to another location if you loose a major component you have another site you can bring online to minimalize the downtime. Still, until the rollover occurrs you are down. Ideally you should use all methods available to you to minimalize downtime on mission-critical apps, but non-critical systems should not be viewed as a neccessity even by management (if it doesn't impact productivity then it can be down for periods of time with concern until it can be brought back up), which is the reason for SLA (Sevice Level Agreements).

    "Don't roll your eyes at me. I will tape them in place." (Teacher on Boston Public)

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply