How Important is Zero Downtime?

  • Comments posted to this topic are about the item How Important is Zero Downtime?

  • I think web users have lower expectations than Ye Olde Worlde desktop apps.  They are used to pressing F5 to refresh if stuff doesn't seem to have worked.

    If you get your cloud architecture right then recoverability is more important than 100% uptime.  For example, the user clicks buy.

    The event message goes onto a distributed queue with at-least-once delivery.  If the DB is down then the data is in the resilient distributed queue until the DB is back online and can drain the queue.

    There is much heavier use of caching components that also take the pressure off the database.  Where as an old app might query the DB directly for the product catalogue a web app might get the images from a CDN and the product data from a cache, only falling back to the DB when the cache has expired.  From the customer perspective they get the product info whether it is from the cache or the DB.  There will be a difference in performance but maybe not noticeable to the customer.

    In retrospect I think in many cases the 100% uptime requirement was as much  badge of honour and personal pride more than it was pure business need.  The cloud presents a thing you consume rather than control.  As soon as you are not in control then you need a new badge of honour.

  • I haven't seen pressure to reach 0% downtime where I work, but we still struggle with some basic issues, so I'm not sure if things are more focused on those than the overall uptime.  The approach we are taking to try to help alleviate potential downtimes are:

    • using feature switches (similar to how trace flags turn certain features on and off in SQL Server)
    • Having a rollback plan and scripts for each rollout we do, so if a problem is found, it is quick and easy to go back
    • Have rollouts on non-business hours on days that are anticipated to be non-busy

    Outages are few, but when we do have an outage it is typically more a problem with a lower level software or infrastructure rather than an issue from the deployments themselves.  Some of those are things such as a file corruption issue we had with a third party encryption tool, or because of network connectivity between our various datacenters (local datacenter, parent company datacenter, and AWS hosted items)

  • I think folks are learning that 99.999 uptime is expensive, and even if you pay for it, there are operational reasons why the database will be offline on occasion anyhow.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply