Downtime

  • Comments posted to this topic are about the item Downtime

  • I once had a role commissioning a set of servers and associated hardware in, at the time, what was Europe's biggest data centre. The servers housed various internet and intranet offerings that were critical to the company. Given my background as a software developer, it was an odd number of days when I was standing in Matrix-esque aisles pulling out cables in order to simulate failing hardware etc.

    At the time I thought it was a little ridiculous, however, looking back perhaps just not quite as ridiculous as not doing it.

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

  • I went to Amazon's first big AWS conference with a client. We attended a session included a discussion on Netflix's chaos monkey. My client was so impressed that when we got back to Canada he announced that at least once a week the "Chaos Bear" would walk into the server room and unplug any undocumented hardware. This was terrifying, but at the same time motivated the staff to finish their documentation and properly label all the racks and racks of hardware.

    An interesting consequence was that sometimes servers would be shut down and nobody would notice. Those servers were NOT turned back on, and did not survive the move to another data center.

  • Datagod-309892 (3/3/2014)


    I went to Amazon's first big AWS conference with a client. We attended a session included a discussion on Netflix's chaos monkey. My client was so impressed that when we got back to Canada he announced that at least once a week the "Chaos Bear" would walk into the server room and unplug any undocumented hardware. This was terrifying, but at the same time motivated the staff to finish their documentation and properly label all the racks and racks of hardware.

    An interesting consequence was that sometimes servers would be shut down and nobody would notice. Those servers were NOT turned back on, and did not survive the move to another data center.

    I wonder if the same technique can be applied to staff? :Whistling:

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

  • It is remarkable when organization change from risk avoidance to risk preparedness. Many companies build systems (computer or otherwise) and hope the big issue never happens. Eventually, it does.

    I was called out as crazy the first time I walked into our data center to perform “pull the plug” testing in the middle of the day. We found issues and needed a MS patch to be 100% ready to go. We had a few minutes of downtime during the test, but it would have been much longer had it occurred in the middle of the night with nobody ready to work the issue.

  • We do everything we can to keep our applications online. We avoid patches...

    I would much rather take a small amount of periodic planned downtime to patch than have a large amount of unplanned downtime when my unpatched systems get infected, compromised or crash due to a bug that should've been patched. Plus there are often optimizations that come from patching as well.

  • Tritoch (3/3/2014)


    We do everything we can to keep our applications online. We avoid patches...

    I would much rather take a small amount of periodic planned downtime to patch than have a large amount of unplanned downtime when my unpatched systems get infected, compromised or crash due to a bug that should've been patched. Plus there are often optimizations that come from patching as well.

    I think that there is a balance to be made. For in-house applications I would hope that the development team wouldn't have the impression that they could release on an ad-hoc basis. I know, and benefit from, the agile principle of releasing little and often but updates to systems interacting with others should not be taken lightly.

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

  • Tritoch (3/3/2014)


    We do everything we can to keep our applications online. We avoid patches...

    I would much rather take a small amount of periodic planned downtime to patch than have a large amount of unplanned downtime when my unpatched systems get infected, compromised or crash due to a bug that should've been patched. Plus there are often optimizations that come from patching as well.

    Agreed. One of the data centers I worked did not keep patches up to date. then Slammer was released. Building systems with the near-monthly patch reboot in mind goes hand in hand with other HA requirements.

  • Datagod-309892 (3/3/2014)


    I went to Amazon's first big AWS conference with a client. We attended a session included a discussion on Netflix's chaos monkey. My client was so impressed that when we got back to Canada he announced that at least once a week the "Chaos Bear" would walk into the server room and unplug any undocumented hardware. This was terrifying, but at the same time motivated the staff to finish their documentation and properly label all the racks and racks of hardware.

    An interesting consequence was that sometimes servers would be shut down and nobody would notice. Those servers were NOT turned back on, and did not survive the move to another data center.

    Nice. I've always assumed lots of servers weren't necessarily being used, or at least not regularly, in many data centers. Be worth shutting down, or these days, p->v, and bringing them up when needed.

  • Steve's p->v would give me the security that I would require for such a strategy. Maybe I am just a little more risk averse.

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

  • If you think the Chaos Monkey could help you, let us know.

    I think it would drive me bananas.

  • We do the Chaos Monkey in software development all the time. I am kind of surprised no one has recognized this. Testers are supposed to find ways to break software and are generally spiffed to find issues.

    Someone also mentioned the old turn off the server and see if anyone complains. We also do this with software/reports. When we upgrade if we're not sure a report is used then it's not upgraded. Now if we could just learn to document everything maybe we wouldn't need to do this. 😀

    Chaos rarely reigns when we do the above, it's just painful for a little while.

  • I like the idea of the chaos monkey. Outages need to be tested and processes practiced. Without the practice, people forget what to do or they act slowly trying to figure it out. With a planned outage, you get the chance to practice and make sure the process works.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Well my last company didn't have a Chaos Monkey. But we had at least once a year that we had to fire up all the systems at the DR site. And that included getting a temp code if needed for some systems. We explained to the vendors that if they couldn't handle us doing a DR test, then we would have to find a new vendor. Not a single one let us down.

    We also had some functions at the DR site we would test back at the primary location. (The e-mail Groupwise setup done in a graphic looked like a cloverleaf with a small highway to nowhere. 😎 )

    At the same time the need for the Chaos Monkey was limited from the simple fact that basically every single server that was in production was touched everyday.

    There was a stack of test servers that weren't used every day but the upgrade cycle of over twenty-five SW packages usually was going to be touching any group of them within two-three weeks.



    ----------------
    Jim P.

    A little bit of this and a little byte of that can cause bloatware.

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply