What's Your Downtime?

  • Comments posted to this topic are about the item What's Your Downtime?

  • Well, concerning downtimes of Azure, here's some recent experience of mine:

    I attended 2 SQL events so far this year, at both there were speakers breaking their demos because Azure was unreachable for one reason or another. At the second event, the folks were well prepared at least: they had prerecorded what they wanted to demo, so they played the videos instead 🙂

    But what impression does that make on people intrerested in cloud solutions...?

  • Sometimes the downtime profile will vary because of the uptime requirements e.g. a recent system that I worked on had 08:00->18:00 uptime requirements so it was expect that there would be zero planned downtime. All downtime would be due to the lack of a robust solution.

    Of course, sometimes downtime is preferable over spending more money. Someone else's decision. I just have to explain they whys and wherefores of the solutions on the table.

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

  • My company runs a 24/7 business. As a DBA I take down our main DB servers for maintenance every 6th week for about 1 to 1 1/2 hour each time. Total downtime makes up 10-11 hours over a year. We haven't had any downtime except these planned service windows for the last 3-4 years.

  • I support three different key systems. I developed three Windows services (the first was used as a shell for the other two) that runs at periodic intervals to do diagnostics checks for each system; the services check for a valid web login page and for database connectivity (open, close). For after hours, weekends, and holidays, it polls the system on a less frequent basis.

    Depending on the error, it will notify the server/networking group and/or the DBAs. If it gets a network error connecting to SQL Server, it adds a notification to server/networking.

    By examining the event log, I can see approximately when the system failed and when it went back online.

    One of the systems can fail if a non-numeric character gets in a key field, so that service does a SELECT DISTINCT and runs through the rows. That system broke after shortly it "was installed". I did a UNIX-type strings dump on the proprietary DLL looking for interesting SQL code and found the problem "MAX(CONVERT(int, AdHocId))".

    I am not the DBA, well, I am for the Postgres system since we don't have that expertise in-house.

  • Thomas Hütter (4/15/2016)


    Well, concerning downtimes of Azure, here's some recent experience of mine:

    I attended 2 SQL events so far this year, at both there were speakers breaking their demos because Azure was unreachable for one reason or another. At the second event, the folks were well prepared at least: they had prerecorded what they wanted to demo, so they played the videos instead 🙂

    But what impression does that make on people intrerested in cloud solutions...?

    Maybe I'm wrong about how it works, but my feel is that the reliability of any given cloud database probably has to do with the level of service purchased. For example, I wouldn't be surprised if the free or demo account that comes bundled with a MSDN subscription would be less reliable than that provided for an enterprise customer paying $$,$$$ per month for service. I also suspect that the wifi internet connectivity at these conferences is spotty.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Aggregate downtime per year is a common baseline measurement, but it may not be the right question. In many cases it's really a matter of how much contiguous downtime can be tolerated. For example, if the service were unavailable for 12 hours straight, then that's a disaster for an eCommerce website, even if the service were rock solid for the remainder of the year. However, if the service were unavilable sporatically throughout the day at intervals of 15 - 60 seconds, then middelware frameworks utilize data caching to mitigate that, and it may not even impact the customer experience.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • We had an interesting outage earlier this month. My boss and Microsoft were working on a certificate issue on my hosted server and needed to reboot it. After the reboot, the server wasn't accessible. I tried to get in to the vendor's control panel to see if maybe they'd done a shutdown instead of a restart, and the control panel wasn't available.

    The entire data center had crashed. They were down 30-40 minutes or so. I knew about the outage and its restoration long before I got emails from them as I get a high memory usage alert whenever my server starts up from a power off. Fortunately my server is on internal testing at the moment and not in full production.

    For me, the worst thing about a cloud provider (and I hate that term -- it's just a VM on someone else's hardware, and that's all it is! We've been doing things like this for years!) is that you have no way of knowing what's happened or to guess at how long the server will be unavailable. At least if it's in your data center, and a drive in your RAID fails, you can provide an estimate how long it will take to replace. If the entire server goes up in smoke, you can provide a SWAG as to how long it will take to restore to another box. But you probably will never get that level of information from a service provider, you just have to wait until it's up again and you'll never know the details as to what happened.

    -----
    [font="Arial"]Knowledge is of two kinds. We know a subject ourselves or we know where we can find information upon it. --Samuel Johnson[/font]

  • Thomas Hütter (4/15/2016)


    Well, concerning downtimes of Azure, here's some recent experience of mine:

    I attended 2 SQL events so far this year, at both there were speakers breaking their demos because Azure was unreachable for one reason or another. At the second event, the folks were well prepared at least: they had prerecorded what they wanted to demo, so they played the videos instead 🙂

    But what impression does that make on people intrerested in cloud solutions...?

    I don't worry about this, unless I'm presenting. Bandwidth and connections from presentation venues are usually flaky, though they are getting better.

  • Terje Hermanseter (4/15/2016)


    My company runs a 24/7 business. As a DBA I take down our main DB servers for maintenance every 6th week for about 1 to 1 1/2 hour each time. Total downtime makes up 10-11 hours over a year. We haven't had any downtime except these planned service windows for the last 3-4 years.

    That's pretty good. I used to get Sat midnight - Sun 4 am once a quarter for maintenance. We stuck to that pretty well.

  • Eric M Russell (4/15/2016)


    Aggregate downtime per year is a common baseline measurement, but it may not be the right question. In many cases it's really a matter of how much contiguous downtime can be tolerated. For example, if the service were unavailable for 12 hours straight, then that's a disaster for an eCommerce website, even if the service were rock solid for the remainder of the year. However, if the service were unavilable sporatically throughout the day at intervals of 15 - 60 seconds, then middelware frameworks utilize data caching to mitigate that, and it may not even impact the customer experience.

    Great points. 5 hours a year might not be great for some businesses if it's all on Black Friday.

  • Wayne West (4/15/2016)


    ...

    For me, the worst thing about a cloud provider (and I hate that term -- it's just a VM on someone else's hardware, and that's all it is! We've been doing things like this for years!) is that you have no way of knowing what's happened or to guess at how long the server will be unavailable. At least if it's in your data center, and a drive in your RAID fails, you can provide an estimate how long it will take to replace. If the entire server goes up in smoke, you can provide a SWAG as to how long it will take to restore to another box. But you probably will never get that level of information from a service provider, you just have to wait until it's up again and you'll never know the details as to what happened.

    Maybe. I worked in a large, Fortune 500 company (12k+ employees) and we had someone drop a tool during UPS maintenance one afternoon. Most of the data center, including critical sales and accounting systems dropped. This was a data center we owned, in our buiding, with thousands of machines in it. The CTO was in the building, and went inside to check. A number of techs couldn't tell him how long it would be since the UPSes were down and they weren't sure when they could bring systems up with power. Expertise inside a company is sometimes really limited.

    I'd hope a cloud provider wouldn't have this happen and would have plans for multiple redundancies and the ability to move systems elsewhere. However it could go either way.

    However I would note that it's true, you have no idea what's going on. Some providers are better than others at updating clients, but there's probably always some withholding of information.

  • Steve Jones - SSC Editor (4/15/2016)


    Maybe. I worked in a large, Fortune 500 company (12k+ employees) and we had someone drop a tool during UPS maintenance one afternoon. Most of the data center, including critical sales and accounting systems dropped. This was a data center we owned, in our buiding, with thousands of machines in it. The CTO was in the building, and went inside to check. A number of techs couldn't tell him how long it would be since the UPSes were down and they weren't sure when they could bring systems up with power. Expertise inside a company is sometimes really limited.

    I'd hope a cloud provider wouldn't have this happen and would have plans for multiple redundancies and the ability to move systems elsewhere. However it could go either way.

    However I would note that it's true, you have no idea what's going on. Some providers are better than others at updating clients, but there's probably always some withholding of information.

    Ouch! We had something similar to that happen when I was working at a police department in the '90s. Someone had taken the generator offline to service it and didn't switch it back in to the circuit. There was an area power outage and apparently that switchout also disconnected the first-line battery backups. They weren't UPSes, it was a room full of 48 VDC lead acid batteries, each about 3x the volume of a car battery. The mainframe dropped, the minis dropped, I think the LAN servers stayed up as they had their own UPSes, but I think the LAN switches and routers crashed.

    Fortunately dispatch and 911 were in another building with their own minis, so those operations were unaffected.

    It was not pretty. And it's not uncommon.

    -----
    [font="Arial"]Knowledge is of two kinds. We know a subject ourselves or we know where we can find information upon it. --Samuel Johnson[/font]

  • Steve Jones - SSC Editor (4/15/2016)


    I'd hope a cloud provider wouldn't have this happen and would have plans for multiple redundancies and the ability to move systems elsewhere. However it could go either way.

    And, yet, they have "downtime".

    And wasn't it just a little over a year ago when some major provider had something like more than a week of downtime?

    If the data is local, you at least have a chance of surviving something like that even if you have to stand up something temporary. If it's under someone else's umbrella, you're pretty much toast in such a situation. Providers on the cloud need to make things a whole lot more reliable and secure before I succumb to the draw.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I also suspect that the wifi internet connectivity at these conferences is spotty.

    For the second one, that might have been the case. The other one was in February, and at that time the reason was confirmed to be an outage in the Azure credentials verification mechanism (sorry I can't describe this any better - I'm not into Azure at all... 😛 )

    [And that even was a Microsoftie presenting... FWIW]

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply