An Azure Outage

Question

An Azure Outage

Steve Jones - SSC Editor

SSC Guru

Points: 734552
More actions
March 13, 2012 at 9:40 pm

#146706

Comments posted to this topic are about the item An Azure Outage

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply

Orlando Colamatteo SSC Guru Points: 182276 More actions · Answer 1

In my experience the weakest managers are the ones repeatedly leading a witch-hunt for this or that...terrible for morale and very anti-progress.

All moves have advantages and disadvantages. The cloud is not going to be right for all businesses, but I suspect it will be good for enough, for enough, to prove it is here to stay.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

addieleman@outlook.com SSC Veteran Points: 246 More actions · Answer 2

The last sentence: However I think lots of management might prefer in-house infrastructure for a simple reason: it gives them a specific neck to choke, and possibly replace, when things go wrong.

That may be true for some companies, but I've also seen the opposite: it's easier to put the blame on a third party because it looks like it frees managers from the duty to solve the problems. If SLA's are defined it's also easier to explain why you bash a service supplier or not.

paul.knibbs SSCoach Points: 15320 More actions · Answer 3

I think the difference is, if one of your internal systems goes down for any reason, you're in control of getting it back up and running. If the "cloud" goes down, you're entirely in the hands of the company providing that service to restore your access, and this leaves you feeling a bit helpless. Plus, you kind of expect a company the size of Microsoft to have enough redundancy in place that you really shouldn't be getting 8-hour outages!

Phil Factor SSC-Insane Points: 20244 More actions · Answer 4

Yes, it was a technical problem and we all have sympathy for these because we experience them, and are sometimes responsible for them. Azure has, in general, performed very well and this incident is uncharacteristic. For me, the problem was that Microsoft's marketing department had previously over-egged the pudding by talking up the resilience of Azure 'Always up, Always on'. If they'd been more circumspect, and said that, on balance, there would be outages in any cloud service but these would probably be fewer than you'd expect from your own in-house IT Infrastructure (the Azure SLA quotes 99.95% uptime) , then it wouldn't have caused so much of a story. With marketing material, any IT manager needs to know by how much to dilute the claims, and they're likely to add plenty more water after this incident. After all, the occurrence of a leap year is rather more predictable than an earthquake.

Best wishes,
Phil Factor

Gary Varga SSC Guru Points: 82166 More actions · Answer 5

This reminds me off when I was a passenger in a car recently. The driver was distracted by something the other side of the road for a moment and noticed late that the road had started to bend. I confess I have done exactly the same. As a driver you have an "Oops!!!" moment whilst adjusting direction. As a passenger its more like "Aaagggghhhh...we're all gonna die!!!". Basically, the driver notices the error and works on correcting it safe in knowledge that all is under control whereas the passenger doesn't have any confidence until the adjustment is complete.

Anyone gone to pump the brakes whilst a passenger?

Gaz

-- Stop your grinnin' and drop your linen...they're everywhere!!!

paul s-306273 SSChampion Points: 10727 More actions · Answer 6

addieleman (3/14/2012)
The last sentence: However I think lots of management might prefer in-house infrastructure for a simple reason: it gives them a specific neck to choke, and possibly replace, when things go wrong.
That may be true for some companies, but I've also seen the opposite: it's easier to put the blame on a third party because it looks like it frees managers from the duty to solve the problems. If SLA's are defined it's also easier to explain why you bash a service supplier or not.

Quite right - where I work any incident has the phrase 'we are working with our 3rd party suppliers...'.

It's never OUR fault.

phegedusich SSCommitted Points: 1552 More actions · Answer 7

Phil Factor (3/14/2012): (the Azure SLA quotes 99.95% uptime).

So we shouldn't expect another outage for, oh, three years or so. Sounds good to me.

Redundant failover architecture should include the management tools, folks. I'm spouting because I don't know the nature of the problem or the technical solution, but hey, if the system were in-house, I'd know, wouldn't I?

Gary Varga SSC Guru Points: 82166 More actions · Answer 8

phegedusich (3/14/2012)
Phil Factor (3/14/2012): (the Azure SLA quotes 99.95% uptime).
So we shouldn't expect another outage for, oh, three years or so. Sounds good to me.
Redundant failover architecture should include the management tools, folks. I'm spouting because I don't know the nature of the problem or the technical solution, but hey, if the system were in-house, I'd know, wouldn't I?

Surely you would have to investigate before you knew anything beyond what was reported. Wouldn't you?

Gaz

-- Stop your grinnin' and drop your linen...they're everywhere!!!

Steve Jones - SSC Editor SSC Guru Points: 734552 More actions · Answer 9

If you read the update and root cause analysis, this wasn't a redundancy issue. It was caused by a software bug, one that couldn't be fixed by more hardware. Developers had to build a fix, test it, and deploy it. This resulted in substantial delays, as many of us should be able to understand.

However it also appears that MS wasn't as forthcoming initially, at least according to Gartner: http://blogs.gartner.com/kyle-hilgendorf/2012/03/09/azure-outage-customer-insights-a-week-later/

Apparently MS is offering credit for the day, which is something: http://www.zdnet.com/blog/microsoft/microsoft-to-provide-azure-users-with-33-percent-credit-for-february-outage/12154

Steve Jones - SSC Editor SSC Guru Points: 734552 More actions · Answer 10

addieleman (3/14/2012)
The last sentence: However I think lots of management might prefer in-house infrastructure for a simple reason: it gives them a specific neck to choke, and possibly replace, when things go wrong.
That may be true for some companies, but I've also seen the opposite: it's easier to put the blame on a third party because it looks like it frees managers from the duty to solve the problems. If SLA's are defined it's also easier to explain why you bash a service supplier or not.

I've seen the opposite too, but it seems management above management doesn't want to hear this too often. If you pick the service, you're responsible as well. If the third party fails too often, the manager that's in charge of the third party gets choked.

At least that's been my experience.

TravisDBA SSCoach Points: 15780 More actions · Answer 11

Too early for the complete post-mortem analysis of this IMHO. Wait at least another month or so until more facts have been gathered and then make a judgement based on ALL the gathered facts at that time. To use the cloud, or not use the cloud is a big decision that a company should not rush to judgement on based on what is being reported now. 😀

"Technology is a weird thing. It brings you great gifts with one hand, and it stabs you in the back with the other. ...:-D"

djackson 22568 SSChampion Points: 11733 More actions · Answer 12

paul.knibbs (3/14/2012)
I think the difference is, if one of your internal systems goes down for any reason, you're in control of getting it back up and running. If the "cloud" goes down, you're entirely in the hands of the company providing that service to restore your access, and this leaves you feeling a bit helpless. Plus, you kind of expect a company the size of Microsoft to have enough redundancy in place that you really shouldn't be getting 8-hour outages!

You captured my thoughts very well. Those of us who have been around for a while remember Apple almost going under. How many other tech companies have failed over the years? Yet business is putting their data in the control of other companies, with no recourse when those companies fail - and some of them will fail.

Going to court to sue for data recovery is useless. Look at how many companies failed after 9/11 due to not having access to data.

There is a reason we study history.

Dave

IceDread SSCertifiable Points: 5050 More actions · Answer 13

I don't understand how people can be so eager to give up control and security.

I do not trust that secret business data remains secret if I put it in a random cloud.

When clouds becomes something each large IT department can install and set up themselves and maintain, keeping it local and not under another companies control, then it's really worth looking into it.

What happens if your one supplier closes shop or makes some changes you do not agree with in the services and the agreement you sign.

TravisDBA SSCoach Points: 15780 More actions · Answer 14

There is a reason we study history.
Dave

True Dave, but most people don't study it two weeks after it happened. Enough time has not gone by yet for the whole story to come out.:-D

"Technology is a weird thing. It brings you great gifts with one hand, and it stabs you in the back with the other. ...:-D"