BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

Microsoft's Azure Outage; Three Reasons Why Such Things Happen and Three Steps to Avoid Them

This article is more than 10 years old.

Many years ago, over lunch with a friend, I theorized that there was a business to be created out of maintaining limited duration virtual property (that is, stuff that needs renewing) such as domain names and SSL certificates held by organizations. My friend disagreed arguing that the demand for such a service would be low as corporations usually had that sort of thing under control.

I forgot about the idea and then, a month or so later, someone came out with exactly that service for domain names. Meh.

I think my original intuition about organizations and periodic virtual properties (PVPs) was more relevant than I thought and, as if more proof were needed, it appears even Microsoft can't keep on top of such things.

Consider the company's recent snafu of allowing an SSL certificate to expire which, in turn, effectively disabled secure access to their Azure cloud platform from 12:44 PM PST on Friday, February 22, for over 12 hours.

The problem for organizations is that PVPs, unless they are a routine concern of a department or business unit become side issues; things that exist on someone's calendar rather than a group's calendar.

Just consider how often you forget things that aren't apparently critical; you meant to call your mother to see how her cat is doing after the surgery or you forgot to get your car serviced.

At an organizational level something as technically-oriented and infrequent as renewing a digital security certificate can be much like your mother's cat or your car service ... they all seem unimportant until you discover that your mother is really upset because Fluffy didn't make it and you obviously just don't care, your engine seizes because it ran out of oil and it turns out your warning light wasn't working, or your multi-million dollar Web service stops working because someone just forgot to renew a certificate.

There are three reasons why PVPs get overlooked:

  1. Inadequate Initial Recognition of Importance: If the importance of a PVP isn't correctly framed when it is first acquired or created by the organization and no one is given or takes responsibility for it then it's pretty certain that its importance will be discovered the first time the PVP isn't renewed or restarted or whatever is required.
  2. Organizational Incompetence: Many PVPs are simply too "narrow" (too techie, not understood by more than a handful of people, not that expensive up front, etc.) in the view of the organization to be considered important because no one understood the implications ... even if PVPs are actually make or break issues.
  3. Failure of Organizational Memory: Organizations have memories about the things they do but these are animated by people; if those people leave the organization or get assigned conflicting responsibilities that minimize the importance of a PVP, the PVP will be effectively be forgotten ... until not servicing the PVP causes chaos, then it gets remembered.

In every case the problem is a failure to plan strategically which, in turn, requires asking what could go wrong and when it does,

There are three steps in avoiding PVP failures:

  1. Identifying: PVPs live in plans ... you're going to build something, locate the failure points and those periodic virtual properties that require servicing.
  2. Quantifying: If a PVP fails - in other words, if it isn't serviced - what are the consequences? Will molten fire rain from the sky or will the Web service stop working? What will the costs be including lost revenue and repair?
  3. Assigning Responsibility: Identify which group should take ownership of a PVP and ensure that there's a recurring review process that's tied to line of business and individual performance measurements.

This is really part of disaster planning which is both an art and a discipline. It's about examining a plan and finding the PVPs then identifying risks, consequences, and maintenance requirements ... essentially it's about bringing relevancy and responsibility to seemingly tiny details. Tiny details that can ruin a career.

So, who's getting fired at Microsoft?