What's Downtime?

There was a time when I worked for a company that sold products on-line Since our wares could be purchased at any time of the day or night, we wanted to ensure that our systems were running all the time. This led us to build some sort of monitoring, which we tried. That led us to buy some monitoring software, which we did. This led us to build more tools, and it felt like we were in an endless loop for a period of time.

Eventually we stepped back and tried to answer the question that many Operations people have asked themselves and others: what is downtime?

It's a tough question, and I want to give you a few examples of how I've viewed things, and debates I've had. For example, we had a database server and a web server. We used a simple script to ensure that the services (IIS and SQL) were running on both machines. If they weren't, we received a page. Is that sufficient to detect if our system is working?

We also had a process that would ping our web server from outside the data center, using a public machine. If that works, is the system working?

In this job, we deployed new code every week, in a DevOps style process that existed before anyone had ever uttered the term. These updates sometimes included schema changes, but almost always included application changes. If a page on our website broke after a deployment, was our system up or down?

We integrated with some third party software to perform various tasks. There were times that we couldn't communicate with the third party, or received broken communications. In those cases, were we up or down?

We built our application to work with multiple browsers, but at times there would be a new piece of functionality that didn't render or work correctly on either a new (Firefox) or old browser (IE6). Did that mean the application was down?

Determining uptime isn't a single thing. Even when you provide mechanisms that ensure all parts of your application are working, are they working for everyone? Many of us might see this in various online calls, where a system like GoToMeeting or Skype might work for some of the audience and not others. I see this at times with Microsoft sites where some of us can use one of their online systems, but others can't, sometimes because of the browser of the end user.

I was thinking about this while researching zero-downtime deployments, which can be hard for database changes. There are people that have success, but many others don't. At Redgate Software, we are trying to build tools to make this easier for everyone, but there seem to be plenty of edge cases that cause issues. There are also many different processes and flows that groups use to perform database development, which often affects the final deployments. It is hard to build a general solution that needs to apply to specific environments.

I tend to learn towards measuring uptime of the systems I'm responsible for and letting others worry about intermediate infrastructure. I'll caveat that with the note that I sometimes only worry about sections of the system and if those are broken. It's good to be clear when talking about this topic with others. For example, we might be able to take orders, but can't report on them, or can't add new customers. That's downtime for some sections of our application, but less stressful than if we couldn't take orders.

Let us know today. How do you measure downtime or uptime, and where is your responsibility?

No More Downtime

by Steve Jones

SQLServerCentral.com

Editorial

Windows is improving its ability to be patched without downtime.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

5 (1)

You rated this post out of 5. Change rating

2018-05-29

61 reads

Discuss

HA HA HA

by Kenneth Fisher

SQLServerCentral.com

Editorial

Crossword puzzle focusing on high availability

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2017-10-25

332 reads

Discuss

Zero Downtime for 2016

by Steve Jones

SQLServerCentral.com

Editorial

Someone made a call to architect zero downtime for databases. Steve Jones isn't sure this is the best thing you could do.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2016-02-15

153 reads

Discuss

Two Days Off

by Steve Jones

SQLServerCentral.com

Editorial

Verizon is taking their cloud service offline for maintenance for two days. Steve Jones doesn't think that sounds good.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2015-01-08

198 reads

Discuss

Reverse Engineering Disasters

by Steve Jones

SQLServerCentral.com

Editorial

Preparing a disaster recovery plan means more than just trying to prevent a few specific disasters. It means turning around the way you view the world.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

4 (1)

You rated this post out of 5. Change rating

2014-08-28

186 reads

Discuss

What's Downtime?

Rate

Share

Categories

Share

Rate

What's Downtime?

Rate

Share

Categories

Share

Rate

Related content

No More Downtime

HA HA HA

Zero Downtime for 2016

Two Days Off

Reverse Engineering Disasters