Site Reliability Engineering vs. DevOps

I’m honestly not a fan of pitting concepts against one another. Our goal should be extremely simple: ensure that we’re helping our organization deliver functionality in a fast, safe manner. The precise, moment-by-moment details of this really don’t matter much. There are two major thoughts and approaches, both of which are trying to achieve the same thing: Site Reliability Engineering (SRE) and DevOps. While they have a lot in common, each one has some unique features.

Rather than try to sell you on either, let’s just discuss both and then discuss how we could decide which approach is a better fit for our organization.

Site Reliability Engineering

SRE originated within Google to deal with the challenges of software development and ongoing operations. The focus is very much on using tools to support automation of processes. However, it’s also looked on as an actual role within an organization. There must be a Site Reliability Engineer in order to successfully implement Site Reliability Engineering.

The overall purpose of SRE is on the middle word of the name, Reliability. When implementing an SRE approach, you are focused first on ensuring that you are doing whatever is necessary to keep your systems online and available. Most of the automation and testing is focused on this aspect of SRE.

SRE has several core principles that are worth noting:

Automation: SRE treats operations as a software problem and automates every aspect of operations from deployment to monitoring. SRE focuses first on automation which means automation is everywhere with an eye towards finding more possibilities.
Service Level Objectives: The engineers on an SRE team will work with other teams to define Service Level Agreements (SLA) and Service Level Indictors (SLI) that define what is needed in terms of reliability and how that can be best measured. The two are combined to produce Service Level Objectives (SLO) that use the SLI and SLA as a means of defining what must be built.
Monitoring: Because of the embrace of automation, distributed systems, cutting edge tooling, and all the rest that define SRE, monitoring, especially monitoring of the distributed systems, becomes a vital aspect of a successful SRE implementation. Further, you can only know you’re meeting your SLO and SLA through the use of SLI, which you obtain through monitoring.
Preparation: Embracing development and bring developers into the operations team in order to use their skills to help you prepare for outages is fundamental to SRE. This is all in support of ensuring reliability of the systems being developed.

If you wanted to sum up SRE in a nutshell, you could say it’s a focus on reliability that uses automation and brings in the development team to help.

DevOps

DevOps has a much more organic history, coming from multiple organizations and disciplines. The focus is absolutely on development, but it intentionally brings in every IT team, and, when done well, management and the business, all as part of a fundamental shift in how functionality is defined and deployed. The result then is a process that is much broader, but far less well defined, than SRE.

DevOps was summed up rather nicely by Donovan Brown:

DevOps is the union of people, process and products to enable continuous delivery of value to our end users

The principals of DevOps then are as follows:

Communication: Tearing down metaphorical walls and eliminating silos is the most fundamental aspect of implementing DevOps. The integration of multiple disciplines to enhance communication between those disciplines is at the heart of the union of people, process and products.
Automation: Automate everything, but especially, automate testing. The focus on automation is also foundational to ensuring that you’re eliminating silos between teams.
Continuous Delivery: Through the application of automation and communication, the ability to deliver software and services becomes faster as well as safer.
Fail Early and Often: In order to better protect production environments, DevOps encourages failure in development and other environments, early in the process. This, again, provides better protection for production environments.

Summing up DevOps, we could say it’s a focus on fast development and deployment using automation to help bring in all the other teams.

Choosing one over the other

If I were forced to pick either SRE or DevOps, which would I choose? Well, I would argue that it’s a false choice.

There is nothing within SRE that precludes continuous delivery even though it’s not one of the core tenets. There’s also nothing within SRE that would prevent you from bringing in more business involvement to help achieve reliability. The rest of SRE is very much in support of the goals of DevOps. Automation and preparation would fit very neatly within a DevOps paradigm.

Conversely, there are no dictates within DevOps that suggests you shouldn’t have SLAs, SLIs and SLOs. Just the opposite in fact. You should also implement monitoring in order to meet the requirements of automated testing and continuous delivery. While nothing in DevOps defines specific roles, there is also no exclusion that would exclude having a Site Reliability Engineer in support of your DevOps process.

In short, I don’t think you are forced to pick either of these approaches. I believe that you can adopt either, or both, with equal success. It really depends on where your problems principally lie and how best to go about addressing them within your organization.

Conclusion

If you have no challenges to either your development or operations processes at work, you can safely ignore either SRE or DevOps. However, most of us are struggling in one way or another. Since either approach could be embraced without eliminating the other, I’d suggest focusing on the things you need to fix in your environment. Then, use better communication, collaboration, testing, automation and tooling to help you address those issues.

Register for Simple Talk

Site Reliability Engineering vs. DevOps