Site reliability engineering (SRE) is a discipline to create ultra-scalable and reliable software systems by applying software engineering practices to infrastructure and operations problems. Expert site reliability engineers can craft solutions that walk the balance between development and operations teams. Google pioneered this role; for an in-depth explanation, read the Google e-book, "Site Reliability Engineering."
Site reliability engineers (SREs) work between development and operations, but not necessarily within DevOps proper. The concept of SRE has been around since 2003, which means that it’s older than DevOps. The term was made popular by Ben Treynor, who created Google’s Site Reliability Team. According to Treynor, SRE is “what happens when a software engineer is tasked with what used to be called operations.”
Both disciplines, DevOps and SRE, aim to enhance the release cycle by helping dev and ops see each other’s side of the process throughout the application lifecycle. They also advocate automation and monitoring, reducing the time from when a developer commits a change to when it’s deployed to production. SREs and DevOps aim for this result without compromising the quality of the code or the product itself.
SRE and DevOps ask two different but equally valuable questions:
DevOps asks what needs to be done.
SRE asks how that can be done.
Site reliability engineers measure service level indicators (SLIs) and service level objectives (SLOs), while DevOps teams measure the failure rate plus the success rate over time. SREs share responsibilities related to the following DevOps pillars of infrastructural improvement:
SREs don't discuss how many silos exist in company, but they encourage everyone else to discuss the issue. This discussion is accomplished by using the tools and techniques across the company, helping to spread ownership across all employees.
SREs need to make sure that there aren’t too many errors or failures. To do so, they use a formula composed of SLI and SLO scores. SLIs count failures per request, by calculating request latency, throughput of requests per second, or failures per request per time. SLOs are derived from threshold and percentage, and represent the success of SLIs over a certain amount of time.
SREs are all in for change, but in a slow, methodical way. Because companies want to move faster, they demand frequent releases, continually updating the product. So DevOps and SREs must respond quickly but maintain a steady, controlled pace.
Automate as long as it provides value to developers and operations by removing manual tasks.
SRE teams need to know that everything is moving in the right direction. This can be accomplished by setting up alerts for various scenarios, embracing peer code review, and/or using unit tests.
Site reliability engineering (SRE) and DevOps are two trending disciplines with quite a bit of overlap; their essential goals are understanding how to measure success or failure and how to gain continuous reliability across every application. Reliability is not just about the infrastructure—it’s relevant every step of the way, from application quality through performance and on up to security. SREs care about every process from source code to deployment; that’s how they earn the reputation of being a true bridge from development to operations.