SRE – Principles

After a quick introduction to SRE in the previous blogpost, lets step into the principles as shared by Google in their book. Wikipedia defines Reliability as the probability that a system will produce correct outputs up to some given time “t”. Reliability is enhanced by features that help to avoid, detect and repair hardware faults. A reliable system does not silently continue and deliver results that include corrupted data. Instead, it detects and, if possible, corrects the corruption. Reliability can be characterized in terms of mean time between failures (MTBF), with reliability = exp(-t/MTBF).

While getting reliability to 100% appears to be ideal, there is cost involved. SRE outlines the following principles that can help achieve desired reliability level by balancing resiliency with cost. This blogpost will briefly cover each principle and help us appreciate SRE practices that will be covered next.

  1. Embracing Risk
  2. Service Level Objectives
  3. Eliminating Toil
  4. Monitoring Systems
  5. Release Engineering
  6. Simplicity

Embracing Risk:
SRE seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users’ overall happiness – with features, service, and performance – is optimized. Efforts to increase reliability beyond a certain point will exponentially increase recurring costs making it economically worse for a service and its users. Cost of improving reliability can be categorized into two buckets, both of them are invisible to end users but essential to avoid disruptions rather than building new features:

  1. The cost of redundant machine / compute resources.
  2. The opportunity cost when engineers are allocated to improve reliability.

In SRE, service reliability is managed by managing risk. The goal is to explicitly align the risk taken by a given service with the risk the business is willing to bear and strive to make a service reliable enough, but no more reliable than it needs to be. To achieve this, a set of Service Level Objectives need to be defined and this will be covered in the next principle.
Before that, another key concept is Error Budgets. As we embrace risk this way, tensions will arise between Product Development and SRE teams as they are usually evaluated on different metrics. An error budget aligns incentives and emphasizes joint ownership between SRE and product development. Error budgets make it easier to decide the rate of releases and to effectively defuse discussions about outages with stakeholders, and allows multiple teams to reach the same conclusion about production risk without rancor.

Service Level Objectives:
To manage a service, we first need to express its important behaviors quantitatively and then define the level of service that will be delivered. Three important terminologies that help achieve this are:

  1. Service Level Indicator (SLI): a carefully defined quantitative measure of some aspect of the level of service that is provided. Examples – request latency, error rate, system throughput, availability, durability.
  2. Service Level Objective (SLO): a target value or range of values for a service level that is measured by an SLI. Example – 99% of Get RPC calls will complete in less than 100 ms.
  3. Service Level Agreement (SLA): an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain. SLAs usually have financial implication for violating SLO.

Eliminating Toil:
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. SRE’s goal is to eliminate toil so that they can spend time on long-term engineering project work. Typically 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service features.

Monitoring Systems:
Monitoring includes collecting, processing, aggregating and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times and server lifetimes. Effective monitoring helps proactively avoid failures and involves alerting, building dashboards, analyzing long term trends and root cause analysis. Monitoring can either be:
· White-box that is based on metrics exposed by the internals of the system, including logs, interfaces like JVM Profiling Interface or an HTTP handler that emits internal statistics.
· Black-box that involves testing externally visible behavior as a user would see it.

Release Engineering:
When equipped with the right tools, proper automation, and well-defined policies, developers and SREs shouldn’t have to worry about releasing software. Releases can be as painless as simply pressing a button and Release Engineers help achieve this using devops pipeline that includes source code repository, build rules for compilation, configuration management, test integration, packaging and deployment.
Release engineering is guided by an engineering and service philosophy that’s expressed through four major principles:

  1. Self-Service Model: Tools and process that allows product development teams to control and run their own release processes and achieve high release velocity.
  2. High velocity: Frequent releases that result in fewer changes between versions.
  3. Hermetic Builds: Self-contained builds that must not rely on services that are external to the build environment.
  4. Enforcement of Policies and Procedures

Simplicity:
Software simplicity is a prerequisite to reliability. With an eye towards minimizing accidental complexity, SRE teams should:
· Push back when accidental complexity is introduced into the systems for which they are responsible.
· Constantly strive to eliminate complexity in systems they onboard and for which they assume operational responsibility

SRE – Introduction

SRE is what happens when you ask a software engineer to design an operations team
– Ben Treynor Sloss, Google

Site Reliability Engineering (SRE) is among the most popular technology topics during the last few years, with the IT industry viewing it as a better way to run production systems by applying a software engineering mindset to accomplish the work that would otherwise be performed, often manually, by sysadmins. The definition of SRE by the originator of this term (Ben Treynor Sloss at Google) gives an insight into the vision with which this concept was originally created – “SRE is what happens when you ask a software engineer to design an operations team”. As it usually happens with any topic that becomes popular, there are numerous SRE experts in the industry who have interpreted the concept as it is most convenient for their needs. To avoid a biased understanding, I started learning about SRE by reading the book written by creators of this concept at Google – Site Reliability Engineering: How Google Runs Production Systems.

Most misinterpretations on what SRE team should do and who should be part of this team will go away if one understands this statement from the book: SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, design and implement automation with software to replace human labor.

Google’s Approach to Service Management
  • Hire software engineers to run products and to create systems to accomplish the work that would otherwise be performed manually
  • Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload
  • 50% cap on the aggregate “ops” work for all SREs—tickets, on-call, manual tasks, etc.
  • When a SRE team consistently spends less than 50% of time on engineering work, shift some of the operations burden back to the development team or add staff to the team without assigning that team additional operational responsibilities
  • Want systems that are automatic, not just automated
  • SRE vs. Devops
    Before going further into SRE, let me compare SRE with Devops, which is a similar concept that addresses friction between development and operations. SRE and Devops are similar when it comes to bridging the gap between development and operations in addition to massive focus on automation. In Google’s view, SRE is a specific implementation of DevOps with some idiosyncratic extensions. There are significant differences too with Devops being a mindset focused on product development and delivery while SRE is a set of practices focused on post production reliability.

    SREDevops
    ProductionRemoving silos, “big picture”, delivering applications
    Set of practices and metricsMindset and culture of collaboration
    System availability and reliabilityProduct development and delivery
    Systems engineers who write codeEveryone involved
    How it should be doneWhat needs to be done

    SRE Responsibilities:

    SRE team is typically responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of the services they support. The core tenets of Google SRE are:

    • Ensuring a Durable Focus on Engineering
    • Pursuing Maximum Change Velocity Without Violating a Service’s SLO
    • Monitoring using automated software
    • Emergency Response designed to reduce Mean Time To Repair (MTTR)
    • Change Management that is automated to accomplish progressive rollouts, quickly detecting any problems and rolling back changes safely when problems arise
    • Demand Forecasting and Capacity Planning to ensure that the required capacity is in place by the time it is needed
    • Provisioning conducted quickly and only when necessary
    • Efficiency and Performance by predicting demand and provisioning capacity

    Many organizations embark on building a SRE team in addition to a dedicated multi-tiered Operations team to support a service. Adding a SRE team as just another layer to existing ones supporting a service will only make the Operations process more inefficient. Being on-call is one of the integral functions of a SRE team and transforming existing L2 Support team to SRE model will yield the best results. Instead of “my environment is unique and SRE won’t work” attitude, it is important to revisit the entire Operations process holistically considering SRE principles and practices. In the next two blogposts, I will cover key points on principles and practices followed by Google as mentioned in the book.