SRE – Introduction

SRE is what happens when you ask a software engineer to design an operations team
– Ben Treynor Sloss, Google

Site Reliability Engineering (SRE) is among the most popular technology topics during the last few years, with the IT industry viewing it as a better way to run production systems by applying a software engineering mindset to accomplish the work that would otherwise be performed, often manually, by sysadmins. The definition of SRE by the originator of this term (Ben Treynor Sloss at Google) gives an insight into the vision with which this concept was originally created – “SRE is what happens when you ask a software engineer to design an operations team”. As it usually happens with any topic that becomes popular, there are numerous SRE experts in the industry who have interpreted the concept as it is most convenient for their needs. To avoid a biased understanding, I started learning about SRE by reading the book written by creators of this concept at Google – Site Reliability Engineering: How Google Runs Production Systems.

Most misinterpretations on what SRE team should do and who should be part of this team will go away if one understands this statement from the book: SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, design and implement automation with software to replace human labor.

Google’s Approach to Service Management
  • Hire software engineers to run products and to create systems to accomplish the work that would otherwise be performed manually
  • Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload
  • 50% cap on the aggregate “ops” work for all SREs—tickets, on-call, manual tasks, etc.
  • When a SRE team consistently spends less than 50% of time on engineering work, shift some of the operations burden back to the development team or add staff to the team without assigning that team additional operational responsibilities
  • Want systems that are automatic, not just automated
  • SRE vs. Devops
    Before going further into SRE, let me compare SRE with Devops, which is a similar concept that addresses friction between development and operations. SRE and Devops are similar when it comes to bridging the gap between development and operations in addition to massive focus on automation. In Google’s view, SRE is a specific implementation of DevOps with some idiosyncratic extensions. There are significant differences too with Devops being a mindset focused on product development and delivery while SRE is a set of practices focused on post production reliability.

    SREDevops
    ProductionRemoving silos, “big picture”, delivering applications
    Set of practices and metricsMindset and culture of collaboration
    System availability and reliabilityProduct development and delivery
    Systems engineers who write codeEveryone involved
    How it should be doneWhat needs to be done

    SRE Responsibilities:

    SRE team is typically responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of the services they support. The core tenets of Google SRE are:

    • Ensuring a Durable Focus on Engineering
    • Pursuing Maximum Change Velocity Without Violating a Service’s SLO
    • Monitoring using automated software
    • Emergency Response designed to reduce Mean Time To Repair (MTTR)
    • Change Management that is automated to accomplish progressive rollouts, quickly detecting any problems and rolling back changes safely when problems arise
    • Demand Forecasting and Capacity Planning to ensure that the required capacity is in place by the time it is needed
    • Provisioning conducted quickly and only when necessary
    • Efficiency and Performance by predicting demand and provisioning capacity

    Many organizations embark on building a SRE team in addition to a dedicated multi-tiered Operations team to support a service. Adding a SRE team as just another layer to existing ones supporting a service will only make the Operations process more inefficient. Being on-call is one of the integral functions of a SRE team and transforming existing L2 Support team to SRE model will yield the best results. Instead of “my environment is unique and SRE won’t work” attitude, it is important to revisit the entire Operations process holistically considering SRE principles and practices. In the next two blogposts, I will cover key points on principles and practices followed by Google as mentioned in the book.