SRE – Introduction – Santhanam Govindaraj

SRE is what happens when you ask a software engineer to design an operations team
– Ben Treynor Sloss, Google

Site Reliability Engineering (SRE) is among the most popular technology topics during the last few years, with the IT industry viewing it as a better way to run production systems by applying a software engineering mindset to accomplish the work that would otherwise be performed, often manually, by sysadmins. The definition of SRE by the originator of this term (Ben Treynor Sloss at Google) gives an insight into the vision with which this concept was originally created – “SRE is what happens when you ask a software engineer to design an operations team”. As it usually happens with any topic that becomes popular, there are numerous SRE experts in the industry who have interpreted the concept as it is most convenient for their needs. To avoid a biased understanding, I started learning about SRE by reading the book written by creators of this concept at Google – Site Reliability Engineering: How Google Runs Production Systems.

Most misinterpretations on what SRE team should do and who should be part of this team will go away if one understands this statement from the book: SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, design and implement automation with software to replace human labor.

Google’s Approach to Service Management
Hire software engineers to run products and to create systems to accomplish the work that would otherwise be performed manually
Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload
50% cap on the aggregate “ops” work for all SREs—tickets, on-call, manual tasks, etc.
When a SRE team consistently spends less than 50% of time on engineering work, shift some of the operations burden back to the development team or add staff to the team without assigning that team additional operational responsibilities
Want systems that are automatic, not just automated

SRE vs. Devops
Before going further into SRE, let me compare SRE with Devops, which is a similar concept that addresses friction between development and operations. SRE and Devops are similar when it comes to bridging the gap between development and operations in addition to massive focus on automation. In Google’s view, SRE is a specific implementation of DevOps with some idiosyncratic extensions. There are significant differences too with Devops being a mindset focused on product development and delivery while SRE is a set of practices focused on post production reliability.

SRE	Devops
Production	Removing silos, “big picture”, delivering applications
Set of practices and metrics	Mindset and culture of collaboration
System availability and reliability	Product development and delivery
Systems engineers who write code	Everyone involved
How it should be done	What needs to be done

SRE Responsibilities:

SRE team is typically responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of the services they support. The core tenets of Google SRE are:

Ensuring a Durable Focus on Engineering
Pursuing Maximum Change Velocity Without Violating a Service’s SLO
Monitoring using automated software
Emergency Response designed to reduce Mean Time To Repair (MTTR)
Change Management that is automated to accomplish progressive rollouts, quickly detecting any problems and rolling back changes safely when problems arise
Demand Forecasting and Capacity Planning to ensure that the required capacity is in place by the time it is needed
Provisioning conducted quickly and only when necessary
Efficiency and Performance by predicting demand and provisioning capacity

Many organizations embark on building a SRE team in addition to a dedicated multi-tiered Operations team to support a service. Adding a SRE team as just another layer to existing ones supporting a service will only make the Operations process more inefficient. Being on-call is one of the integral functions of a SRE team and transforming existing L2 Support team to SRE model will yield the best results. Instead of “my environment is unique and SRE won’t work” attitude, it is important to revisit the entire Operations process holistically considering SRE principles and practices. In the next two blogposts, I will cover key points on principles and practices followed by Google as mentioned in the book.

SRE – Introduction

Categories

Recent Posts

Archives