Agile

SRE – Management

We covered the motivation behind SRE in the first blogpost of this series, followed by Principles and Practices. Lets complete the foundation with Google’s guidance on how to get SREs working together in a team and working as teams. To ensure SRE approach sticks without the team slipping back to old ways, the new ways of working covered in this blogpost should be incorporated in a structured manner along with the team and the management committing to adhere to them at all costs.

Accelerating SREs to On-Call and Beyond: Educating new SREs on concepts and practices up front will shape them into better engineers and make their skills more robust.

  • Initial Learning Experiences – The Case for Structure Over Chaos: SRE must handle a mix of proactive (engineering) and reactive (on-call) work while traditional Operations teams are predominantly reactive. To position the team for success with proactive work, structured knowledge build-up of the system is essential. Some techniques for getting there:
    • Learning Paths That Are Cumulative and Orderly – Show the new SRE team an orderly path that will infuse confidence that there is a plan to mastery of the system through a combination of education, exposure and experience.
    • Targeted Project Work, Not Menial Work – Make the initial weeks effective by giving the engineers project work that can reinforce their learning.
  • Creating Stellar Reverse Engineers and Improvisational Thinkers: SREs will continue to encounter systems with design patterns that they have not seen before. They need strong reverse engineering skills along with ability to think statistically and improvise fully to untangle without avoid getting stuck.
  • Best Practices for Aspiring On-Callers: For engineers who typically prefer creating new tech solutions, being on-call to troubleshoot production issues can be made interesting with the following practices:
    1. A Hunger for Failure: Reading and Sharing Postmortems
    2. Disaster Role Playing (regular team exercises for new joiners to enact responding to pages)
    3. Break Real Things, Fix Real Things (by simulating volumes or issues in non-critical lower environments)
    4. Documentation as Apprenticeship (by overhauling outdated knowledge base)
    5. Shadow On-Call Early and Often
  • On-Call and Beyond – Rites of Passage and Practicing Continuing Education: Once the engineer has demonstrated ability to handle issues independently, it is time to be formally added to on-call rota and celebrate this milestone as a team. It is important to setup a regular learning series that helps the entire team stay in touch with changes.

Dealing with interrupts: Once the SRE team is in-charge of handling operations, “Managing Operational Load” is the next topic to focus on. Operational Load is the work that must be done to maintain the system in a functional state, and this will interrupt the SRE team working on any other planned project work. So, the objective is to handle such interruptions without distracting the engineers from their cognitive flow state. The interrupts fall into three general categories:

  • Pages concern production alerts and are triggered in response to production emergencies. They are commonly handled by a primary on-call engineer, who is focused solely on on-call work. A person should never be expected to be on-call and also make progress on projects or anything else with a high context switching cost. A secondary on-call engineer provides back-up in case of contingencies.
  • Tickets concern customer requests that require the team to take an action. The primary or secondary on-call engineer can work on tickets when there are no pages to handle. Depending on the nature and priority of tickets, a dedicated person might also be assigned to work on tickets.
  • Ongoing operational responsibilities include activities like team-owned code or flag rollouts, or responses to ad-hoc, time-sensitive questions from customers. An approach similar to handling tickets can be adopted.

Embedding a SRE to Recover from Operational Overload: A burdensome amount of ops work for a prolonged period will be dangerous because the SRE team might burn out or be unable to make progress on project work. One way to relieve this burden is to temporarily transfer a SRE into the overloaded team. Google’s guidance to the SRE who will be embedded on a team:

  • Phase 1: Learn the Service and Get Context – Remind the team that more tickets should not require more SREs and emphasize on healthy work habits that reduce the time spent on tickets. Some of the healthy habits are focusing on non-linear scaling of services, identifying sources of inordinate amount of stress, and identifying emergencies waiting to happen.
  • Phase 2: Sharing Context – After identifying pain points, suggest improvements and demonstrate better ways to work. Some examples are writing a good postmortem for the team or identifying root cause for frequent issues and suggesting solutions.
  • Phase 3: Driving Change – Nudge the team with ideas based on SRE principles and help them self-regulate. This can be done by helping the team fix any basic issues (like defining SLO), coaching team members to address issues in a permanent way or asking leading questions.

Communication and Collaboration in SRE: There is tremendous diversity in SRE teams as it includes people with various skills such as systems engineering, software engineering, project management, etc. Also, given the nature of responsibilities handled by SRE, team members tend to be more distributed across geographical regions and time zones when compared to product development. Considering these aspects, communication and collaboration among SRE teams and across other teams should be designed to address the joint concerns of production and the product in an atmosphere of mutual respect. There should be forums (like weekly Production Meetings) for the SRE team to articulate the state of the system they support and highlight improvement opportunities to Product Development.

The Evolving SRE Engagement Model: The focus so far has been on onboarding SRE support for a product or service that is already in production. While this “classic” engagement model is commonly a good starting point, there are two other models that are better at embedding SRE principles and practices earlier during development lifecycle. Let’s looks at all the three models, starting with the classic one.

  • Simple PRR (Classic) Model: When SRE receives a request for taking over production management, SRE gauges both the importance of the product and the availability of SRE teams. The SRE and development teams then agree on staffing levels to facilitate this support followed by a Production Readiness Review (PPR). Once the gaps and improvements identified from the review are addressed, SRE team assumes its production responsibilities.
  • Early Engagement Model: SRE participates in Design and later phases, eventually taking over the service any time during or after the build phase.
  • Evolving Services Development – Frameworks and SRE Platform: As the industry moves towards microservices architecture, the number of requests for SRE support and the cardinality of services to support will increase. To effectively address the increased demand, all microservices should adopt structured frameworks for production services. These frameworks include codified SRE best practices that are “production ready” by design and reusable solutions to mitigate scalability and reliability issues. A production platform built on top of such frameworks with stronger conventions reduces operational overhead.

These five ways to work should help establish and reinforce SRE teams in an organization. And with this, we come to the end of SRE overview series. I strongly recommend reading Google’s book to get a comprehensive understanding of SRE. As the industry moves further towards microservices and cloud, traditional support model that is predominantly based on manual operations will not be scalable and sustainable. The sooner organizations embark on pivoting towards an engineering-oriented support model with necessary investments in technology and people, the better for products and services they provide.

SRE – Practices

After covering the motivation behind SRE along with the responsibilities and principles in previous blogposts, this one will focus on “how” to get there by leveraging SRE practices used by Google. The book explains 18 practices and I strongly recommend reading the book to thoroughly understand them. I have provided a brief summary of the most common and relevant practices here.

The book has characterized the health of the service similar to Maslow’s hierarchy of human needs, with basic needs at the bottom (starting with Monitoring) and goes up all the way to taking proactive control of the product ‘s future rather than reactively fighting fires. All the practices fall under one of these categories.

Monitoring: Any software service cannot sustain in the long term if customers usually come to know of problems before the service provider. To avoid this situation of flying blind, monitoring has always been an essential part of supporting a service. Many organizations have L1 Service Desk teams that either manually perform runbook based checks or visually monitor dashboards (ITRS, App Dynamics, etc.) looking for any service turning “red”. Both these approaches involve manual activity, which make monitoring less effective and inefficient. Google being a tech savvy organization, always had automated monitoring through custom scripts that check responses and alert.

  • Practical Alerting from Time-Series Data: As Google’s monitoring systems evolved using SRE, they transformed to a new paradigm that made the collection of time-series a first-class role of the monitoring system, and replaced those check scripts with a rich language for manipulating time-series into charts and alerts. Open source tools like Prometheus, Riemann, Heka and Bosun allow any organization to adopt this approach. For organizations still relying heavily on L1 Service Desks, a good starting point will be to use a combination of white-box and black-box monitoring along with a production health dashboard and optimum alerting to eliminate the need for manual operations that only scales linearly.

Incident Response: Incidents that disrupt a software service dependent on numerous interconnected components is inevitable. SRE approaches these incidents as an opportunity to learn and remain in touch with how distributed computing systems actually work. While Incident Response and Incident Management are used interchangeably at some places, I consider Incident Response that includes technical analysis and recovery to be the primary responsibility of SRE team, whereas Incident Management deals with communication with stakeholders and pulling the who response together. Google has also called out Managing Incidents as one of the four practices under Incident Response:

  • Being On-Call is a critical duty for SRE team to keep their services reliable and available. At the same time, balanced on-call is essential to foster a sustainable and manageable work environment for the SRE team. The balance should ensure there is no operational overload or underload. Operational overload will make it difficult for the SRE team to spend at least 50% of their time on engineering activities leading to technology debt and inefficient manual workarounds creeping into support process. Operational underload can result in SREs going out of touch with production creating knowledge gaps that can be disastrous when an incident occurs. On-call approach should enable engineering work as the primary means to scale production responsibilities and maintain high reliability and availability despite the increasing complexity and number of systems and services for which SREs are responsible.
  • Effective Troubleshooting: Troubleshooting is a skill similar to riding a bike or driving a stick-shift car, something that becomes easy once you internalize the process and program your memory to subconsciously take necessary action. In addition to acquiring generic troubleshooting skill, solid knowledge of the system is essential for a SRE to be effective during incidents. Building observability into each component from the ground up and designing systems with well-understood interfaces between components will make troubleshooting easier. Adopting a systematic approach to troubleshooting (like Triage -> Examine -> Diagnose -> Test / Treat cycle) instead of relying on luck or experience will yield good results and better experience for all stakeholders.
  • Emergency Response: “Don’t panic” is the mantra to remember during system failures to be able to recover effectively. And to be able to act without panic, training to handle such situations is absolutely essential. Test-Induced emergency helps SRE proactively prepare for such eventualities, make changes to fix the underlying problems and also identify other weaknesses before they became outages. In real life, emergencies are usually change-induced or process induced and SREs learn from all outages. They also document the failure modes for other teams to learn how to better troubleshoot and fortify their systems against similar outages.
  • Managing Incidents: Most organizations already have an ITIL based Incident management process in place. SRE team strengthens this process by focusing on reducing mean time to recovery and providing staff a less stressful way to work on emergent problems. The features that can help achieve this are recursive separation of responsibilities, a recognized command post, live incident state document and clear handoff.

Postmortem and Root Cause Analysis: SRE philosophy aims to manually solve only new and exciting problems in production unlike some of the traditional operations-focused environments that end up fixing the same issue over and over.

  • Postmortem Culture of Learning from Failure has primary goals of ensuring that the incident is documented, all contributing root causes are well understood and effective preventive actions are put in place to reduce the likelihood and impact of recurrence. As the postmortem process involves inherent cost in terms of time and effort, well defined triggers like incident severity is used to ensure root cause analysis is done for appropriate events. Blameless postmortems are a tenet of SRE culture.

Testing: The previous practices help handle problems when they arise but preventing such problems from occurring in the first place should be the norm.

  • Testing for Reliability is the practice that helps adapting classical software testing techniques to systems at scale and improve reliability. Traditional tests during software development stage like unit testing, integration testing and system testing (smoke, performance, regression, etc.) help ensure correct behavior of the system before it is deployed into production. Production tests like stress / canary / configuration tests are similar to black-box monitoring that help proactively identify problems before users encounter them and also help staggered rollouts that limits any impacts in production.

Capacity Planning: Modern distributed systems built using component architecture are designed to scale on demand and rely heavily on diligent capacity planning to achieve it. The following four practices are key:

  • Load balancing at the Frontend: DNS is still the simplest and most effective way to balance load before the user’s connection even starts but has limitations. So, the initial level of DNS load balancing should be followed by a level that takes advantage of virtual IP addresses.
  • Load balancing in the data center: Once the request arrives at the data center, the next step is to identify the right algorithms for distributing work within a given datacenter for a stream of queries. Load balancing policies can be very simple and not take into account any information about the state of the backends (e.g., Round Robin) or can act with more information about the backends (e.g., Least-Loaded Round Robin or Weighted Round Robin).
  • Handling Overload: Load balancing policies are expected to prevent overload but there are times when the best plans fail. In addition to data center load balancing, per-customer limits and client-side throttling will help spread load over tasks in a datacenter relatively evenly. Despite all precautions, when backend is overloaded, it need not turn down and stop accepting all traffic. Instead, it can continue accepting as much traffic as possible, but to only accept that load as capacity frees up.
  • Addressing cascading failures: A cascading failure is one that grows over time as a result of positive feedback. It can occur when a portion of an overall system fails, increasing the probability that other portions of the system fail. Increasing resources, restarting servers, dropping traffic, eliminating non-critical load, eliminating bad traffic are some of the immediate steps that can address cascading failures.

Development: All the practices covered so far deal with handling reliability after software development is complete. Google recommends significant large-scale system design and software engineering work within the organization to enable SRE through following practices:

  • Managing Critical State – Distributed Consensus for Reliability: CAP Theorem provides the guiding principle to determine the properties that are most critical. When dealing with distributed software systems, we are interested in asynchronous distributed consensus, which applies to environments with potentially unbounded delays in message passing. Distributed consensus algorithms allow a set of nodes to agree on a value once but don’t map well to real design tasks. Distributed consensus adds higher-level system components such as datastores, configuration stores, queues, locking, and leader election services to provide the practical system functionality that distributed consensus algorithms don’t address. Using higher-level components reduces complexity for system designers. It also allows underlying distributed consensus algorithms to be changed if necessary in response to changes in the environment in which the system runs or changes in nonfunctional requirements.
  • Distributed Periodic Scheduling with Cron, Data Processing Pipelines and ensuring Data Integrity: What You Read Is What You Wrote are other practices during Development.

Product is at the top of the pyramid for any organization. Organizations will benefit by practicing Reliable Product Launches at Scale using Launch Coordination Engineering role to setup a solid launch process with launch checklist.

These practices shared by Google provide a comprehensive framework to adopt across software development lifecycle to improve reliability, resilience and stability of systems.

SRE – Principles

After a quick introduction to SRE in the previous blogpost, lets step into the principles as shared by Google in their book. Wikipedia defines Reliability as the probability that a system will produce correct outputs up to some given time “t”. Reliability is enhanced by features that help to avoid, detect and repair hardware faults. A reliable system does not silently continue and deliver results that include corrupted data. Instead, it detects and, if possible, corrects the corruption. Reliability can be characterized in terms of mean time between failures (MTBF), with reliability = exp(-t/MTBF).

While getting reliability to 100% appears to be ideal, there is cost involved. SRE outlines the following principles that can help achieve desired reliability level by balancing resiliency with cost. This blogpost will briefly cover each principle and help us appreciate SRE practices that will be covered next.

  1. Embracing Risk
  2. Service Level Objectives
  3. Eliminating Toil
  4. Monitoring Systems
  5. Release Engineering
  6. Simplicity

Embracing Risk:
SRE seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users’ overall happiness – with features, service, and performance – is optimized. Efforts to increase reliability beyond a certain point will exponentially increase recurring costs making it economically worse for a service and its users. Cost of improving reliability can be categorized into two buckets, both of them are invisible to end users but essential to avoid disruptions rather than building new features:

  1. The cost of redundant machine / compute resources.
  2. The opportunity cost when engineers are allocated to improve reliability.

In SRE, service reliability is managed by managing risk. The goal is to explicitly align the risk taken by a given service with the risk the business is willing to bear and strive to make a service reliable enough, but no more reliable than it needs to be. To achieve this, a set of Service Level Objectives need to be defined and this will be covered in the next principle.
Before that, another key concept is Error Budgets. As we embrace risk this way, tensions will arise between Product Development and SRE teams as they are usually evaluated on different metrics. An error budget aligns incentives and emphasizes joint ownership between SRE and product development. Error budgets make it easier to decide the rate of releases and to effectively defuse discussions about outages with stakeholders, and allows multiple teams to reach the same conclusion about production risk without rancor.

Service Level Objectives:
To manage a service, we first need to express its important behaviors quantitatively and then define the level of service that will be delivered. Three important terminologies that help achieve this are:

  1. Service Level Indicator (SLI): a carefully defined quantitative measure of some aspect of the level of service that is provided. Examples – request latency, error rate, system throughput, availability, durability.
  2. Service Level Objective (SLO): a target value or range of values for a service level that is measured by an SLI. Example – 99% of Get RPC calls will complete in less than 100 ms.
  3. Service Level Agreement (SLA): an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain. SLAs usually have financial implication for violating SLO.

Eliminating Toil:
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. SRE’s goal is to eliminate toil so that they can spend time on long-term engineering project work. Typically 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service features.

Monitoring Systems:
Monitoring includes collecting, processing, aggregating and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times and server lifetimes. Effective monitoring helps proactively avoid failures and involves alerting, building dashboards, analyzing long term trends and root cause analysis. Monitoring can either be:
· White-box that is based on metrics exposed by the internals of the system, including logs, interfaces like JVM Profiling Interface or an HTTP handler that emits internal statistics.
· Black-box that involves testing externally visible behavior as a user would see it.

Release Engineering:
When equipped with the right tools, proper automation, and well-defined policies, developers and SREs shouldn’t have to worry about releasing software. Releases can be as painless as simply pressing a button and Release Engineers help achieve this using devops pipeline that includes source code repository, build rules for compilation, configuration management, test integration, packaging and deployment.
Release engineering is guided by an engineering and service philosophy that’s expressed through four major principles:

  1. Self-Service Model: Tools and process that allows product development teams to control and run their own release processes and achieve high release velocity.
  2. High velocity: Frequent releases that result in fewer changes between versions.
  3. Hermetic Builds: Self-contained builds that must not rely on services that are external to the build environment.
  4. Enforcement of Policies and Procedures

Simplicity:
Software simplicity is a prerequisite to reliability. With an eye towards minimizing accidental complexity, SRE teams should:
· Push back when accidental complexity is introduced into the systems for which they are responsible.
· Constantly strive to eliminate complexity in systems they onboard and for which they assume operational responsibility

SRE – Introduction

SRE is what happens when you ask a software engineer to design an operations team
– Ben Treynor Sloss, Google

Site Reliability Engineering (SRE) is among the most popular technology topics during the last few years, with the IT industry viewing it as a better way to run production systems by applying a software engineering mindset to accomplish the work that would otherwise be performed, often manually, by sysadmins. The definition of SRE by the originator of this term (Ben Treynor Sloss at Google) gives an insight into the vision with which this concept was originally created – “SRE is what happens when you ask a software engineer to design an operations team”. As it usually happens with any topic that becomes popular, there are numerous SRE experts in the industry who have interpreted the concept as it is most convenient for their needs. To avoid a biased understanding, I started learning about SRE by reading the book written by creators of this concept at Google – Site Reliability Engineering: How Google Runs Production Systems.

Most misinterpretations on what SRE team should do and who should be part of this team will go away if one understands this statement from the book: SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, design and implement automation with software to replace human labor.

Google’s Approach to Service Management
  • Hire software engineers to run products and to create systems to accomplish the work that would otherwise be performed manually
  • Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload
  • 50% cap on the aggregate “ops” work for all SREs—tickets, on-call, manual tasks, etc.
  • When a SRE team consistently spends less than 50% of time on engineering work, shift some of the operations burden back to the development team or add staff to the team without assigning that team additional operational responsibilities
  • Want systems that are automatic, not just automated
  • SRE vs. Devops
    Before going further into SRE, let me compare SRE with Devops, which is a similar concept that addresses friction between development and operations. SRE and Devops are similar when it comes to bridging the gap between development and operations in addition to massive focus on automation. In Google’s view, SRE is a specific implementation of DevOps with some idiosyncratic extensions. There are significant differences too with Devops being a mindset focused on product development and delivery while SRE is a set of practices focused on post production reliability.

    SREDevops
    ProductionRemoving silos, “big picture”, delivering applications
    Set of practices and metricsMindset and culture of collaboration
    System availability and reliabilityProduct development and delivery
    Systems engineers who write codeEveryone involved
    How it should be doneWhat needs to be done

    SRE Responsibilities:

    SRE team is typically responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of the services they support. The core tenets of Google SRE are:

    • Ensuring a Durable Focus on Engineering
    • Pursuing Maximum Change Velocity Without Violating a Service’s SLO
    • Monitoring using automated software
    • Emergency Response designed to reduce Mean Time To Repair (MTTR)
    • Change Management that is automated to accomplish progressive rollouts, quickly detecting any problems and rolling back changes safely when problems arise
    • Demand Forecasting and Capacity Planning to ensure that the required capacity is in place by the time it is needed
    • Provisioning conducted quickly and only when necessary
    • Efficiency and Performance by predicting demand and provisioning capacity

    Many organizations embark on building a SRE team in addition to a dedicated multi-tiered Operations team to support a service. Adding a SRE team as just another layer to existing ones supporting a service will only make the Operations process more inefficient. Being on-call is one of the integral functions of a SRE team and transforming existing L2 Support team to SRE model will yield the best results. Instead of “my environment is unique and SRE won’t work” attitude, it is important to revisit the entire Operations process holistically considering SRE principles and practices. In the next two blogposts, I will cover key points on principles and practices followed by Google as mentioned in the book.

    The 4 Disciplines of Execution

    I had referred to 4DX in my previous blog and the book that introduced this concept was my next read. Executing projects to completion successfully has been my strength for long and after reading the book, was delighted to know that I already follow most of the rules given in this book. So, the book helped build my vocabulary on execution focus and also articulate how one can succeed with excellent execution. In this blogpost, have summarized the key rules and principles behind 4DX.

    People working at large organizations will be familiar with the struggle to prioritize execution of important strategic goals as they will invariably end up spending most of their time on urgent day-to-day operational tasks. So, the real enemy of execution is our day job, which the book calls the whirlwind. 4DX acknowledges the importance of whirlwind and provides a set of rules for executing our most critical strategy in the midst of our whirlwind.

    1. Discipline #1: Focus on the Wildly Important – The big idea here is to focus our finest effort on a few highly important goals that can he achieved in the midst of the whirlwind of the day job, rather than giving mediocre effort to dozens of goals.
      1. Rule #1: No team focuses on more than two Wildly Important Goals (WIGs) at the same time.
      2. Rule #2: The battle you choose must win the war.
      3. Rule #3: Senior leaders can veto, but not dictate.
      4. Rule #4: All WIGs must have finish line in the form of from X to Y by when.
    2. Discipline #2: Act on the Lead Measures: Discipline 1 takes the wildly important goal for an organization and breaks it into a set of specific, measurable targets until every team has a WIG that it can own. Discipline 2 then defines the leveraged actions that can enable the team to achieve that goal. Tracking a goal is done through two types of measures:
      • Lag Measures: Measurement of a result we are trying to achieve and called lag measure because by the time we get the data the result has already happened, so they are always lagging. Example – sprint velocity, lead time, revenue, profits, etc.
      • Lead measures: Foretell the result and is virtually within our control. Example – while a sprint goal (say velocity, lag measure) can be jeopardized due to external dependencies that are out of the team’s control, the team can certainly adhere strictly to acceptance criteria (lead measures like definition of ready and definition of done). And more the team acts on the lead measure, the more likely sprint goals will be accomplished. Lead measures should have two primary characteristics:
        • Predictive: If the lead measure changes, team can predict that the lag measure will also change.
        • Influenceable: It can be directly influenced by the team without a significant dependence on another team.
    3. Discipline #3: Keep a compelling Scoreboard – The third discipline is to make sure everyone in the team know the score at all times, so that they can tell whether they are winning. A Sprint Burndown Chart that tracks the team’s progress towards sprint goal can be an example. The following four questions will determine if the scoreboard is likely to be compelling to the team:
      • Is it simple?
      • Can I see it easily?
      • Does it show lead and lag measures?
      • Can I tell at a glance if my team is winning?
    4. Discipline #4: Create a Cadence of Accountability – The fourth discipline is to create a frequently recurring cycle of accountability (WIG sessions) for the past performance and planning to move the score forward. In a Scrum team, Sprint Retrospective is a routine that strives to achieve this by expecting the team to discuss those things that went well and others that went wrong, to identify improvement opportunities for future sprints. A WIG session has the following three part agenda:
      1. Account: Report on commitments
      2. Review the scoreboard: Learn from successes and failures
      3. Plan: Clear the path and make new commitments

    Over the years, I have seen numerous strategic organizational initiatives being launched with the right intent and much fanfare. However, only a few of them achieved the real goals and many went down quietly over time, slowly suffocated by the whirlwind. The book summarizes this situation beautifully and kindles hope at the end: Once people give up on a goal that looks unachievable – no matter how strategic it might be – there is only one place to go: back to the whirlwind. After all, it’s what they know and it feels safe. When this happens, your team is now officially playing not to lose instead of playing to win and there is a big difference. Simply put, 4DX gets an organization playing to win!

    Agile Engineering Practices

    Agile software development helps reduce “time to market” by placing value on “responding to change” over “following a plan”. It is proven that a “project plan” only provides an illusion of progress towards the product goal, given the number of failed projects across the industry with solid plans and after several person years of effort. Instead, Agile seeks to “fail fast” and “pivot” to more valuable goals. This is possible only when the team operates with strong discipline and solid engineering practices.

    Using the word “engineering” is anathema for some practitioners who consider software development to be “craft” than “engineering” discipline. While creativity is essential for software development, engineering discipline enables creativity. Remember Nikola Tesla who proved Thomas Edison wrong on alternating current – how many can claim to be more creative than him? Nikola Tesla was an electrical and mechanical engineer who combined his engineering discipline with creativity to become a genius! Engineering discipline helps address variability and unpredictability with software development. The engineering practices I will cover below act as the scaffolding required to provide safety to the Agile team as they embark on building a tall tower!

    Test Driven Development (TDD): This invariably appears on any list of engineering practices and there are variants in Behaviour Driven Development (BDD) and Acceptance Test Driven Development (ATDD). TDD is best described by Rob Martin with three rules:

    1. Write no production code except to pass a failing test
    2. Write only enough of a test to demonstrate failure
    3. Write only enough production code to pass a failing test

    These three rules are logical and sound simple. But I can vouch this will be painful. In this competitive world, there is no way to create something outstanding and unique without going through pain. Automated unit tests form the basis for other engineering practices that come later in SDLC.

    There are numerous tools for TDD, some of the popular ones I have used – JUnit, Robot Framework, Fitnesse, Lettuce (BDD).

    Continuous Integration (CI): is the practice of merging all developer working copies to a shared mainline several times a day. This will help avoid “integration hell” that developers encounter when they try to merge their changes just before release packaging. The reason for “integration hell” is obvious – a developer continues to accumulate technology debt by hanging on to changes in local environment without checking them into the mainline. It is prudent to keep repaying debt in small increments rather than accumulating it to become a monster! CI has the following prerequisites:

    1. Code Repository – Git, SVN, TFS, etc.
    2. Automated build – Gradle, Maven, Ant, Make, etc.
    3. Build self-test – refer to TDD

    Jenkins is the most popular CI server with thousands of plugins to setup a robust CI environment. Once you have CI setup, next level engineering is Continuous Deployment (CD) that enables software to be deployed directly into production.

    Refactoring: Martin Fowler’s book is the authority on this topic. His preamble is insightful – Refactoring is a controlled technique for improving the design of an existing code base. Its essence is applying a series of small behavior-preserving transformations, each of which “too small to be worth doing”. However the cumulative effect of each of these transformations is quite significant. By doing them in small steps you reduce the risk of introducing errors. You also avoid having the system broken while you are carrying out the restructuring – which allows you to gradually refactor a system over an extended period of time.

    Technologists often talk about challenges with legacy code. Refactoring regularly will ensure software does not become “legacy”!

    Other major engineering practices are:

    • Pair Programming
    • Collective Ownership
    • Emergent Design

    To summarize, engineering practices help a team become agile and stay that way. It is important to understand that adopting engineering practices is a cultural aspect and not just a matter of mandating a bunch of popular tools for the team to use. Agile teams will immensely benefit by embracing engineering discipline with conviction.

    Leading Agile Teams

    Welcome to the third part of my Agile series. Having covered the foundational elements of Agile and basics about the most widely used Agile framework, I will share my knowledge on how to motivate an Agile team towards the product goal. An Agile team is self-organizing and cross-functional. The term “self-organizing” is key, indicating that the traditional management approach of direction and control will not work.

    Let me start with the origins of traditional rationale for the need for direction and control. “The Human Side of Enterprise”, a management classic written by Douglas McGregor almost 60 years back insightfully covers the assumption on which the traditional view is based:

    • The average human being has an inherent dislike of work and will avoid it if possible
    • Hence most people must be coerced, controlled, directed and threatened with punishment to get them to put forth adequate efforts towards achievement of organizational goals
    • The average human being prefers to be directed, wishes to avoid responsibility, has relatively limited motivation and wants security above all

    I can bet that no one reading this blog will associate themselves with this average human being! This characterization is demeaning and Douglas McGregor concludes by saying “under the conditions of modern industrial life, the intellectual potentialities of the average human being are only partially utilized”. He made this case for factory workers sixty years ago. Software development in modern technology environment requires even more intellectual stimulation than routine work in factories.

    I will now switch to a classic Harvard Business Review article from the 1980s by Frederick Herzberg titled “One more time: How do you motivate employees”. It starts with an interesting preamble: “Forget praise. Forget punishment. Forget cash. You need to make their jobs more interesting”. In short, we can enrich jobs by applying the following principles:

    • Increase individuals’ accountability for their work by removing some controls
    • Give people responsibility for a complete process or unit of work
    • Make information available directly to employees rather than sending it through their managers first
    • Enable people to take new, more difficult tasks they have not handled before
    • Assign individuals specialized tasks that allow them to become experts

    A relatively modern book “Drive: The surprising truth about what motivates us” by Daniel Pink provides the most powerful insights that are applicable for software development. He says the predominant motivating factors have changed as humans evolved over the last 50,000 years. While the motivation 50,000 years back was just trying to survive, the labor workforce during early stages of industrial revolution was motivated to seek rewards and avoid punishments. He delves deep into what motivates the modern technology workforce required for software development.

    He makes a compelling case on why rewards don’t work. The deadly flaws with rewards are that they can extinguish intrinsic motivation, diminish high performance, crush creativity, crowd out good behavior, encourage unethical behavior, become addictive and foster short-term thinking. Rewards are often equated to compensation and does this mean compensation does not matter? Compensation does matter and is vital to attract good talent. Instead of carrot and stick approach towards compensation, pay the team well in line with their market value and take it out of the equation so that the team is driven by intrinsic motivation.

    The question then is how to achieve intrinsic motivation. Daniel Pink has an answer that I have seen work effectively – create a Results Only Work Environment and provide autonomy over the 4 “T”s:

    • Task: People are hired for specific business needs and they need to perform activities required to satisfy them. At the same time, several companies have benefited immensely by encouraging their people to spend about 20% of their time on tasks that they want to do on their own.
    • Time: Stop tracking time! Several studies have shown that creative work like software development cannot be measured by time – there are situations when an outcome that an expert programmer can produce in 2 hours cannot be achieved even after hundreds of hours spent by several mediocre programmers.
    • Technique: Business priorities determine what needs to be done but avoid telling the team how to do it. The suggestion is simple – hire people you can trust, tell them what needs to be done and trust them to figure out how to do it.
    • Team: Let the Team interview and select new members for their own team.

    I will conclude by referring to Mihaly Csikszentmihalyi’s theory that people are happiest when they are in a state of flow – a state of concentration or complete absorption with the activity at hand and the situation. It is a state in which people are so involved in an activity that nothing else seems to matter. Some people call it being in the zone or getting in the groove. This is the state that people in an Agile team aspire to reach. So, create an environment where the team is fueled by intrinsic motivation and let the results flow in!

    Scrum: What is it all about?

    After articulating my views on agile in my previous blog, the next step is to cover the most famous agile framework in practice across the industry – Scrum. If you want to get a quick insight into Scrum, you should read The Scrum Guide authored by the creators themselves. There are numerous books and online material available to cater to your specific interests. This blog is only my mental model of Scrum.

    Where did the term scrum come from? Rugby – scrum (short for scrummage) is a method of restarting play in rugby that involves players packing closely together with their heads down and attempting to gain possession of the ball. It was first used in software development context by Hirotaka Takeuchi and Ikujiro Nonaka in their 1986 HBR paper “The New New Product Development Game”. Rugby is team sport and success can be achieved only when all the players perform in unison. Teamwork is essential for software development to succeed too.

    Who developed Scrum for software development? Ken Schwaber and Jeff Sutherland. They were among the 17 original signatories of the Agile Manifesto in Feb 2001.

    Definition of Scrum: A framework within which people can address complex adaptive problems, while productively and creatively delivering products of the highest possible value. Scrum is lightweight and simple to understand but difficult to master.

    Scrum is founded on empirical process control theory, or empiricism. Empiricism asserts that knowledge comes from experience and making decisions based on what is known. Scrum employs an iterative, incremental approach to optimize predictability and control risk. Three pillars uphold every implementation of empirical process control: transparency, inspection and adaptation.

    One needs to go through a 2-day Certified Scrum Master (CSM) training to get a good understanding of Scrum. Having gone through the training twice and practiced it for several years, I would say Scrum is all about understanding the roles, events and artifacts, and bringing them together to succeed in developing complex software.

    Roles in a Scrum Team: The Scrum Guide has captured this foundational element insightfully. To retain the impact, I have just pasted the excerpt below:

    The Scrum Team consists of a Product Owner, the Development Team, and a Scrum Master. Scrum Teams are self-organizing and cross-functional. Self-organizing teams choose how best to accomplish their work, rather than being directed by others outside the team. Cross-functional teams have all competencies needed to accomplish the work without depending on others not part of the team. The team model in Scrum is designed to optimize flexibility, creativity, and productivity. The Scrum Team has proven itself to be increasingly effective for all the earlier stated uses, and any complex work.

    Scrum Teams deliver products iteratively and incrementally, maximizing opportunities for feedback. Incremental deliveries of “Done” product ensure a potentially useful version of working product is always available.

    Every word stated above is important and really leaves no scope for misinterpretation. However, many practitioners and so-called experts continue to alter the roles for their convenience. I have seen instances where a Manager from the legacy process becomes Scrum Master in the new environment and attempts to continue managing the team. As per my Scrum Coach, any violation of these definitions is fake scrum!

    A quick summary of the only three roles recognized in Scrum:

    • The Product Owner is the only person responsible for managing and prioritizing the book of work (Product Backlog).
    • The Development Teams in scrum typically includes seven plus / minus two members. They are self-organizing, cross functional the accountability for delivering committed items belong to the development team as a whole.
    • The Scrum Master is a servant-leader for the scrum team, being responsible for promoting and supporting scrum by helping every one understand scrum theory, practices, rules and values.

    Scrum Events: Some people call them ceremonies or routines, I feel the former unnecessarily glorifies them while the latter sounds mundane. I like to stick to events as it reflects simplicity and necessity. All events are time-boxed with an agreed maximum duration. The super event is The Sprint, which is a container of all other events that are designed to facilitate the three pillars of Scrum – transparency, inspection and adaptation.

    An overview of the events:

    • The Sprint is the heart of Scrum, a timebox of one month or less during which a Potentially Shippable Product Increment (PSPI) is created. Sprints have consistent durations throughout development effort and a series of Sprints would typically result in a Minimum Viable Product (MVP). While sprint duration should be less than a month, the most preferred duration is a fortnight. As a thumb rule, the higher the ambiguity in requirements, the shorter the sprint. This might be counter-intuitive for some, but will be easy to understand when you consider from inspection and adaptation perspective. Shorter sprints allow for failing faster and pivoting quickly without being carried away by an illusion of control.
    • Sprint Planning is the first event during a Sprint. The primary input for this event is the prioritized Product Backlog that the Product Owner maintains. Sprint Planning covers what can be done in this sprint and how will we do it. The outcome is Sprint Backlog and Sprint Goal that the entire team commits to. It is time-boxed to not more than 5% of a Sprint.
    • Daily Scrum is a 15 minute event for the development team where every team member answers the following three questions:
      • What did I do yesterday to meet the Sprint Goal?
      • What do I plan to do today?
      • What are the impediments that need to be addressed?
    • Sprint Review is held at the end of the Sprint for the development team to demo the PSPI to Product Owner. It can occupy upto 5% of the Sprint depending on the level of details that need to be covered. At the end of the review, the Product Owner updates the Product Backlog based on learnings from the Sprint in the spirit of inspection and adaptation.
    • Sprint Retrospective is an opportunity for the team to introspect. All team members articulate what went well during the sprint, what could have been done better and collectively come up with a plan for improvements. The Scrum Master plays a key role during this event, helping the team to stay positive and productive.

    Sprint Artifacts: Scrum keeps this part simple and focuses on enabling the three pillars of Scrum. The artifacts are:

    • Product Backlog is a list of everything that is known to be needed in the product and ordered by their value as determined by the Product Owner. Product Backlog is always evolving and the highest ordered items are more detailed than lower order ones. The details include estimates and the Product Owner collaborates with the development team to flesh out the details. This process is called Product Backlog Refinement.
    • Sprint Backlog is the list of all items to be completed to achieve a Sprint Goal.

    This is Scrum basics in a thousand words. It is quite simple and sometimes simple things are the most difficult ones to follow. A team will realize this as they encounter issues during the initial sprints after agile transformation. However, the good news is that Scrum Framework provides the means to deal with all the challenges that will inevitably come up. Just stick to the basics and persevere using the framework, success will follow! Happy scrumming!!!

    Agile Software Development: Revisited

    It is six years since I was formally initiated into Agile Software Development and find myself at a logical juncture to reminisce the experience. I started my Agile journey in Jan 2013 as a skeptic, having seen another team decimated during the previous year after a global Agile transformation. There was no choice as my team was next in the line and the transformation was scheduled to officially start with a week long Certified ScrumMaster course at Chicago. The course started on a cold winter morning with senior leaders from all locations in attendance and it soon became clear that it had to be an all-in transformation with any half-measures doomed to fail. Over the next three months, having understood the merits of succeeding with Agile and the risks of not doing so, I became a believer and an earnest adopter. I was a proud practitioner during the next two years, coaching seven scrum teams across more than fifty sprints. I am not going to tell the story here, but will share some of the learnings from the experience.

    What is Agile Software Development and how is it different from the other methods used? There are many Agile frameworks / methodologies – Scrum, Extreme Programming (XP), Lean, Adaptive Software Development and many more. The common elements across all these methods are captured insightfully in the Agile Manifesto signed in Feb 2001. If a team truly embraces all the values listed below from the manifesto even without religiously following a specific methodology, it is still an agile team. As a corollary, if any of these values is not followed in letter and spirit, then it is fake agile!

    Individuals and interactions over processes and tools
    Working software over comprehensive documentation
    Customer collaboration over contract negotiation
    Responding to change over following a plan

    That is, while there is value in the items on
    the right, we value the items on the left more.

    These values are achieved by following the 12 principles that complete the Agile Manifesto.

    1. Our highest priority is to satisfy the customer through early and continuous delivery of valuable software.
    2. Welcome changing requirements, even late in development. Agile processes harness change for the customer’s competitive advantage.
    3. Deliver working software frequently, from a couple of weeks to a couple of months, with a preference to the shorter timescale.
    4. Business people and developers must work together daily throughout the project.
    5. Build projects around motivated individuals. Give them the environment and support they need, and trust them to get the job done.
    6. The most efficient and effective method of conveying information to and within a development team is face-to-face conversation.
    7. Working software is the primary measure of progress.
    8. Agile processes promote sustainable development. The sponsors, developers, and users should be able to maintain a constant pace indefinitely.
    9. Continuous attention to technical excellence and good design enhances agility.
    10. Simplicity–the art of maximizing the amount of work not done–is essential.
    11. The best architectures, requirements, and designs emerge from self-organizing teams.
    12. At regular intervals, the team reflects on how to become more effective, then tunes and adjusts its behavior accordingly.

    Just stick to these values and principles without violating any and one will be Agile! It is that simple!

    The challenge is not about learning to be agile, the difficult part is to unlearn old ways that people have grown to be comfortable with. Some of the elements to follow will be against what is perceived as common sense. So, we need to believe in Albert Einstein’s quote “Common sense is the collection of prejudices acquired by age eighteen”.

    There is an ongoing debate about purist / theoretical agile vs. being agile in spirit. As one can see from the manifesto, a team is either agile or not. So, where does a purist angle come into play? It does when a specific framework or methodology is used. I experienced it when the teams had to adopt Scrum framework as part of an “all-in” transformation. All-in transformation is one where an entire group decides to make fundamental changes to ways of working by following a framework. It is hard but effective as it reduces ambiguity and resistance, avoids problems created by having scrum and traditional teams work together and will be over more quickly. More importantly, when a team is forced to go all-in by abandoning comfortable traditional practices and mandating hard new practices, it becomes difficult to pretend to adopt the change. It will essentially leave the team with only two options – embrace and survive OR pretend and perish. And Scrum has a number of routines that are difficult to religiously follow. It takes teams to the brink but once the transformation is complete, they will find their sweet spot and settle down while retaining the new found effectiveness.

    As Mike Cohn says in “Succeeding with Agile”, becoming Agile is hard but worth it. It is hard as successful change is neither top-down or bottom-up, the end state is unpredictable, it is pervasive and dramatically different. But it is worth the effort as successful change will result in higher productivity, faster time to market, higher quality, improved employee engagement and job satisfaction among other benefits. However, not every one will willingly and whole-heartedly support the change. One of the significant reasons for resistance from certain groups is explained by Larman’s Laws of Organizational Behavior. It might not be possible to eliminate all the complexity with org structures in a large organization. But it is important to sponsor and empower the agile team. Free them from traditional monitor and control processes. Trust them to get the job done.

    There is a lot more for me to share – on Scrum, Kanban, tools, techniques, books, etc. In the spirit of keeping my blog posts to a thousand words, here is a summary of my journey during the last six years:

    • During the first couple of years, I converted from being a skeptic to a proud agile practitioner coaching co-located, cross-functional, long-lived feature teams to success. It was a great experience to see agile engineering practices like test driven development, peer reviews, continuous integration and continuous deployment in action.
    • Took up a different role during the next four years where most teams were made up of 6 to 9 members and expected to release software every month. So, they had to follow most of the agile principles.
    • During this time, I have seen attempts to centrally administer Agile and plan / project manage agile transformation with fancy launch ceremonies. Such approaches that go against agile values and principles have consistently failed to produce desired results.

    I will pause here and will continue this as a series soon.