SRE as the cornerstone of your Production

Site Reliability Engineering (SRE) has become an essential practice for ensuring the reliability and performance of complex software systems. In this article, we explore key insights and best practices shared in a discussion on SRE, covering various aspects of this critical discipline.

‍

Regarding the terminology surrounding SRE, akin to Agile and DevOps, there are diverse interpretations of what SRE entails. For some, the definitive interpretations are those offered by Google, the originator of the acronym. Others view SRE as a new trendy job title for the operations teams handling distributed systems. Nevertheless, when we speak of SRE practices, we encompass all the practices aimed at enhancing reliability, which are typically categorized into the following five dimensions:

Business SLAs & Error Budget: Empowering the business to make decisions regarding the IT system based on business outcomes.
Observability: The capacity to foresee potential disruptions to business processes and end-user experiences.
Post-Incident Improvement: Furnishing a clear and predictable path for swiftly implementing fixes in response to ongoing or potential business disruptions.
Automation: Safeguarding against high-risk actions and dedicating more time to value-added tasks for improving existing tools and processes.
Fast & Robust Deployment: Recognizing that reliability also means the capability to securely deliver hotfixes in production at any given time.

However, there exists a sixth dimension that is often overlooked—a dimension pivotal to the successes of companies such as Google, Facebook, and Amazon: Design. Organizations that maintain highly reliable services while frequently implementing changes ensure that SRE has a direct influence on service design. This influence extends from the codebase to involvement in software architecture and daily interactions with development teams.

‍

‍

A common pitfall in any transformation is to push these dimensions vigorously and anticipate that it will inherently enhance reliability. The recommended approach is to start by considering reliability, followed by design, and then employ the other dimensions as levers.

Now let’s get a bit more hands-on with a real outage and what an SRE actually does. When a significant incident occurs, there are six critical moments for an SRE to analyze and act upon:

Time to Detect & Notify: Implement automatic detection, alerting, and comprehensive traceability for subsequent analysis.
Time to Assign: Ensure that there is a designated owner for the incident management process without overcrowding responsibilities.
Time to Diagnose: Minimize the time required for diagnosis by establishing observability to prevent guesswork during incidents. Ensure that the investigative team is not disrupted.
Time to Restore: Grant operators the necessary credentials to act swiftly, identify all high-risk actions taken, and focus on restoring the service rather than identifying root causes
Time to Design: Determine the actions needed to prevent a recurrence of the same issues or minimize their impact. This may encompass enhancements in monitoring, processes, on-call procedures, as well as functional, technical, or software design.

In real-life scenarios, enhancing reliability is invariably a trade-off involving at least three key stakeholders: development teams, operations teams, and the business. The following observations reflect potential responses from each team after an outage:

Development Team: Propose enhancing monitoring for more proactive responses, as automating everything can be time-intensive and isn't necessary for less likely incidents.
Operations Team: Request that the development team automate failover procedures. However, an increase in monitoring places additional workload on the operations team, which is already occupied.
Dev & Ops Team (aka You Build It, You Run It): Assert that such incidents are rare and challenging to improve upon, expressing reluctance to alter upcoming sprints.
Business Team: Express dissatisfaction with downtime, citing financial losses or missed opportunities.

Complex information systems, where SRE is particularly valuable, often involve more than three stakeholders, including users, application production engineers, managers, and control towers. This is why SRE is really the cornerstone, to make sure all the actors talk to each other, at the same time, around the same service to take the best actions together to enhance reliability at the right cost, and not just what is easy for each team.

The real challenge lies in initiating your SRE journey. There is no one-size-fits-all approach, but based on our experience as consultants and SRE professionals in large tech companies, here are the critical points that warrant serious consideration for a successful transformation:

Define Objectives: Before commencing, clarify your objectives. Why do you want to implement SRE? What is important to your organization, and what are your expectations?
Define SRE Role: Clearly outline the SRE role and responsibilities. For complex information systems, this step is essential to think Reliability first, and then utilizing business SLAs, observability, post-incident improvement, automation, fast and robust deployment, and design as levers.
Identify SRE Expertise: Identify or recruit your initial SRE with strong interpersonal skills, observability, and design proficiencies.
Determine Operating Model: the most suitable operating model is the one that empowers your SREs. This choice depends on your organization's IT legacy and corporate culture, whether it's a Center of Excellence (NatWest), embedded within development squads (Amazon, SG Markets), or situated in operations (Google, BNP Paribas, Bank of America).

In conclusion, the successful implementation of Site Reliability Engineering (SRE) is a multifaceted endeavor that requires a thoughtful approach. It's imperative to recognize that SRE isn't solely about reactive measures in the face of incidents but also proactive strategies that span multiple dimensions.The dynamic nature of SRE, influenced by the ever-evolving technology landscape, necessitates continuous collaboration and communication among diverse stakeholders. The ability of SRE to bring these parties together around a common goal—improving reliability while managing costs—is the cornerstone of its success.

SRE as the cornerstone of your Production

Vincent Gérard

Principal Manager

Discover more