Site Reliability Engineering 101
Some initial thoughts on thinking about establishing a Site Reliability Culture and Team
SRE Site Reliability Engineering org design organizational modelsPhoto by Goh Rhy Yan / Unsplash
Ideologies are great ways to add structure to a group of people, one I’ve gotten enamoured with is the framework of Site Reliability Engineering or ‘SRE’. That said - I’ve also been asked if the SRE culture is less culture and more cult. In my experience the execution really matters, and the fine details like working agreements can greatly shape how the culture forms around this idea. The SRE concept isn’t all that well understood, and I think that is where some of the mythos around it makes it murky and hard to approach.
What is SRE exactly? Why isn’t DevOps the same thing?
DevOps was a concept, or an idea rather than a framework as I accept it. - There was a clear industry rift in many organizations between a formal Software Developer community, and the Operations community. These two often were isolated from each other for reasons that are frankly designed around factory processes that helped the Model T come off the line, but have little positive affect anymore. DevOps was the concept of putting the two much closer together, so close that they sliced the two words together and mashed them as close as linguistically possible. Despite that, the lack of definition around what DevOps was, lead to hiring ‘DevOps Engineers’ rather than hiring developers, operations, or any role and asking them to perform in a ‘DevOps culture’.
Here is some reading on DevOps:
Site Reliability Engineering really goes much further, and describes not just a solution for two departments which haven’t gotten along, but really describes how to better divide work and responsibility within a company. The idea stems from a simple premise, but it has some extensive implications. Rather than have top down management be the source of direction and coordination for all things - carve out a scope of ownership and make a team fully responsible for the development, operations, health, financial success, and really as many aspects of an effort as possible for that scope. Examples of SRE in practice vary greatly company to company, but typically are found in cloud-scale, or highly distributed environments where quick effective iteration is desired. This approach hyper-localizes the conversation and creates alignment through extreme focus within the company; enabling meaningful collaboration, responsibility, high initiative, and creativity. SRE is still being defined by the industry for it’s exact definition, but the core tenets involve cross-functional dialogue and enablement for the people close to the problem to work collaboratively to solve it. It dovetails well with the idea of Squad based development since that concept also goes further to describe the even wider extent of cross-functional development. The benefits really start to show themselves when the business can more greatly scale horizontally, and deliver, operate more at greater scale without scaling management tiers so agressively. This provides more focused business spend on delivery of product/services rather than overhead.
Here is some reading on the definition of SRE:
- https://en.wikipedia.org/wiki/Site_Reliability_Engineering
- https://landing.google.com/sre/interview/ben-treynor.html
Some Definition of Squad, which can pair nicely with SRE:
- https://alexwitherspoon.com/agile-squad-organization-models/
- https://medium.com/project-management-learnings/spotify-squad-framework-part-i-8f74bcfcd761
Let’s add some ‘ure’ to the ‘cult’.
TL:DR - Do everything you can to safely build a team dynamic that supports collaboration, responsibility, high initiative, and creativity.
Culture is really hard, and it’s the influence and perception of the successful members of a community on it’s constituents that really form it over time. Many ‘small’ things add up to a combined culture as well, so it’s hard to even attempt to describe them all specifically. At the macro level however, it’s important to create a mutually beneficial environment and remove any parasitic elements you can - those will by definition drain the team with no reward and eventually destroy the team dynamic. At that point the team can’t collaborate, and start to only follow, or lose initiative, and/or eventually leave. If that’s the intention, well congratulations, you’ve built a cult! Make sure the right leaders exist to champion the SRE culture, and the business culture around it. These leaders should embody the culture you want to see, because that is who they are naturally. - Find leaders that honestly and personally value collaboration, responsibility, high initiative, and creativity.
Here are some excellent reading around building these Agile SRE work dynamics:
- https://landing.google.com/sre/book.html
- https://engineering.linkedin.com/blog/2017/05/building-the-sre-culture-at-linkedin
Hopefully, this has been a primer on what SRE can be, and give you some ideas to consider how a SRE framework could help enable better collaboration and quality of execution. I found that in environments that were exceptionally good; the culture helped share ownership, which caused more collaboration, which in turn increased quality and reduced risk.