On-Call Design

A minimum viable on-call design guide!
SRE Site Reliability Engineering org design organizational models

Photo by [Goh Rhy Yan]

Photo by Source

Overview

Systems will fail, and how we respond is critical so we minimize the negative impacts when it happens. Designing an On-Call responder rotation is tricky, involves trade offs, tooling, process, empathy and company culture conversations. In this document, I’ve included recommendations from industry best practice, and experience from implementing and operating software engineering on-call rotations from a number of large-scale SRE style environments. Many specific details will need to be fine tuned for your situation specifically, and should be discussed further, but this could act as the foundation for those conversations.

Goals

  1. Preserve staff sleep as much as possible, and maximize engineer value.
  2. Suggest a minimum viable On-Call model that is easy to implement and improve upon.
  3. Define a scalable and repeatable way to operate our software services.
  4. Increase the ‘quality of life’ for on-call staff.
  5. Reduce risk throughout the process.
  6. Reduce minimum scope of domain expertise required for a given on-call staffer.

TL:DR - Next Steps

Sorry, no short-cuts! Read this document, consider the points within it, and use it to help structure your conversation to either create your own unique or derivative plan, or to fine tune this plan to your specific environment. These models do contain many of the primary things that need consideration even if you choose a different approach. When you do create a policy for your environment, commit to re-evaluating it at least quarterly to make sure it is working, and feedback can be considered and worked in.

Key Concepts

Many companies with internal bespoke hierarchies like to make or use complex phrases and then shorten them into indescribable acronyms. This is often a tool to exclude people from conversations through passive structural oppression by making it hard for participation. - But that’s another white paper for another day. Here are some common concepts spelled out as they are used in this document.

Service Level Agreement, or SLAs

If a customer wants to ensure a specific level of availability for a product or service, they will often ask for an ‘SLA’. An ‘SLA’ is simply a section of a contract to purchase a product or service, and lays out a contractual statement of that intent, often with financial incentives and legal definitions for the terms. Examples could be availability of the service, or time to resolution requirements. These often make sense outwardly to customers, and don’t always directly make sense for internal teams as a single external service may actually be comprised of many internal services that make it possible. Service Level Agreements should never by principle exceed the capabilities outlined in the Service Level Objectives of the services it is dependent on.

Service Level Objective, or SLOs

A given product or service a customer may purchase is often composed of multiple components. A service level objective is the internal goal used to make sure each constituent part or service is capable of working in aggregate to meet service level agreements with customers. These are often internal documents describing each specific component of the system, it’s criticality, the measurable goals for performance, and the ownership of that component. Another simple way to think about them, are the definitions of what normal looks like, and when something needs to be fixed.

Future State

Any Service Level Objective should ultimately result in some alert to a responsible party for remediation. There are two kinds of main on-call structures that should be used, and they focus around domain expertise and ownership of a given service. As a principle it’s prefered to minimize the scope of expertise required to respond to on-call, so the single team ownership model should be preferred when it’s possible.

Single Team Ownership Model

When a service or system is operated by a single long lived team, or group of people on a regular basis, they become the domain experts for that service or system. This model is ideal because the people that know the most about the problem, and the people who can fix the problem are escalated to directly. The team or group can independently resolve and fix issues with more minor impact to the larger group of support staff.

General Structure

Each team should pick an exact structure that works for them, but use the below as guidelines for what is minimally required by the organization, and as a set of minimal topics to discuss and decide on.

On-Call Membership

All staff the manager of that team thinks are ready to successfully handle the responsibilities of the task should play part (including non-engineering staff). To help in that, the below rubrics can be helpful to define readiness. The manager and staff should mutually agree to the new responsibility. The manager should review with all on-call staff that report to them monthly to confirm they are still capable and agree to on-call responsibilities.

Staff On-Call Duration

Discuss with the team, and pick a duration no less than 3 days, and no more than 7 continuous days for an individual to be on-call. If in doubt I would strongly suggest starting with 7 day durations and adjust from that. The manager of the team should be the escalation above their staff. If an on-call staff needs to adjust their on-call schedule, they should get approval from their manager. The manager should review the on-call schedule with the team regularly.

On-Call Schedule Handoffs

Handoffs between on-call staff should be scheduled consistently during normal business hours during the work week, or for international teams, a consistent time that is agreed upon should be decided. Avoid night time or ‘sleeping’ schedule handoffs. A debrief meeting with minimally the current on-call staffer, and the next on-call staffer should happen at this hand off point to discuss anything that has happened and check that everything has been documented and acted on. It’s a great idea for the manager, or tech lead to also join, or the entire team if they want to, but make it optional.

Scheduling On-Call and Escalations

Level 1 - Scheduled 1st Responder

Level 2 - Entire Team

Level 3 - Team Manager

The team should have a single rotation for the service when it is escalated to the team which is only triggarable if a service level objective has been missed, and it should escalate to the entire team, or manager after an automatic amount of time if the alert isn’t acknowledged by the on-call staff. The manager can then confirm and assign the page to another staff member as they best see fit.

Pick a single tool to manage on-call schedules, and manage automatic escalation. Then use it. (Examples include Pagerduty, OpsGenie, or VictorOps, or similar tool)

Bonus points for integrating those tools into your system for managing vacation/(P)TO time to make sure people on call are not on vacation. - Lacking an automated tool, the manager for the team should check this manually to ensure on-call staff’s (P)TO do not conflict, or that other staff can cover for the individual.

Choosing how to Communicate

Pick a primary tool to use during an incident that is company approved which is highly reliable, ideally hosted by someone else, and allows for visual, verbal, text, and company confidential file conversation between remote people. Incidents worth waking up from often wake us up from our bed, not from our office chair. (Examples include Slack, Mattermost, Zoom, Google Hangouts or similar tool )

Pick a second tool, hosted by a different company/entity and establish that as a known backup in case the primary tool isn’t working. This can be something simple like phone numbers, but try to avoid passing around private personal information whenever possible.

Publish the tools and how to access them in a place staff can find, and regularly simulate fire drills at least twice a year to check that systems are working to make sure staff don’t forget how to use them.

Playbooks

At a minimum, playbooks should contain the service level objective an alert is triggered by, the symptoms to investigate to identify the root cause, and finally steps for remediation and/or escalation. Playbooks should be stored in a place the entire team can easily find them, and they should be edited and reviewed at least quarterly.

Documentation

Document the timeline of the event in an incident report, ideally with automatic tooling, and manual annotations by the on-call staff. Make sure this documentation can be shared with other team members and the manager of the team. (Examples include Pagerduty, OpsGenie, or VictorOps, Google Docs, Office365, or similar tools)

Onboarding & Training

The following process should help train and ensure that

Buddy System

The manager of the team should assign each prospective on-call staff should an on-call buddy who has been on-call before and has experienced real world incidents and proven to respond well to them per the rubric. They should help mentor, support, evaluate and train the prospective on-call staffer until the manager approves them to join the rotation.

On-Call Shadowing

Anyone should be able to apply to be part of an on-call rotation, but only those who are ready for the responsibility should be allowed to be on-call without supervision. I recommend that a prospective on-call member should be required to shadow an existing on-call staff and encounter a minimum of 3 real-world incidents of any severity, or show strong evidence they meet all the rubric requirements through mock-scenarios before they are allowed to be on-call. Each on-call buddy should debrief and evaluate the prospective on-call staffer using the rubric to help constructively train and improve. The outcomes of these evaluations should be shared with the manager of the prospective on-call staff.

Tooling

Whatever tools are used to measure, analyse, or allow for investigation to the operation of the service or system should be documented, this should be reviewed by a prospective on-call staff, and buddy. These tools should be further explored through mock-incidents and On-Call Shadowing sessions.

Role Awareness

During an incident each person regardless of their normal day job, will take on roles for the duration of the incident to help bring it to a safe conclusion. Below are the minimum roles to have on hand. The 1st responder is expected to handle all roles and delegate to others as they arrive on scene. A manager is expected to be able to field all roles, or find someone to fill a needed role.

Incident Review

At least monthly the entire team should go over the highest severity, and highest volume incidents and review the playbooks, consider improvements to the service or system that occured during that period of time since the prior meeting. Awareness of real world problems helps prepare for responding, and fixing the issues. This is also an excellent time to review all Service Level Agreements to make sure they all still are accurate, make sense, and are being monitored correctly.

Playbook Review

Each new perspective on-call staff should familiarize and attempt to act out a playbook as a mock-exercise with the help of the on-call buddy. Simulate failures and remediations, and debrief the event. Use this as a chance to learn the playbooks, as well as suggest playbook improvements.

Regular Manager Evaluation

At least monthly the manager should review with each on-call staff incidents they have encountered using the rubric below and reviews from peer on-call staff. This is an opportunity to help coach and mentor improvements, and confirm the on-call staff are still ready for the responsibility. If the manager approves, the on-call staff can be removed from on-call rotation, and they can go through the onboarding process again to rejoin at a later date. There should not be negative stigma against this - the worst thing possible is having someone on-call who is not ready for the responsibility, and are set up to fail at the detriment of the team, and the company.

Shared Ownership Model

Often when software isn’t abstracted or isolated into distinct functional elements or failure domains multiple teams or groups will have to work together to support a common service or system. This is especially tricky for two main reasons besides general complexity - the scope of domain expertise increases, and it becomes harder to get the right person at the right time on the problem. The below calls out the elements that are different for a Shared Ownership Model.

General Structure

The Shared Ownership Model is similar to the Single Team Model, with large exception to the presence of cross-team collaboration, and need of central operational oversight. In many companies this is a support manager, or someone in the operations side of the business. I will refer to this as the on-call rotation owner.

On-Call Membership

Each team participating in the On-Call ownership model should nominate a similar number of staff for on-call duty. For example a company with 10 engineering teams, might nominate 10 staff for minimum duty, which spreads the operational burden, and gets wider exposure to different staff on the 10 teams for this shared on-call responsibility. Using 7 day on-call schedules as an example this would be one week on-call every ten weeks. The on-call nomination on each team may change from time to time to allow for a break from on-call work.

Staff On-Call Duration

Discuss with the team, and pick a duration no less than 3 days, and no more than 7 continuous days for an individual to be on-call. If in doubt I would strongly suggest starting with 7 day durations and adjust from that.

On-Call Schedule Handoffs

Handoffs between on-call staff should be scheduled consistently during normal business hours during the work week, or for international teams, a consistent time that is agreed upon should be decided. Avoid night time or ‘sleeping’ schedule handoffs. A debrief meeting with minimally the current on-call staffer, and the next on-call staffer should happen at this hand off point to discuss anything that has happened and check that everything has been documented and acted on. The owner of the shared rotation should be present, along with any other necessary staff to conduct and notate the meeting.

Scheduling On-Call and Escalations

Level 1 - Scheduled 1st Responders

Level 2 - Scheduled Incident Commanders

Level 3 - On-Call Rotation Owner

The Level 1 escalation should go to the on-call staff scheduled for rotation. The level 2 rotation should go to a scheduled rotation of Incident Commanders. Finally it should escalate to the owner of the on-call rotation after an automatic amount of time if the alert isn’t acknowledged by the on-call staff. The owner of the on-call rotation can then confirm and assign the page to another staff member as they best see fit.

Pick a single tool to manage on-call schedules, and manage automatic escalation. Then use it. (Examples include Pagerduty, OpsGenie, or VictorOps, or similar tool)

Bonus points for integrating those tools into your system for managing vacation/(P)TO time to make sure people on call are not on vacation. - Lacking an automated tool, the manager for the team should check this manually to ensure on-call staff’s (P)TO do not conflict, or that other staff can cover for the individual.

Role Responsibilities & Rubrics

1st Responder Responsibilities

The 1st responder is often the first person on scene, and may delegate the responsibility to another willing party when they arrive on scene if they are able. While acting as first responder, the following key responsibilities below are critical during response.

Acknowledge

Using the tooling of choice, acknowledge the issue, and signal you are ready or not ready to respond. If you know you can’t get to the page because something got in the way, escalate if possible, or let the automatic escalation do its job. Pages don’t always happen at convenient times, and that is what escalation plans are for.

Identify

Given the information available start looking for symptoms, and root probable cause of the incident. Take quick inventory, and don’t assume obvious problems are the only issue at play. If playbooks are available, check them for any instructions for remediation. Check any tooling available to look for other failures, or systems which are not operating normally.

Communicate

Until an incident commander, or other support staff can arrive on scene, your job is to be verbose enough to help others know what is happening, and what we are currently doing about it. If you need help, clearly ask for the skills needed, or domain expertise that may be needed to assist. Describe what you have identified as you come across it. If the severity of the issue is small, you may be the only responder, but if the incident is larger, your first goal is to communicate the scope of the incident to others who can help. Make sure to avoid ‘going dark’ - communicate status no less than every 5 minutes even if status hasn’t changed much as it helps others find you and support you.

Resolve

The key to triage is looking for root causes, and things that when fixed result in the best return for effort in a short time. Consider criticality, and for an issue large enough delegate tasks others who might be better at solving them, or ask the incident commander for help delegating. Act on playbooks if available to recover services as they describe. Make others aware of what issue you are actively working on, and tackle them calmly one at a time - the incident has already happened at this point, the focus is on recovering the service or system.

Document

As a first responder, documentation often is hard to create when things are broken and we are excited to fix them. Try to focus on keeping track of the timeline, what happened in what sequence of events, and who is doing what on scene. Often an incident commander, or a dedicated responder will take the role as scribe to document the event if the severity of the event requires it. Any documentation is better than none, and even notes on paper can later be documented with the incident report. If the problem being solved hasn’t been documented in a playbook, write down the symptoms, methods of identification, and remediation steps.

Debrief

Depending on how severe an incident was, the event should be documented in a place of record, and a meeting held to discuss the incident with at minimum the manager and on-call staff. As 1st responder you have the obligation to describe the scene as you found it, how you acknowledged, identified, communicated, resolved, and documented the event. For a complex event, you may have delegated to others, and they are accountable for their own roles at that point. Treat these as learning situations and avoid blame and harsh critique.

1st Responder Rubric

Skill Below Expectations Meets Expectations Exceeds Expectations
Acknowledge Starts working on resolving an issue without acknowledging it, or acknowledges the issue when not available to take further action. If available is quick to respond and clearly indicates intent to take 1st responder role. If available is quick to respond and clearly indicates intent to take 1st responder role. Manually escalates when not available to another responder.
Identify Is quick to dismiss playbook advice, or doesn't investigate to confirm an issue is present. Can identify and verify problems outlined in playbooks, and can investigate undocumented issues without much assistance. Can identify and verify problems outlined in playbooks, and can investigate complex or undocumented issues without assistance.
Communicate Disrespectful, unclear or no updates are given. Others are unsure what is going on. Calmly, respectfully and clearly communicates what others need to know and maintains regular updates of status. Calmly, respectfully and clearly communicates what others need to know and maintains frequent updates of status, and manages updates from others.
Resolve Quickly gives up, or does not request help remediating problems. Does not demonstrate skill or effort to follow playbooks or instructions from others. Reliably follows playbooks, manual resolution steps, and escalates and requests help when appropriate. Can follow direction from playbooks and other staff, as well as independently investigate resolutions. Quickly escalates as needed to other team members to get Domain Experts on scene.
Document No or little notes of the event timeline, who was involved, or what was done to resolve the incident were taken during or after the event. Keeps a rough outline of events, who participated, and when services were recovered along with notes about how remediations were attempted during the event. Clear time stamped notes are taken with who is involved, what actions were taken, and what remediations were attempted during the event, and clarifying information added after the event.
Debrief No participation, poor documentation. No further actions, or improvements are suggested. Clear documentation and explanations provided during debrief. Clarifies any missing details with debrief participants and suggests next steps including playbook improvements or bug fixes. Facilitates meeting, clearly sharing documentation, explanations, and lessons learned during the debrief. Helps gather feedback from others, and suggests playbook improvements or bug fixes.

Incident Commander Responsibilities

There should be one active incident commander per incident, often acting in a support role to coordinate, communicate, take notes, and help responder(s) towards resolution of the incident. If the incident is prolonged, the Incident Commander should be relieved by another Incident Commander. The following are the core responsibilities for the role.

Acknowledge

Using the tooling of choice, acknowledge the issue, look for an already present Incident Commander, and if one is not present, announce your intentions and confirm with the 1st responder.

Communicate

Help the 1st responder with providing updates on the situation, coordinate others so things are orderly, clear and calm during an incident. If you need help, clearly ask for the skills needed, or domain expertise that may be needed to assist. Describe what you have identified as you come across it. If the severity of the issue is small, you may not be needed, but if the incident is larger, your first goal is to communicate the scope of the incident to others who can help. Second is to make sure to avoid ‘going dark’ - communicate status no less than every 5 minutes even if status hasn’t changed much as it helps others find you and support you. Help maintain communication discipline and keep the communication lines clear of any non-essential conversation.

Resolve

Help with delegation, assist with finding resources or people. Make others aware of what is being worked on, and by whom. Keep track of

Document

When arriving on scene, try to focus on keeping track of the timeline, what happened in what sequence of events, and who is doing what on scene. Help with documentation so 1st responders can focus on remediation efforts. Begin documentation for an incident report, and open one if appropriate. If the problem being solved hasn’t been documented in a playbook, write down the symptoms, methods of identification, and remediation steps.

Debrief

Depending on how severe an incident was, the event should be documented in a place of record, and a meeting held to discuss the incident with at minimum the manager and on-call staff. As Incident Commander you have the obligation to facilitate meetings, documentation, and communications. Treat these as learning situations and avoid blame and harsh critique.

Incident Commander Rubric

Skill Below Expectations Meets Expectations Exceeds Expectations
Acknowledge Starts working on resolving an issue without acknowledging it, or acknowledges the issue when not available to take further action. If available is quick to respond and clearly indicates intent to become, or releive current incident commander. If available is quick to respond and clearly indicates intent to become, or releive current incident commander. Manually escalates when not available to another responder.
Communicate Disrespectful, unclear or no updates are given. Others are unsure what is going on. Calmly, respectfully and clearly communicates what others need to know and maintains regular updates of status. Calmly, respectfully and clearly communicates what others need to know and maintains frequent updates of status, and manages updates from others.
Resolve Quickly gives up, or does not request help remediating problems. Does not demonstrate skill or effort to assist in the incident. Assists other responders, and escalates and requests help when appropriate. Can follow direction from playbooks and other staff, as well as independently investigate resolutions. Quickly escalates as needed to other team members to get Domain Experts on scene.
Document No or little notes of the event timeline, who was involved, or what was done to resolve the incident were taken during or after the event. Keeps a rough outline of events, who participated, and when services were recovered along with notes about how remediations were attempted during the event. Clear time stamped notes are taken with who is involved, what actions were taken, and what remediations were attempted during the event, and clarifying information added after the event.
Debrief No participation, poor documentation. No further actions, or improvements are suggested. Facilitates meeting, clearly sharing documentation, explanations, and lessons learned during the debrief. Helps gather feedback from others, and suggests playbook improvements or bug fixes. Facilitates meeting, clearly sharing documentation, explanations, and lessons learned during the debrief. Helps gather feedback from others, and suggests playbook improvements or bug fixes and manages follow up actions.

Domain Expert Responsibilities

Acknowledge

If you are requested to provide help with an issue, let the requester know if you can or cannot join the incident response.

Identify

Work with other incident responders to provide information, start looking for symptoms, and root probable cause of the incident. Take quick inventory, and don’t assume obvious problems are the only issue at play. If playbooks are available, check them for any instructions for remediation. Check any tooling available to look for other failures, or systems which are not operating normally.

Communicate

Make it clear who you are, and what your expertise is. Clearly, respectfully, and calmly help others diagnose and understand complex systems you may understand better than them. Help find documentation as needed.

Resolve

The key to triage is looking for root causes, and things that when fixed result in the best return for effort in a short time. Consider criticality, and for an issue large enough delegate tasks others who might be better at solving them, or ask the incident commander for help delegating. Act on playbooks if available to recover services as they describe. Make others aware of what issue you are actively working on, and tackle them calmly one at a time - the incident has already happened at this point, the focus is on recovering the service or system.

Domain Expert Rubric

Skill Below Expectations Meets Expectations Exceeds Expectations
Acknowledge Doesn't make presence known, or coordinate with others. If available is quick to respond and clearly indicates intent to assist to others. If available is quick to respond and clearly indicates intent to assist others and coordinates availability.
Identify Is quick to dismiss playbook advice, or doesn't investigate to confirm an issue is present. Works well with others to identify and verify problems outlined in playbooks, and can investigate undocumented issues without much assistance. Can identify and verify problems outlined in playbooks, and can investigate complex or undocumented issues without assistance.
Communicate Disrespectful, unclear or no updates are given. Others are unsure what is going on. Calmly, respectfully and clearly communicates what others need to know and maintains regular updates of status. Calmly, respectfully and clearly communicates what others need to know and maintains frequent updates of status, and manages updates from others.
Resolve Quickly gives up, or does not request help remediating problems. Does not demonstrate skill or effort to follow playbooks or instructions from others. Reliably follows playbooks, manual resolution steps, and escalates and requests help when appropriate. Can follow direction from playbooks and other staff, as well as independently investigate resolutions. Quickly escalates as needed to other team members to get Domain Experts on scene.

References

https://response.pagerduty.com/

https://www.fema.gov/national-incident-management-system

https://landing.google.com/sre/books/