One of the most common complaints we hear from operations and site reliability engineers is about the quality of life impacts and the resulting stress imposed by their on-call responsibilities. Most of us are already aware that a proper on-call rotation is critical to our engineering organization’s health in terms of both immediate incident response and long-term sustainable growth.
If most of us know on-call rotation is important, then why do so many organizations fail to implement good process? The answers for why can range from poorly understood or hidden costs to competing priorities. This post enumerates some of the risks introduced when a single person carries 100% of the on-call duties of a team, and shows why those risks are not simply eliminated by increasing the number of people in the rotation. It’s also not meant to be an exhaustive list. Ideally, this post spurs some conversation within your organization to distribute your on-call workload more fairly.
- Burnout and attrition
- Concentration of systems knowledge
- Impaired decision making and accidents
- Gaps in coverage
- No one is responsible if everyone’s responsible
Burnout and attrition
An organization with an expectation that an engineer will respond to an alert within five minutes and begin triaging, severely restricts what the engineer can do in their free time. That person is unable to realistically go to a concert, the beach, their kid’s soccer game, the gym or in an extreme case even walk their dogs. We can cope with those restrictions for a few days at a time, but expecting someone to carry their laptop around with them for weeks on end won’t last long. Impairing someone’s quality of life isn’t going to lead to an immediate incident, but given the current demand for SREs, it’s very likely that individual will move to a new opportunity.
Organizations most likely wind up with a single person on-call “team” since developers are expensive. SRE salaries start at around 150k per year¹ at the entry-level and easily double with a few years of experience. That’s a high cost, but so is the expense of recruiting. Even costlier is when a major incident is occurring while a new hire is still getting up to speed. Given the demand for engineers far exceeds the supply, it’s simply the logical conclusion that your SRE might seek out a new position if they’re unhappy being a team of one.
Concentration of systems knowledge
There will always be information that is held within a single person’s mind and not documented or shared with the rest of the team. This institutional knowledge is usually held by a few members of a team who are very familiar with a system through their having built the system or significant experience in mitigating incidents. Ideally, this would be recorded in postmortems and shared with the rest of the team. Sadly, if the totality of the response burden is placed upon one person, it’ll likely stay in their mind.
There’s no incentive to record this information, nor is anyone else asking questions to prompt the sharing of it. The SRE role is by definition about spending at least half of our time writing things down in playbooks, or in actual code. Without incentivizing an SRE to do the job of writing things down and sharing knowledge, it’s likely the person fulfilling that role is surreptitiously just a highly over-burdened sysadmin. That’s a lot of costly, hidden risk for your company to assume.
Impaired decision making and accidents
The mistakes made by engineers during incident response are just that, mistakes. They’re not making a conscious decision to take a risky or inappropriate action; they’re doing the best they can given the information available, previous training and current cognitive load. Even the best engineers, like those employed by major clouds, occasionally make mistakes under pressure.
Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
An engineer that’s been on-call for the last thirteen days with six overnight pages is at a much greater risk of making a mistake while typing a command or running the wrong script. This is no fault of their own; it’s just human nature. We know that driving while exhausted is dangerous. The same effect is present in incident response and we can avoid it here, as well.
Gaps in coverage
People simply aren’t available 24/7/365.2422; there will be periods where your single engineer will not be available for a portion of their on-call rotation. If they’re depended upon for the majority of those responsibilities, the rest of your staff won’t be familiar with your incident response processes. This will mean that some incidents will occur and be ignored entirely, or the response will be handled by someone ill-equipped and unlikely to yield results promptly.
If everyone’s responsible, no one is responsible
This is a special case of the zero coverage risk we’ve discussed so far. This is (instead) a product of teams where the responsibility of acknowledging and triaging incidents is shared across a group. If you blast out a page to six people, the page will likely end up ignored by the entire group. This is due to the “diffusion of responsibility” phenomenon.
If you’ve taken a CPR class, you may have heard this referred to as the Bystander Effect. It’s a widespread enough phenomenon that the effect is an important discussion point in any emergency, even including everyday emergency medical training. In the world of software, the Bystander Effect occurs, because everyone on the team feels safe in their assumption that someone else will take responsibility. The result is that the alert ends up being ignored.
FireHydrant’s three-person rotation
We wouldn’t have any room to speak on this topic if we didn’t practice it in our organization. We started as a three-person team, building a Rails monolith on Kubernetes, with one engineer whose background didn’t include operations experience. We were still able to establish a rotation where each person was the only primary on-call for one week at a time. We did this while maintaining appropriate coverage in incident response.
@bobbytables and I both had significant experience with operating Rails and Kubernetes and traded off weeks being the backup for Dylan. He could still triage alerts and reports of issues from customers and bring us in as necessary. This allowed us to have a relatively sane life outside of work while maintaining immediate coverage for any incidents that occurred. We also believe we won with this strategy, because it naturally led to increasing Dylan’s knowledge of the infrastructure.
We have the additional benefit of me being three hours behind the rest of the organization, providing additional cover for the team if an incident occurs in the first few hours after their normal workday ends. It’s easier for me to just switch contexts and investigate an alert than it is for them to interrupt dinner and dig out their laptops.
Distribute the workload
Your primary on-call engineer doesn’t need to be someone who has deep knowledge of every component. Ideally, anyone who has a working knowledge of your systems should be able to at least triage an incident on a Saturday afternoon. They can determine if it’s something that needs to be addressed immediately, the next day, or next Monday.
Your organization’s availability requirements (SLAs) and customer needs will drive your expectations of response time and escalation policies for your engineers.
Maybe your application is only used between 9 am to 5 pm in a given timezone and outside of that, it’s safe to relax your alert acknowledgment requirements to 90 minutes or a few hours. If you’re a credit card processor or maintaining healthcare infrastructure, a five minute response time may be way too slow and you’ll have a fully staffed NOC 24/7.
It’s unrealistic to expect every team to have enough engineers in different time zones to do a proper “follow the sun” schedule, but we can still take steps to insure that our engineers are happy and well-rested in an effort to reduce these risks.