Hi. I’m Rich Burroughs, and I’ve just joined FireHydrant as a Senior Developer Advocate. I want to tell you a bit about myself and why I’m excited to be here at FireHydrant. As I’m writing this introduction, though, we’re in the midst of the coronavirus pandemic, which is something I can’t ignore. So I also want to pass on some thoughts about operating production systems during this difficult time.
I started working as a sysadmin in the mid-1990s. My first role was at an Internet Service Provider, where I managed about ten Linux hosts. There was no monitoring or altering when I arrived, and I set up that. Then my boss bought me a pager and told me that I was on-call all the time. It was a rotation of one. I probably thought that was pretty cool at the time, actually, to have the extra responsibility. While on-call there, I responded to issues with our servers and also some for co-located customers. I remember vividly spending one Thanksgiving on the floor by a server rack, rebuilding our primary DNS server.
After a few years at the ISP, I moved on. I participated in on-call rotations over the years at other companies, and those rotations ranged from relatively easy and manageable to extremely painful. Sometimes the problem was that the team got too small and I was on-call too much. Other times it was the schedule. I had one boss who decided that instead of weekly rotations, he’d assign people times of the day to be on-call. My shift was every evening from 7 PM to midnight. And sometimes the problem was software quality or scaling issues. Sleep deprivation sets in pretty quickly when you’re getting woken up several nights in a row because the system is unstable.
A few years ago, I moved out of Operations and into other kinds of roles. I came to realize that I have a passion for helping people that build and operate production services. Being concerned about mental health is a big part of that. I was diagnosed with Generalized Anxiety Disorder a few years ago. While I’m sure that being on-call for a couple of decades isn’t the only reason for my anxiety, I do believe it has contributed to it. So, the opportunity to work at FireHydrant and help people make their on-call experiences better seemed like a perfect fit. I’m delighted to have joined the team.
Right now, we’re all in an unprecedented situation, at least in my lifetime. People are trying to keep production systems running during a global pandemic. Folks are on-call while they’re surrounded by dire news and chaos. Unfortunately, that’s likely to continue for some time. I feel for people who are in that situation, and I want to pass on a few things I think on-call teams should be keeping in mind in this extremely challenging time.
First, it’s time to think about business continuity, in terms of your team. While we all hope none of our teammates are infected with COVID-19, the reality is that a lot of people will be. You should expect some team members to be unavailable for periods. One immediate thing to consider is where the single points of failure are on your team. Do you share the credentials people need, either with a password manager like 1Password or Lastpass, or through some other mechanism? Do people have accounts and permissions to access the things they need to? Do they know how to reach the vendors you all depend on? Do people know how to restore backups? If you’re the only person on your team that knows how to do something, now is the time to write that process/procedure down somewhere visible. It’s time to break down some silos.
It’s also important to consider that people are likely to be dealing with lots of stress. Some are working from home for the first time too. Remember how holiday on-call schedules can be challenging to manage because so many people are on PTO or traveling? The way we make holiday on-call work is by being flexible, and that’s the mode to be in now. An always-on system requires an always flexible team, even under normal circumstances. The best teams I’ve been on always made me feel like people had my back. If I needed someone to cover on-call for me for a bit, they would, and I would cover for them when they needed it. If we saw that a teammate was having a rough week and was sleep-deprived, one of us would tag them out so they could get some rest. That give and take, working together and supporting each other, is critical. If you see someone struggling, don’t wait for them to ask for help.
Community is a basic human need, and many people will be feeling scared and isolated, so finding ways to bring people together is important too. Our team here at FireHydrant got together on Zoom recently and played some quiz games. It was a lot of fun, and it gave us all some much-needed laughter. I’ve also heard of teams and individuals leaving video calls running where people could join and talk in an ad hoc way. I’ve joined some calls like this with friends in the Kubernetes community, and it’s helped me a lot. At my last job, we used a Slack app called Donut that allows people to opt-in for 1:1 chats with coworkers. Anything you can do along these lines is helpful but also gives space to people who aren’t up to participating. We all have different situations and needs.
Last, I want to pass on some advice from someone I admire a lot, Dr. Nicole Forsgren, from GitHub. Nicole tweeted this:
This advice resonates with me a lot, and I know Nicole has the receipts from her research to back it up. Earlier in my career, I worked in a shop that did quarterly releases, and they were brutal to deploy and debug. Added change management will cause changes to slow down and batch up into larger changesets. I can understand the instinct to slow down and add process when there is so much chaos around us, but it’s likely to cause you more instability, not less. Keep doing what’s been working for you.
If you’re on-call right now, I feel for you, and I hope these suggestions help. In this post, I’ve used words like unprecedented, difficult, and challenging to describe our current situation, but honestly, I don’t have words that can even express what I’ve been feeling. This is a situation unlike anything I’ve encountered in my career or lifetime. None of us know what the future will bring, but I do believe that we’re stronger if we stick together. It’s more important now than ever that we do what we can to take care of each other.