Not unlike the stumbles experienced in all companies — catastrophic weather, supply chain disruptions and building security breaches — IT failures happen everywhere and to virtually every enterprise.
Technology engineers know and understand how systems can fail on a regular basis.
“When systems fail — and they always do — we need to address them immediately in order to minimize customer impact,” said Miles O’Connell, senior director of infrastructure engineering at Sprout Social.
O’Connell told Built In Chicago that having his entire team rotate into and out of troubleshooting roles — and for each engineer to share what he or she learns from those rotations — enables an informed and proactive approach to preventing problems in the first place. He adds that an engineer who spends too much time in these firefighting roles can lead to burnout, but treating tech failures as a chance to learn and improve the system gives the job a broader role and a higher place of importance.
Sprout Social is a global leader in social media management and analytic software.
Describe your team's current approach to firefighting. How does a consistent rotation enable your team members to balance their time between projects and on-call duties?
An on-call rotation has two purposes, and it can be illuminating to separate them. An on-call rotation designates an individual on the team to field and handle “little fires” like support requests, bugs and so on. This person will be interrupted frequently, so having a rotation allows us to limit the impact of context-switching to other engineers on the team, while ensuring the team is meeting its responsibilities.
The other purpose is a little more exciting. Here, the responsibility of the on-call engineer isn’t to ride alone into battle. Instead, the rotation allows us to count on someone to monitor for failures and alert the team, especially during off-hours.
But there’s a lot to be learned from these failures, and they can be critical to our customers, so we exercise a shared responsibility model where the whole team will pitch in to fight our fires.
As a leader, how do you keep a pulse on how engineers are handling on-call duties? Do you engage in frequent check-ins to identify possible burnout, and if so, what steps do you take to remedy the situation?
Avoiding on-call burnout starts with your engineering culture. No matter how inevitable — or rote — on-call duties seem to be, you should never allow these failures and interruptions to exist long-term. That means treating every failure as a chance to improve your systems and processes, and every support request as a chance to improve your documentation. Most importantly, it means actually giving your teams the time, resources and encouragement to go out and improve those things.
In my experience, on-call burnout happens either because there are so many off-hours fires that it feels like working a second job, or because you dread spending a week solving the same problems over and over again. When the on-call engineer is engaged, it should result in some takeaways, such as a system to improve or a process to fix.
At Sprout, our engineers care about the success of their teams and are motivated by solving complex problems. When we make on-call a chance to pursue those things, rather than a series of the same operations they’ve done a thousand times, it gets a lot less tiresome.
What are some of the biggest benefits you've seen from having a consistent firefighting rotation?
When you put everything above together you gain some extra benefits. Solving new and surprising problems in your systems is one of the best ways to learn how they work, and they often lead to identifying opportunities for improvement. It exposes engineers to parts of the system that they might not have touched, and it’s one of the best ways I’ve found to ensure folks level up their technical skills.
So much of engineering is about avoiding failure conditions. Designing a system to scale really means “designing a system to not fail when run at scale.” Whenever we can, we try to share stories of possible failure modes to help everyone learn without learning the hard way — but sometimes there’s no replacement for spending a bit of time on the front lines, watching things we thought were solid burn down and sitting down with our team to figure it out.
Because of this, being on call is a key ingredient for growing technical leaders — one you lose if, by eschewing a rotation, you find the same people handling every fire, while the rest of your engineers focus on other things.