When Fires Get In The Way
Hey!
Welcome back to another week of musings.
My wife came back from a quick work trip to Chicago, so we spent the weekend resting! I hope you had a great weekend and managed to recover for the week ahead.
Was this forwarded to you? You can subscribe here!
Things I enjoyed in the past week
- Introducing Strands Agents, an Open Source AI Agents SDK. This Amazon Strands SDK is interesting because it leans into a model-driven approach rather than requiring developers to hard-code every step of the workflow.
- Elephants, Goldfish, and the New Golden Age of Software Engineering. A good reflection on how software engineering is changing with AI, and the kind of skills that might become more important as the tools keep improving.
Lately, I've been thinking more about my usual weekly tasks and goals and how much I tend to "jump in front of a grenade".
As part of that, I've realized I tend to jump from fire to fire regularly. Enough that I now notice myself doing it. And, sadly, in a slow week, I sometimes find myself looking for these fires to feel productive.
The system is never really fixed
In a technology company, incidents are unavoidable.
The system is broken all the time, as described by How Complex Systems Fail. There's no final state of the system where it is fixed; everything around it changes: the company, the infrastructure, the services, etc.
If I'm exhausted by these fires, but there's not a state where there are "no more fires." What do we do?
We keep treating capacity as optional
One thing that feels more obvious in the current state of tech is that many companies want to do more with fewer humans (especially if you believe all the Bay Area billboards).
After multiple waves of layoffs across the industry, many teams are being asked to maintain their systems with fewer and fewer people. These companies are not investing enough in adaptive capacity: the ability of a system, team, or organization to adjust when reality does not match the plan.
This capacity comes from having enough people who understand the system. And while we still adapt, sadly, the adaptation becomes dependent on a few people who know how to navigate the system.
Automation does not remove the need for operators
This is also why I keep coming back to The Ironies of Automation.
I recommend reading the paper, but in summary, the more we automate, the more we "deskill" the operators, leaving less and less understanding of the system to recover it in case of failure. Then, when the incident goes poorly, we blame the operator.
What can I actually do?
So, fixing hiring plans is outside my salary band. But I can advocate for more hires.
But still, what can I do in the present?
Sometimes, I am expected to jump into the fire because I am the highest-level IC present, or because the room needs someone to be "the adult in the room." I get that part.
But in many cases, I work with teams to shape a plan, break down the work, and help execute. This is a useful role to fulfill, but it doesn't require unique skills, and others can learn to do it well.
Make more people capable of holding the room
I think at this point, the best outcome is that more people become capable of leading through incidents.
People do not become trusted incident leaders just because we put their name in a rotation. They become trusted because they have safe ways to enter incident response teams, understand how decisions are made, and are set up for success.
The hard part is that this takes time, and time is exactly what understaffed teams rarely have.
Where agents might help
I've also been thinking about bringing agents into this workflow.
There are places where they can help. Searching wikis. Finding runbooks. Summarizing related incidents. Pulling dashboards together. Looking for ownership. Drafting a timeline. Turning chat noise into a cleaner status update.
All of that is useful, especially during the early minutes of an incident when everyone is trying to orient. But I don't think agents replace the human work of incident command.
I don't have a clean conclusion
I don't think I have a clean conclusion for my current state of the world.
I can advocate for more hiring. I can invest in people so more teammates can lead incidents. I can use agents and automation to make context easier to find.
In practice, some of these are hard and will take time, especially with fewer teammates around to help or to train.
Your turn!
Do you find yourself jumping from fire to fire at work? How does your team build enough capacity so the same people are not always the ones holding everything together? Let me know by replying to this email!
Happy coding!