Crisis Shape Us
Hey!
Welcome back to another week of musings.
I'm finally recovered from the race, and we're observing Memorial Day here in the US.
I hope you had a great weekend, and managed to rest for the week ahead.
Was this forwarded to you? You can subscribe here!
Things I enjoyed in the past week
- The death of good taste. Came across this video from someone in the Fashion industry who found the idea that Silicon Valley is obsessed with "good taste" interesting. Found the intersection interesting.
- Scope Is The Steering Wheel is a very short post by Kent Beck (from XP fame, among other things) that reminds us why we walk about good, fast, cheap, and why it still matters.
I started reading Crisis Engineering: Time-Tested Tools for Turning Chaos into Clarity*. I am still early in the book, but it has already . iven me a useful frame for thinking about the incidents I have been part of this year.
Throughout my tenure at my current job, I have been involved in multiple incidents and crises.
Some were releases gone wrong. Some were multi-day incidents that required tight coordination across teams. Some ended with remediation plans, follow-up meetings, and the familiar promise that we would make sure this same failure never happened again.
But lately, I have been thinking that "never again" is not really a plan. The real work is asking what the incident revealed about the system, and then deciding whether we are willing and able to redesign it around that new understanding.
Incidents reveal the system
One idea I keep coming back to is that incidents do not just break the system. They show the actual inner working.
They show how the system actually works, not just diagrams, org charts, or runbooks. They show which services depend on a person who left six months ago. They show which renewal date nobody owns anymore.
I was involved in an incident in which a system stopped working at a very specific time. At first, it looked random. Something had failed, but there was no obvious code change, no recent deployment, and no clear owner in the room.
After hours of triage, the issue turned out to be an expired license.
The failure was that no one was clearly accountable for noticing that this critical renewal date was approaching. The incident revealed a real part of the system: not the software system, but the socio-technical system around it.
Complexity keeps accumulating
Most systems tend toward complexity over time.
Sometimes that complexity comes from success. The product grows, the number of customers grows, new features get added, new integrations appear, and the system has to support more use cases than it was originally designed for.
Sometimes the complexity comes from the organization around the system. Hiring freezes. Layoffs. A team that used to own a broad area now owns a smaller one. Another team assumes the first team still has the old scope. Documentation exists, but it describes a version of the organization that no longer exists.
During an incident, we usually discover the mismatch between documentation and the real system at the worst possible time.
Ownership is part of the architecture
One pattern I have seen more than once is an incident getting prolonged because the ownership model is unclear.
After enough re-orgs, each team may have reduced its scope in a locally reasonable way. But if those scope changes were not made explicit across the organization, everyone can end up assuming someone else still owns the thing (whatever that thing may be).
Then a larger project or incident appears and requires coordination across those boundaries.
At that point, everyone is waiting for someone else to clarify the ask. Or each team starts solving only the part in front of them. Or people disappear into side channels to fix their area and forget to share what they are learning.
Most of the time, people are trying to help. But the incident reveals that the organization no longer has a shared model of ownership.
And ownership is not just a management detail. The system includes the people who operate it, the teams that maintain it, the incentives that shape decisions, and the communication paths that exist under pressure.
"Never again" requires capacity
I think this is where the idea of crisis engineering becomes interesting to me.
The goal is not only to respond to the incident and get back to normal. The goal is to learn enough from the crisis to redesign the system, or at least improve the parts that made the crisis harder than it needed to be.
It is also easier said than done.
In many companies, teams are operating with fewer people than before. Hiring is frozen. Teams have been reduced. The people who remain are carrying more context, more systems, and more interrupts.
So after an incident, we ask for the right things to be fixed:
- We need clearer ownership.
- We need better runbooks.
- We need better monitoring.
- We need to reduce complexity.
- We need to prevent this class of failure.
But then the same people who were burned out by the incident have to go back to their roadmap, their on-call rotation, their project deadlines, and the next urgent escalation.
Learning from incidents requires capacity. Redesign requires capacity. Simplifying a system requires capacity.
Without that capacity, "never again" becomes a remediation ticket that sits in a backlog until the next incident gives it urgency again.
The work after the incident is the real test
During an incident, the organization usually knows how to create a sense of urgency.
People join the bridge. Leaders ask for updates. Teams focus. Decisions get made faster than usual.
Do we actually make time to understand what the incident revealed? Do we ask why the system made this failure possible? Do we ask why detection took so long? Do we ask why ownership was unclear? Do we ask why the people in the room lacked the context they needed?
When the same themes keep appearing, the incident is probably pointing at something deeper. Stale ownership. Unclear communication. Brittle dependencies. Too much work concentrated in too few people. A system that only works because a few humans are constantly adapting around it.
What I am trying to do differently
Sadly, I don't think I have a clean answer here.
I still work within the same constraints as everyone else. I cannot magically create more capacity. I cannot undo every re-org. I cannot keep every piece of documentation up to date.
We can ask the expected questions, clarify ownership, and do a lot of work before an incident occurs. Even if we're not fully ready, we should strive to be in a better position.
Your turn!
Have you been part of an incident that revealed how the organization actually worked, not just how the technical system failed? What did your team do afterward, and did the lesson stick? Let me know by replying to this email!
Happy coding!