Incidents are the friends we made along the way
Hey!
Welcome back to another week of musings. I've had a few hectic weeks, traveling and having guests over.
I think I'm also getting into a rhythm and routine I like, such as walking, journaling, note-taking, etc.
I hope you had a restorative week!
Was this forwarded to you? You can subscribe here!
Things I discovered in the past week
- This is a chill video about taking landscape photos! I like this YouTuber's vibe, in general, so I recommend watching their other videos.
- A Substack post about how to use your blog or newsletter to find your tribe! Very long read, so take your time.
I was having a conversation with someone recently, and they mentioned the number of incidents. They mentioned that the number was "too high," but when I asked, they didn't have a suitable threshold for determining a "good" number of incidents. Also, from the perspective that software will always have incidents, you can never approach zero.
At my current job, we invest a lot of time (and money) in incidents and their management. Change Advisory Boards (CABs), on-call rotations, bridge calls, incident commanders, root cause analysis (RCAs), post-mortems, remediation tasks, etc.
The conversation I was having sparked from the Crowdstrike incident and their post-mortem. It's hard to understand how they ended up having the incident from the write-up, but also we don't know the internals of their system.
Coming back to the original conversation, in our immediate system, we suffer from incidents. Still, curiously, leadership asked us to get the aggregations of "time to recovery" and "time to detect," i.e., the MTTR/MTTD metrics. These are considered "shallow data" for a reason, mainly because when we got the data: we didn't have any insight into the incidents, they were so unrelated to one another that the metrics were all over the place.
Focus on learnings
I'm part of a team that reviews postmortem documents, and most of the time, people are focused on answering these "shallow data" questions: How long did it take to detect? How long it took to detect the root cause? Etc.
When I ask questions about learning from the incident, I often ask: Has the same root cause caused an incident before this one? Were there tickets in the backlog to address this root cause before this incident happened? Was it already started? How was it prioritized previously?
In most cases, people didn't think about these types of questions. Generally, teams are focused on how much money we lost or having "enough" remediation tasks that it seems that the issue could be fixed. One time, we had an incident that the team assured me multiple times, and with their leadership, it would never happen again, so we should waive the RCA step.
Complex systems have no single root cause
Whenever I have an incident during my on-call shift, the systems are so complex, with our systems running and everything that gets deployed to a host, plus your infrastructure running on the cloud, that you can't fully control it. You end up testing assumptions from the host up. Ideally, multiple people might be helping out to test various points.
When the moment to draft the RCA document comes around, there's no single root cause in practice. The backend service experienced database connection depletion, but the database had latency due to the pressure. A random node experienced a restart from the cloud provider. Every incident ends up being a perfect storm of factors. I liked the term "contributing factors" of the incidents, as opposed to the root cause; there's no true "root" in most distributed incidents. I like to joke that the "root cause" is my choosing to work here 8+ years ago (it's me, hi, I'm the problem, it's me).
It's all about people
Reliability work, and resilient organizations are all about the people. Companies that understand this aspect dedicate time, money, and their people to build capacity to adapt to surprise.
I'm very new to this field and topic, so I would recommend reading from more expert sources:
- Loring Hochstein's collection of resources
- The Field Guide to Understanding 'Human Error' by Sydney Dekker
- Adaptive Capacity Labs Blog
Your turn!
Have you ever considered your organization's incidents this way? Have you ever read about resilience engineering or applied some of its lessons to your team? Let me know by replying to this email!
Happy coding!
website | twitter | github | linkedin | mastodon | among others