Root Cause Analysis
Hey!
Welcome back to another week of musings. Thank you for opening this email!
We went to the beach this weekend and enjoyed the sunset while watching the Golden Gate Bridge. I hope you had a restorative weekend as well.
Was this forwarded to you? You can subscribe here!
Things I discovered in the past week
- How to create software quality. from Lethain, an interesting view of what's required to create software quality.
- Document your product and software architecture decisions by Indu Alagarsamy, whom I've been following due to her content on architecture modernization!
During the last few weeks, I've had to meet with some vendors and discuss some incident metrics. As part of that exercise, I thoroughly reviewed the RCA (root cause analysis) documents, which have now been submitted for review and approval.
No Two Incidents Are The Same
The first thing that clearly shows is that time for discovery and recovery is all over the place. There's no relationship between incidents in a way we think we might find.
In distributed systems, while "Database issue" might be common, it might not be the same database (especially with microservices), and if you're in a cloud computing vendor, you might not be in the same region, availability zone, etc.
Continuing the database issue example, sometimes that issue might start a chain reaction, and other times it might be a side effect of another incident.
So, in distributed systems, you only get contributing factors that might mix and match in different ways each time.
Reliability and Resilience
If you've ever watched Dr Richard Cook's talk on "How Complex Systems Fail" (which I highly recommend you do!), you'll understand that complex systems are almost always operating in a state of degradation.
So it's more surprising that the systems work at all than to find incidents!
It also highlights how, even when designing a system for reliability, we need to consider the systems in execution to make them resilient.
Human Operators
As with any other complex system, humans interact with technology in some way to make the whole work.
In that sense, humans are part of the system, and as such, they fulfill two main roles: they "operate" the system to produce output and are also in charge of preventing incidents.
Human nature is adaptable to new contexts and constraints. We make decisions in the moment to reallocate resources, prevent an incident, or avoid an incident from becoming larger, etc.
This also means that expertise is one of the main assets we bring to distributed systems, but we also need to train new people who come to the system.
Systems Thinking
One thing that comes into play here is that "systems thinking" is required to operate in this environment.
The system is more about the interactions than the individual parts at scale.
Safety
A very interesting topic that came to me was "safety engineering." In this whole complex systems dynamic, the idea of safety was never in my mind due to thinking of software as "soft," but in reality, safety applies in a wider sense here.
I would recommend listening to Pre-Accident Investigations Podcast with Todd Conklin interviewing Ryan Kitchens (at that moment working for Netflix), part one, and two.
Your turn!
Have you ever considered incidents at your organization from a complex system perspective? With safety, resilience, and reliability in mind? Let me know by replying to this email!
Happy coding!
website | twitter | github | linkedin | mastodon | among others