Release like no one is watching

                June 16, 2025

            Release like no one is watching

            Hey!
Welcome back to another week of musings.
I hope you had a restorative weekend. We returned some items with my wife, and we can finally close the moving chapter for this year.
This week we have a US Holiday on Thursday, so that it will be an odd week for me. 
Was this forwarded to you? You can subscribe here!

Things I discovered in the past week

How to build a team that can “take a punch”: A playbook for building resilient, high-performing teams | Hilary Gridley (Head of Core Product, Whoop). This is another interesting episode from Lenny's Podcast! Worth a listen.
To accompany this issue, if your organization deploys to GCP, it had an outage last week. You can now read the incident report (and you should).

If your company has ever implemented a Change Advisory Board (CAB) following an incident, you, like me, may have experienced stringent requirements and certification processes that delayed projects, fixes, and other initiatives.
What is the Change Board?
The Change Advisory Board (CAB) is a group of individuals who review all changes intended for application to the production environment. 
So, suppose you have a change for your application. In that case, you create a ticket, document the change, risks, and any other necessary information, and then attend the meeting to discuss when to schedule it, the window of execution, and other relevant details. They might approve or request more information to approve the change. You may also have this if your company handles IT work for another company (think outsourcing) and needs to communicate changes to your clients constantly.
Where does it go wrong?
People will always have a love-hate relationship with CABs because they add "overhead" to releases, and in other cases, having all that tracking prevents issues.
However, the CAB often goes wrong when people outside of day-to-day operations are involved; they are so far removed from actual work that the requirements become either too vague or too stringent, making it impossible to comply. In other cases, the CAB ruling might be reactionary, meaning that rules don't stem from an analysis of actual operating standards and procedures, but rather from a response to incidents. 
One other thing I've seen is that the CAB wants to have absolute control over changes and artifacts, such that they want to be the only ones with break-glass processes, or force rollbacks (without having domain knowledge).
What are the good parts?
The good parts of change control are generally things like:

Automate all the things (possible)!
All changes are documented and accounted for
Rollback plans exist (and are automated if possible)
A Second pair of eyes validates the change

But most of these things won't matter much if your organization hasn't built resilience. It won't matter how much you document a change or how many eyes you have on top of a process if (almost) every change produces some incident.
Your organization is not built to adapt to surprise. If you're interested in resilience engineering, I recommend reading this repository. 
Socio-Technical Approach
One of the primary aspects of resilience in your organization revolves around understanding the relationship between people and the technology they use. Probably the most famous paper on this topic is "How Complex Systems Fail" by Richard Cook.
You cannot solve problems by focusing solely on the technology. As we progress to staff and above levels, many of the issues we solve are due to the people, rather than the technology itself. While it's interesting to build a tool that summarizes incidents from Slack channels using AI, if people don't learn from the incident and generate new insight, faster summaries won't provide any benefit.
Learning is Action
As a teammate and staff engineer in teams, our role is to ensure that teams learn from incidents and develop new ways to deal with surprises.
While we can create a myriad of dashboards to slice and dice data, if the team doesn't use them during incidents, we need to understand why and consider shifting our approach. This is why I have lately preferred using traces to understand and monitor running services. We can add as much data as needed to create the right signals, which we continually improve over time with each release.
Your turn!
Does your organization have a CAB? Are you rushing to production without oversight? How has this worked for you and your teams? Let me know by replying to this email!
Happy coding!

website | twitter | github | linkedin | mastodon | among others

Don't miss what's next. Subscribe to Oscar Funes: