Why is it hard to keep lights on?
Hey!
Welcome back to another week of musings.
These past couple of weeks, I've been knee-deep in a migration project that has consumed most of my time, so I haven't had much restorative time. But I do hope that you have!
I've been trying to enjoy the last few weeks of late sunsets before days start to become shorter.
Was this forwarded to you? You can subscribe here!
Things I discovered in the past week
- I came across this repository in Github around resilience engineering. It's a pretty cool source of information if you want to start reading on the topic.
- I recently came across a tweet that mentioned Zen in the Art of Writing by Ray Bradbury. I'm always interested in books that discuss how writers think about their craft, so I bought a copy. I'll tell you in a few weeks how it was!
If you work for a company that schedules its work in sprints, you might be using terms like “keep the lights on” (KTLO) or “running the business” (RTB).
These terms are how we generally refer to the work needed daily to keep the systems operational. Imagine things like updating dependencies, responding to pagers, responding to support, and monitoring the applications. These tasks sometimes get relegated behind product work (i.e., new features for customers), and sometimes new leaders try to make arbitrary allocations, like 30% of the work should be KTLO, or we will have a KTLO team, etc.
Why do these strategies tend to fail?
While these strategies have good intentions (e.g., keep these tasks top of mind), I feel they don't fully acknowledge the work's "fluidity."
Upgrading libraries is not the same all the time, especially if the libraries are 2, 3, or 5 majors behind and are something foundational like Spring, or you're upgrading something like lodash. In other cases, you cannot predict when you're going to get paged or the reasons why. Also, some teams in your org might have easier work due to only having the newer stack versus the team managing that monolith that powers the core business and runs on Perl.
Rethinking the work
Some of these tasks, like upgrading the libraries or versions of other technologies like your database, nginx, etc, grow with the creation of new services, even if we don't grow our team.
Most of the time we tackle the tasks as putting more time into it, or generally exert more effort. But this is a case where working smarter is better in the long run. You might have seen tools like OpenRewrite or jscodeshift, which allow you to manipulate code bases. These tools help with upgrading libraries and making changes required, such as method names, parameters, etc. If you pair this with a strong validation/verification process, you can let these merge as often as you can tolerate.
Rethinking expectations
Other tasks, like upgrading databases, queue systems, etc., might not be able to be rethought, such as library upgrades.
But I've come across that dev teams, especially in on-prem deployments, tend not to code as defensively against database (DB) availability. This creates a need for "maintenance windows" because none of the services would tolerate not having the DB available. When you're in this situation, you need to start reshaping the culture and expectations, purposefully lowering the availability so that teams start recognizing these events, solving for them, and taking them into account from now on.
Reducing the need for maintenance windows will allow your database operations team to run upgrades as soon as they come up and are tested and verified. When you're on a journey to cloud-native technologies, they offer SLAs that you need to work with and create highly available solutions for your client.
Your turn!
Have you noticed how your team or company prioritizes and handles these KTLO tasks? Maybe your team is on top of them, and they've managed to keep everything under control. Or perhaps you do it "just in time" or when a vulnerability comes along. Let me know your thoughts by replying to this email!
Happy coding!
website | twitter | github | linkedin | mastodon | among others