Monorepos and Forced Migrations
Monorepos and Forced Migrations
By most metrics, Google is a great place to work. Top tier compensation, free food, and largely co-workers who want do quality engineering. And yet, many Google engineers burn out. Putting aside problems with bad managers, or ethical objections—both of which are more important than whats in this article—I talk to many Google engineers whose heart simply isn’t in their work. On top of that, there’s a lot of friction to doing anything at Google. Few tasks are simple. Many engineers, myself included, often feel like their most creative and productive years are being wasted to jump over needless hurdles, all to optimize consumer behavior to boost ad revenue.
Let’s zoom in on one particular component of friction that affects me regularly. It’s about how software migrations work at Google. Most open source software works in terms of versioned releases. New releases add features, and major releases can be backwards-incompatible, but for the most part, clients are rarely forced to upgrade (sometimes at their own risk). And while eventually very old things will naturally lose support, the most laudable software projects take decades to do this. For example, most of the Java API is backwards compatible with its early versions. This allows me to run programs I wrote in 2006 on a modern system without changing a line of code.
When software spans companies, some degree of backwards compatibility is necessary. You can’t change someone else’s code for them, and even if you could, at scale it wouldn’t be feasible or worthwhile. At Google it’s the opposite. and this is the source of much friction.
As is well-known, Google has a massive monorepo, and uses Perforce with extensive customization to support it. The recently-made-free “SWE book” (pdf) describes the reasoning for adopting this system. The Google monorepo comes with a number of policies and processes to make working with it feasible. One such policy is called “One Version,” defined as follows.
For every dependency in our repository, there must be only one version of that dependency to choose. (p. 341)
Here’s how they cheer for it:
We highly endorse the One-Version Rule presented here: developers within an organization must not have a choice where to commit, or which version of an existing component to depend upon. There are few policies we’re aware of that can have such an impact on the organization: although it might be annoying for individual developers, in the aggregate, the end result is far better. (p. 349)
They often refer to this as “trunk-based” development, borrowing a term from older version control systems in which the main branch was called “trunk.” This is in contrast to having very long-lived feature branches that cause chaos when merging them back into the main branch.
Motivations aside, the One Version rule has little to do with (the good practice of) having small, short-lived feature branches. It goes much further. No software library can ever have two versions, so the second an API changes, all clients must migrate. The person making the new API is expected to change client code to match, but is not responsible for ensuring the change does not break the client.
Google leadership made the choice to have a monorepo because enforcing One Version at scale requires efficient repository-wide migrations. And to Google’s credit their tooling supports this. You can write a substitution command for a particular code usage, and the tool finds the appropriate owner for each match (code ownership is heavily regulated at Google), bundles the changes by owner, sends out the changes for review, and then automates merging in the changes once the owners approve. This gives migrators some impressive powers.
As an aside, the SWE book provides no qualitative analysis of what, exactly, the impact of One Version is, compared to using, say, semver, which most open-source projects use for versioning. They don’t back up any claims, really, except in comparison to hypothetical scenarios. Even worse, they use circular logic to justify it.
In Accelerate and the most recent State of DevOps reports, DORA points out that there is a predictive relationship between trunk-based development [monorepos] and high-performing software organizations (p. 343).
If you read the DORA report, you’ll learn the way it identifies “high-performing organizations” is by clustering survey responses for questions like, “How often do you deploy?” and, “How often do your changes break things?” See my first newsletter article on why clustering is a red flag, but nothing about that analsys is predictive, and the report clearly already had Google in mind as a high-performing organization to start.
The idea that developers should work in small batches off master or trunk rather than on long-lived feature branches is still one of the most controversial ideas in the Agile canon, despite the fact it is the norm in high-performing organizations such as Google. (p. 30)
Did I mention the report was commissioned, in part, by Google? (End of Aside.)
The SWE Book also says the following in its chapter on version control. Before the quote, I can’t help but mention that the book mainly discusses version control as it existed when CVS and Subversion were king (pre-2010). Today, according to perforce.com, a competitor of git, git has over 80% market share (note git is not allowed at Google).
When the SWE book discusses version control, it describes the monorepo as a critical component of their success. As mentioned above, you can make codebase-wide changes, and you never need to manage versions with the “One Version” rule. When considering alternatives to monorepos, they write (my emphasis added),
Most arguments against monorepos focus on the technical limitations of having a single large repository. [… long time to clone a repo, blah, blah …] The other major argument against monorepos is that it doesn’t match how development happens in the Open Source Software (OSS) world. Although true, many of the practices in the OSS world come (rightly) from prioritizing freedom, lack of coordination, and lack of computing resources. Separate projects in the OSS world are effectively separate organizations that happen to be able to see one another’s code. Within the boundaries of an organization, we can make more assumptions: we can assume the availability of compute resources, we can assume coordination, and we can assume that there is some amount of centralized authority. (p. 346-347)
The last two words are the crux of this issue: monorepos deliberately centralize power. The underlying assumption is that the centralized authority makes good decisions. If not, the same efficiency that enables codebase-wide changes magnifies the impact of a bad decision. “Centrally authoritative” decisions are often made by people without a clear view of the systems they’re affecting or the impact of the decisions being made. Front-line engineers can’t make sensible exceptions for their projects, or cauterize the damage easily.
One example is git. At Google git was supported as a shim around Perforce, but it used a Python 2 library for Google-flavor Perforce, and the forced migration to Python 3 (admittedly, not Google’s fault) resulted in the library being dropped and nobody was staffed to migrate the shim. Poof! Git is dead.
A more recent, smaller example: an engineer in a European time zone migrated a dependency of mine that broke my systems in an unexpected way. That engineer had made reinstating the deprecated API a compilation error, so rolling back required me to obtain special permission. But by the time the breakage rolled out and started alerting us, it was night in Europe, and the entire team responsible for this change was off hours. I could roll back my system, but this new release included a critically important fix for some other problem. There is a break-glass option to force a release, but it comes with its own hassles and red tape. So I had to hunt around for someone who could approve the change, and altogether it wasted most of a work day.
A version of this story happens to me at least twice a month. X is being deprecated and all new systems must migrate to Y by date Z or else submit an exception justification (which will be denied), which will only delay the forced migration. If you do nothing your services will stop running. Every aspect of your work life is subject to such mandates, from code APIs to your IDE, from configuration management to permissions and releases. Rather than be able to pin something to a version that works, the culture is that first someone breaks you (or says, “this might break you, please figure out if it will and act accordingly”), then you have to dig around for the change that caused it, find the right person to approve a rollback, and then figure out how to prevent them from breaking you again. Since Google is so big, it doesn’t matter if everyone perfectly learns their lesson. There is always someone new to ignore best practices and break you all over again.
These forced-migration mishaps accumulate to such a large portion of the job that you wonder if it might be more enjoyable to brew craft cider, or more worth your efforts to go address the climate crisis instead of putting cats on the internet.
I can’t help but imagine whether Google Cloud’s low market share is related to this. If the internal culture is that clients must act multiple times a month to deal with impending breaking migrations, at least some of that attitude likely falls on external customers.
But even if you balk at the line I drew between monorepos and Cloud adoption, the question still stands why mandated migrations are so common at Google. Why can’t Google just support things and maintain backwards compatibility?
My guess is it’s intimately related to the promotion process at Google (note, I have never been on a promotion committee). Everyone needs a “promo project,” and building new things (and removing old things) is a much easier promotion case than maintaining an existing service. It provides easy impact metrics like user growth, removal of old code, and shiny new UIs. And it’s extremely hard to measure the cost of doing something poorly versus an alternative, or versus doing nothing at all. In the simplest case, like the git shim above, there is simply no owner left who can migrate it, so perfectly good products simply disappear because someone else needs a promo project.
Google promotes for offense and firefighting, not being careful and avoiding disasters. “Offense” also means many new projects are rushed to production, or their mandated adoption is sped along, to meet a promotion deadline. As a result, the new thing is often feature-incomplete or unreliable, and compensating for that creates a whole new layer of technical debt that stays with the project until one day, after the lead is promoted and moves on, the once-new system is viewed as a quagmire of technical debt, so the remaining engineers concoct a brand new replacement which by the way would make a very tidy promo case, and the cycle repeats.
I hope it doesn’t all boil down to incentives and the difficulty of measuring impact. It does seem like the core assumptions behind the monorepo— good coordination, availability of human effort, and good centralized decision making—were perhaps truer at Google a decade or two ago than they are now.