Monorepos and Forced Migrations

forced

                October 12, 2021

            Monorepos and Forced Migrations

            Monorepos and Forced Migrations
By most metrics, Google is a great place to work.
Top tier compensation,
free food,
and largely co-workers who want do quality engineering.
And yet, many Google engineers burn out.
Putting aside problems with bad managers,
or ethical objections—both of which
are more important than whats in this article—I 
talk to many Google engineers
whose heart simply isn’t in their work.
On top of that, there’s a lot of friction
to doing anything at Google.
Few tasks are simple.
Many engineers, myself included,
often feel like their most creative 
and productive years are being wasted
to jump over needless hurdles,
all to optimize consumer behavior
to boost ad revenue.
Let’s zoom in on
one particular component of friction
that affects me regularly.
It’s about how software migrations
work at Google.
Most open source software
works in terms of versioned releases.
New releases add features,
and major releases can be backwards-incompatible,
but for the most part,
clients are rarely forced to upgrade (sometimes at their own risk).
And while eventually very old things will naturally lose support,
the most laudable software projects
take decades to do this.
For example, most of the Java API
is backwards compatible with its early versions.
This allows me to run programs I wrote in 2006
on a modern system without changing a line of code.
When software spans companies,
some degree of backwards compatibility is necessary.
You can’t change someone else’s code for them,
and even if you could, 
at scale it wouldn’t be feasible or worthwhile.
At Google it’s the opposite.
and this is the source of much friction.
As is well-known, Google has a massive monorepo,
and uses Perforce with extensive customization to support it.
The recently-made-free “SWE book” (pdf)
describes the reasoning for adopting this system.
The Google monorepo comes with a number of policies
and processes to make working with it feasible.
One such policy is called “One Version,”
defined as follows.

For every dependency in our repository, there must be only one version of
that dependency to choose. (p. 341)

Here’s how they cheer for it:

We highly endorse the One-Version Rule presented here: developers within an
organization must not have a choice where to commit, or which version of an
existing component to depend upon. There are few policies we’re aware of that
can have such an impact on the organization: although it might be annoying
for individual developers, in the aggregate, the end result is far better.
(p. 349)

They often refer to this as “trunk-based” development,
borrowing a term from older version control systems
in which the main branch was called “trunk.”
This is in contrast to having very long-lived feature branches
that cause chaos when merging them back into the main branch.
Motivations aside, the One Version rule 
has little to do with (the good practice of) 
having small, short-lived feature branches.
It goes much further. 
No software library can ever have two versions,
so the second an API changes,
all clients must migrate.
The person making the new API is expected to change
client code to match,
but is not responsible for ensuring the change
does not break the client.
Google leadership made the choice to have a monorepo
because enforcing One Version at scale
requires efficient repository-wide migrations.
And to Google’s credit their tooling supports this.
You can write a substitution command
for a particular code usage,
and the tool finds the appropriate owner for each match
(code ownership is heavily regulated at Google),
bundles the changes by owner,
sends out the changes for review,
and then automates merging in the changes once the owners approve.
This gives migrators some impressive powers.
As an aside, 
the SWE book provides no qualitative analysis
of what, exactly, the impact of One Version is,
compared to using, say, semver, 
which most open-source projects use for versioning.
They don’t back up any claims, really,
except in comparison to hypothetical scenarios.
Even worse, they use circular logic to justify it.

In Accelerate and the most recent State of DevOps reports, DORA points out
that there is a predictive relationship between trunk-based development [monorepos] and
high-performing software organizations (p. 343).

If you read the DORA report,
you’ll learn the way it identifies “high-performing organizations”
is by clustering survey responses
for questions like, “How often do you deploy?”
and, “How often do your changes break things?”
See my first newsletter article
on why clustering is a red flag,
but nothing about that analsys is predictive,
and the report clearly already had Google in mind 
as a high-performing organization to start.

The idea that developers should work in small batches off master or trunk
rather than on long-lived feature branches is still one of the most
controversial ideas in the Agile canon, despite the fact it is the norm in
high-performing organizations such as Google. (p. 30)

Did I mention the report was commissioned, in part, by Google?
(End of Aside.)
The SWE Book also says the following in its chapter on version control.
Before the quote, I can’t help but mention
that the book mainly discusses 
version control as it existed
when CVS and Subversion were king (pre-2010).
Today, according to perforce.com,
a competitor of git,
git has over 80% market share 
(note git is not allowed at Google).
When the SWE book discusses version control,
it describes the monorepo as a critical component of their success.
As mentioned above, 
you can make codebase-wide changes,
and you never need to manage versions
with the “One Version” rule.
When considering alternatives to monorepos, they write
(my emphasis added),

Most arguments against monorepos focus on the technical limitations of having
a single large repository. [… long time to clone a repo, blah, blah …] 
The other major argument against monorepos is that it doesn’t match how
development happens in the Open Source Software (OSS) world. Although true,
many of the practices in the OSS world come (rightly) from prioritizing
freedom, lack of coordination, and lack of computing resources. Separate
projects in the OSS world are effectively separate organizations that happen
to be able to see one another’s code.  Within the boundaries of an
organization, we can make more assumptions: we can assume the availability of
compute resources, we can assume coordination, and we can assume that there
is some amount of centralized authority. (p. 346-347)

The last two words are the crux of this issue:
monorepos deliberately centralize power.
The underlying assumption 
is that the centralized authority makes good decisions.
If not, the same efficiency that enables
codebase-wide changes magnifies the impact of a bad decision.
“Centrally authoritative” decisions are often made
by people without a clear view of the systems they’re affecting
or the impact of the decisions being made.
Front-line engineers can’t 
make sensible exceptions for their projects,
or cauterize the damage easily.
One example is git.
At Google git was supported as a shim around Perforce,
but it used a Python 2 library for Google-flavor Perforce,
and the forced migration to Python 3 
(admittedly, not Google’s fault)
resulted in the library being dropped
and nobody was staffed to migrate the shim.
Poof! Git is dead.
A more recent, smaller example:
an engineer in a European time zone
migrated a dependency of mine that broke my systems
in an unexpected way.
That engineer had made reinstating
the deprecated API a compilation error,
so rolling back required me to obtain special permission.
But by the time the breakage rolled out and started alerting us,
it was night in Europe, 
and the entire team responsible for this change was off hours.
I could roll back my system,
but this new release included a critically important fix 
for some other problem.
There is a break-glass option to force a release,
but it comes with its own hassles and red tape.
So I had to hunt around for someone
who could approve the change,
and altogether it wasted most of a work day.
A version of this story happens to me
at least twice a month.
X is being deprecated and all new systems
must migrate to Y by date Z or else submit
an exception justification (which will be denied),
which will only delay the forced migration.
If you do nothing your services will stop running.
Every aspect of your work life is subject to such mandates,
from code APIs to your IDE,
from configuration management to permissions and releases.
Rather than be able to pin something to a version that works,
the culture is that first someone breaks you (or says, “this might break you, please figure out if it will and act accordingly”),
then you have to dig around for the change that caused it,
find the right person to approve a rollback,
and then figure out how to prevent them from breaking you again.
Since Google is so big,
it doesn’t matter if everyone perfectly learns their lesson.
There is always someone new to ignore best practices
and break you all over again.
These forced-migration mishaps 
accumulate to such a large portion of the job
that you wonder if it might be more enjoyable to brew craft cider,
or more worth your efforts to go address the climate crisis
instead of putting cats on the internet.
I can’t help but imagine
whether Google Cloud’s low market share
is related to this.
If the internal culture is that
clients must act multiple times a month
to deal with impending breaking migrations,
at least some of that attitude likely falls 
on external customers.
But even if you balk
at the line I drew between monorepos
and Cloud adoption,
the question still stands why
mandated migrations are so common at Google.
Why can’t Google just support things and maintain backwards compatibility?
My guess is it’s intimately related 
to the promotion process at Google 
(note, I have never been on a promotion committee).
Everyone needs a “promo project,”
and building new things
(and removing old things)
is a much easier promotion case
than maintaining an existing service.
It provides easy impact metrics
like user growth,
removal of old code,
and shiny new UIs.
And it’s extremely hard to measure the cost 
of doing something poorly
versus an alternative,
or versus doing nothing at all.
In the simplest case, like the git shim above,
there is simply no owner left who can migrate it,
so perfectly good products simply disappear
because someone else needs a promo project.
Google promotes for offense and firefighting,
not being careful and avoiding disasters.
“Offense” also means
many new projects are rushed to production,
or their mandated adoption is sped along,
to meet a promotion deadline.
As a result, the new thing is often feature-incomplete
or unreliable,
and compensating for that creates
a whole new layer of technical debt
that stays with the project until one day,
after the lead is promoted and moves on,
the once-new system is viewed as a quagmire of technical debt,
so the remaining engineers concoct a brand new replacement
which by the way would make a very tidy promo case,
and the cycle repeats.
I hope it doesn’t all boil down to incentives
and the difficulty of measuring impact.
It does seem like the core assumptions behind the monorepo—
good coordination, 
availability of human effort, 
and good centralized decision making—were 
perhaps truer at Google a decade or two ago than they are now.

Don't miss what's next. Subscribe to Halfspace: