Goodhart's Law in Software Engineering
It's not just about your boss.
Blog Hiatus
You might have noticed I haven't been updating my website. I haven't even looked at any of my drafts for the past three months. All that time is instead going into Logic for Programmers. I'll get back to the site when that's done or in 2025, whichever comes first. Newsletter and Patreon will still get regular updates.
(As a comparison, the book is now 22k words. That's like 11 blog posts!)
Goodhart's Law in Software Engineering
I recently got into an argument with some people about whether small functions were mostly a good idea or always 100% a good idea, and it reminded me a lot about Goodhart's Law:
When a measure becomes a target, it ceases to be a good measure.
The weak version of this is that people have perverse incentives to game the metrics. If your metric is "number of bugs in the bug tracker", people will start spuriously closing bugs just to get the number down.
The strong version of the law is that even 100% honest pursuit of a metric, taken far enough, is harmful to your goals, and this is an inescapable consequence of the difference between metrics and values. We have metrics in the first place because what we actually care about is nonquantifiable. There's some thing we want more of, but we have no way of directly measuring that thing. We can measure something that looks like a rough approximation for our goal. But it's not our goal, and if we replace the metric with the goal, we start taking actions that favor the metric over the goal.
Say we want more reliable software. How do you measure "reliability"? You can't. But you can measure the number of bugs in the bug tracker, because fewer open bugs roughly means more reliability. This is not the same thing. I've seen bugs fixed in ways that made the system less reliable, but not in ways that translated into tracked bugs.
I am a firm believer in the strong version of Goodhart's law. Mostly because of this:
What does a peahen look for in a mate? A male with maximum fitness. What's a metric that approximates fitness? How nice the plumage is, because nicer plumage = more calories energy to waste on plumage.1 But that only approximates fitness, and over generations the plumage itself becomes the point at the cost of overall bird fitness. Sexual selection is Goodhart's law in action.
If the blind watchmaker can fall for Goodhart, people can too.
Examples in Engineering
Goodhart's law is a warning for pointy-haired bosses who up with terrible metrics: lines added, feature points done, etc. I'm more interested in how it affects the metrics we set for ourselves that our bosses might never know about.
- "Test coverage" is a proxy for how thoroughly we've tested our software. It diverges when we need to test lots of properties of the same lines of code, or when our worst bugs are emergent at the integration level.
- "Cyclomatic complexity" and "function size" are proxies for code legibility. They diverges when we think about global module legibility, not local function legibility. Then too many functions can obscure the code and data flow.
- Benchmarks are proxies for performant programs, and diverge when improving benchmarks slows down unbenchmarked operations.
- Amount of time spent pairing/code reviewing/debugging/whatever proxies "being productive".
- The DORA report is an interesting case, because it claims four metrics2 are proxies to ineffable goals like "elite performance" and employee satisfaction. It also argues that you should minimize commit size to improve the DORA metrics. A proxy of a proxy of a goal!
What can we do about this?
No, I do not know how to avoid a law that can hijack the process of evolution.
The 2023 DORA report suggests readers should avoid Goodhart's law and "assess a team's strength across a wide range of people, processes, and technical capabilities" (pg 10), which is kind of like saying the fix to production bugs is "don't write bugs". It's a guiding principle but not actionable advice that gets to that principle.
They also say "to use a combination of metrics to drive deeper understanding" (ibid), which makes more sense at first. If you have metrics X and Y to approximate goal G, then overoptimizing X might hurt Y, indicating you're getting further from G. In practice I've seen it turn into "we can't improve X because it'll hurt Y and we can't improve Y because it'll hurt X." This could mean we're at the best possible spot for G, but more often it means we're trapped very far from our goal. You could come up with a weighted combination of X and Y, like 0.7X + 0.3Y, but that too is a metric subject to Goodhart.
I guess the best I can do is say "use your best engineering judgement"? Evolution is mindless, people aren't. Again, not an actionable or scalable bit of advice, but as I grow older I keep finding "use your best judgement" is all we can do. Knowledge work is ineffable and irreducible.
-
This sent me down a rabbit hole; turns out scientists are still debating what exactly the peacock's tail is used for! Is it sexual selection? Adverse signalling? Something else??? ↩
-
How soon commits get to production, deployment frequency, percent of deployments that cause errors in production, and mean time to recovery. ↩
If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.
One of the few meaningful metrics I saw was Keith Braithwaite's. He measured the slope of a log/log comparison of method complexity vs. number of methods of that complexity. It turns out that there was a very strong association between the gradient and whether a code base had unit test coverage (suggesting TDD). Kent Beck did some similar work on power laws in code during his time at Facebook.
It's not a causal relationship but it's a very interesting indicator and, most relevantly, very difficult to game.
The only online description I can find is at https://www.slideshare.net/slideshow/keith-braithwaite-measure-for-measure/309236
Isn't unit test coverage directly measurable, though? Like with code coverage or lines of test / lines of code.
Unit test coverage is measurable. But what does 80% vs 90% coverage tell you? Those numbers can be gamed. Also, if it is 80% and really good tests vs 90% and poor tests, is the 90% better? The point is what do the measurement numbers mean? And even "objective" numbers can provide a misleading view.
V.F. Ridgeway said something close to this: "Even where performance measures are instituted purely for purposes of information, they are probably interpreted as definitions of the important aspects of that job or activity and hence have important implications for the motivation of behavior."
Once you set any kind of metrics, you message that the things you are measuring are what matter, and your people will respond accordingly. If that metric diverges in any way from the thing you really want to achieve -- and they almost always must -- then you'll be driving people to do the wrong thing. The only solution I've found is to adjust metrics over time as the situation evolves. The metric you might use when starting a "bug reduction" initiative (for example) might not make sense once you've addressed the "low hanging fruit" and need to move on to more nuanced problems. That runs headlong into most current management practices, that dictate we should be using the same measures over long periods of time.
This blog post by Fred Hebert has an interesting framework for evaluating whether metrics are meaningful, that provides some ideas for figuring out whether a metric is actually a useful proxy for the ultimate goal.