In Defense of Slow Feedback Loops
Goodhart's Law is a cruel master.
Recently the piece Integrated tests are a scam was making the rounds. I was going to write a rebuttal, but it comes prerebutted by the author himself, who says in other places that integrated tests are not integration tests, that integrated tests are good for non-basic correctness, and that the problem is actually integrated tests without unit tests. I wonder if this original article was a response to testing trends of late 00's, which 10 years later now reads like an overcorrection.1
I could look into that, or I could take one of his points and freebase a completely different topic. Here's two critical steps in his integrated test death spiral:
- We write more integrated tests, which are bigger and don’t criticize our design as harshly as microtests do.
- Our tests don’t give us as quick nor strong feedback about our design as they could, and we’re humans, so we design things less carefully than we could. We design more sloppily both because we can get away with it and because we don’t notice as easily that it’s happening.
So the problem is that integrated tests are bigger, take longer to run, and "don't give as quick or strong" feedback as unit tests. Or, more fundamentally, unit tests are a fast feedback loop while integration tests are a slow feedback loop.
"Speed up your feedback loop" is like the universal rule of knowledge work. Faster feedback loops let you iterate more, try novel things, see errors immediately instead of far from when you made them. I think in all circumstances, reducing a feedback loop is the best thing you can do to make it more useful.
But maybe a more useful feedback loop is less useful to your process. I think it's possible that workflows benefit from having one or two slow feedback loops in addition to many many fast ones.
Goodhart's Law
When a measure becomes a target, it ceases to be a good measure. — Goodhart's law
This is a fundamental principle of systems theory. If you measure developer productivity as number of closed issues per day, people will close lots of spurious or trivial bugs. If you measure it on overall software performance, they'll sacrifice safety and security to get it.
You'd think the law comes from people being jerks, but even AIs and dolphins follow it. I think it's universal because it has such simple causes. Here's how it happens in practice:
- We have a goal, like "educated students" or "software quality", that can't be measured.
- We have a measure, like "SAT scores" or "number of bugs", that's an imperfect proxy for the goal.2
- We know how to act on the measure, or at least incentivize people improving the measure, so we do. For some degree of optimization, this also improves the goal.
- But because the measure is an imperfect proxy, there are optimizations that improve the measure while hurting the actual goal. Because the measure and not the goal is the target, people make these perverse/maladaptive optimizations.
- The overall goal is hurt by making the measure a target. Goodhart laughs at us from Hell.3
Now for the claim: Goodhart's law is accelerated by fast feedback loops. This based on a couple of arguments. First, to optimize a measure you need to be able to check if your intervention actually improved it. The faster the feedback loop, the more improvements you can check, the more improvements you can make, which means the more perverse optimizations you find. Second, many perverse improvements are also microoptimizations, and you need fast feedback loops to measure tiny improvements.
If you only take a new measurement once a week, then you don't know what in that week contributed, or how much. You'll only be able to detect large-scale changes to the measurement. Slow loops trade efficiency for alignment. They are less good at optimizing your measure but are also harder to subvert.4
Pulling it back to software
There are lots of fast-loop processes intended to "write better software": pairing, unit tests, type systems, linters, etc.
There are also some slow-loop processes: integration tests, code review, and QA teams.
Slow-loop processes are criticized for not being fast enough, which is why people say write unit tests instead of integration tests, pair program instead of code review, etc.
By replacing integration tests with unit tests, we're losing alignment guarantees. Consider the failure modes of integration tests: slow tests, lots of setup, flaky results, poor error reporting. These generally make the tests more frustrating to use, but they still serve the goal of higher quality, more correct software. The failure modes of unit testing, on the other hand, are things like overreliance on mocks and designing your code to be more unit testable. These are failures modes that make unit testing a poorer indicator of code quality and correctness.5 They are also failure modes that make unit tests easier to write, speeding up the feedback loop. That's a good example of Goodhart's Law, where the measure takes priority over the actual goal.
This argument looks similar to the Cleanroom Engineering practice of not letting developers write unit tests, because you supposedly catch more process errors if a separate team writes the tests. They pair this with a form of proto-mob programming; a combination of slow feedback loop with a fast one.
Anyway, this is mostly speculation, I don't know if in practice slow feedback loops augment fast ones because they are slow, or if that's just an unfortunate consequence of their necessary properties. Further study is warranted, etc etc etc
-
Though I think Surely the Mars rover needed integrated tests! (Maybe not?) goes overboard in suggesting you could find hardware defects without needing integrated tests. ↩
-
God I wish I knew statistics. This is convariance, right? ↩
-
False, he's still alive ↩
-
But not unsubvertable. Consider a team that only gets sees the issue tracker once every week, so they write new features as slowly as possible. No new production code means no new issues! ↩
-
Many people would argue that making your code more unit testable is a good thing: the unit test "design pressure" lead to a better design and higher quality code. I tend to agree that in most cases, unit testability is a decent proxy for quality code. But again, it's a proxy. ↩
If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.
My new book, Logic for Programmers, is now in early access! Get it here.