Program Spaces and Interesting Counterexamples
Okay so let's say I have a function f
and ask you to figure out what it is. Input's a list of integers, output is one integer. You gotta figure it out by seeing what inputs give you what.
First test: f([3]) == 3
. This removes some possibilities, like len
, but there's still a lot of other possible functions. It could be head
, it could be last
, or min
, or max
or sum
or prod
. It could even be the constant function f(x) = 3
. You can think of these all as a program space of sorts, the set of all functions that satisfy the specification.1 There's a lot of programs in this space, and they're all very different, so we know our spec is too loose.
Next test: f([3, 2]) == 3
. That removes last
, sum
, min
, and prod
. Could still be max
, head
, or the constant function.
Final test: f([1, 4, 2]) == 4
. That gets rid of head
and the constant function, so you can be reasonably confident it's probably max
.
Now I triumphantly reveal the function:
def max_except_19(l: list[int]) -> int:
if 19 in l:
return 19
return max(l)
Hah, you never tested it with 19 as an element! If you did you would have seen that you totally got f
wrong. What do you say to that?!
...Probably that I'm being obnoxious. Why would anybody ever need a function like that? I'm just being spiteful.
But what if I instead revealed this?
def furthest_from_origin(l):
return max(l, key=abs)
Then f([3, -4, 1]) == -4
, because -4
is the furthest number away from 0. Unlike max_except_19
, furthest_from_origin
looks like something that a reasonable people might actually want.
That realism is what distinguishes spurious counterexamples from interesting ones. We use examples as simplified models of the work we do. A gotcha is a boring counterexample because it's very unlikely to exist in reality. We're unlikely to encounter functions that do something very different on specific inputs. We don't learn anything from those counterexample, except maybe "Hillel is a smartass."
Then again, here's a realistic function that does something different on a specific input:
def f(l):
if l == []:
return 0
return max(l)
max
is only defined for nonempty lists. There's a cluster of functions that extend max
to handle empty inputs, such as by returning a standard number or raising an error. This is more interesting than max_except_19
because there's a relevant domain reason why the empty list is special. It's also less interesting than furthest_from_origin
. I can think of two reasons why:
- Each possible
f
differs from max by a single special case. On most inputs they're identical. This is similar tomax_except_19
, which is only different on a very small number of inputs. By contrast,furthest_from_origin
differs frommax
on a very large number of inputs. You can think of some programs being closer to each other in the program space than other programs are. Valid functions that are far away in the program space make more interesting counterexamples. - The empty input is a "well-known" edge case. People think about it by habit. More people would see my tests as incomplete because it lacked the empty input than because it lacked negative numbers. Counterexamples are more interesting when they realistically trigger under less "pathological" inputs, like "arrays with duplicates" or "arrays with adjacent duplicates".
Our specifications should eliminate as much of the program space as possible. Some people at Brown University use this for teaching: they have students write tests that pass "wheat" programs and fail "chaff" programs. To my understanding they mostly focus on fine divisions in the space, so challenging students to distinguish between mean
and median
vs mean
and sum
. I've heard tentative reports that this is very effective at improving student requirement analysis, and it's something I want to start incorporating in my own teaching.
furthest_from_origin
is really more of a requirement counterexample than an implementation one. It's too conceptually different from max
to be something you'd accidentally implement. Writing the wrong max
for an empty input is more likely to happen. Similar "implementation" counterexamples: using a greedy algorithm instead of a thorough one; having a combination of boolean inputs and messing up the input of exactly one of them; checking an error in the wrong place. These are all mostly correct, which is why they're more likely to "sneak through" in practice. You're going to end up somewhere close in the program space.
That suggests different techniques for catching intention errors vs implementation errors. If you write max
when you needed furthest_from_origin
, you fundamentally misunderstand what you're doing, and that won't be addressed by you writing tests, because you'll be testing the wrong thing in the first place.
More on Counterexamples
I still think furthest_from_origin
is an interesting counterexample. Or, more pedantically, it's an example of a interesting counterexample. Counterexamples are more interesting when they are realistic, nonobvious, and nontrivial. This heuristic can help us find better counterexamples in general, not just when talking about testing.
Take the claim "premature optimization is the root of all evil". A counterexample to that claim would be game development, where optimization matters a great deal. But this isn't actually realistic! That's because "premature optimization" is most common among enterprise developers, 99% of whom will never, ever have to program a game. Similarly, "replacing bubblesort with quicksort" is too obvious and specific to be an interesting counterexample.
So what is interesting? I'd consider N+1 queries to be sufficient interesting. N+1 queries are a data access antipattern where you have a collection of entities need information on their relations, so you loop over the collection and make database query per entity. This is surprisingly common and hugely inefficient compared to getting all the information in one query, so it's almost always a good idea to prematurely optimize the N+1 out. That's interesting enough to push us to a more nuanced claim, like "premature optimization is magnitudes less effective per unit time than mature optimization". But that doesn't quite roll off the tongue.
All of this is speculation and theorycrafting. I might think differently about all this in a couple of weeks.
-
"If it's a program space, what's your distance metric?" That's a good question and my answer is LOOK OVER THERE ↩
If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.
My new book, Logic for Programmers, is now in early access! Get it here.