Why Property Testing Finds Bugs Unit Testing Does Not
I intended this newsletter to be my thoughts without editing, and I have a new thought, so here goes. I want to respond to this discussion:
But Kids These Days think randomly generating inputs (property-based testing) is cooler than finding the most telling inputs. (cf Hamlet&Taylor, "Partition testing does not inspire confidence", 1990, https://t.co/rNLugvGxob)
— Brian Marick (@marick@mstdn.social) (@marick) March 27, 2021
(Standard disclaimer, don't brigade. It's a civil discussion and Brian is a friend, but even if both weren't the case, brigading is chillul Hashem)
For those of you just joining us, Property-Based Testing is when, instead of giving specific input-outputs in your test, you describe the general properties the execution should have. Then you try it on randomized values. A simple example is testing that addition is commutative:
def add(a, b):
    return a + b
@given(integers(), integers())
def test_add_is_commutative(a, b):
    assert add(a,b) == add(b,a)
(I'm using pytest and hypothesis for everything.)
You can read a deeper treatment of PBT here. PBT is powerful because it lets you test a wider range of inputs, but you have to learn how to come up with properties and how to do complex strategies (input generators), which are both skills.
Anyway, Brian's argument in the tweets is roughly this:
- The majority of errors you find with testing are either issues with an entire "partition" of inputs or "boundary" inputs, like INT_MIN.
- For partition errors, you can pick any random issue and find it, in which case you might as well just write a unit test for that input.
- For boundary errors, those arise from the specification of the problem, and to handle those, it's necessary to think hard about the problem and what the unexpected cases will be, regardless of how you plan to test.
- In which case PBT doesn't provide sufficient benefit over manual unit tests.
I don't think that's exactly his line of reasoning, but I think it's close enough that my response to this version should be compatible with the response to his intended argument. The value PBT has over manual unit testing is in the combinatorial explosion of boundary conditions and edge cases in even a moderately-complex problem.
The Geometry of Tests
Consider a function with one integer input:
def f(x: int):
  ...
We can imagine the function as a 2D graph, like the kind you see in high school. That means you can see a "spatial" element to the inputs: it is a one-dimensional line, describing the entire input space. If we were writing all tests manually, we'd try a few "ok" inputs and then some pathological cases:
- 0
- 1
- negative numbers
- very large numbers
- INT_MINand- INT_MAX, if you have those
That's in the realm of feasibility for manual testing. If there are any edge cases in the specification of the problem, you'd probably see them.
What if we have two integer inputs?
def g(x: int, y: int):
  ...
Now the input space is two-dimensional. This immediately squares our pathological cases: not only do we have to test the edge cases for both x and y, we have specifically test all combinations where both of them are pathological! But beyond that, we also have new edge cases that stem from the geometry of the space:
- x = y
- x = -y
- x < y
- x > y
- x >>> y,- x <<< y
- x divides/is divisible by y
- x and y are coprime
Again, with some cleverness and studying of the spec, you can probably identify what the actual problematic edges are and test them manually. Also, you can write tests that hit multiple of these at once, like testing (1, 1). But even so, you can see that the extra structure of having two variables leads to more edge cases. A 2D plane has a lot more complexity than 1D line. And a 3-space has more complexity than a 2D plane- we'd expect another set of edge cases in the interactions of three variables together.
I make the "input space" sound like a geometric thing, but it's a poor analogy. There's no geoemetric analog for a string input, or a list of numbers1, or a nested dict. This weakens our ability to test, since we can't hypothesis problematic edges by thinking about the geometry. More edge cases become unintuitive, and it's more likely that someone could miss them. Randomized testing becomes increasingly useful as it becomes harder for us to think through all the edges of a system.
(A common "intuition" for this is that for an 10-dimensional hypercube, "99.75% of the volume is in the corners". But I'm already including "interesting substructure of the geometry" as examples of edge cases, while that post builds the intuition off edge cases being values "far from the center". And I also already said that the geometric analogy breaks down. The link is a good post but reading both our articles back-to-back is prolly a bad idea.)
Most PBT examples suck
Later in the discussion Brian expresses a distaste for PBT examples "that use numbers and arrays". I agree with him there. My PBT example of "addition is commutative" is an eye-rolling cliche, along with the notorious "reversing a list twice gives you back the same list." They're popular because they're obvious properties with simple input strategies. You don't need to be good at PBT to explain those examples to someone else.
Every time someone uses reversing a list twice to demonstrate property-based testing, I take a drink.
— David R. MacIver (@DRMacIver) June 17, 2019
No, this isn't a drinking game, I'm just being driven to drink by bad examples.
But that simplicity also makes them bad examples. Without complex input spaces, there's no explosion of edge cases, which minimizes the actual benefit of PBT.2 The real benefits come when you have complex input spaces. Unfortunately, you need to be good at PBT to write complex input strategies. I wrote a bit about it here, but that's for Hypothesis only. The strategy APIs vary between PBT libraries, meaning we can't write about the skills in a more general way.
(In contrast, coming up with good properties is generalizable, which is why there are so many more posts on properties than strategies.)
All this means that most examples people use to demonstrate the benefits of PBT... don't. We're better off writing examples that take strings and dictionaries or use more than two inputs. The kinds of problems where you can think you've studied the spec and written good unit tests but still miss an edge case. I have a few ideas on what those examples might look like, but they're still just sketches in a notebook right now.
Update for the Internet
This was a weekly email newsletter (from four years ago). I still update regularly, usually on topics like Common topics are software history, formal methods, unusual technologies, and the theory of software engineering. You can subscribe here.
Also, Brian wrote a rebuttal to this newsletter here.
- 
Okay sure it's kinda like an infinite-dimension vector space, fine you win ↩ 
- 
I just said that lists are basically "infinite-dimensions", which you'd think would make "reversing a list" interesting, but reversing doesn't actually do anything with the input space. It's "only one partition". I'm sure there's a way to tie it back to the type signature and argue that [a] -> [a]is innately partition-resistant. I don't know enough type theory to do that, though. ↩
If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.
My new book, Logic for Programmers, is now in early access! Get it here.