Oracle Testing
This is another experiment in producing intermediate workhorse essays, combined with a bunch of theorycrafting.
Most unit tests are "example test", showing an example execution of the program:1
assert mult(2,3) == 6
There's two things going on here. There's the actual example of calling mult
, and then there's comparing the output to an expected value. While these are commonly entangled in testing literature, we can separate them. Many testing frameworks use mocks to allow you to abstract out some external portion of the program, letting you "unitize" it. Tests that assert "mocks are called" are example tests, but don't give us an output:
scrape_website(url)
assert HttpLibrary.get is_called_with(url)
We can also have a test that doesn't provide an example of using the call. Maybe we have to do a lot of setup to test something abnormal, or we're calling several different things and combining the results, or whatever. But we can still test the output is correct. I'd say most unit tests are example tests: the tests that check output but aren't also example tests are a tiny minority. But there are more types of tests than unit, and it makes sense to be able to talk about outputs as a category.
We can steal a term from computability theory. An oracle test is a test where we compare the results of an execution against a known correct value, or "oracle".2 In almost all unit tests, the "oracle" is the human brain. We manually compute that 2*3 == 6
, and so the test is assert mult(2,3) == 6
. But it doesn't have to be done manually. If I'm trying to write a fast multiplication algorithm, I don't need to be the oracle. I can use regular multiplication as an oracle function:
assert fastmult(2,3) == 2 * 3
Once we have the oracle function, we don't even need to come up with test cases ourselves. We can generate random inputs and plug them into the oracle function:
x = rand()
y = rand()
assert fastmult(x, y) == x * y
This will test a broader set of possibilities than a single example test can, leading to the idea of property-based testing (PBT). Instead of feeding in specific inputs, you instead test thousands against thousands of oracles. There's a lot of accidental complexity you have to solve, like deterministic tests, but that's for the library developers to figure out.
In practice, property tests rarely use oracle functions. The elephant in the room: oracle tests only make sense if the oracle is more likely to be correct than what you're testing, which implies the oracle is simpler than the function under test. But if it's simpler and more correct, why not just use the oracle as your implementation? Sometimes there's a good reason to not use a simpler oracle:
- The oracle is too slow
- The oracle is a reference implementation
- The oracle only works on a subset of expected inputs
- The oracle only covers the happy path
- The oracle is the old code you're trying to refactor
But these are all uncommon cases. Most property tests instead test properties:3 not fastmult(x, y) == x * y
, but fastmult(x, y) == fastmult(y, x)
. This, in turn, hampers PBT adoption. Our brains easily think of oracles, whereas you need to be clever to come up with program properties that aren't oracles. Anybody can see the test "multiplication works like it does in my head", but how many people will see the test "multiplication is commutative"?
Rampart Speculation
This problem with using oracles has an implicit premise: that all environments can run all code. What if the oracle could run as test code but couldn't run at all as production code? What if the oracle used features of the test environment that weren't in the production environment? What if the test environment was more powerful than the production environment?
This might sound crazy, but consider: production code has to run in production. People depend on it working, it needs to automatically handle failures, it needs to be performant. Test code can be slower, interactive, more fragile. In return, it can include features that are more powerful and expressive at the cost of stability.
Consider Prolog. Prolog is a logic programming language, letting you do things like this:
suffix(Xs, Ys) :-
append(_, Ys, Xs).
prefix(Xs, Ys) :-
append(Ys, _, Xs).
sublist(Xs, Ys) :-
suffix(Xs, Zs),
prefix(Zs, Ys).
% ------------
% Is [a, c] a sublist of [a, b, c]?
?- sublist([a, b, c], [a, c]).
false
% Give me sublists of [a, b, c]
?- sublist([a, b, c], Ys).
Ys = []
Ys = [a]
Ys = [a, b]
Ys = [a, b, c]
% It goes for a bit
% Give me lists that have [a, b] as a sublist
% and c as the final element
?- sublist(Xs, [a, b]), append(_, [c], Xs).
Xs = [a, b, c]
Xs = [a, b, _1340, c]
% etc etc
Prolog is "more expressive" than a lot of production code. It's uncommon in production, though, due to that same expressiveness. Prolog finds solutions through depth-first search and backtracking, meaning it can burn a lot of computational power getting answers, or hang trying to find a deep solution in the tree when a shallow solution is right there. This is especially bad in production, as you're writing a lot of highly complex code that can fail in interesting ways.
But maybe there's a place for Prolog in testing? The test harnesses are a lot simpler and more independent than production, and you're less tied to performance as a bottleneck. So the barriers to use Prolog aren't as present here. I could see it being used for oracle functions: make sure the fifty lines of production code gets the same answer as the three lines of Prolog oracle.
Maybe not Prolog exactly, maybe not even the entire class of programming languages that Prolog represents. But it suggests, at least, that there's the possibility of testing in a more powerful environment than production. Or using more powerful languages to write oracles, which gives you a good reason to not just run the oracle directly.
To be clear, this is all speculation. I'm not aware of any language that has this kind of power difference between testing and prod,4 and I don't know how well it would work in practice. If I wanted to test it, I'd probably try it with Python and J, simply because I know those two well. Maybe you could have one program write input and outputs to a csv and load that in for testing. A form of multiprogram testing.
To somehow wrap this all up: most unit tests are oracle tests, using your brain as an oracle. Some property tests are oracle tests, using a simpler but unusable-for-production oracle function. We may be able to make more property tests into oracle tests by using a more expressive testing environment, or by parameterizing tests by a separate program's output as the oracle. There may be other forms of oracle, too.
-
All examples are in a pseudocode that doesn't resemble anything, this one kinda got away from me ↩
-
I discovered after writing this that the testing literature uses a broader definition of oracle than I am here, but I'll keep restricted to this narrow use for now. ↩
-
That is definitely a thing I just wrote. ↩
-
I think some people use Groovy and Java this way, but know basically nothing about the Java ecosystem. ↩
If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.
My new book, Logic for Programmers, is now in early access! Get it here.