Cross-Branch Testing

do

                December 21, 2020

            Cross-Branch Testing

            This was inspired by a few conversations I had last week. There's a certain class of problems that are hard to test:

The output isn't obviously inferrable from the input. The code isn't just doing what a human could abstractly do, it's doing something where we don't know the answer without running the program.
The output doesn't have "mathematical" properties, like round-tripness or commutativity.
Errors in the function can be "subtle": there can be a bug that affects only a small subset of possible inputs, so that a set of individual test examples would still be correct.

Some examples of this: computer vision code, simulations, lots of business logic, most "data pipelines". I was struggling to find a good semantic term for this class of problems, then gave up and started mentally calling them "rho problems". There's no meaning behind the name, just the first word that popped into my head.
Most rho problems are complex, such that providing a realistic example would take too long to set up and explain. Here's instead a purely contrived rho problem.
def f(x, y, z):
  out = 0
  for i in range(10):
    out = out * x + abs(y*z - i**2)
    x, y, z = y+1, z, x
  return abs(out)%100 < 10

Not code you'd see in the real world, but it has all the properties of a rho problem. Imagine this is the code in our deploy branch, and we try to refactor this code on a new branch. The refactor introduces a subtle bug that affects only 1% of possible inputs, which we'll simulate with this change:
def f(x, y, z):
  out = 0
  for i in range(10):
    out = out * x + abs(y*z - i**2)
    x, y, z = y+1, z, x
- return abs(out)%100 < 10
+ return abs(out)%100 < 9

What kind of test would detect this error? Standard options are:

Unit testing. Problem is that if the error is uniformly distributed then each unit test only has a 1%-ish chance of finding the bug. We need to test a lot more inputs.
Property testing. Doesn't work because we don't have easily inferable mathematical properties. 
Metamorphic Testing. There's no obvious relationship between inputs and outputs. Knowing the value of f(1, 1, 1) tells us little about f(1, 1, 2).

(Yes, I know we could try decomposing the function and testing the parts individually. That's not the point: we're trying to find general principles here regardless of the problem.)
But we do have a "correct" reference point: the deploy branch! We can use property testing where the property is that our refactored code always gives the same output as our regular code.¹ Step one, use git worktree:
git worktree add ref deploy

That creates a new ref folder in our directory that's identical to our deploy branch. And by "identical", I mean that you're effectively on the deploy branch when in that folder, to the point of being able to commit. Worktree is pretty cool! Anyway, if our code is f, our deploy version is ref.f. We can write a property test (using hypothesis and pytest):
from hypothesis import given
from hypothesis.strategies import integers

import ref.foo as ref
import foo

@given(integers(), integers(), integers())
def test_f(a,b,c):
    assert ref.f(a,b,c) == foo.f(a,b,c)

Running this gives us an erroneous input:
Falsifying example: test_f(
    a=5, b=6, c=5,
)

This is nondeterministic: I've also gotten (0, 4, 0) and (-114, -114, -26354) as error cases. Regardless, I still think it works as a proof of concept. We find an error case where the original code and the refactor diverge. Main blocker I see is that version control systems aren't designed for this kind of use case. Before someone told me about git worktree I had pyest setup code manually change between branches and reload the module, which is... not exactly usable in production. But this seems close enough to something that a sufficiently-motivated person could pull it off.

This bears a lot of similarity to a technique known as "snapshot testing", where you compare the output to a result stored in a text file. The difference is that this is generative: we aren't tied to fixed inputs. Also, snapshot testing is actually used in production while this is a wild fringetech idea.
While we're here, this also counts as metamorphic testing. Instead of our relation being between inputs, it's between versions of the functions. Metamorphic testing is pretty neat! ↩

            If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.
My new book, Logic for Programmers, is now in early access! Get it here.

Don't miss what's next. Subscribe to Computer Things:

Start the conversation: