Some tests are stronger than others

strength

                June 14, 2023

            Some tests are stronger than others

                "Stronger" tests are better for determining correctness, while "weaker" tests are useful for localizing errors.

            Hello everyone!
Monday was the final TLA+ workshop of the season! I'm finally done with workshops and all outstanding consulting gigs, leaving the whole summer free for writing and projects. And like, learn new tech in general. Maybe I'll pick up Rust again. Who knows.
I haven't been thinking too much of new ideas, but one came up last night about how everyday programmers can use a concept from formal methods. In FM we have a notion of property strength, defined as
STRONG => WEAK

Any system that has STRONG as a property also has WEAK. This matches our notion of strength because a system can have a bug that breaks STRONG but not WEAK, but can't have a bug that breaks WEAK but not STRONG. In a sense, WEAK is redundant, because it cannot give us any "new information" about correctness.
To transfer this to "regular" programming, let's apply this to testing. TEST1 is stronger than TEST2 if TEST1 passing means that TEST2 will also pass. We'll use the same arrow => to mean "stronger than".¹
If TEST1 => TEST2, then there's no bug which TEST2 would catch that TEST1 does not. It doesn't mean that TEST2 isn't useful: if it passes when TEST1 fails, that helps us localize the bug. But it means that TEST2 doesn't give us any additional confidence about program correctness.
Some Examples
The simplest example would be:
test A {add(1, 2) does_not_raise_error}
test B {add(1, 2) > 1}
test C {add(1, 2) == 3}

C => B => A, as if add(1, 2) == 3, then it's definitely greater than one and definitely didn't raise an error.
That's contrived for instructive purposes. A more common and useful case is integration tests. An integration test crossing modules A and B will imply some unit tests in A and some unit tests on B. If an integration test can decompose into a set of unit tests, it must be stronger than any individual test in that set. Similarly, an end to end test can be decomposed into integration tests.²
E2E => Integration => Unit

This is also demonstrates that weaker doesn't mean not useful: it's easier to localize an error if a unit test fails than if an integration test fails.
Test transformations
One cool thing we can do is take a property P and generate a stronger/weaker property without knowing anything about P. For example:
<>[]P => []<>P

(If a system can stabilize, it's resilient. See my post Stability and Resilience Properties for more detail on this)
An example with tests would be mock-removal:
test weak {f(x) calls mock M}
test strong {f(x) == y}

Another is "lifting" a single-input test to a property test:³
test weak {add(1, 0) == 1}
test strong {
  given(x: int)
  f(x, 0) == x
}

Sometimes two tests together can be stronger than a single test, even if neither test is individually stronger:
test weak {add(2, 4) == 6}
test part1 {add(4, 2) == 6}
test part2 {
  given(x: int, y: int)
  add(x, y) == add(y, x)
}

Uses
If you already know there's an error, a weaker test can be more useful than a stronger test, because it localizes where the bug is more. If you're trying to determine correctness, though, stronger tests are better. I wonder if there's a way to speed up test suites using this. If P is stronger than Q, there's no point to running Q if P already passed.
I also wonder if this gives us a way of formalizing "integration test". It's not clear what distinguishes a "unit" from an "integration" test, but I think any good categorization would have a single integration test imply a set of weaker unit tests.

Nitpick: it's actually "at least as strong as", because TEST2 can also imply TEST1, in which case we have TEST1 <=> TEST2. ↩

I'm not too sure if this is true of all possible integration and E2E tests? Maybe there's some tests that can't be decomposed, and they are extra-important to write. ↩

Nitpick: since property testing generates random inputs, it's possible it won't test (1, 0), and there's only a bug with (1, 0). This is more a case of "moral correctness" than "formal correctness". ↩

            If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.
My new book, Logic for Programmers, is now in early access! Get it here.

Don't miss what's next. Subscribe to Computer Things:

Start the conversation: