"Testing can show the presence of bugs but not the absence"
An analysis of the ramifications.
Program testing can be used to show the presence of bugs, but never to show their absence! — Edgar Dijkstra, Notes on Structured Programming
Dijkstra was famous for his spicy quips; he'd feel right at home on tech social media. He said things he knows aren't absolutely true but will get people listening to him. In the same essay he discusses how testing could, in specific cases, show the absence of bugs:
- If a function has a finite set of inputs, like
floor(float)
, you could prove correctness by simply testing all possible inputs. - Even if a function has infinite inputs, tests prove the absence of bugs for tested inputs. You could say "
date_of_easter
works for any date within 10,000 years of today" and have "proven the absence of bugs" for the subset of dates people care about.1 - You can use testing to complete a proof. For example, you can divide inputs into "equivalence classes", prove that if the function is correct for one value it will also be correct for all values in that class, and then test one value from each class.
In the majority case, however, testing cannot show the absence of bugs. Testing f(16)
does not give you any guarantees about all x. f(x)
.
There are two possible takeaways. The reasonable one is that we shouldn't be blindly confident in testing, should use a variety of approaches as part of defense in depth, and always be careful about what we claim. The unreasonable takeaway is that testing is an inferior verification method, and we should use more "rigorous" approaches instead of testing.
It's unreasonable because methods that show the existence of bugs are more robust than methods that show the absence of them, and this makes them more useful more of the time.
Comparing tests and proof
Let's say you have a pseudocode function like this:
fun elem_before_x(s: int[], x: int) returns (r: int | NULL) {
let i: int = index(s, x);
if (i == NULL) {
return NULL;
} else return s[i-1];
}
Property:
if r == NULL then x notin s
else some i in 0..<len(s): s[i] == r && s[i+1] == x
The bug here is that we're out-of-bounds if x
is the first element of the array. Testing will find this bug if you remember to test that case. Otherwise, we'd have a bug and a passing test suite. Tests can't verify the absence of bugs.
Now imagine we do find it and fix it:
else if (i == 0) {
return s.last;
}
Modifying the property is left as an exercise to the reader. Once again, the passing test suite means we haven't verified the absence of the bug. Now, how does formal verification do on the same problem? I wrote a Dafny version of the same problem, here, both with and without the fix. Looking at v1, we see it finds the s[0]
error:
> var v := index.Extract();
> return Some(s[v-1]);
(19,16): Error: index out of range
|
19 | return Some(s[v-1]);
This error also goes away in v2: the bug is proved to be absent. But v2 adds a new error:
> ensures if r.None? then x !in s
> else if s[|s|-1] == r.value then s[0] == x
> else s[IndexOf(s, r.value)+1] == x
(7,20): Error: index out of range
|
7 | else if s[|s|-1] == r.value then s[0] == x
| ^^^^^^^
When would s[|s|-1]
be an index out of range? When the list is empty. But we know that's not possible: if we return a non-null value then the list must not be empty, because we found x
in it! But the prover doesn't know that, it can't infer the relationship between the returned value and what we know about the list.2 We have to annotate our code with proof hints for it to recognize that if we return a number, the list is nonempty.
In short: the prover definitely finds the out-of-bounds bug, but also raises a spurious "empty list" bug. The test might not find the former but also won't raise the latter. Tests can have false negatives but not false positives, while provers can have false positives but not false negatives.
(I'd say "formal verification can prove the absence of a bug but never the presence", but that's not true. Dafny can sometimes give you a concrete counterexample to a property, and type systems detect true positives like "you passed an array into an Int -> Int
function" all the time.)
Knowing there "might" be a bug is not as useful as knowing there is one. Empirically speaking people mostly ignore compiler warnings and plenty of retrospectives find people ignore "might possibly be a bug" in even mission-critical software. Formal verification often becomes an "all or nothing" deal, either you totally prove the absence of bugs in programs or you get no benefits at all.
And that sucks! It takes a lot of effort to prove something correct, magnitudes more than running some tests. And that's why I consider testing "more robust": even a little bit of testing may find a true bug, while a little bit of verification doesn't get you much.
Correctness is not a thing
I'll end this by getting on my soapbox. Here's something that's really important to understand about "proving the absence of bugs":
You do not prove something's correct. You prove conformance to specification.
It's possible for code to conform to the specification and still have bugs. Two common cases are:
- The spec is wrong or incomplete, like
v2.dfy
is in the above example. Spoiler in the footnote.3 - The spec is based on an assumption that's not true in practice, like "no other program is writing to the same file."
This is also important because it puts the specification front-and-center. You can write tests without having a specification in mind, but you need a spec to do any kind of formal method. Doing FM trains you to think about specs, which in turn helps you write better tests.
Things I am interested in
- Crossovers from fields besides classical engineering, who worked professionally as something else before switching to software. What ideas from your old job as a journalist/librarian/social worker/acrobat could be useful in software, and vice versa?
- Wargaming in software. Have you designed or participated in a devops/security/system design/whatever simulated game? Was it helpful?
If you have insights into either of these, feel free to share! Or write about them publicly and share the link with me. You can just reply directly to this email.
-
Though you could argue this isn't sufficient for security. If
easter(-1e8)
causes a memory overflow then that's a bug even if a normal user won't ever call it. ↩ -
We're lucky that the Dafny standard library defines a postcondition for index functions saying the output is less than the length of the list. Otherwise the prover would raise a potential out-of-bounds error on
s[IndexOf(s, x)]
! ↩ -
s[|s|-1] == r.value then s[0] == x
isn't true for the input[a, r, x, r]
. The good news is that Dafny refuses to verify the code because it didn't match the spec. I only realized this when I was error-checking this piece. (This doesn't refute the point I made above about false positives: the false positive was thats[|s|-1]
was out of bounds!) ↩
If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.
My new book, Logic for Programmers, is now in early access! Get it here.