Exhaustive Testing is Not a Proof of Correctness

It is often said that exhaustively testing a piece of software is equivalent to performing a proof of correctness. Although this idea is intuitively appealing—and I’ve said it myself a few times—it is incorrect in a technical sense and also in practice.

What’s wrong with exhaustive testing in practice? The problem comes from the question: What is it that we are testing? One possible answer would be “We’re testing the program we wrote.” For example, I might write this program containing an exhaustive test harness and an ostensible identity function which is defined to return the value of its argument for any value in the range -5..5:

#include <stdio.h>

int identity (int x)

int main (void)
  int i;
  for (i=-5; i<=5; i++) {
    printf ("identity(%d) = %d\n", i, identity(i));
  return 0;

Let’s run the exhaustive tester:

$ gcc ident_undef.c ; ./a.out
identity(-5) = -5
identity(-4) = -4
identity(-3) = -3
identity(-2) = -2
identity(-1) = -1
identity(0) = 0
identity(1) = 1
identity(2) = 2
identity(3) = 3
identity(4) = 4
identity(5) = 5

Nice! We have used exhaustive testing to show that our program is correct. But actually not:

$ gcc -O ident_undef.c ; ./a.out
identity(-5) = 1322972520
identity(-4) = 26
identity(-3) = 18
identity(-2) = 18
identity(-1) = 18
identity(0) = 18
identity(1) = 17
identity(2) = 17
identity(3) = 17
identity(4) = 17
identity(5) = 17

The identity function is, of course, woefully wrong even if it happened to work at -O0. So clearly we did not prove the C program to be correct. On the other hand, perhaps we proved the compiled program to be correct? In this case (I looked at the object code) the compiled program is correct.

The thing is, it is not difficult to create a different compiled program that passes exhaustive testing, while being demonstrably incorrect. For example, a program that relies upon buggy operating system functionality or buggy hardware (such as a Pentium with the fdiv bug) could pass exhaustive testing while being totally wrong.

So maybe the thing that we’re proving correct isn’t just the compiled program. Maybe it’s the entire computer system: the hardware platform, the operating system, and also our compiled program. Now, a program that gives the correct result by relying on the fdiv bug is perfectly fine.

There are two problems with saying that we have proved our computer system correct (with respect to its ability to execute a program we have chosen). First, the computer system as a whole is of course incorrect: any real hardware and operating system will have some known bugs and we have failed to argue that none of these bugs can affect the execution of the identity function (to do that, we would have to test every input to our program in every possible state that the hardware and OS could be in—a tough job). Second, the computer is a real physical object and it follows the laws of physics. If we wait long enough, a disturbance will occur (it could be external, like some alpha particles, or internal, like a quantum fluctuation) that causes the computer to fail to give the right result for our program.

The fundamental problem here is that our “proof” was not in terms of an abstraction, it was in terms of a physical object. Physical objects live in the real world where all guarantees are probabilistic.

The second problem with saying that exhaustive testing constitues a proof (actually, the second aspect of the only problem) is that a proof of correctness is a mathematical proof, whereas a collection of successful test cases is not a mathematical proof. A collection of successful test cases, even if it is exhaustive, may form a very compelling argument, but that doesn’t make it a proof.

Although I do not know of anyone who has done this, it should be perfectly feasible to automatically turn an exhaustive collection of test cases into a proof of correctness. To do this you would first need to borrow (or create) a formal semantics for the computing platform. This could be a semantics for C, it could be a semantics for x86-64, or it could be something different. The formal semantics can be used to evaluate the behavior of the computer program for every input; if the behavior is correct for all inputs, then we can finally construct a proof of correctness. The point of using a formal semantics is that it provides a mathematical interpretation, as opposed to a physical one, for the computing platform.


8 responses to “Exhaustive Testing is Not a Proof of Correctness”

  1. Isn’t another in-practice killer of this perspective that most programs accept infinitely many inputs?

  2. It reminds me of the old truth that in practice, a program is more or less tied to the machine on which it was developed. You have to force testing on a wide range of machines (and operating systems) (and compilers) to really smoke out the implicit and invisible assumptions that come from the developers main machine. To often does code break when put into some new context.

  3. Hi Robby, sure. I mainly wrote this up since it helps get at the basic difference between verification and testing, which I find interesting…

    Jakob, definitely.

  4. You know, I’m glad that we’ve made progress on this front since the “old days” (partially by everyone using x86…. but not completely), in that good PLs these days help you abstract away a lot of these details. I spend lots of time working on a complex GUI app and I mostly don’t have to worry about platform-specific issues (and, perhaps more tellingly) few of my bug reports end up being platform-specific, despite the fact that the reporter usually considers that an important piece of information.

  5. Robby are you saying that you guys have moved all of the bugs into Racket??

    But of course I agree with you. The improvement in programming environments just in the time I’ve been programming (~30 years) has been shocking. Part of it is widespread adoption of automatic memory management, part of it is languages with real type systems (whether static or dynamic), part of it is just that there are finally enough MHz and megabytes to go around. There must be even more going on. Is anything written about this? Seems pretty interesting. Of course this is all begging for some crusty Lisp Machine folks to show up and tell us they had it all in 1984. I actually passed on buying several Lisp machines for really cheap at an auction in the 1990s, have kicked myself since then.

  6. Since John was right about some experiments that I ran this week, and I was very wrong, I’m feeling ok with playing devil’s advocate here for a second:

    “A collection of successful test cases, even if it is exhaustive, may form a very compelling argument, but that doesn’t make it a proof.”

    I think if

    1) The formal semantics you are proving over is equivalent to the one of the tested system


    2) The checks in the test are equivalent to the specification used in the proof

    then a collection of successful test cases by definition does constitute a proof. Proof by exhaustive enumeration of all cases, or running on an abstract machine is a perfectly valid approach to proof. Most everyone agrees that successful full model checking constitutes proof, but many model checking algorithms amount to exhaustive testing with some state storage to reduce what is required to be exhaustive, given some constraints on the specification. The problem is just imprecise notion of “exhaustive” tests — using a different notion of correctness (correct under this compile) than is being used in the proof. You could have done the proof in this example over the assembly code and “proved” the compiled version correct. So I think exhaustive testing _is_ a proof method, it’s just one where you have to be especially careful about noting exactly what you’ve proved.

  7. Hi Matt, thanks for the link! Interesting stuff. I’m an extremely poor control theorist myself (having barely scraped by in every differential equations course I ever took) so I may be the wrong person to ask. I think that a few specific kinds of software are amenable to this kind of analysis but in general the discontinuities in functions implemented by real software are just too difficult to deal with. But I could be totally wrong. What do you think of this paper?