Exhaustive Testing is Not a Proof of Correctness

It is often said that exhaustively testing a piece of software is equivalent to performing a proof of correctness. Although this idea is intuitively appealing—and I’ve said it myself a few times—it is incorrect in a technical sense and also in practice.

What’s wrong with exhaustive testing in practice? The problem comes from the question: What is it that we are testing? One possible answer would be “We’re testing the program we wrote.” For example, I might write this program containing an exhaustive test harness and an ostensible identity function which is defined to return the value of its argument for any value in the range -5..5:

#include <stdio.h>

int identity (int x)

int main (void)
  int i;
  for (i=-5; i<=5; i++) {
    printf ("identity(%d) = %d\n", i, identity(i));
  return 0;

Let’s run the exhaustive tester:

$ gcc ident_undef.c ; ./a.out
identity(-5) = -5
identity(-4) = -4
identity(-3) = -3
identity(-2) = -2
identity(-1) = -1
identity(0) = 0
identity(1) = 1
identity(2) = 2
identity(3) = 3
identity(4) = 4
identity(5) = 5

Nice! We have used exhaustive testing to show that our program is correct. But actually not:

$ gcc -O ident_undef.c ; ./a.out
identity(-5) = 1322972520
identity(-4) = 26
identity(-3) = 18
identity(-2) = 18
identity(-1) = 18
identity(0) = 18
identity(1) = 17
identity(2) = 17
identity(3) = 17
identity(4) = 17
identity(5) = 17

The identity function is, of course, woefully wrong even if it happened to work at -O0. So clearly we did not prove the C program to be correct. On the other hand, perhaps we proved the compiled program to be correct? In this case (I looked at the object code) the compiled program is correct.

The thing is, it is not difficult to create a different compiled program that passes exhaustive testing, while being demonstrably incorrect. For example, a program that relies upon buggy operating system functionality or buggy hardware (such as a Pentium with the fdiv bug) could pass exhaustive testing while being totally wrong.

So maybe the thing that we’re proving correct isn’t just the compiled program. Maybe it’s the entire computer system: the hardware platform, the operating system, and also our compiled program. Now, a program that gives the correct result by relying on the fdiv bug is perfectly fine.

There are two problems with saying that we have proved our computer system correct (with respect to its ability to execute a program we have chosen). First, the computer system as a whole is of course incorrect: any real hardware and operating system will have some known bugs and we have failed to argue that none of these bugs can affect the execution of the identity function (to do that, we would have to test every input to our program in every possible state that the hardware and OS could be in—a tough job). Second, the computer is a real physical object and it follows the laws of physics. If we wait long enough, a disturbance will occur (it could be external, like some alpha particles, or internal, like a quantum fluctuation) that causes the computer to fail to give the right result for our program.

The fundamental problem here is that our “proof” was not in terms of an abstraction, it was in terms of a physical object. Physical objects live in the real world where all guarantees are probabilistic.

The second problem with saying that exhaustive testing constitues a proof (actually, the second aspect of the only problem) is that a proof of correctness is a mathematical proof, whereas a collection of successful test cases is not a mathematical proof. A collection of successful test cases, even if it is exhaustive, may form a very compelling argument, but that doesn’t make it a proof.

Although I do not know of anyone who has done this, it should be perfectly feasible to automatically turn an exhaustive collection of test cases into a proof of correctness. To do this you would first need to borrow (or create) a formal semantics for the computing platform. This could be a semantics for C, it could be a semantics for x86-64, or it could be something different. The formal semantics can be used to evaluate the behavior of the computer program for every input; if the behavior is correct for all inputs, then we can finally construct a proof of correctness. The point of using a formal semantics is that it provides a mathematical interpretation, as opposed to a physical one, for the computing platform.

GCC pre-4.8 Breaks Broken SPEC 2006 Benchmarks

UPDATE: Ok, I did something stupid. I wrote most of this post while using a pre-4.8 snapshot that I had sitting around and then I built the actual 4.8.0 release to verify the behavior but I screwed up a path or something. The 4.8.0 release does not have the behavior described in this post. It was patched a few days before the release, see comments 9 and 10 below. I apologize for the sloppiness. Of course, SPEC is still broken and there are still plenty of undefined programs not in SPEC that GCC will break for you…

Near the top of the GCC 4.8.0 release notes we find this:

GCC now uses a more aggressive analysis to derive an upper bound for the number of iterations of loops using constraints imposed by language standards. This may cause non-conforming programs to no longer work as expected, such as SPEC CPU 2006 464.h264ref and 416.gamess. A new option, -fno-aggressive-loop-optimizations, was added to disable this aggressive analysis.

The thing that is happening here—a compiler breaking a previously working program by exploiting undefined behavior—is a topic I’ve written plenty about, but breaking the current version of SPEC seems like a big enough deal to make it worth looking into. If you want the goriest possible details, check out some discussion at the GCC bugzilla and the response from SPEC. Here I’ll just present the essentials.

The problem with h264ref is a function that contains code like this:

int d[16];

int SATD (void)
  int satd = 0, dd, k;
  for (dd=d[k=0]; k<16; dd=d[++k]) {
    satd += (dd < 0 ? -dd : dd);
  return satd;

Using GCC 4.8.0 on Linux on x86-64, we get this (I’ve cleaned it up a bit):

$ gcc -S -o - -O2 undef_gcc48.c
	jmp	.L2

Of course this is an infinite loop. Since SATD() unconditionally executes undefined behavior (it’s a type 3 function), any translation (or none at all) is perfectly acceptable behavior for a correct C compiler. The undefined behavior is accessing d[16] just before exiting the loop. In C99 it is legal to create a pointer to an element one position past the end of the array, but that pointer must not be dereferenced. Similarly, the array cell one element past the end of the array must not be accessed.

One thing we might ask is: What is the compiler thinking here? Why not just give a nice warning or error message if the program is going to unconditionally execute undefined behavior? The basic answer is that the compiler isn’t somehow maliciously saying to itself: “Aha– undefined behavior. I’ll screw up the object code.” Rather, the compiler is simply assuming that undefined behavior never occurs. How does that come into play in SATD()?

In detail, here is what’s going on. A C compiler, upon seeing d[++k], is permitted to assume that the incremented value of k is within the array bounds, since otherwise undefined behavior occurs. For the code here, GCC can infer that k is in the range 0..15. A bit later, when GCC sees k<16, it says to itself: “Aha– that expression is always true, so we have an infinite loop.” The situation here, where the compiler uses the assumption of well-definedness to infer a useful dataflow fact, is precisely analogous to the null-pointer check optimization which got some Linux kernels into trouble.

My view is that exploiting undefined behavior can sometimes be OK if a good debugging tool is available to find the undefined behavior. In this case, a memory safety error, we’ll turn to Valgrind and the address sanitizer (an awesome new feature in GCC 4.8.0, ported over from LLVM). Here goes:

$ gcc undef_gcc48.c
$ valgrind -q ./a.out
$ gcc -fsanitize=address undef_gcc48.c
$ ./a.out 
$ clang -fsanitize=address undef_gcc48.c
$ ./a.out 

No dice, bummer. Valgrind cannot change the layout of objects so I would not have expected it to notice this, but it seems like the address sanitizer should be able to catch the problem. On the other hand, where dynamic analysis fails us, static analysis succeeds:

$ gcc -O3 undef_gcc48.c -Wall 
undef_gcc48.c: In function ‘SATD’:
undef_gcc48.c:6:29: warning: array subscript is above array bounds [-Warray-bounds]
   for (dd=d[k=0]; k<16; dd=d[++k]) {
undef_gcc48.c: In function ‘main’:
undef_gcc48.c:6:29: warning: array subscript is above array bounds [-Warray-bounds]
   for (dd=d[k=0]; k<16; dd=d[++k]) {

For whatever reason, this warning shows up at -O3 but not -O2. We should, however, not pat ourselves on the back quite yet—this kind of static analysis is generally somewhat brittle and we cannot expect to always get such a nice warning.

It would be reasonable to ask the larger question of whether a compiler should be breaking code like this, but it’s late to be opening that can of worms, so I’ll leave this discussion for another day (or for the comments). I have not looked into 416.gamess but I suspect it’s more of the same.

Proofs from Tests

An idea that I’ve been interested in for a while is that a good test suite should be able to directly support formal verification. How would this work, given that testing usually misses bugs? The basic insight is that a test case is usually telling us about more than just one execution: it’s telling us about a class of similar executions that will all give the correct result if the test case succeeds. Partition testing is based on this same idea.

To get started, let’s do a little thought experiment where our test suite is exhaustive: it tests every possible input value. Although this is obviously very restrictive, exhaustive testing is sometimes feasible for non-trivial examples such as a function that decodes a binary ARM instruction.

What does it take to make a proof out of an exhaustive collection of test cases? We need some sort of a mathematical model for how the system works—an executable formal semantics—and also a mathematical model for how the system is intended to work: a formal specification. Then, we basically unroll the execution of the system on each test case and show that the result is equivalent to the specified result. The conjunction of this collection of statements about the system is our proof. The problem is that it’s going to be a very large, unsatisfying proof.

So the next problem is turning the cumbersome proof into a prettier one. The basic strategy is to look for a collection of “similar” conjuncts: these should all have the same control flow and, for all operations performed (addition, subtraction, etc.) the inputs and outputs should live in the same continuous region. Two’s complement addition, for example, has a discontinuity where it wraps around; the 8-bit signed addition operations 50+50 and 100+100 therefore do not live in the same continuous region. I’m guessing that for a suitable definition of “continuous” (which may not be totally straightforward for bitwise operations) it should be possible to automatically collapse a collection of cases into a single case that covers all of them. In terms of partition testing, our final proof will have one conjunct for each partition.

But we still haven’t accomplished anything particularly useful, since we can’t exhaustively test most of the codes that we care about. What I hope is that if we actually went through this exercise, we’d learn something about how to attack the more general case where we have a non-exhaustive collection of test cases and we want to automatically expand each of them to reason about an entire partition of the input space. If we could do that, a collection of test cases achieving 100% partition coverage would lead to a proof of correctness.

Consider a 2-bit saturating adder where the result of the addition sticks at the minimum or maximum representable value instead of overflowing. This function takes a pair of values in the range -2..1 and returns a value in the same range. A C implementation would be:

int twobit_sat_add (int a, int b)
  int result = a+b;
  if (result>1) return 1;
  if (result<-2) return -2;
  return result;

Here is an exhaustive collection of test cases with results:

twobit_sat_add (-2, -2) = -2
twobit_sat_add (-2, -1) = -2
twobit_sat_add (-2,  0) = -2
twobit_sat_add (-2,  1) = -1
twobit_sat_add (-1, -2) = -2
twobit_sat_add (-1, -1) = -2
twobit_sat_add (-1,  0) = -1
twobit_sat_add (-1,  1) =  0
twobit_sat_add ( 0, -2) = -2
twobit_sat_add ( 0, -1) = -1
twobit_sat_add ( 0,  0) =  0
twobit_sat_add ( 0,  1) =  1
twobit_sat_add ( 1, -2) = -1
twobit_sat_add ( 1, -1) =  0
twobit_sat_add ( 1,  0) =  1
twobit_sat_add ( 1,  1) =  1

The proof of correctness requires an executable semantics for C; these are available. The resulting proof will be large, even for this simple example, so let’s look at the partitions.

Only one test case takes the first return path:

twobit_sat_add (1, 1) = 1

Three test cases take the second return path:

twobit_sat_add (-2, -2) = -2
twobit_sat_add (-2, -1) = -2
twobit_sat_add (-1, -2) = -2

And these test cases take the third return path:

twobit_sat_add (-2,  0) = -2
twobit_sat_add (-2,  1) = -1
twobit_sat_add (-1, -1) = -2
twobit_sat_add (-1,  0) = -1
twobit_sat_add (-1,  1) =  0
twobit_sat_add ( 0, -2) = -2
twobit_sat_add ( 0, -1) = -1
twobit_sat_add ( 0,  0) =  0
twobit_sat_add ( 0,  1) =  1
twobit_sat_add ( 1, -2) = -1
twobit_sat_add ( 1, -1) =  0
twobit_sat_add ( 1,  0) =  1

Within each collection of test cases, no discontinuities in C’s addition operator are triggered, so our final proof should contain three top-level conjuncts. Is it possible to reconstruct this proof given only, for example, these three representative test cases?

twobit_sat_add ( 1,  1) =  1
twobit_sat_add (-2, -1) = -2
twobit_sat_add ( 0, -1) = -1

I don’t know, but I hope so. Perhaps techniques such as abstract interpretation can be used to reason effectively about the behavior of entire partitions. If so, then this proof based on three test cases would also work for saturating additions of any width (assuming that the implementation had available to it an addition operation at least one bit wider—if not, a bit more work is required).

To summarize, here’s the work to be done:

  • For some simple programming language where an executable formal semantics already exists, show that an exhaustive collection of test cases for a function can be automatically turned into a proof of correctness.
  • Define partition coverage in an appropriate fashion.
  • Show that an exhaustive proof of correctness can be turned into a proof by partitions.
  • Show that a single test case per partition is sufficient to generate an overall proof of correctness.
  • For extra credit, automatically detect the case where the test suite does not achieve 100% partition coverage.

I suspect that a lot of this hinges upon finding an appropriate way to define partitions. A conservative definition for what constitutes a partition will make the research easier (because there will be less going on inside each partition) but also less useful because more test cases will be required to achieve 100% partition coverage. A liberal definition of partition makes the testing part easier and the proofs harder. Clearly, in practice, it is not OK to insist (as I did above) that every test case in a partition has identical control flow: we’ll need to be able to abstract over loop counts and recursion depth.

The basic insight here is that some of the work of formal verification can be borrowed from the testing phase of software development. We can’t avoid testing even when we’re doing formal verification, so we might as well try to get more benefit out of the sunk costs. My hope is that for a smallish language subset, this project is within the scope of a PhD thesis. If anyone knows of work that already accomplishes some or all of the goals I’ve outlined here, please let me know.

str2long Contest Results Part 1

[NOTE: Reading this post only makes sense if you read and cared about this previous post.]

Ok, evaluating the submissions has been more work than I anticipated, and also things have gotten pretty busy at work, so I’m going to split the evaluation of the submissions into two parts. This first part will discuss objective metrics while the second will be subjective.

I received approximately 78 implementations of str2long(), including multiple submissions from a number of people. I say “approximately” since I deleted some submissions where the sender noticed it was buggy before I did. All submissions can be found on Github here. A total of 35 submissions pass all of my tests, which include many functional tests in addition to integer overflow checks using Clang’s -fsanitize=integer. My tests are reasonably good (I hope) but I still could have missed things, let me know if this is the case. I tested all submissions on an x86-64 Linux machine, targeting both 64-bit and 32-bit modes. The nastiest test case that I used is one that allocates a 9GB string that is mostly leading zeros; this caused a few submissions that passed all other tests to fail, of course because they used a 32-bit index somewhere. One might reasonably argue that this is a vanishingly unlikely use case, but it’s not prohibited by the specification so I tested for it.

Besides correctness, I also looked at code size and performance. Neither were explicit criteria for this contest, but all other things being equal faster code is better. Code size is mainly interesting because it serves as a very crude way to measure algorithmic complexity. In other words, code size answers the question: How much complexity remains after an optimizing compiler has looked at the function?

Since compilers are notoriously finicky about what they will and won’t optimize, I used a variety of compilers and report only the best result. The compilers are:

  • GCC 4.4.7, 4.5.4, 4.6.3, 4.7.2, and a pre-4.8 snapshot (r196703, compiled from SVN a few days ago)
  • Clang 3.0, 3.1, 3.2, and a pre-3.3 snapshot (r177225, compiled from SVN a few days ago)
  • Intel CC 12.0.5 and 13.1.0

Each compiler targeted x86-64 and was invoked at -O0, -O1, -O2, -Os, and -O3. I’m aware there are other flags controlling code generation but I don’t have time to play those games. Code size was measured using the size command after passing the compiler flags -fno-exceptions -fno-unwind-tables -fno-asynchronous-unwind-tables which are necessary to prevent extraneous code from being generated and counted. With these preliminaries aside, here are the size results:

bytes compiler submission
96 icc-13.0.1 -O1 libc
109 gcc-snap -Os pascal
110 gcc-4.6 -Os peter
110 gcc-snap -Os ryan
112 icc-13.1.0 -O1 bernd_2
115 gcc-snap -Os tordek_2
116 gcc-snap -Os davidl
116 gcc-snap -Os davidl_2
122 clang-3.0 -Os yolanpa
129 gcc-4.6 -Os daniel_2
129 gcc-snap -Os francois_2
131 gcc-snap -Os jeffrey
136 gcc-4.4 -Os yang_2
137 gcc-snap -Os davide
138 gcc-4.4 -Os david_2
139 gcc-snap -Os thomas
145 gcc-snap -Os matthewf_2
145 clang-snap -Os chucky_2
148 gcc-4.4 -Os mikael_2
159 gcc-snap -Os greg
160 gcc-snap -Os phil
165 gcc-snap -Os mattias
172 icc-13.1.0 -O1 stefan
173 gcc-4.6 -Os matthew_3
175 gcc-4.4 -Os sidney
176 icc-13.1.0 -O1 patrick
182 gcc-snap -Os bastian
185 gcc-4.7 -Os andrew
203 clang-3.2 -Os till
232 clang-3.1 -Os olivier
251 gcc-snap -Os kevin
275 gcc-snap -Os renaud
276 gcc-4.6 -Os ben
283 gcc-snap -Os ethan_2
312 clang-snap -Os mats

The libc entry doesn’t count, it’s just a call to strtol that has been wrapped with a bit of logic to make it conform to the str2long specification. Perhaps interestingly, no compiler consistently wins. Happily, -Os is the most popular compiler flag.

Performance was measured by taking the best out of 10 runs on an otherwise idle Core i7-2600. The metric is the average number of nanoseconds required to convert a string including whatever overhead is incurred by the test harness. As Robert Graham observed, the test cases are perhaps skewed too heavily towards invalid inputs, which would tend to penalize code optimized for the valid case.

Here are the results:

ns compiler submission
21.4 icc-13.1.0 -O2 bernd_2
21.6 icc-13.1.0 -Os peter
22.4 gcc-4.7 -O2 yang_2
22.5 clang-3.0 -O1 ryan
22.6 gcc-4.7 -O2 till
23.8 icc-13.1.0 -O1 yolanpa
24.7 gcc-4.7 -O2 tordek_2
25.3 clang-3.2 -Os andrew
25.6 gcc-4.4 -O3 davidl_2
25.7 gcc-4.7 -O2 pascal
25.7 gcc-4.5 -O3 renaud
26.0 gcc-4.7 -O2 davidl
26.2 icc-12.0.5 -O3 stefan
26.8 gcc-snap -O3 bastian
27.1 gcc-4.7 -O2 chucky_2
27.2 clang-3.1 -O3 olivier
27.5 gcc-snap -O3 mikael_2
27.8 gcc-4.7 -O2 kevin
27.9 clang-3.2 -O2 davide
28.3 gcc-snap -O3 david_2
28.5 clang-3.2 -O1 patrick
28.7 gcc-snap -O3 sidney
29.6 gcc-4.7 -O3 phil
32.1 gcc-snap -O3 greg
33.1 clang-3.1 -O2 ben
33.1 icc-12.0.5 -Os jeffrey
33.8 icc-13.1.0 -O2 mattias
34.5 gcc-snap -O2 thomas
34.6 clang-snap -O3 matthewf_2
37.5 gcc-4.7 -O3 matthew_3
49.0 clang-3.1 -O1 francois_2
52.8 icc-12.0.5 -O2 libc
96.4 gcc-4.7 -Os daniel_2
103.3 gcc-4.4 -O3 mats
128.3 gcc-4.4 -O2 ethan_2

Again, no particular compiler wins, and this time no particular optimization flag consistently wins either. The strtol code from libc performs surprisingly well considering that it is implementing a slightly more generic specification than str2long.

These results would perhaps be improved by repeating them on a modern ARM platform and some good ARM compilers, but I don’t have those available right now.

This concludes the objective portion of the contest results. The followup will be subjective and will look at the strategies and idioms that people used to implement the str2long specification without overflowing integers.

Thanks to everyone who submitted code! This has been fun.

A Quick Coding Contest: Convert String to Integer Without Overflow

UPDATE: As of Saturday March 9, the contest is closed. Results will be posted in a few days. I’ve created a Github repository containing all submissions and my test harness.

Regular readers will know that I’m obsessed with integer overflows. One apparently simple piece of code that many programmers get wrong in this respect is converting strings to integers. It is slightly tricky. The goal of this contest is to submit the nicest (judged by me) implementation in C for this specification:

 * if input matches ^-?[0-9]+\0$ and the resulting integer is
 * representable as a long, return the integer; otherwise if
 * the input is a null-terminated string, set error to 1 and 
 * return any value; otherwise behavior is undefined
extern int error;
long str2long (const char *);

I deliberately didn’t name the function atol or strtol since its semantics are slightly nonstandard. The regular expression above probably isn’t standard either; it is simply trying to specify a null-terminated string containing only a signed decimal integer of arbitrary size. No other characters (even whitespace) are permitted.

A few extra things:

  • Don’t call library functions; your code should perform the conversion itself
  • It is fine to use constants from limits.h
  • You should write conforming C code that works on x86 and x86-64. I’ll probably only test entries using GCC and Clang but even so you are not allowed to use extensions such as GCC’s 128-bit integers
  • You may assume that implementation-defined behaviors are the standard ones for the x86 and x86-64 platforms on Linux machines
  • Your code must not execute undefined behaviors (use a recent Clang to check for this)

Please mail entries to me, not only to avoid spoiling the fun but also because WordPress loves to mangle C code left in the comments. Feel free to leave comments that don’t include code, of course. I’ll announce the winner in a few days, and will send the winner a small prize. By submitting an entry into this contest you are agreeing to let me use your entry in a subsequent blog post, either as a positive or negative example.

NOTE: The best Clang flags to use are: -fsanitize=integer -fno-sanitize=unsigned-integer-overflow

UPDATE: As of Saturday March 9, the contest is closed. Results will be posted in a few days. I’ve created a Github repository containing all submissions and my test harness.