Proofs from Tests

An idea that I’ve been interested in for a while is that a good test suite should be able to directly support formal verification. How would this work, given that testing usually misses bugs? The basic insight is that a test case is usually telling us about more than just one execution: it’s telling us about a class of similar executions that will all give the correct result if the test case succeeds. Partition testing is based on this same idea.

To get started, let’s do a little thought experiment where our test suite is exhaustive: it tests every possible input value. Although this is obviously very restrictive, exhaustive testing is sometimes feasible for non-trivial examples such as a function that decodes a binary ARM instruction.

What does it take to make a proof out of an exhaustive collection of test cases? We need some sort of a mathematical model for how the system works—an executable formal semantics—and also a mathematical model for how the system is intended to work: a formal specification. Then, we basically unroll the execution of the system on each test case and show that the result is equivalent to the specified result. The conjunction of this collection of statements about the system is our proof. The problem is that it’s going to be a very large, unsatisfying proof.

So the next problem is turning the cumbersome proof into a prettier one. The basic strategy is to look for a collection of “similar” conjuncts: these should all have the same control flow and, for all operations performed (addition, subtraction, etc.) the inputs and outputs should live in the same continuous region. Two’s complement addition, for example, has a discontinuity where it wraps around; the 8-bit signed addition operations 50+50 and 100+100 therefore do not live in the same continuous region. I’m guessing that for a suitable definition of “continuous” (which may not be totally straightforward for bitwise operations) it should be possible to automatically collapse a collection of cases into a single case that covers all of them. In terms of partition testing, our final proof will have one conjunct for each partition.

But we still haven’t accomplished anything particularly useful, since we can’t exhaustively test most of the codes that we care about. What I hope is that if we actually went through this exercise, we’d learn something about how to attack the more general case where we have a non-exhaustive collection of test cases and we want to automatically expand each of them to reason about an entire partition of the input space. If we could do that, a collection of test cases achieving 100% partition coverage would lead to a proof of correctness.

Consider a 2-bit saturating adder where the result of the addition sticks at the minimum or maximum representable value instead of overflowing. This function takes a pair of values in the range -2..1 and returns a value in the same range. A C implementation would be:

int twobit_sat_add (int a, int b)
  int result = a+b;
  if (result>1) return 1;
  if (result<-2) return -2;
  return result;

Here is an exhaustive collection of test cases with results:

twobit_sat_add (-2, -2) = -2
twobit_sat_add (-2, -1) = -2
twobit_sat_add (-2,  0) = -2
twobit_sat_add (-2,  1) = -1
twobit_sat_add (-1, -2) = -2
twobit_sat_add (-1, -1) = -2
twobit_sat_add (-1,  0) = -1
twobit_sat_add (-1,  1) =  0
twobit_sat_add ( 0, -2) = -2
twobit_sat_add ( 0, -1) = -1
twobit_sat_add ( 0,  0) =  0
twobit_sat_add ( 0,  1) =  1
twobit_sat_add ( 1, -2) = -1
twobit_sat_add ( 1, -1) =  0
twobit_sat_add ( 1,  0) =  1
twobit_sat_add ( 1,  1) =  1

The proof of correctness requires an executable semantics for C; these are available. The resulting proof will be large, even for this simple example, so let’s look at the partitions.

Only one test case takes the first return path:

twobit_sat_add (1, 1) = 1

Three test cases take the second return path:

twobit_sat_add (-2, -2) = -2
twobit_sat_add (-2, -1) = -2
twobit_sat_add (-1, -2) = -2

And these test cases take the third return path:

twobit_sat_add (-2,  0) = -2
twobit_sat_add (-2,  1) = -1
twobit_sat_add (-1, -1) = -2
twobit_sat_add (-1,  0) = -1
twobit_sat_add (-1,  1) =  0
twobit_sat_add ( 0, -2) = -2
twobit_sat_add ( 0, -1) = -1
twobit_sat_add ( 0,  0) =  0
twobit_sat_add ( 0,  1) =  1
twobit_sat_add ( 1, -2) = -1
twobit_sat_add ( 1, -1) =  0
twobit_sat_add ( 1,  0) =  1

Within each collection of test cases, no discontinuities in C’s addition operator are triggered, so our final proof should contain three top-level conjuncts. Is it possible to reconstruct this proof given only, for example, these three representative test cases?

twobit_sat_add ( 1,  1) =  1
twobit_sat_add (-2, -1) = -2
twobit_sat_add ( 0, -1) = -1

I don’t know, but I hope so. Perhaps techniques such as abstract interpretation can be used to reason effectively about the behavior of entire partitions. If so, then this proof based on three test cases would also work for saturating additions of any width (assuming that the implementation had available to it an addition operation at least one bit wider—if not, a bit more work is required).

To summarize, here’s the work to be done:

  • For some simple programming language where an executable formal semantics already exists, show that an exhaustive collection of test cases for a function can be automatically turned into a proof of correctness.
  • Define partition coverage in an appropriate fashion.
  • Show that an exhaustive proof of correctness can be turned into a proof by partitions.
  • Show that a single test case per partition is sufficient to generate an overall proof of correctness.
  • For extra credit, automatically detect the case where the test suite does not achieve 100% partition coverage.

I suspect that a lot of this hinges upon finding an appropriate way to define partitions. A conservative definition for what constitutes a partition will make the research easier (because there will be less going on inside each partition) but also less useful because more test cases will be required to achieve 100% partition coverage. A liberal definition of partition makes the testing part easier and the proofs harder. Clearly, in practice, it is not OK to insist (as I did above) that every test case in a partition has identical control flow: we’ll need to be able to abstract over loop counts and recursion depth.

The basic insight here is that some of the work of formal verification can be borrowed from the testing phase of software development. We can’t avoid testing even when we’re doing formal verification, so we might as well try to get more benefit out of the sunk costs. My hope is that for a smallish language subset, this project is within the scope of a PhD thesis. If anyone knows of work that already accomplishes some or all of the goals I’ve outlined here, please let me know.

20 replies on “Proofs from Tests”

  1. An exhaustive collection of test cases (as you describe here with the spec) is a proof. QED. IMO, for the realm of machine assisted formal verification, the value of a beautiful proof is pretty minimal.

  2. Robby, a collection of exhaustive test cases isn’t necessarily the kind of proof that we want. It tells us that somehow the entire hardware+software stack that we’re using gives the right results, but if we want to verify just the software that we wrote, we need to do something different.

    The distinction is important if, for example, I have a piece of software that is exhaustively correct when run on a Pentium with the fdiv bug. This proof is of little value if my software fails to work on a fixed CPU.

    Similarly, we can write a C program that is exhaustively correct when compiled without optimization. But again this proof is of little use if the same code fails when compiled with optimization turned on. I shouldn’t need to add that writing this kind of code is all too easy.

    I can’t see a way to avoid these kinds of problems without writing a formal semantics for the implementation language.

  3. Hi John,

    to related notions that come to mind:

    1. You might imagine using Mike Ernst’s “Daikon” or something like it to infer “likely” invariants from the above test suite. You might then feed said invariants into something like ESC/Java or FramaC to “verify” them (i.e. check that they are inductive and are strong enough to imply whatever property you are interested in.)

    2. In a not entirely dissimilar spirit, Corina Pasareanu and co. wrote some papers about executing tests and then “abstracting” the generated executions into “abstract states” which you can then verify to be inductive and so on. I can’t find the exact paper but its one of the Java Pathfinder papers by Corina, Willem Visser and others.



  4. Ranjit, thanks! I’ll definitely dig up the Pasareanu paper which sounds like just what I want to read about.

    I’m a little not-optimistic about dynamic invariants since many invariants just seem too complicated to guess (I’m thinking about red-black invariants or something like that). But it was a long time ago that I read those papers, I’m sure the techniques have gotten better…

  5. Hi Alex, I think I see what you’re getting at but I haven’t spent enough time thinking about predicate abstraction to have any kind of useful intuition here.

  6. Yeah, I need to think it through — there, you’re letting one abstract run “handle” a large set of concrete cases, here you want the concrete tests to serve that purpose…

  7. The folks at U Maryland developed the RubyDust system [POPL 2011], which gives a type soundness guarantee given sufficient test coverage (which is much less than exhaustive). This is weaker than full correctness, but very much along the lines you’re suggesting.

  8. That’s a great research program, to try and get more from tests than just the absence of bugs that you thought of.

    I think you miss Robby’s point, by the way, and it is that this example is too small to illustrate some important issues. In the example you give, the exhaustive test suite is already small, so there’s little point in minimizing it further.

    An example with a bigger exhaustive suite would be one that takes an integer as input and computes a cubic polynomial. Now an exhaustive suite is unreasonably large. The minimal test suite, meanwhile, is just four inputs. Additionally, you need to verify that the program in fact computes a cubic. So there’s the outline of the proof: you verify that the program computes some form of cubic, and then you use testing to verify it is the right one.

    This last point raises another issue: you need to consider the spec, not just the program, in figuring out your minimal test suite. You can’t examine the program and see that it’s computing a cubic and then make four test cases. Maybe the spec wants something more complicated than a cubic.

    Sam’s example sounds really good for coming up with achievable partial results. Instead of going after specs in general, maybe go after some specific forms of specs, such as a lack of null dereference?

  9. How would this apply to the following:

    int MulBy2(int x) { return x + 2; }

    assert(MulBy2(2) == 4);

    What tells us that we need a 2nd test case to distinguish the two functions? As there is only one possible code path (and only one abstract state) one test should be enough from any tools view.

  10. And by “two functions” I mean (x+2) and (x*2) which happen to behave identically regarding the test case.

  11. Or put another way: The simplest proof that a give chunk of code works for a given input set is to run the code with that input and examine the results. The next steps are:

    1) expand that single specific test case to the largest provably equivalent class of cases.
    2) find a new input set that is not in that equivalence class.
    3) run it and (possibly interactively) learn if it is also working correctly.
    3) go back to 1 until it can be proven that the set of discovered equivalence classes is covering (it need not be a partition).

  12. Thanks Sam! I’ll maybe do an updated post later on after reading that and the Pathfinder paper.

    Lex, you’re right, the partitioning depends on the specification as well as the program.

    Hi Tobi, the proof is going to have to show that the code and specification are equivalent for the entire partition. We definitely can’t just look at this one test case and be done with it. My hypothesis is that this is not so difficult when both the program and the specification are continuous functions for that particular set of inputs. In other words: discontinuous behavior is one of the major reasons that proving things about software is hard.

    So basically the answer that we get from running the test case is not the important thing. The important thing about the test case is that it is showing us a concrete path through the system under test (and the specification) that we will need to reason about.

  13. bcs, that’s a pretty good description of concolic testing. I think the Microsoft SAGE approach is a good start to this type of equivalence testing, although the equivalence class inferences may not be ideal.

  14. Sharkey: the big difference is that in concolic testing, there isn’t really an “equivalence class” — just a search for new coverage. Determining that a set of inputs/paths form an equivalence class is (in general) a very difficult task, and not one concolic testing that I know of attempts to perform.

    All coverage (which is a motivation for concolic testing) is guessing (quite bad, but not totally useless) equivalence classes.

    Note that under some (bad) assumptions coverage even works fine. Imagine a program and spec such that:

    1. The program is loop free.
    2. Neither the spec nor the program contain any predicates over the inputs or outputs other than those in the conditional branches of the program.
    3. The program does not output the inputs, it only outputs fixed constants.
    4. The spec only describes, under those predicates over inputs, what the final output of the program will be.
    5. The program never writes any values, it only makes decisions.

    Obviously for this class of programs (and they do exist, at least as functions) branch coverage is a perfect equivalence class criteria, and exhaustive branch coverage tests prove correctness (assuming none of them output wrong by the spec).

    Of course, most programs have loops and write to memory and output things other than constants, etc.

  15. (In case I wasn’t clear enough: I have long thought that a few test cases can help a lot when it comes to proving things about programs. I’ve not found a way to turn this into an research contribution, but it’s been in the back of my mind for at least 10 years now. I was reacting to a different aspect of your post, namely the idea that some proofs are better than others. When you’re dealing with machine-assisted type settings, the value of a better proof is not anything like it is when you just have people reading your proof. In the ‘only people-reading’ case, making a better proof actually provides non-trivial additional evidence for the theorem, since the better proof is probably easier to read and verify. (Not that it has _no_ value, but just that it becomes much less exciting.)

    Obviously, some theorems are better than others. When John wrote “The distinction is important if, for example, I have a piece of software that is exhaustively correct when run on a Pentium with the fdiv bug. This proof is of little value if my software fails to work on a fixed CPU.” then I would characterize this as a different theorem, not just a different proof.

  16. Hi Robby, I agree with you about the value of conciseness in machine generated proofs. The main reason I introduced the proof-optimization step here was to set the stage for its tricky counterpart where we take a proof for a single input (derived from a test case) and try to generalize it to make a statement about an entire partition of the input space.

Comments are closed.