How Does Formal Verification Affect Software Testing?

This has been a difficult piece to write and I’ve already deleted everything and started over more than once. So, I’m going to take the easy way out and structure it as a sequence of questions and answers.

What does formal verification mean? Something like “using mathematical techniques to convincingly argue that a piece of software implements a specification.” The specification may be very domain-specific, as in “the robot may not injure a human being or, through inaction, allow a human being to come to harm,” or it may be generic, as in “no execution of the system dereferences a null pointer.” So far, we know extremely little about what might be involved in verifying a specification such as Asimov’s Laws that identifies entities in the real world, as opposed to entities inside the software system.

Do we need both testing and formal verification? Yes. It’s absurd to think we could do without testing. Testing is how we create systems that work. On the other hand, testing alone suffers from needle-in-haystack problems where critical software almost certainly has surprising behaviors that we need to know about, but we don’t know how to create test inputs that trigger them.

Why would testing and formal verification interact at all? Aren’t these separate activities? Today, testing and formal verification are largely independent activities. I’ve heard of a very small number of cases where formal verification is considered an acceptable substitute for running some kinds of tests. In the long run we can hope to see more of this.

Under what conditions can formal verification reduce the amount of testing that is required? This is difficult. The basic observation is that formal verification tends to eliminate the possibility of certain types of bugs in a narrow layer of the system—the one that is being verified. To be concrete, let’s say that I prove that my robot software faithfully follows Asimov’s laws, but then I compile it with a buggy compiler, run it on a buggy operating system, run it on a buggy processor, or run it inside a robot with a defective camera that cannot reliably recognize human beings. This robot, of course, will fail to follow Asimov’s laws; the proofs about its software are rendered irrelevant by violated assumptions. These problems need to be discovered via testing (which, of course, may also fail to reveal them). To make formal verification apply to a wider range of layers, we need verified operating systems, compilers, libraries, chips, and more. Of course, people are working on all of these things, though at present the chaining together of correctness proofs for a realistic software stack has not yet happened.

Another problem with formal verification is that the specifications themselves are often quite complicated and difficult to debug; gaining confidence in a specification requires a testing process not very different from how we gain confidence in code. Even if we have non-buggy formal specifications, it can be unclear what the specification means in practice, and it can be unclear that the specification describes all of the behaviors that we might care about. Usability considerations are an example of a property of software that we care about, that can be tested, but that may not be possible to formally specify and verify.

But let’s get back to the question: Under what conditions can formal verification replace testing? My answer would be:

  • The formal verification tool is considered to be trustworthy.
  • The specification is considered to be trustworthy.
  • The specification is considered to capture all properties of interest.
  • Layers of the system not being verified (compiler, OS, etc.) are considered to be trustworthy.

Of course, it may be difficult to satisfy all of these criteria. In contrast, testing makes many fewer assumptions, but rather provides end-to-end checks of the compiler, the OS, the code being developed, etc.

Why do we want formal verification to replace some kinds of testing? The reason is simple: cost. Although formal verification is currently quite expensive, techniques are improving every year. Testing techniques are also improving, but not necessarily at a very rapid rate. The fundamental problem with testing is that we can’t do a very good job of it in any realistic amount of time. The great hope of formal verification is that once the large initial investment is made to get it going, incremental re-verification will be relatively inexpensive, permitting high-quality software to be delivered rapidly. As the technologies improve, even the initial costs are likely to come down. In the long run, reduction in cost, not reduction in defects, will be the killer app for formal methods.

How about the other way around: When can testing replace formal verification? Testing all possible behaviors of a system is formal verification; it is a kind of proof by exhaustion. Unfortunately, exhaustive testing is only feasible in very narrow circumstances.

Can testing and formal verification interact in other ways? Definitely, and I think the possibilities here are very exciting. We are already seeing excellent systems like Klee and SAGE where formal methods are used to produce test cases. In the future we will start to see patchwork verification efforts where part of the code is verified to be correct, part of it is verified only to be free of certain bugs, and part of it is not verified at all. At that point we will need Klee-like tools that take these different levels of verification into account and generate test cases that maximally increase our confidence in the correctness of the unverified code. Even a fully formally verified software stack may contain some surprises, for example discontinuities in the functions it implements or perhaps unwarranted sensitivity to changes in input variables. We want test case generators for these too.


11 responses to “How Does Formal Verification Affect Software Testing?”

  1. I very much expect testing will remain the cheapest way to show that the common case works.

    Put another way, testing and formal methods converge as you expand the one and restrict the other. The spec becomes a finite and concrete set of cases and the proof becomes direct evaluation of the machine under those cases.

  2. Hi bcs, I’m more optimistic than you are about formal methods! I’ve heard some nice stories about Centaur (maker of slow x86 chips) using a nightly formal verification run that essentially replaces testing for day-to-day purposes for many developers. In other words, if you checkin a change that causes the design to not verify, then you need to fix it.

    Of course, most of the best formal verification success stories are from hardware. But I believe that will change over coming decades. As in the Centaur example, it’ll look very much like testing to most of us—which is good.

  3. Unless the formal verification can be set up *from scratch* in O(minutes) and can run in <O(minute), it is too slow for the use I'm thinking of.

    I'm looking at the increment from "it compiles" to "I think this might be correct in the most common case". In other words, an improvement on "run the code and eyeball the output".

  4. Hi bcs, it doesn’t seem likely that realistic use cases for FM will take minutes to setup in the foreseeable future.

    I think it is more the case that we’ll use formal verification for durable, critical, and complex artifacts like compilers, OSes, and some kinds of libraries. These live for so long that we can amortize significant tool costs.

    I agree that for the vast majority of software, very lightweight formal methods + testing remains the main thing we do.

  5. I feel like this post doesn’t talk about the most common cases for either testing or formal methods. Mostly, testing tests a single layer of the system in the form of unit tests. Most formal methods used on software are lightweight formal methods like type systems or Java’s definite assignment analysis. I think this is even true on significant systems like OSs and compilers, although not for safety-critical systems. And I think it is in these cases where formal methods have made the most inroads in replacing testing.

  6. Two thoughts:

    * This post isn’t complete without citing Knuth’s quote ““Beware of bugs in the above code; I have only proved it correct, not tried it.”
    * It’ll be interesting to see how the interplay falls out with DO178C and the new formal methods supplement.

    John, do you have an opinion on what the trends are or what they should be, particularly for embedded systems?

  7. “I think it is more the case that we’ll use formal verification for durable, critical, and complex artifacts like compilers, OSes, and some kinds of libraries.”

    I’m going to (I think) join Sam in suggesting that another case may be things that are throw-away but turn out to eventually be pretty easy to use formal methods on (and are in contexts where nobody worries much about the underlying layers). At least as programming currently stands, people code up simple variants of common data structures fairly often, “libraries” for some special purpose with a small interface and where a large portion (but not all) of correctness is equivalence under some projection to a standard library data structure (or just “implements a multiset” etc.) Right now I think random testing is the way to go for these, but it doesn’t take a big stretch to imagine push-button model checking becoming the easiest solution.

  8. Hi Sam, “Mostly, testing tests a single layer of the system in the form of unit tests.” is not generally true (in my experience) for systems like compilers and OSes. Where are the unit tests for units of GCC, LLVM, and Linux for example? I do not believe these exist.

  9. Alex, that’s an interesting proposition that we might someday use heavyweight formal methods for throwaway code. I think it boils down to specification reuse. If, as you say, I can grab a multiset specification then great! But my fear is that much of the throwaway code has little one-off differences that make it incompatible with existing specs. If not, we’d just use an existing implementation of the existing spec.

  10. John — my claim is that a decent percent of them are “and also” implementations. You want it to implement multiset, “and also” do some bookkeeping efficiently, or compress something, etc. For not-quite-multiset, yeah, you’re probably not going formal.