Who Verifies the Verifiers?

Verification tools, like all complex software systems, contain design flaws and implementation errors. Moreover, they are not necessarily very easy to test. The other day I started to address the question: If someone claims that a piece of software has been formally verified, what should you believe about the software? The previous piece was non-technical; this one adds a bit of detail.

It often comes as a surprise to non-specialists that a formally verified program or system can still be wrong, but this is the case. To see how it happens, let’s break down a typical formal verification argument into the following pieces:

  1. A computer program or system that we want to reason about
  2. A model that assigns a mathematical meaning to the program
  3. An intuitive property that we wish to show the program has
  4. A mathematical formalization of the property
  5. A proof showing the model has the property

Even if the system to be verified contains no bugs and we have a useful property in mind, there is significant room for error in elements 2, 4, and 5 of this list. First I’ll discuss the difficulties, then give a few examples, and finally discuss solutions.


A lot of verification efforts go wrong when ascribing a mathematical meaning to the verified program. Although finding a faithful formalism for a small piece of code (merge sort, for example) is pretty easy, doing so for medium-sized or large codes in real programming languages remains a major challenge. These problems are exacerbated by the fact that most verification research focuses on reasoning about models of programs, as opposed to deriving good models in the first place.

Formalizing a system at the right level of abstraction is important. If the level of abstraction is too high, important details will be missing from the formalism. Some years ago I went to a talk by a well-known verification researcher who said he was providing a sound argument that a piece of C code had certain properties. I spoke up, pointing out that nothing in his analysis ruled out problems such as stack smashing attacks; he responded that his result was sound with respect to a model of C where memory safety errors were assumed not to occur. I wondered (but didn’t say aloud) why someone working for Microsoft would ignore the precise class of C program errors that leads to most of the problems with Microsoft’s software. Since that time, the state of the art has improved greatly, but we still see sloppy models of C with respect to integer behaviors, order of evaluation of arguments to functions, inline assembly code, and other details that can easily render a verification result meaningless.

There are plenty of other problems in formalizing a real language like C. The mathematical meaning of some parts of the standard are unclear. There are many dialects: Microsoft C, GNU C, ANSI C, C99, etc. Many multithreaded C and C++ programs depend on processor-level weak memory models, which are horrible. The header files and library functions pull all sorts of hard-for-formalize code into a program. A rigorous formalization of C++0x will be a major challenge because the language is massive.

If a system’s behavior is formalized at a level of detail that is too low, the complexity of the formalism may be so large that it cannot be dealt with. It is only fairly recently that proofs about non-trivial executable programs have become possible. Regardless of the level of abstraction that is chosen, the fundamental leakiness of all real abstractions can be a problem. C code dives down into assembly sometimes; the correctness of assembly code may depend on the timing properties of the chip being used; etc.

A pernicious problem is that models of programs may include subtle assumptions. For example, a CIL-based tool for analyzing C code (many such tools exist) implicitly assumes that the C compiler used to build the system evaluates function arguments in the same order that CIL does. Ferreting out all of these assumptions requires wizard-level expertise, and bogus assumptions may go undetected for years.


There are plenty of famous examples of faulty specifications such as requiring that the output of a sort routine is sorted, but failing to specify that the output is a permutation of the input. Similarly, one traditional definition of correctness is “returns the correct result or else fails to terminate” — not too useful when verifying embedded software that actually needs the answer. These are silly examples but checking the specification for a real piece of software is not easy and a serious problem in the specification may invalidate the whole verification result.

Even relatively simple properties such as memory safety contain plenty of room for error: Do they account for threads, interrupts, signal handlers, and DMA? Are holes poked in the safety system for accessing device registers and invoking inline assembly?

It is, as far as I know, not yet clear how one might formalize a property such as “never enters a no-fly zone” for a UAV control system in terms of a relatively faithful system model that includes aerodynamics, jet engine behavior, lots of sensors and actuators, a network of computers, and similar important details.


At some computer science conferences, a common kind of paper proposes an improved way to perform formal verification. The results section of this kind of paper contains some tables showing how long it took to verify some different pieces of software. Typically verification is faster now, or perhaps verification becomes possible where it wasn’t, before. The thing that is missing from many of these papers is a convincing argument that verification happened as opposed to, for example, a logic error causing the verifier to spin for a while and then return the wrong result. A nice recent paper showed that SAT and QBF solvers — the reasoning engines behind many program verification tools — are not as correct as we might have hoped.

Proofs are supposed to be formalized arguments that break intuitive arguments down into steps small enough that they are individually believable. The problem is that constructing such an argument about a real piece of software is tedious at about the same level as computing the average size of a grain of sand on the beach by measuring all of them. Thus, people have created tools such as HOL and Coq that can eliminate some of the tedium. However, the vast majority of program verification tools fail to emit a proof that could be checked using a mechanized prover. Rather, the proof is implicit in the execution of the tool. If we are to trust a result produced by this kind of tool, the trust has to have a different source.


One of my favorite papers that I’ve written described a tool that took a piece of executable software for AVR microcontrollers and returned an upper bound on the amount of memory that would be consumed by the call stack, including stack memory used by one or more interrupt handlers. Let’s look at the components of the argument made by this tool:

  • The system model is derived from the AVR documentation and is embedded in the C++ code comprising my analyzer.
  • The specification is of the form “under a set of assumptions, the stack memory consumed by this executable does not exceed N.” This was never formalized in math, but rather in English in the paper and also embedded in the code for the analyzer.
  • The proof is based on an abstract interpretation of each sequential flow of control (one per interrupt handler and one for the non-interrupt context) plus a customized algorithm for composing these partial results into an overall stack depth. The algorithms are described informally in the paper. No proof — formal or otherwise — was given.

At first glance it would seem that I’ve done everything possible to create an unconvincing argument. Why would anyone believe a result produced by this tool? The first major piece of evidence is based on the fact that the tool gives a numerical result — the worst-case stack depth — as opposed to the more typical yes/no answer. By comparing this depth against the worst observed stack depth executions of real AVR codes, it was shown to be conservative in complicated cases and exact in simple cases. Second, in the paper I tried to be very explicit about the assumptions under which the tool would produce valid results. The intent was for readers to think about these assumptions and the system model, and conclude that their combination is plausibly correct. Third, the AVR semantics and the abstract interpretation derived from it were extensively tested, for example by turning low-level analysis results into assertions and stress-testing systems compiled with these assertions. Finally, the tool is open source. None of these arguments creates compelling evidence that my tool is bug-free, of course, but they do serve to give us a bit of confidence in its results.

The best way to argue against subtle bugs would be to formalize the AVR instruction set’s semantics and the worst-case stack depth property in, for example, HOL4. Then, my tool would have to be altered to emit a proof witness to back up its stack depth claims. I thought about this back in 2003 when I was doing the work, and decided the 2-3 person years it probably would have required weren’t worth it. So what did my tool really accomplish? The best we can say is that it emits a claim that could, in principle, be formalized, barring design and implementation errors.

Let’s look at another example. A few years ago Patrick Cousot gave a talk called “Proving the Absence of Run-Time Errors in Safety-Critical Avionics Code.” The talk abstract states that

…abstract-interpretation-based static analyzers have no false negative hence are sound by design. As opposed to bug-finders, static analyzers will never omit to signal an error that can appear in some execution environment.

“Proving” is a pretty generous description of an argument that is buried in a large, complicated tool. Moreover, Astree is closed-source, preventing independent validation of this proof. If Astree has not been designed and implemented correctly, or if the semantics of C have not been formalized correctly, the proof holds no water. Was Astree validated, for example by turning its intermediate results into assertions? I assume so, since you’d have to be crazy not to use these kinds of methods to debug a large analyzer. But this is not described in any of the papers (that I’ve read, at least). Rather, the argument for correctness is basically “we’re smart, trust us.” But of course the avionics software developers could have saved everyone some time by making the same claim about their own system.

Does Astree’s result for the Airbus software pass the piano test? Speaking for myself: given the available published evidence, absolutely not. I”m not really criticizing Astree — which is an amazing piece of work — but rather taking a critical look at exactly what kind of argument it makes about the analyzed system.

Could Astree’s argument for formalized as a result that could be verified using a mechanical theorem prover? Again, yes — barring design and implementation errors. But it would probably take in excess of 10 person-years of effort. The Astree team is unlikely to invest that effort because they have already made public claims equivalent to stating that the tool is infallible. (This reminds me of a story I heard from a formal verification researcher who was told by a large safety-critical system vendor that they were allowed to prove the company’s software correct, but by definition they weren’t going to find any bugs in it.)

A rather different kind of example comes from Column 4 of Programming Pearls, where Bentley proves that a pseudocode binary search implementation is correct, and then in the next column, derives a C implementation. In 2006 Joshua Bloch noticed that the proof is faulty for any target programming language that has fixed-width integers, including C, when an array at least half the size of the maximum representable integer value is sorted. Bentley’s verification effort failed because his model assumed that integers in C behave mathematically, whereas integer overflows in C either have 2’s complement behavior or else completely destroy the meaning of the program, depending on whether the ints are unsigned or signed and depending on what version of C is being used. Unfaithful models of integer behavior are a common source of false verification results. Verification engineers and researchers have taken note of this and many modern tools have accurate models of integers.

The most convincing verification arguments are those where the proof is machine-checked. Recent examples include seL4, CompCert, and Myreen’s Lisp interpreter. All of these are great results. The one that I have the most experience with is CompCert, which my group has tested extensively. We found a few bugs in its unproved frontend and a few minor holes in its formalization of the PowerPC assembly language, but we did not find any of the incorrect safety checks in optimization passes that plague every other compiler we have studied. Furthermore, in recent versions CompCert provably (and in practice) respects the volatile invariant, whereas every other compiler we have tested gets it wrong.


The closest thing we have to a gold standard for mathematical proof is to make the proof checkable by machine. Could a proof checker be wrong? Sure, but people have made some very small, simple checkers, and different implementations of them can be run on totally different hardware and software systems, greatly reducing the scope for undetected proof errors. The main problem with machine-checked proofs is that creating them requires a pedantic level of commitment to detail that most people do not have. Proof-producing verifiers are the solution, but currently these are so hard to create that very few researchers do so. In the future we can look forward to program analysis infrastructure that handles most of the grunt-work of producing proofs, giving researchers (and system certifiers) the benefit without most of the pain.

Is there a gold standard for assigning meanings to realistic software artifacts? Generally not; but perhaps so, in a few very limited domains. A formalization of the ARM instruction set architecture by some folks at Cambridge is one example. There are three arguments in favor of trusting this particular formalization. First, Fox proved that it was implemented by a particular version of the ARM hardware. Second, the semantics is executable: it can be used to interpret ARM programs, and the results checked against the results on real hardware. Third, the semantics has been around for a while and has been used by many different research groups. It is surprising to some people that a formal model — the basis for all formal reasoning about programs — should itself require extensive verification and validation, but this is absolutely the case. In the long run I think we must hope that relatively few kinds of artifacts will require formal models, and that well-vetted ways of creating these models will emerge. The current situation, where for example everyone who verifies C code comes up with their own semantics, is pretty terrible.

What about program properties? There don’t seem to be widely accepted specifications for memory safety for C, exception safety for C++, and related reusable properties. There exist plenty of nice specification languages such as SPARK ADA’s and ACSL. Even so, creating nontrivial specifications is really hard.

In my “piano test” post I said that NASA’s “test what you fly, fly what you test” adage needed to be adapted for software. Here’s how to do it. The unverifiable elements of program verification — the models and specifications — need to be thoroughly validated through use, through many-eyes techniques, and through independent mathematical checks. They need to lead to executable system models that can be stress-tested. Whenever possible, intermediate results computed by analyzers must be turned into executable checks and the resulting checked codes must also be stress-tested.

In the long run, are formal verification tools that don’t produce proofs any good? Sure. First, we can gain trust in these tools just like we have gained trust in the software that today controls our avionics systems, antilock brakes, and pacemakers. Second, they can find bugs: a bug is a bug, regardless of the quality of the tool that found it. Third, these tools may be useful simply by reporting false positives that make us look closely at our software to make sure they’re false. There’s a great paper showing that only one static analysis tool refutes the null hypothesis that “static analyzer performance is equivalent to diagonal random guessing.”


A formal verification result is based on a mathematical property that attempts to formalize something we care about, a mathematical model that attempts to formalize the behavior of a program or system of interest, and a proof that attempts to argue that the model has the property. Practically speaking, the model, property, and proof may be one of:

  • informal and implicit (embedded in the source code of a checking tool, for example)
  • documented in technical English (or another human language)
  • written in informal math
  • fully formalized
  • formalized and also machine checked

At present, taking the “fully formalized and machine checked” option adds massive overhead to a verification effort. This is a problem because the systems that we want to verify are getting larger and more complicated each year. Advances in modularity and proof techniques are needed, and appear to be coming.

7 responses to “Who Verifies the Verifiers?”

  1. Hello John,

    I read your new blog post with interest (I was expecting it
    ever since you hinted at it last time).

    There is a three-part invited blog post I wrote a long while
    back on the same kind of subject. Well, it was in reaction
    to another article promoting one of the heuristic, widely
    distributed analyzers as a verification tools. I understand
    how phrases such as “sound by design” in Astrée
    publications may seem dismissive and hand-wavy,
    but there is also the problem of positioning “sound by
    design” analyzers with respect to “unsound by design”
    ones. A lot of people think about the latter when they
    hear “static analysis”, and a lot of people, apparently
    educated way beyond their intelligence, think that
    it’s a mathematical impossibility to build a sound

    So that essay was, in addition to all its weaknesses,
    written for a different public:




  2. Hi John,

    nice post and well-argued. Your points are, in my opinion, the key reason why formal verification can never replace thorough testing, though both techniques should be used to smoothly complement each other. Anyway, this discussion has a long history, if you think about Knuth’s “beware of bugs in the above code; I have only proved it correct, not tried it”.