Who Verifies the Verifiers?

Verification tools, like all complex software systems, contain design flaws and implementation errors. Moreover, they are not necessarily very easy to test. The other day I started to address the question: If someone claims that a piece of software has been formally verified, what should you believe about the software? The previous piece was non-technical; this one adds a bit of detail.

It often comes as a surprise to non-specialists that a formally verified program or system can still be wrong, but this is the case. To see how it happens, let’s break down a typical formal verification argument into the following pieces:

  1. A computer program or system that we want to reason about
  2. A model that assigns a mathematical meaning to the program
  3. An intuitive property that we wish to show the program has
  4. A mathematical formalization of the property
  5. A proof showing the model has the property

Even if the system to be verified contains no bugs and we have a useful property in mind, there is significant room for error in elements 2, 4, and 5 of this list. First I’ll discuss the difficulties, then give a few examples, and finally discuss solutions.


A lot of verification efforts go wrong when ascribing a mathematical meaning to the verified program. Although finding a faithful formalism for a small piece of code (merge sort, for example) is pretty easy, doing so for medium-sized or large codes in real programming languages remains a major challenge. These problems are exacerbated by the fact that most verification research focuses on reasoning about models of programs, as opposed to deriving good models in the first place.

Formalizing a system at the right level of abstraction is important. If the level of abstraction is too high, important details will be missing from the formalism. Some years ago I went to a talk by a well-known verification researcher who said he was providing a sound argument that a piece of C code had certain properties. I spoke up, pointing out that nothing in his analysis ruled out problems such as stack smashing attacks; he responded that his result was sound with respect to a model of C where memory safety errors were assumed not to occur. I wondered (but didn’t say aloud) why someone working for Microsoft would ignore the precise class of C program errors that leads to most of the problems with Microsoft’s software. Since that time, the state of the art has improved greatly, but we still see sloppy models of C with respect to integer behaviors, order of evaluation of arguments to functions, inline assembly code, and other details that can easily render a verification result meaningless.

There are plenty of other problems in formalizing a real language like C. The mathematical meaning of some parts of the standard are unclear. There are many dialects: Microsoft C, GNU C, ANSI C, C99, etc. Many multithreaded C and C++ programs depend on processor-level weak memory models, which are horrible. The header files and library functions pull all sorts of hard-for-formalize code into a program. A rigorous formalization of C++0x will be a major challenge because the language is massive.

If a system’s behavior is formalized at a level of detail that is too low, the complexity of the formalism may be so large that it cannot be dealt with. It is only fairly recently that proofs about non-trivial executable programs have become possible. Regardless of the level of abstraction that is chosen, the fundamental leakiness of all real abstractions can be a problem. C code dives down into assembly sometimes; the correctness of assembly code may depend on the timing properties of the chip being used; etc.

A pernicious problem is that models of programs may include subtle assumptions. For example, a CIL-based tool for analyzing C code (many such tools exist) implicitly assumes that the C compiler used to build the system evaluates function arguments in the same order that CIL does. Ferreting out all of these assumptions requires wizard-level expertise, and bogus assumptions may go undetected for years.


There are plenty of famous examples of faulty specifications such as requiring that the output of a sort routine is sorted, but failing to specify that the output is a permutation of the input. Similarly, one traditional definition of correctness is “returns the correct result or else fails to terminate” — not too useful when verifying embedded software that actually needs the answer. These are silly examples but checking the specification for a real piece of software is not easy and a serious problem in the specification may invalidate the whole verification result.

Even relatively simple properties such as memory safety contain plenty of room for error: Do they account for threads, interrupts, signal handlers, and DMA? Are holes poked in the safety system for accessing device registers and invoking inline assembly?

It is, as far as I know, not yet clear how one might formalize a property such as “never enters a no-fly zone” for a UAV control system in terms of a relatively faithful system model that includes aerodynamics, jet engine behavior, lots of sensors and actuators, a network of computers, and similar important details.


At some computer science conferences, a common kind of paper proposes an improved way to perform formal verification. The results section of this kind of paper contains some tables showing how long it took to verify some different pieces of software. Typically verification is faster now, or perhaps verification becomes possible where it wasn’t, before. The thing that is missing from many of these papers is a convincing argument that verification happened as opposed to, for example, a logic error causing the verifier to spin for a while and then return the wrong result. A nice recent paper showed that SAT and QBF solvers — the reasoning engines behind many program verification tools — are not as correct as we might have hoped.

Proofs are supposed to be formalized arguments that break intuitive arguments down into steps small enough that they are individually believable. The problem is that constructing such an argument about a real piece of software is tedious at about the same level as computing the average size of a grain of sand on the beach by measuring all of them. Thus, people have created tools such as HOL and Coq that can eliminate some of the tedium. However, the vast majority of program verification tools fail to emit a proof that could be checked using a mechanized prover. Rather, the proof is implicit in the execution of the tool. If we are to trust a result produced by this kind of tool, the trust has to have a different source.


One of my favorite papers that I’ve written described a tool that took a piece of executable software for AVR microcontrollers and returned an upper bound on the amount of memory that would be consumed by the call stack, including stack memory used by one or more interrupt handlers. Let’s look at the components of the argument made by this tool:

  • The system model is derived from the AVR documentation and is embedded in the C++ code comprising my analyzer.
  • The specification is of the form “under a set of assumptions, the stack memory consumed by this executable does not exceed N.” This was never formalized in math, but rather in English in the paper and also embedded in the code for the analyzer.
  • The proof is based on an abstract interpretation of each sequential flow of control (one per interrupt handler and one for the non-interrupt context) plus a customized algorithm for composing these partial results into an overall stack depth. The algorithms are described informally in the paper. No proof — formal or otherwise — was given.

At first glance it would seem that I’ve done everything possible to create an unconvincing argument. Why would anyone believe a result produced by this tool? The first major piece of evidence is based on the fact that the tool gives a numerical result — the worst-case stack depth — as opposed to the more typical yes/no answer. By comparing this depth against the worst observed stack depth executions of real AVR codes, it was shown to be conservative in complicated cases and exact in simple cases. Second, in the paper I tried to be very explicit about the assumptions under which the tool would produce valid results. The intent was for readers to think about these assumptions and the system model, and conclude that their combination is plausibly correct. Third, the AVR semantics and the abstract interpretation derived from it were extensively tested, for example by turning low-level analysis results into assertions and stress-testing systems compiled with these assertions. Finally, the tool is open source. None of these arguments creates compelling evidence that my tool is bug-free, of course, but they do serve to give us a bit of confidence in its results.

The best way to argue against subtle bugs would be to formalize the AVR instruction set’s semantics and the worst-case stack depth property in, for example, HOL4. Then, my tool would have to be altered to emit a proof witness to back up its stack depth claims. I thought about this back in 2003 when I was doing the work, and decided the 2-3 person years it probably would have required weren’t worth it. So what did my tool really accomplish? The best we can say is that it emits a claim that could, in principle, be formalized, barring design and implementation errors.

Let’s look at another example. A few years ago Patrick Cousot gave a talk called “Proving the Absence of Run-Time Errors in Safety-Critical Avionics Code.” The talk abstract states that

…abstract-interpretation-based static analyzers have no false negative hence are sound by design. As opposed to bug-finders, static analyzers will never omit to signal an error that can appear in some execution environment.

“Proving” is a pretty generous description of an argument that is buried in a large, complicated tool. Moreover, Astree is closed-source, preventing independent validation of this proof. If Astree has not been designed and implemented correctly, or if the semantics of C have not been formalized correctly, the proof holds no water. Was Astree validated, for example by turning its intermediate results into assertions? I assume so, since you’d have to be crazy not to use these kinds of methods to debug a large analyzer. But this is not described in any of the papers (that I’ve read, at least). Rather, the argument for correctness is basically “we’re smart, trust us.” But of course the avionics software developers could have saved everyone some time by making the same claim about their own system.

Does Astree’s result for the Airbus software pass the piano test? Speaking for myself: given the available published evidence, absolutely not. I”m not really criticizing Astree — which is an amazing piece of work — but rather taking a critical look at exactly what kind of argument it makes about the analyzed system.

Could Astree’s argument for formalized as a result that could be verified using a mechanical theorem prover? Again, yes — barring design and implementation errors. But it would probably take in excess of 10 person-years of effort. The Astree team is unlikely to invest that effort because they have already made public claims equivalent to stating that the tool is infallible. (This reminds me of a story I heard from a formal verification researcher who was told by a large safety-critical system vendor that they were allowed to prove the company’s software correct, but by definition they weren’t going to find any bugs in it.)

A rather different kind of example comes from Column 4 of Programming Pearls, where Bentley proves that a pseudocode binary search implementation is correct, and then in the next column, derives a C implementation. In 2006 Joshua Bloch noticed that the proof is faulty for any target programming language that has fixed-width integers, including C, when an array at least half the size of the maximum representable integer value is sorted. Bentley’s verification effort failed because his model assumed that integers in C behave mathematically, whereas integer overflows in C either have 2’s complement behavior or else completely destroy the meaning of the program, depending on whether the ints are unsigned or signed and depending on what version of C is being used. Unfaithful models of integer behavior are a common source of false verification results. Verification engineers and researchers have taken note of this and many modern tools have accurate models of integers.

The most convincing verification arguments are those where the proof is machine-checked. Recent examples include seL4, CompCert, and Myreen’s Lisp interpreter. All of these are great results. The one that I have the most experience with is CompCert, which my group has tested extensively. We found a few bugs in its unproved frontend and a few minor holes in its formalization of the PowerPC assembly language, but we did not find any of the incorrect safety checks in optimization passes that plague every other compiler we have studied. Furthermore, in recent versions CompCert provably (and in practice) respects the volatile invariant, whereas every other compiler we have tested gets it wrong.


The closest thing we have to a gold standard for mathematical proof is to make the proof checkable by machine. Could a proof checker be wrong? Sure, but people have made some very small, simple checkers, and different implementations of them can be run on totally different hardware and software systems, greatly reducing the scope for undetected proof errors. The main problem with machine-checked proofs is that creating them requires a pedantic level of commitment to detail that most people do not have. Proof-producing verifiers are the solution, but currently these are so hard to create that very few researchers do so. In the future we can look forward to program analysis infrastructure that handles most of the grunt-work of producing proofs, giving researchers (and system certifiers) the benefit without most of the pain.

Is there a gold standard for assigning meanings to realistic software artifacts? Generally not; but perhaps so, in a few very limited domains. A formalization of the ARM instruction set architecture by some folks at Cambridge is one example. There are three arguments in favor of trusting this particular formalization. First, Fox proved that it was implemented by a particular version of the ARM hardware. Second, the semantics is executable: it can be used to interpret ARM programs, and the results checked against the results on real hardware. Third, the semantics has been around for a while and has been used by many different research groups. It is surprising to some people that a formal model — the basis for all formal reasoning about programs — should itself require extensive verification and validation, but this is absolutely the case. In the long run I think we must hope that relatively few kinds of artifacts will require formal models, and that well-vetted ways of creating these models will emerge. The current situation, where for example everyone who verifies C code comes up with their own semantics, is pretty terrible.

What about program properties? There don’t seem to be widely accepted specifications for memory safety for C, exception safety for C++, and related reusable properties. There exist plenty of nice specification languages such as SPARK ADA’s and ACSL. Even so, creating nontrivial specifications is really hard.

In my “piano test” post I said that NASA’s “test what you fly, fly what you test” adage needed to be adapted for software. Here’s how to do it. The unverifiable elements of program verification — the models and specifications — need to be thoroughly validated through use, through many-eyes techniques, and through independent mathematical checks. They need to lead to executable system models that can be stress-tested. Whenever possible, intermediate results computed by analyzers must be turned into executable checks and the resulting checked codes must also be stress-tested.

In the long run, are formal verification tools that don’t produce proofs any good? Sure. First, we can gain trust in these tools just like we have gained trust in the software that today controls our avionics systems, antilock brakes, and pacemakers. Second, they can find bugs: a bug is a bug, regardless of the quality of the tool that found it. Third, these tools may be useful simply by reporting false positives that make us look closely at our software to make sure they’re false. There’s a great paper showing that only one static analysis tool refutes the null hypothesis that “static analyzer performance is equivalent to diagonal random guessing.”


A formal verification result is based on a mathematical property that attempts to formalize something we care about, a mathematical model that attempts to formalize the behavior of a program or system of interest, and a proof that attempts to argue that the model has the property. Practically speaking, the model, property, and proof may be one of:

  • informal and implicit (embedded in the source code of a checking tool, for example)
  • documented in technical English (or another human language)
  • written in informal math
  • fully formalized
  • formalized and also machine checked

At present, taking the “fully formalized and machine checked” option adds massive overhead to a verification effort. This is a problem because the systems that we want to verify are getting larger and more complicated each year. Advances in modularity and proof techniques are needed, and appear to be coming.

The Piano Test for Program Verification

[Update from Feb 1 2011: I’ve written a new post that adds some technical details.]

Here’s a little thought experiment:

You’re spending the day visiting the Utah computer science department. We’re chatting and I describe a software verification project where my group has proved that some particular version of OpenSSH is memory safe: it will never execute an array bounds violation, null pointer dereference, or similar. Since you are perhaps slightly gullible, you believe me. Now I ask you to look up and for the first time, you notice that a large piano is hanging from my office ceiling, directly above your head. The rope from which the piano is hanging is securely tied off to an I-beam, so there’s no reason to worry about it falling on you accidentally. However, the rope passes through a small box that is connected to my desktop computer using a USB cable. I explain to you that this box contains a very sharp, remotely-activated knife that can instantly sever the rope when commanded to by the computer (don’t worry about this software being wrong, I verified it too).

Now, since you believe my verification result, you should be perfectly willing to keep sitting in your chair while I attack OpenSSH using some sort of highly advanced fuzz tool that tries to crash it by generating evil inputs. Of course, I have rigged OpenSSH’s SIGSEGV handler so that if it ever fires, it will command the USB knife to sever the rope — and you will be squished. Do you keep sitting in the chair?

My opinion is that a person would have to be a bit foolish to sit under the piano. But forget about my work: how many verification researchers believe in their own work to this level? Probably not too many. The first problem is one of semantics: the word “verified” is used extremely loosely by members of the verification community (I recently reviewed a paper that kept saying things like “we rigorously prove” and “we formally verify” — despite the fact that the paper contained nothing but proof sketches). The second problem is deeper: it is extremely difficult to argue convincingly that a verification result is correct. By “correct” I mean not only “mathematically sound” but also “the result is faithful to the intent of the verification effort.”

Although the picture I’ve painted about dropping a piano is silly, the situation is realistic. For example, every time you use the transportation system in a developed country, you are gambling that the benefit of using the system outweighs the risk that some embedded computer will malfunction, harming you. Of course, we willingly make this bet every day, but not because the stoplights and stability control systems are verified (they aren’t) but because we assume that if these systems worked well yesterday, they’ll work well today. The adage (supposedly from NASA) is “test what you fly, fly what you test” and it captures a fundamental fact about how we create reliable engineered systems: systems are tested in realistic environments and failures are rigorously studied and the results are fed back into the design of the next iteration of the system. This is not incompatible with formal verification, but it’s also not a point of view that comes out very often in the verification literature.

Good Book: Idaho Falls

This book, like the one I wrote about yesterday, is a horror story for engineers. Idaho Falls is about the SL-1, a prototype nuclear reactor in the desert in Idaho. Although it had been designed for a 3 MW thermal capacity, in early 1961 its output briefly reached something like 18 GW when the single main control rod was pulled too far out of the core. This power surge caused the reactor to explode, killing its three operators and necessitating an elaborate cleanup effort and a painful investigation.

Idaho Falls paints a terrifying picture of the SL-1. Apparently it was poorly designed (single primary control rod, no containment vessel) and also had deteriorated significantly over time: not only would the control rods stick sometimes, but also the proper placement for the rods kept changing as neutron-absorbing boron flaked off of the reactor. Furthermore, oversight was limited and operating procedures were poorly documented.

A large part of Idaho Falls focuses on the people involved; this adds human interest, but in the end there is no convincing evidence that personality issues were a direct contributor to the reactor explosion. There is, however, an excellent anecdote describing what may be the least funny practical joke of all time: an operator (one of the three who was later killed in the explosion) turned off a fan that cooled part of the reactor and watched his colleagues panic for a while before reaching behind a control panel and switching it back on.

I came across this book in an indirect way: last summer, while driving across the Lost River Desert on the way to Borah Peak, I saw an interesting collection of buildings. A sign pointed to the Idaho National Laboratory, which I’d barely even heard of. After getting home, I read up a bit on this lab and eventually found a reference to Idaho Falls. It’s a quick read that people interested in the very early nuclear age should enjoy.

Good Book: Space Systems Failures

Space Systems Failures is like a horror novel for engineers: years of people’s lives and hundreds of millions of dollars are wasted because somebody crossed a wire or skipped a test. The real reasons for failures of launch vehicles and their payloads, however, are more interesting:

  • Margins are slim because adding margin is expensive
  • System state may evolve rapidly with respect to our ability to react, especially during launch or when a spacecraft is far from Earth
  • It’s hard to make things work in a vacuum, in high-radiation environments, and under extreme temperatures
  • Each launch is expensive and often involves untested elements

The result is that the space industry has not, and may never, reach the level of reliability seen for example in aviation. Space System Failures is a detailed, engineer-oriented account of a large number of failed space missions. The first part of the book covers launch vehicles and it is organized by rocket family. This part is interesting but perhaps too long: while the failures are surprisingly diverse at first, they begin to blur together after a while. The readability of this section is saved by the (sometimes dryly humorous) discussions of the historical and economic contexts for the various missions.

The second part of the book is about payload failures. It is more interesting for several reasons. First, the functionality of the payloads is diverse, compared to launch vehicles. Second, since failures happen on longer time scales, humans are in the loop. Surprisingly often, an initial apparent failure could be turned into partial or total success by talented ground personnel. Another surprise was how many failures stemmed from error-checking logic, such as shutting down a thruster that was judged to have been firing for too long, preventing a satellite from being stabilized. Unsurprisingly, many serious failures come from interesting feature interactions unforeseen by designers.

I learned a lot from this book, particularly about the environment in which spacecraft live and about the sensors, actuators, and other subsystems that spacecraft use to fulfill their functions. It was interesting to learn that something goes wrong on almost every mission, but often the fault is properly contained instead of propagating into total system failure. The authors conclude that far too many failures result from lessons being improperly learned or forgotten as the relentless drive to lower costs forces tests to be skipped and personnel to be cut.

The Synergy Between Delta Debugging and Compiler Optimization

Before reporting a compiler bug, it’s best to reduce the size of the failure-inducing input. For example, this morning I reported an LLVM bug where the compiler enters an infinite loop when compiling this C code:

static int foo (int si1, int si2) {
  return si1 - si2;
void bar (void) {
  unsigned char l_197;
  for (;; l_197 = foo (l_197, 1))

My bug report was a good one because the test program looks pretty close to minimal; it was reduced manually and automatically from a code that was originally 25 KB. Testcase reduction is so important that the GCC folks have an entire document devoted to the topic, including instructions for how to use the Berkeley Delta implementation.

The Berkeley Delta is pretty awesome except when it is not. It often gets stuck at local minima that cannot be avoided except by making coordinated changes to multiple parts of a program. For example, it will never remove an argument from a function definition and also from all calls to the function. Therefore, a person works with delta like so:

  1. Run delta
  2. Do more reduction by hand
  3. Goto step 1

After spending more hours on this kind of activity than I’d care to admit, I realized that most of the stuff I was doing resembled compiler optimizations:

  • inlining a function
  • deleting a dead function argument
  • propagating a constant
  • reducing the dimension of an array
  • removing a level of indirection from a pointer

I think the similarity is a useful observation, but there are important differences between an optimizer and a testcase reducer:

  1. An optimizer must be semantics-preserving, whereas a testcase reducer doesn’t need to be
  2. An optimization is desirable when it speeds up the code, whereas a reduction step is desirable when it both simplifies the test case and also doesn’t make the bug disappear

Thus, to reuse an optimizer as a reducer, we would simply need to drive it in a typical Delta loop, being ready to back out any transformation which makes the bug go away or fails to make the test case simpler. I haven’t had time to do this; hopefully someone else will.

What Peers?

[This post was motivated by, and includes things I learned from, a discussion with some of my blogging and non-blogging colleagues.]

Ideally the peer review process benefits not only the conference or journal that does the reviewing, but also the authors who receive valuable feedback on their submission. And usually it does work like that: I’ve received a large amount of great feedback on my work through paper reviews. However, I feel like some part of the process isn’t working very well. On the reviewer side, I keep getting papers to review that are far enough from my areas of expertise that the reviews I write are superficial (over the weekend I reviewed a paper on mapping temperatures in small bodies of water using sonar — what am I supposed to say?). On the author side, at least 50% of reviews that I get are similarly flawed: they don’t tell me anything new or useful. This is confusing to me because I would have expected that the massive proliferation of conferences and tracks at conferences that has taken place in computer science would facilitate good matching of reviewers to papers. Somehow this doesn’t seem to be happening. In contrast, when I need feedback on a paper I typically send it off to a few acquaintances and friends, and usually get tremendously useful feedback. Also, when I send people feedback on their work I often get a reply along the lines of “thanks — that was way more helpful than the reviews we got from conference X.”

A few random guesses about what’s happening:

  • The peer review system is groaning under the overload of papers produced by all of us who are trying to get a higher h-index, trying to get a faculty position, trying to get tenure, etc.
  • Overworked program chairs are resorting to algorithmic methods for assigning papers to reviewers. This never, ever works as well as manual assignment (though it may be useful as a starting point).
  • There may be an increasing amount of work that crosses community and sub-community boundaries, making it harder to effectively pigeonhole a paper. Conferences and journals do not keep up well as boundaries and interests shift.

The interesting question is: how can we use blogs, online journals, and other modern mechanisms to better match up the people producing research to the people who are best able to evaluate it? It seems like some of the obvious answers, such as adding a reviewing facility to arXiv, suffer from problems where I submit a paper and then get my ten best friends to submit reviews saying how great it is.

Update from Jan 18, evening: Because it’s so easy to throw stones, I wanted to add a couple of positive anecdotes:

  • A student and I submitted a paper to POPL 2011. Although it was rejected, the reviews were super useful and totaled to almost 25 KB of text (a bit over 4000 words) after deleting all boilerplate. This is especially amazing considering that POPL got a lot of submissions, or so I heard.
  • I was recently invited to be on the “external review committee” for ISMM, the International Symposium on Memory Management. The invitation said “ISMM reviews are often on the order of one typewritten page” which is again pretty amazing. I’ve submitted only one paper to ISMM and the reviews (again after removing all boilerplate) came out to nearly 21 KB (not quite 3400 words).

ISMM is an interesting conference because it is of unusually high quality for the relatively small community it serves. I’m not sure how they have accomplished that, but my guess is that it’s a fairly tight-knit community where there are good incentives to provide people with useful feedback.