Trust, But Verify


CompCert is an optimizing C compiler whose output provably has the same semantics as its input, at least when the input programs are conforming. Of course this high-level view sweeps a huge number of details under the rug. If we want to gain confidence in CompCert’s correctness we’ll need to either dig into these details — making sure, for example, that the formalizations of x86 and of C are done right — or else we’ll need a different plan.

One kind of alternate plan is differential testing, where we supply both CompCert and one or more other compilers with conforming C programs and check that the resulting executables all have the same behavior. For the Csmith paper we spent quite a bit of time testing CompCert in just this way. We found some problems in it along the way, but were unable to find code generation bugs in the latest version of CompCert at the time that we finalized the paper.

Recently I returned to CompCert testing and wanted to share a few observations about that. The reference compilers were recent development versions of Clang and GCC, targeting x86, with optimizations disabled. One reason that I wanted to get back to CompCert is that its recent versions contain a function inliner. This is a pretty important optimization and it is one that Xavier once told me would be very difficult to handle in CompCert because it changes the shape of the program in a significant way. So this seemed worth testing. Also, CompCert now has a better register allocator, it supports 64-bit integers, and has a few other new features.

I ran 1,000,000 randomly-generated test cases through CompCert 2.0 without asking it to do any inlining. This turned up no problems in CompCert, but — somewhat surprisingly — triggered a GCC wrong-code bug just once out of the million test cases. This bug was rapidly fixed. A few weeks later, I ran another 1,000,000 test cases through CompCert 2.1, this time asking it to do a lot of inlining. This time, no bugs were found in GCC, Clang, or CompCert.

Given that all three of these compilers are capable of translating a million randomly generated programs without a detectable miscompilation, one might ask: Why CompCert? The main answer is that it generates better code than gcc -O0 or clang -O0. If we want better code out of the other compilers we can turn on their optimizers, but at that point they would fail to be capable of translating 1,000,000 random programs without miscompiling a few times. Of course we don’t care about random programs. The real questions are (1) how can we be sure that compiler bugs aren’t going to bite the programs that we care about and (2) when developing safety critical software, how can we make an argument that the compiler has not introduced any flaws? One answer is translation validation where we use an external tool to ensure that the compiler has respected the meaning of the input code. But this is quite difficult to do in the presence of optimizations.

The point I wanted to make in this post is simply that while proofs are great, if the subject of the proof is a piece of code that has to accomplish a real-world task, we can sure breathe a lot easier if we have an independent means of validating that the code works as advertised. In this case, we do and it does.


2 responses to “Trust, But Verify”

  1. Thanks for all the testing, John. Csmith has many uses for CompCert, ranging from validating the formal C semantics to finding missing cases and “assertion failed” in the compiler.

    Regarding inlining in CompCert, it took some work because surgery on control-flow graphs is painful to formalize. But it’s a robust transformation, in the sense that there are few special cases and the general semantic invariant, once written down properly, shows correctness “once and for all”.

    The part of CompCert 2.0/2.1 that was really painful is the support for “long long” integer types, because these types add a whole lot of new cases everywhere, from the formal semantics of CompCert C down to the register allocator (which now has to deal with pairs of 32-bit registers) and the calling conventions. Csmith found a bunch of nasty missing cases where the back-end would just stop on an internal error. This was a great help for me.