The Last Mile for a Verification Problem

A few weeks ago my student Lu Zhao defended his PhD thesis and we were lucky to get his external committee member, Magnus Myreen, to attend in person since he was in the USA anyway for a meeting. While he was out here, Magnus gave a talk on some not-yet-published work that is really cool.

The background for Magnus’s work is L4.verified, a tour de force of formal verification where a microkernel was proved to be correct. Now the first thing you should always ask when anyone says that anything was “proved to be correct” is: What was proved about what, and how? The question is important because historically speaking (and currently, but especially in the past) these words have been used rather loosely. In the case of L4.verified, “proved to be correct” means roughly that:

The implementation of a microkernel in C, whose meaning was determined using a formal semantics for C that was developed by the L4.verified researchers, was proved using Isabelle/HOL to implement a formal specification of the L4.verified kernel. Residual unproved assumptions remain with respect to some inline assembly language and also some unformalized aspects of ARM hardware such as its virtual memory subsystem.

Where could this go wrong? It could be the case that the high-level specification does not formalize what was intended. It could be the case that incorrect assumptions were made about the hardware and about the inline assembly that manipulates it. It could be the case that the C formalization fails to match up with the compiler that is used. This mismatch could be due to an error in the formalization or an error in the compiler. Of course, it is well-known that compilers contain bugs. Finally, the Isabelle/HOL prover could itself be incorrect, though most people would consider this highly unlikely.

Magnus’s work (done in conjunction with Thomas Sewell of NICTA, and currently in progress) is focused on eliminating the dependency on a correct C formalization and C compiler. Of course, L4.verified still has to run on some target machine. Their idea is to replace the C abstract machine (augmented with ARM-specific stuff) with the formal model of the ARM instruction set that was developed at Cambridge at least ten years ago. This model has been used in a variety of projects and is believed to be of high quality.

The most obvious way to eliminate L4.verified’s trusted C compiler is to use CompCert, and in fact Magnus and Thomas tried to do that. This line of work was stopped by incompatibilities between CompCert’s view of the meaning of C and L4.verified’s, and also by practical differences between Isabelle/HOL and Coq — the latter prover being the one used by CompCert. It wasn’t clear to me how much effort would have been required to overcome these problems.

The second approach that Thomas and Magnus tried is decompiling the L4.verified kernel’s ARM object code into math. In other words, they treat each machine code function as a mathematical function. Decompilation was the topic of Magnus’s PhD work and he achieved some very impressive results using it. There are two obvious problems with the decompilation approach: loops and side effects. Loops are simply decompiled into tail recursion, and side-effecting functions can be handled by treating the machine’s memory (or part of it) as an explicit input to and output from each function. Plenty of details are found in Magnus’s papers and in his PhD thesis. The next step is to connect the decompiled functions with their equivalent functions as described by the C semantics, and this is accomplished by some rewriting steps and finally using an SMT solver to do some of the dirty work. Magnus described this last part, connecting the C code with the decompiled code, as working for 65% of the functions in L4.verified. Of course, we might expect that it works for the easiest 65% of functions, so it could be the case that a significant amount of work remains, and my guess is that the last dozen or so functions will turn into serious headaches.

I was very curious whether they had so far run into any compiler bugs and it seems that they have not. On the other hand, this is expected since compiler bugs will force the equivalence check to fail, meaning that all miscompiled functions are still hiding in the 35% of functions whose equivalence has not yet been verified. My guess is that it’s substantially likely that they will not run into any serious compiler bugs while doing this work, for two reasons. First, the L4.verified kernel has been tested fairly heavily already, and has been subjected to worst-case execution time analysis at the object code level. It seems likely that serious miscompilations would have already been discovered. Second, L4.verified is coded in a somewhat careful subset of C that eschews, for example, bitfields in favor of explicit shifts and masks, side-stepping a common area where compiler bugs are found. It seems likely that they might run into a few fairly harmless miscompilations such as illegally reordered memory operations where the reordering doesn’t matter that much.

Anyway, it was great having Magnus out in Utah and hearing about this work, which represents a nice improvement on an already extremely impressive piece of work.

3 responses to “The Last Mile for a Verification Problem”

  1. In the case of L4.verified, there is another quite interesting reason why the formalization may go wrong: the proof might actually not exist. The mechanized proof, and the corresponding C code, have to my knowledge never been published. Quite detailed and very interesting papers have been published about the proof — and of course we trust the researcher doing the work — so one may be extremely confident that the proof exists, but that is no formal evidence.

  2. We have actually found compiler bugs in the past, but none with the toolchains we currently use. As you’ve said though, the subset of C we use does rule out some features that are highly correlated with compiler bugs.

  3. “Incompatibilities between CompCert’s view of the meaning of C and L4.verified’s” doesn’t sound too good, especially if they used a subset of C that is easiest to get correct. Any more details about this?

    ps I’m not the Magnus in the article