The Future of Compiler Correctness

Notes:

This piece is mainly about compilers used for safety-critical or otherwise-critical systems. The intersection of formal methods and compilers for scripting languages is the empty set.
Readers may be interested in a companion piece The Future of Compiler Optimization.

A half-century of experience in developing compilers has not resulted in a body of widely-used techniques for creating compilers that are correct. Rather — as most embedded system developers would be happy to tell you — it is not uncommon to run across nasty bugs in compilers and related tools, even when a lot of money has been spent on them. This piece examines the situation and explores some ideas that will be helpful during the next 50 years.

“Can we create a correct compiler?” is the wrong question to ask because clearly we can, if we choose an easy source language or a very simple style of translation. A better question is:

Can we create a trustworthy compiler that generates high-quality object code and is otherwise practical?

Let’s look a little closer at the different parts of the problem.

A correct compiler is one that, for every valid program in its input language, provably translates it into object code that implements the computation that the language’s semantics ascribe to that program. A trustworthy compiler is one that convinces us — somehow or other — that it has produced a correct translation. Creating a trustworthy compiler is easier than creating a correct one.

A compiler that produces high quality object code is one that supports at least the basic optimizations: constant propagation, dead code elimination, register allocation, instruction scheduling, various peephole optimizations, etc. I’m being a bit vague about “high quality” since in some cases it means small code, in some cases fast code, but is usually a balance between these goals. The degree of speedup due to optimization is highly variable but in general we expect program speed to at least double.

A practical compiler is one that:

accepts a useful programming language
targets at least one useful hardware architecture
compiles at an acceptable speed
can be extended and maintained
integrates with other development tools: debuggers, IDEs, linkers and loaders, etc.
produces usable error messages and warnings
is affordable to its intended audience
has technical support available

People developing production-quality compilers make many concessions in the name of practicality. Of course, they wouldn’t call it “practicality” but rather something like “creating a real compiler instead of a stupid toy.”

The “high quality object code” requirement excludes very simple compilers like TCC from the discussion. TCC is a C compiler that performs a local translation: each line of code (more or less) is independently turned into object code. This kind of compiler can be made to be relatively practical and I believe it could be made trustworthy (though TCC and other simple compilers have been quite buggy the few times I’ve run tests on them). However, I’m not ready to believe that developers — even safety-critical developers, most of the time — are willing to throw away the 2x to 5x performance improvements that are gained by permitting the compiler to look across multiple lines of code.

Now, to the core of the issue. Today there are two paths to trustworthy, practical, and performant compilers.

The High Road: Verification and Translation Validation

Verified compilation is the gold standard. It sets a very high bar, requiring a mathematical description of the source language, a mathematical description of the target language, and a machine-checkable proof that all programs in the source language are correctly translated into object code. Translation validation is an easier but still very difficult goal where an individualized proof for each program in the source language is produced as a side effect of compilation.

Although high-road topics have been studied by the research community for many years, only one verified compiler has ever been produced that I would call useful: CompCert, from INRIA. CompCert takes a low-level language that is similar to a subset of C and compiles it into PowerPC or ARM assembly language in a way that is provably correct. The compiler is pretty fast and the generated code is good (comparable to “gcc -O1”). CompCert applies most of the standard optimizations; function inlining is the most significant omission. As of recently (and perhaps not even in the latest released version) CompCert supports volatile-qualified variables, which are important in the embedded, safety-critical domain.

CompCert is substantially correct (there are gaps in its modeling of machine details and its frontend still has some unproved parts). Its code quality is good. However, it misses out on some aspects of practicality. Most importantly, it is not clear that mortal developers can maintain and extend it, though this paper represents very interesting work in that direction. CompCert is almost certainly not the place to experiment with interesting new optimizations, to experiment with code generation for new instruction set extensions, etc. For now, it is just too difficult to create verifying instances of these things.

The Low Road: Testing

Although vendors test compilers before releasing them, the kind of testing that matters most is testing through usage: does the compiler, during a multi-year development cycle, show itself to be free of crashes and to reliably produce correct object code? Of course, the important thing is not to have a correct compiler — these do not exist — but rather to have a good idea of the situations in which it is wrong. It is commonplace for large embedded projects to have a few modules that have to be compiled with optimizations turned off, to avoid crashes and wrong-code bugs. Other modules will contain code to work around bugs, such as chaining together 32-bit adds if 64-bit math is buggy. Nobody likes these workarounds, but they are considered to be an acceptable (or at least unavoidable) part of the development process.

Is trustworthiness through usage good or bad? It’s a bit of both. It’s good because it usually works in practice. Also, it is cheap: to create a reliable system we have to do nearly continuous testing at multiple levels of granularity anyway. If we’re also testing the compiler more or less as a side effect, no big deal.

Trustworthiness through use is bad because I think it’s fair to say that we (both in industry and academia) do not have a good understanding of its limits. It works due to a variant of the locality principle: the programs we write in the future are likely to resemble those we wrote in the past. If past programs have been translated correctly, it’s likely that future ones will also. But what does “likely” mean here? What does “look like” really mean? Can we count on these properties? In a nutshell: No. This is one of the things that keeps developers, testers, and managers up late at night. To summarize: we are creating life-critical systems using tools that demonstrably contain silent wrong code bugs. We are comfortable doing so because we test these systems heavily, because this methodology has worked in the past, and because we don’t have a better alternative.

In a trustworthiness-through-usage world, upgrading the compiler should be a very scary thing to do. In fact, the compiler is very seldom (many people would say “never”) upgraded in the middle of a development cycle for a critical system.

What Good Is a Verified Compiler?

We can ask: If you drop a verified compiler into a typical mission-critical, security-critical, or safety-critical system development effort, what fraction of system testing effort would go away? Very little. First of all, what does verified even mean? As I mentioned earlier, a verified compiler is one that can show a mathematical correspondence between a model of the source language and a model of the target language. But is that correspondence actually meaningful and useful in terms of meeting the goals of an X-critical system? For example, is the target model complete and accurate? Or does it formalize the previous version of the instruction set and a restricted version of the source language? Is the target model specific enough to capture details like stack overflow or does it assume infinite memory? Second, as we have seen, real systems only become reliable through massive testing efforts. Most of this testing is unrelated to compiler reliability.

The real value provided by a verified compiler is that, looking forward, it can be used as part of a verified development stack. These don’t exist, but realistic examples will appear during the next 25 years. For example, we might develop a system in a future Matlab-like high-level language. A verified Matlab-to-C compiler gives us an optimized C implementation, and then a verified C-to-ARM compiler gives us verified object code. The RTOS will be verified and so will various properties of the hardware. In principle, these proofs can all be stacked, giving us a top-to-bottom proof that the system works as advertised, based on very few residual assumptions. These assumptions would include “our proof checker is correct, “the fab implemented the chip we specified,” and very little else.

Now we can ask: What fraction of testing effort will be eliminated by a fully verified tool stack? It’s hard to say, but I think about a 30% reduction in testing effort would be about the most anyone could possibly hope for. Even so, the savings due to a 30% reduction would be enormous.

Experience with a Verified Compiler

My group has spent the last few years finding crash and wrong-code bugs in C compilers. We do this by generating random C code, compiling it using different compilers (or different flags to the same compiler), and seeing if the computation that we specified retains the same meaning in all compiled versions. I’m making it sound easy, but generating “conforming” C codes — those whose meaning cannot be changed by a correct compiler — required several person-years (pointers are hard), and there were some other problems to solve before we could make this all work well.

Although we love compilers, and we are on friendly terms with as many compiler development teams as possible, when we put on our testing hats, there is only one goal: break the compiler. This is how testing has to work; it is an inherently adversarial activity. From this point of view, CompCert can be viewed as an affront. It spits in the face of our tool which is designed to create maximally evil C code, and which had found silent, previously-unknown wrong code bugs in every single C compiler we have tested, including a number that are routinely used for creating safety-critical embedded systems. So what happens when our unstoppable tool meets an immovable wall of proofs? This is what we set out to determine.

I expected that we would find some flaws in CompCert’s formalisms. My meager experience in formal specification is that getting a large specification right is heroically difficult — perhaps not much easier than getting a complicated software system right in the first place. So far, however, we have found no errors in the CompCert PowerPC specification, their C-light specification, or in their proof strategy. This is pretty amazing and my opinion of the job done by the CompCert team is astronomically high.

Even so, we did find previously unknown wrong code bugs in CompCert. Three of these were in the unverified front end and were basically typechecker bugs. These resulted in silent generation of wrong code. The other three bugs involved emission of incorrect PPC assembly code where CompCert generated a number that was too large for some fixed-size field in the instruction encoding. These bugs were not silent: the assembler would die with an error. Why didn’t the PPC specification include constraints on the range of these fields? Basically the CompCert authors weren’t worried about this case because assemblers reliably catch this kind of error. The front-end bugs are more worrisome and just in the last week or so, Xavier Leroy has replaced a large chunk of unproved code in the CompCert frontend with verified code. This is very cool.

The result from our CompCert testing that I find most interesting is that the CompCert silent wrong-code bugs are qualitatively different from the bugs we find in other optimizing compilers:

CompCert bugs: These are in an early phase of the compiler and can be found using very simple test cases. These kinds of bugs are not often found in mature compilers (CompCert isn’t, yet). It’s strongly possible that an exhaustive test suite could be created to find all instances of this kind of bug.
Bugs in other compilers such as LLVM, GCC, Intel CC: These are in the middle-ends and back-ends of the compilers and often require fairly large test cases (on the order of 10 lines). It is almost certainly impossible to create any fixed test suite that can find all of these kinds of bugs because the number of 10-line C programs is, for practical purposes, infinite.

I take two lessons away from our experience with CompCert. First, formal specification and verification are not a replacement for testing; the techniques are complementary. Second, formal verification can change the tester’s job and make it easier. In the case of CompCert, verification has eliminated all of the complicated optimizer bugs. In contrast, even after a few years of effort we have not been able to ferret out all instances of this kind of bug in some traditional compilers.

Predictions

OK, this post is getting long, let’s get to the point. Here are some predictions.

Prediction 1: No Compiler Is Considered Trustworthy Without Extensive Random Testing

Our random tester finds silent wrong-code bugs in all C compilers we have tested; a total of 270 crash and wrong code bugs so far. I hope that at some point in the future, compiler developers will create, with our help, a compiler that we cannot break. But this hasn’t yet happened (though CompCert may well reach this goal soon). Moreover, there are many ways to increase the power of our tester:

Increase the expressiveness of our program generator; there are still many C programs that we cannot even in principle generate, such as those containing unions or making interesting use of the comma operator.
Become smarter about how we explore the space of C programs; the conventional wisdom among software testing researchers (and I suspect there is a large element of truth to it) is that plain old random testing wastes a lot of time repeatedly exploring the same part of the test space. Solutions include uniform random sampling (difficult) and bounded exhaustive testing (slow).
Start using white-box testing techniques, which consider the structure of the compiler when generating test cases. The goal might be to generate a test suite that gives full branch coverage of a compiler.

Basically we have gotten excellent results using perhaps the stupidest possible technique. Actually that’s not right. Rather, you might say that we have focused all our efforts on creating a random code generator that is reasonably expressive and that targets the parts of the program space where our intuition told us compiler bugs would be found. That intuition was totally correct.

The prediction — which I do not think is even remotely radical — is that in the future the kinds of random testing I’m talking about here will be considered mandatory for any compiler that people might want to trust for X-critical systems.

Prediction 2: Practical Compiler Verification Work Will Converge Around a Small Number of IRs

Formal verification increases the effort required to create a system by at least an order of magnitude. This bar is simply too high to be tolerated by very many research groups. But consider the alternative where a high quality IR (intermediate representation) such as LLVM, or at least a stable and useful subset of it, is formalized and becomes the basis for a large fraction of verified compiler work. Now, one team can create verified x86 backend, another can create a verified PowerPC backend, and both of these efforts would be far less work than was already put into CompCert. Similarly, my guess is that some of the existing LLVM optimization passes could be extended to support translation validation without being totally redesigned. Other passes like the useful-but-dreadful instruction combiner (which was the single largest C++ file in LLVM before they refactored it) will have to be scrapped and replaced.

The result of this line of work will be a (perhaps small) subset of the LLVM passes the are proof-producing, permitting verified optimizing compilers to be constructed with far less effort than was required to do CompCert. My belief is that this kind of centralized infrastructure is critical to lowering the barriers to entry for verified compiler work.

Prediction 3: Testing and Verification Will Merge

The path to trustworthy and practical compilers will be through interesting combinations of testing and verification techniques. Here are two examples of directions that may end up being promising.

Example 1

Assume that we have three independently developed compilers for some programming language, all of which are pretty good but none of which is totally trustworthy. The classic way to exploit independent versions is n-version programming. To use it, we can run three copies of our system (each built using a different compiler) on three different processors, and then take a vote before performing any externally-visible action. This strategy also provides margin against hardware failure provided that we can make the voter highly reliable. On the downside, running three copies wastes a lot of hardware and energy and also requires the replicas to be deterministic. Determinism may actually be pretty hard to achieve in this case. First, the compilers will variously create faster or slower code, so our voter will have to have some timing tolerance. Second, if nodes are concurrent, small timing nondeterminisms will bootstrap themselves into significant behavioral differences as context switches cause significant events to be reordered. Third, programming languages like C and C++ permit compilers to make choices about significant program behaviors such as order of evaluation of arguments to a function.

How can we do better than node-level replication? If we had a powerful equivalence checker for binaries, we could simply use it to prove that the three systems all implement the same computation and then run only one of them. However, a tool capable of proving that three diverse binary images are equivalent is unlikely to appear anytime soon. But what about proving that short sequences of instructions are equivalent? For example, proving that

mov $0, r1

and

xor r1, r1

are equivalent is trivial, assuming that we don’t care about what happens to the condition code flags. Can we scale this checker up to tens of instructions, for example using a clever SAT encoding? Probably. Hundreds or thousands? Perhaps. What could we do with a tool that can reliably prove the equivalence or inequivalence of sequences of hundreds of instructions? If every function in the system we’re verifying is small enough, it should suffice to individually perform equivalence verification of functions. For systems that contain functions that are too large, we could automatically split the source code for the too-big functions into smaller functions — there are well-known techniques for accomplishing this kind of thing — until the pieces are compiled into object code small enough that their equivalence can be verified. I’m leaving out a lot of minor pieces such as verifying the function-splitter, but overall I believe this strategy could be make to work, giving a great deal of extra trustworthiness to a collection of legacy compilers. It seems elegant that we could bootstrap a scalable and workable translation validation mechanism from a poorly-scalable equivalence checker.

Example 2

As I’ve said, having used a compiler successfully in the past does not provide a rigorous basis for believing it will work in the future. Nevertheless, there’s something undeniably useful about verification through usage: why should we need to prove something about all paths through the compiler when we’re going to exercise only an infinitesimally small fraction of them?

It should be feasible to perform “just enough testing” to gain confidence that a particular source program is compiled correctly by a particular compiler configuration. To perform just enough testing, we’d again break the application under test into small pieces. For each, we’d compile it and perform some sort of intelligent white-box testing of the resulting object code. If we can test it thoroughly enough, and if we can choose a compiler configuration where optimizations are not performed across code chunks in harmful ways, maybe we can start to say something about the correctness of the translation.

Prediction 4: Formal Semantics Will Be Constructed Earlier

Most programming languages intended for safety-critical use will be developed in conjunction with a formal semantics: a machine-readable way to assign a mathematical meaning to each program in the language. Verified compilation is, of course, nonsensical without a formal semantics. A completely separate benefit is that a document like the C99 standard would have benefited enormously if its construction had had input from an experienced semanticist. Certain shady parts of the design would have had light shone upon them much earlier, and hopefully these could have been cleared up prior to finalizing the standard. The current situation is that the standard not only contains many dark corners, but also some real absurdities that compiler developers have been forced to ignore.

August 20, 2010

regehr

Computer Science, Futurist, Software Correctness

12 responses to “The Future of Compiler Correctness”

Zach Tatlock says:

August 20, 2010 at 10:35 pm

Thanks for another great compilers post.

While we developed XCert (the PLDI 10 paper you mention), one major challenge was aligning the semantics from our previous optimization checker, PEC (PLDI 09), with the semantics of CompCert’s RTL IR. After this experience, I think Predictions 2 and 4 should be related: to really lower the barrier for verified compiler work you want a set of common IRs *accompanied by* agreed upon, formal semantics for each IR.

Regarding mere mortals hacking on CompCert: it is possible. Coq programming is often a demanding discipline, but Xavier has done a wonderful job building infrastructure that handles common patterns so you can focus on what’s relevant to your optimization. It may take a few months to go from checking out the source to writing your own optimizations, but I think the learning curve for any compiler framework is pretty steep. Maybe CompCert’s is 4x as steep, but it offers great rewards for that struggle.

However, compiler writers can’t anticipate all developer requirements, and most developers can’t afford the time to learn Coq in order to add an essential optimization. Thus, XCert’s goal: dramatically lower the bar for adding optimizations to verified compilers. Hopefully, this guaranteed safe extensibility can make verified compilers an even more attractive tool in “X-critical” applications.
regehr says:

August 20, 2010 at 10:52 pm

Hi Zach- Thanks for the detailed comment. Sounds like we agree on all fronts. Lowering the bar for verified compiler passes is an amazingly cool contribution.
Pete says:

August 21, 2010 at 1:49 pm

Safety critical systems don’t usually depend on a verified or even tested compiler. For example, the DO-178B standard for aviation systems works around imperfect compilers by simply specifying that the systems object code should be verified as matching its source through testing. A coverage metric is used as a minimum standard for such tests (100% MC/DC coverage is required for a level A system in DO-178B).
regehr says:

August 21, 2010 at 2:29 pm

Hi Pete– A strong test coverage requirement like 100% MC/DC will go a long ways towards finding application and compiler bugs, but it’s not any kind of silver bullet. For example, miscompilation of a math expression could easily evade detection by a control-flow-based metric like MC/DC.
Pete says:

August 23, 2010 at 3:02 am

100% MC/DC coverage would guarantee that the math expression was at least assessed in testing and any subsequent control flow would have to be completely followed. Such tests will generally be checking the correctness of the output, so it would be rather unlikely for an error to slip through.

There are also more severe coverage metrics that are being developed and becoming popular. I’m just pointing out that a fully verified and bug-free compiler is by no means necessary for the development of safety critical systems (obviously).

Having recently had to test our own Ada parser against the Ada Conformity Assessment Test Suite (ACATS), I can definitely sympathise with anyone that has to work on verified compiler design. There were a vast number of painful edge-cases to consider 🙂
regehr says:

August 23, 2010 at 9:37 am

Hi Pete– Point taken, thanks.

I hadn’t seen ACATS but just looked at it. It is incredibly cool that this is freely downloadable. The state of C and C++ compilers would be better if the analogous test suites (Plum Hall etc.) were more accessible.
regehr says:

August 23, 2010 at 9:58 am

Pete- A quick question: What is your estimate of the fraction of people developing / managing a software-intensive safety-critical system who would be willing to upgrade the compiler in the middle of the development effort? If we fully trust our testing regime, the answer should be “100%”.
Matthias Felleisen says:

August 25, 2010 at 8:30 am

This is very timely column for my current efforts. Thanks.

A few remarks:

1. The ACL2 folks at UT Austin verified a software stack in the 1990s. They spun off a company called Computational Logic Inc. aka CLInc. It wasn’t C and it isn’t necessarily applicable to your work as an ’embedder’ but it’s close. A handful of companies use it on a routine basis but mostly for hardware now (from what I know).

2. The goal of developing the semantics early WITH TESTING is of course one of the theses of SEMANTICS ENGINEERING, a book co-authored with your very own Matthew Flatt and Robby Findler (Northwestern). Even when you develop a semantics, you’re never quite sure whether it’s quite what you want. So developing a framework for testing a semantics demonstrates problems and pitfalls. For example, Robby (with students) worked with the Revised6 Report committee to produce a semantics, and he discovered problems in the process:

http://journals.cambridge.org/action/displayIssue?jid=JFP&volumeId=19&issueId=S1

When Matthew, Robby and I developed delimited control for a production compiler, we proceeded in the same way. The test cases for the semantics were used for the compiler of course.

3. The intersection between semantics engineering and scripting languages shouldn’t be empty. Think of JavaScript and exploitations for web-based attacks. Lots of people are working on JavaScript now to develop some form of ‘secure’ subset but what does ‘secure’ mean when you don’t have a semantics? But now we’re back at the beginning.

Again, thanks for the extremely timely blog.
regehr says:

August 25, 2010 at 3:54 pm

Hi Matthias– Thanks for the comments! The semantics + testing work you folks are doing (which, of course, I really like) was very much in my mind when I was writing this post. A great project for some student would be to apply these techniques to CompCert’s Clight semantics!

I didn’t know that the “verified stack” ACL2 work had made it out into the world, that’s great to hear.
Matthias Felleisen says:

August 26, 2010 at 8:21 am

I forgot one comment:

4. You are absolutely correct in that testing and formal methods are complementary approaches and don’t exclude each other. As you may know, I started an effort some four years ago on replacing our freshman logic course (follow-up on discrete structure) with a course on verifying programs with actual tools. Here is the rough idea:

— the first-semester HtDP course teaches proper design
— after a month or so, students write interactive GUI programs (editors, games)
— part of the HtDP course design recipe is an examples-test approach
— the second-semester ACL2 course teaches validation based on proper design
— students learn to generalize from unit tests to property (random) tests
— random testing discovers mistakes in their conjectures
— when the random testing doesn’t discover mistakes anymore, prove the theorem with ACL2
(which removes a lot of the tedium of spelling out proofs
IF your programs are properly designed)

To make this course interesting, it ends with students proving theorems about some of their interactive GUI programs.

The key insight is that unit testing prepares the grounds for introducing a forall quantifier,
which is a “real” theorem; random testing confirms that it isn’t easy to break; and theorem
proving verifies that the theorem holds. Converse, when the theorem prover can’t verify a
theorem, the output is basically useless for students (not for ACL2 experts). Indeed, it is
often because the conjecture is wrong and random testing will discover a counterexample.
Then again, it is also often the case that students fail to stick to the HtDP design recipe and
the theorem prover can’t follow their reasoning.

I know Matthew is re-introducing HtDP at Utah. Perhaps the two of you should try a course
that synthesizes testing and verification.
regehr says:

August 26, 2010 at 8:54 am

Matthias, this course sequence sounds ideal! This is how formal methods should be taught. I’ll chat with Matthew.

If only the languages commonly used to construct the systems I”m interested in weren’t so difficult to reason about! There is, perhaps, a bit of hope; tools like Frama-C are getting to be pretty good.
Robby says:

August 27, 2010 at 7:31 pm

Hm, John your last comment makes it sound like you need to be doing more PL research. 😉