Independence in N-Version Programming

N-version programming is a way to reduce the impact of software bugs by independently implementing the same specification N times, running the implementations in parallel, and using voting to heuristically choose the right answer. A key weakness of N-version programming is that the increased reliability is predicated on faults occurring independently. As Knight and Leveson famously showed in 1986, the assumption of independence can easily be unjustified.

Aside from the Knight-Leveson problem, N-version programming is also expensive since the entire development process needs to be replicated. But what if the replicas already exist? Take the example of C compilers: dozens of independent implementations have been created. If I were developing a safety-critical or mission-critical embedded system where compiler bugs were a major source of concern, I could compile my source code using N different compilers (targeting N different architectures if desired) and thus gain some margin with respect to compiler bugs. Of course, my system would incur added hardware costs, but triple modular redundancy has long been an accepted way to gain reliability in avionics and space systems.

We’re left with the question: Are compiler flaws independent? My group has spent the better part of three years finding and reporting 300 bugs to various C compiler development teams. We do this using randomized differential testing: create a random C program, compile it using N compilers, run their outputs, and compare the results. So far, we have never seen a case where even two independent compilers (let alone a majority) have produced the same wrong result for a test case.  The intuitive reason for this is that the kind of compiler bugs we find are typically deep inside the optimizer. At this level, independent implementations really are independent: GCC is operating on Gimple or RTL, LLVM is operating on LLVM IR, etc. The problems being solved here don’t relate so much to the C specification, but instead to more fundamental issues like preserving the meaning of a piece of code. My guess is that for experiments like Knight and Leveson’s, more of the complexity of implementations was at the surface level, having direct relevance to the specification, and therefore errors in implementing the specification were more likely to go wrong in similar ways.

It is entirely possible that there exist correlated failures in implementations of tricky and lightly-used parts of the C standard like those relating to variadic functions or FP rounding modes. We don’t test these. In earlier work we tested implementations of the volatile qualifier and found that all compilers get this wrong sometimes; unfortunately we didn’t spend any time looking for correlated failures when doing that work.

When we started testing compilers, I half-expected that we would find correlated failures among compilers because some authoritative source such as the Dragon Book or the CCP paper contained a subtly flawed description of a key algorithm, and everyone just implemented it as described. But so far, no dice. In C++ compilers, the EDG frontend is so widely used that it would seem to create an important opportunity for non-independent failures. On the other hand, it is reputed to be high quality and also there are enough non-EDG frontends out there that sufficient cross-checking power is probably available.

Compiler Bug-Finding Project Milestones

Despite the Fall semester being crappy for me (double course load, and I keep getting sick) this project has made good progress:

  • Xuejun gave a talk at the LLVM Developer’s Meeting
  • We submitted our 300th compiler bug report (up from 200 in March 2010)
  • We submitted a paper to PLDI 2011 summarizing our methods and results; I’m keeping fingers crossed that I won’t have to burn an effigy of the dreaded reviewer 3

The next milestone will be to release Csmith, our test-case generator, as open source, hopefully before Christmas.

Computer Science Research Proposal

Last night I sent a link to this article to my colleagues. It’s a really entertaining essay, though I agree with several of the commenters that some of the details do not ring true. Someone responded with a link to this page, which says:

Our writers staff on, that specialize at computer science can provide you with custom written computer science research proposal, that exactly match your expectations.

However sweet the promise of ending the proposal-writing problems that plague faculty and students alike, this sentence is too riddled with errors to give us much confidence in the proposal-writing service. But then again a lot of funded CS research proposals also display signs of extremely hasty proofreading.

Counting Compiler Crashes

This is a bit of material from my GCC Summit talk that I thought was worth bringing out in a separate post.

As an experiment, we compiled 1,000,000 randomly generated C programs using GCC 3.[0-4].0, GCC 4.[0-5].0, LLVM-GCC 1.9, LLVM-GCC 2.[0-8], and Clang 2.[6-8]. All compilers targeted x86 and were given the -O0, -O1, -O2, -Os, and -O3 options. We then asked the question: How many ways is each compiler crashed? We count crash bugs by looking for unique “assertion failure” strings in LLVM output and “internal compiler error” strings in GCC output. This is conservative because typically a compiler will also have a number of crashes due to null pointer dereferences and other memory safety violations, and we don’t try to tell these apart. Here are the three crash messages we got from Clang 2.8:

Statement.cpp:944: void Statement::post_creation_analysis(std::vector<const Fact*, std::allocator<const Fact*> >&, CGContext&) const: Assertion `0′ failed.

StatementIf.cpp:81: static StatementIf* StatementIf::make_random(CGContext&): Assertion `ok’ failed.

Block.cpp:512: bool Block::find_fixed_point(std::vector<const Fact*, std::allocator<const Fact*> >, std::vector<const Fact*, std::allocator<const Fact*> >&, CGContext&, int&, bool) const: Assertion `0′ failed.

Here are the GCC results:

And here are the LLVM-GCC results:

The thing I really like about these results is the way it shows that the latest versions of GCC and LLVM are really solid. Both teams have put a tremendous amount of work into their tools.

The “bugs fixed” annotations in the graphs refer to fixes to bugs that we found (using random testing) and reported. We had hoped to establish some sort of nice causal link between fixing these bugs and improving the resistance of compilers to crashing, but these graphs are a long way from showing anything like that. There is just a lot going on in each of these projects — the experiment is too uncontrolled to give us any kind of solid evidence.

Are compiler crashes good or bad? Certainly they’re a pain when you’re just trying to get work done, but overall they’re good. First, they can almost always be worked around by changing the optimization level. Second, a crash is the compiler’s way of failing fast, which is good. If that assertion hadn’t been there, it’s entirely possible the compiler would have run to completion, generating incorrect code.

Update from 11/6/2010: Here are the graphs showing rate of crashing, as opposed to number of distinct crash bugs, that Chris asked about in a comment.

Although it’s not totally clear what to read into these numbers, they do seem to tell a good story. GCC steadily improves during the 3.x series, regresses following the major changes that went into 4.0.0, then (more or less) steadily improves again. Similarly, LLVM starts off with a rather high rate of crashes and improves over time. Both compilers are exceptionally solid in their latest releases (note the log scale: both compilers have reduced their crash rates by at least 3 orders of magnitude).

Non-deterministic Compilers?

Compilers are supposed to be deterministic, and they generally are. However, when a compiler has memory safety bugs (use of free memory, usually) and runs on an OS with ASLR (address space layout randomization) enabled, it can behave non-deterministically. Compile a file one time and the result is correct, compile the exact same file again and now the executable crashes or generates the wrong answer. This may sound like a silly case that happens only in theory, but while hunting for compiler bugs I’ve seen it happen a number of times.

The scary scenario is this:

  1. Safety critical embedded code is compiled for testing, and happens to be compiled correctly.
  2. Testing proceeds and finds no problems.
  3. For whatever reason, the system is compiled again and this time the wrong code is generated.
  4. Wrong code is deployed.

Unlikely? Sure! But so are a lot of other things people developing safety critical systems have to worry about. If I were developing safety critical code I’d turn off ASLR for development tools, it’s cheap insurance.