N-version programming is a way to reduce the impact of software bugs by independently implementing the same specification N times, running the implementations in parallel, and using voting to heuristically choose the right answer. A key weakness of N-version programming is that the increased reliability is predicated on faults occurring independently. As Knight and Leveson famously showed in 1986, the assumption of independence can easily be unjustified.
Aside from the Knight-Leveson problem, N-version programming is also expensive since the entire development process needs to be replicated. But what if the replicas already exist? Take the example of C compilers: dozens of independent implementations have been created. If I were developing a safety-critical or mission-critical embedded system where compiler bugs were a major source of concern, I could compile my source code using N different compilers (targeting N different architectures if desired) and thus gain some margin with respect to compiler bugs. Of course, my system would incur added hardware costs, but triple modular redundancy has long been an accepted way to gain reliability in avionics and space systems.
We’re left with the question: Are compiler flaws independent? My group has spent the better part of three years finding and reporting 300 bugs to various C compiler development teams. We do this using randomized differential testing: create a random C program, compile it using N compilers, run their outputs, and compare the results. So far, we have never seen a case where even two independent compilers (let alone a majority) have produced the same wrong result for a test case. The intuitive reason for this is that the kind of compiler bugs we find are typically deep inside the optimizer. At this level, independent implementations really are independent: GCC is operating on Gimple or RTL, LLVM is operating on LLVM IR, etc. The problems being solved here don’t relate so much to the C specification, but instead to more fundamental issues like preserving the meaning of a piece of code. My guess is that for experiments like Knight and Leveson’s, more of the complexity of implementations was at the surface level, having direct relevance to the specification, and therefore errors in implementing the specification were more likely to go wrong in similar ways.
It is entirely possible that there exist correlated failures in implementations of tricky and lightly-used parts of the C standard like those relating to variadic functions or FP rounding modes. We don’t test these. In earlier work we tested implementations of the volatile qualifier and found that all compilers get this wrong sometimes; unfortunately we didn’t spend any time looking for correlated failures when doing that work.
When we started testing compilers, I half-expected that we would find correlated failures among compilers because some authoritative source such as the Dragon Book or the CCP paper contained a subtly flawed description of a key algorithm, and everyone just implemented it as described. But so far, no dice. In C++ compilers, the EDG frontend is so widely used that it would seem to create an important opportunity for non-independent failures. On the other hand, it is reputed to be high quality and also there are enough non-EDG frontends out there that sufficient cross-checking power is probably available.
4 responses to “Independence in N-Version Programming”
Do I remember correctly that the way you detect compiler bugs is to generate examples which are known to have no undefined behaviour and then run them through various different compilers and see if they produce programs which agree on the result? If so, is it possible that there are some instances of hyper-correlated bugs which you’re missing due to simply everyone getting them wrong?
Seems pretty unlikely, and verging on a de facto standard at that point, just wondering if there’s some selection bias where correlated bugs are by their nature harder to spot.
Hi David- You’re right, this could happen. All we can do to stop it is add more C implementations, under the assumption that someone got it right. But as you say, C’s behavior is defined more by its implementations than its standards, so if all compilers agree on some behavior it’s probably the correct one and the standard needs to be updated. In fact this is what happened with INT_MIN%-1, which seems well-defined to me in C99, but most implementations treat it as undefined. C1x is making it explicitly undefined.
With all the work done on ECMAScript interpreters, it would be interesting to see whether the same conclusion applies with such interpreters.
Hi Daniel- We should ask Jesse Ruderman:
http://www.squarefree.com/
He has a great Javascript fuzzer.