N-version programming is a way to reduce the impact of software bugs by independently implementing the same specification N times, running the implementations in parallel, and using voting to heuristically choose the right answer. A key weakness of N-version programming is that the increased reliability is predicated on faults occurring independently. As Knight and Leveson famously showed in 1986, the assumption of independence can easily be unjustified.
Aside from the Knight-Leveson problem, N-version programming is also expensive since the entire development process needs to be replicated. But what if the replicas already exist? Take the example of C compilers: dozens of independent implementations have been created. If I were developing a safety-critical or mission-critical embedded system where compiler bugs were a major source of concern, I could compile my source code using N different compilers (targeting N different architectures if desired) and thus gain some margin with respect to compiler bugs. Of course, my system would incur added hardware costs, but triple modular redundancy has long been an accepted way to gain reliability in avionics and space systems.
We’re left with the question: Are compiler flaws independent? My group has spent the better part of three years finding and reporting 300 bugs to various C compiler development teams. We do this using randomized differential testing: create a random C program, compile it using N compilers, run their outputs, and compare the results. So far, we have never seen a case where even two independent compilers (let alone a majority) have produced the same wrong result for a test case. The intuitive reason for this is that the kind of compiler bugs we find are typically deep inside the optimizer. At this level, independent implementations really are independent: GCC is operating on Gimple or RTL, LLVM is operating on LLVM IR, etc. The problems being solved here don’t relate so much to the C specification, but instead to more fundamental issues like preserving the meaning of a piece of code. My guess is that for experiments like Knight and Leveson’s, more of the complexity of implementations was at the surface level, having direct relevance to the specification, and therefore errors in implementing the specification were more likely to go wrong in similar ways.
It is entirely possible that there exist correlated failures in implementations of tricky and lightly-used parts of the C standard like those relating to variadic functions or FP rounding modes. We don’t test these. In earlier work we tested implementations of the volatile qualifier and found that all compilers get this wrong sometimes; unfortunately we didn’t spend any time looking for correlated failures when doing that work.
When we started testing compilers, I half-expected that we would find correlated failures among compilers because some authoritative source such as the Dragon Book or the CCP paper contained a subtly flawed description of a key algorithm, and everyone just implemented it as described. But so far, no dice. In C++ compilers, the EDG frontend is so widely used that it would seem to create an important opportunity for non-independent failures. On the other hand, it is reputed to be high quality and also there are enough non-EDG frontends out there that sufficient cross-checking power is probably available.