Finding and Understanding Bugs in C Compilers


Today we finished preparing the camera-ready version of our paper that will appear in PLDI 2011. I’m pretty happy with it. Here’s the abstract:

Compilers should be correct. To improve the quality of C compilers, we created Csmith, a randomized test-case generation tool, and spent three years using it to find compiler bugs. During this period we reported more than 325 previously unknown bugs to compiler developers. Every compiler we tested was found to crash and also to silently generate wrong code when presented with valid input. In this paper we present our compiler-testing tool and the results of our bug-hunting study. Our first contribution is to advance the state of the art in compiler testing. Unlike previous tools, Csmith generates programs that cover a large subset of C while avoiding the undefined and unspecified behaviors that would destroy its ability to automatically find wrong-code bugs. Our second contribution is a collection of qualitative and quantitative results about the bugs we have found in open-source C compilers.

Actually, as of today the total is 344 bugs reported but at some point we had to freeze the statistics in the paper.


8 responses to “Finding and Understanding Bugs in C Compilers”

  1. Nice work, and extra thanks for releasing the software. That is all too rare in academia in my experience.

    As for the results, I can’t say I’m surprised. When I evaluate a compiler, I usually find at least half a dozen bugs within minutes, be it wrong code, compiler crash, or rejecting valid input, and this without even trying.

  2. The statement “compilers should be correct” sounds obvious, but is it right?

    Suppose a compiler had a bug that was impossible for a fast adversary to trigger. For example, suppose a compiler had a statement like “if (ENCRYPT(x) = deadbeef), go into infinite loop,” where ENCRYPT is a secure encryption function. It would be impossible for a sub-exponential adversary to find a value of x which triggered this code, and so the compiler would be correct from the viewpoint of fast adversaries.

    Thus, while there exists a code which will elicit incorrect output, the compiler can never, in practice, encounter this code.

    Are there any circumstances in which this type of compiler incorrectness is acceptable or desirable?

  3. (cont’d)

    Here is a plausible example in which this type of incorrectness might be used.
    Suppose a code contains a condition

    if (data1 == data2)

    where data1 and data2 are large data structures.

    Instead of checking equality bit-for-bit, which might be expensive,
    the compiler transforms this to

    if (hash(data1) == hash(data2))

    where hash is a cryptographically secure hash function.

    Is this transformation permissible? The new program definitely does not have the same semantics as the original one, as there exists an input which elicits different behavior. However, it is impossible to find such an input in less than 2^(256) operations, for example.

  4. Hi David,

    Nice example. In general I think there are plenty of situations where a fast, good-enough computation is fine. This is pretty much the entire idea behind floating point math, right? I mean, we seem to be comfortable getting onto aircraft where the stability analysis was done using a system that cannot represent 0.2 precisely :).

    However, compromising on correctness needs to be done in a way that is understandable, predictable, and controllable by programmers. So if you designed a language where the compiler did these things, I’d expect the incorrectness to be precisely quantified in the language spec. At that point there’s no problem. After all, real processors are only probabilistically correct due to cosmic rays, quantum mechanics, etc.

  5. Apologies if you mention this in the pre-print and I missed it, but I was wondering if based on your results, you could characterize a “safe”(er) subset of C that is highly unlikely to trigger compiler bugs, perhaps even restricted to a specific compiler (e.g., GCC). For example, what percentage of bugs are triggered by constructs violating the MISRA C guidelines ?

  6. Lee, that is a very interesting idea, though my intuition is that it won’t work. For example, I believe that almost all bugs we found could be triggered by a MISRA-compliant test case.

    Of course what I am saying is unscientific. An interesting experiment would be to take Csmith’s output and run it through some source-to-source transformers that remove features such as side-effects within expressions. Then, we could see if the transformed code triggers fewer bugs than the original.

    The problem is the compilers are quite smart and would probably turn both versions of the code into more or less the same IR.

    My mental model for compiler bugs is that there is no model.