Finding Compiler Bugs by Removing Dead Code

I was pretty bummed to miss PLDI this year, it has been my favorite conference recently. One of the talks I was most interested in seeing was Compiler Validation via Equivalence Modulo Inputs by some folks at UC Davis. Although I had been aware of this paper (which I’ll call “the EMI paper” from now on) for a while, I was hesitant to write this post — the work is so close to my work that I can’t avoid having a biased view. So anyway, keep that in mind.

One of the things that makes testing hard is the necessity of oracles. An oracle tells us if a run of the software being tested was buggy or not. A cute idea that no doubt has been around for a long time, but which has more recently been called metamorphic testing, is to work around the oracle problem by taking an existing test case and changing it into a new test case whose output can be predicted. For example, if I have a test case (and expected answer) for a program that classifies a triangle as isosceles/scalene/right/etc., I can scale, translate, and rotate the triangle in my test case in order to create many new test cases that all have the same answer as the original one.

So how should one apply metamorphic testing to compilers? It’s not very hard to come up with bad ideas such as adding layers of parens, rewriting (x+y) to be (y+x), rewriting x to be (x+0), etc. The reason that these are bad ideas is that the changes will be trivially undone by the optimizer, resulting in poor testing of the optimizer logic. Some better ideas can be found in this paper on metamorphic compiler testing (IEEE paywall, sorry) which found a few GCC bugs.

The EMI paper is based on a particularly clever application of metamorphic testing where the program transformation is removal of dead code. Of course compilers know how to remove dead code, so how is this transformation any better than the bad ones I already mentioned? The idea is to remove dynamically dead code: code that is dead in some particular run. This kind of code is easy to detect using a code coverage tool. Of course this code may not be dead in other executions of the program, but this is fine: we’ll just need to be careful not to test the compiled program on anything other than the input that was originally used to do dead code discovery. So basically EMI works like this:

  1. Run a C program on whatever input we have sitting around, using compiler instrumentation to check which lines are executed.
  2. Create a new C program lacking randomly chosen pieces of code that did not execute in Step 1.
  3. Run the new program on the same input. Report a compiler bug if its output has changed.

Notice that the original C program had better not execute undefined behavior or rely on unspecified behavior or else we’ll get false positives.

The cleverness of EMI is not abstract or conceptual. Rather, EMI is clever because it works: at the time the paper was finalized the authors had reported 147 compiler bugs that were confirmed by developers, 110 of which have been fixed. This last number — fixed bugs — is the impressive one, since finding bugs that people care about enough to fix is generally a lot harder than just finding bugs.

The great thing about EMI is that it is a simple and extensible process. For example, it would not be hard to adapt the idea to C++. In contrast, random generation of a meaningful subset of C++11 is a project that we have been reluctant to start because we don’t yet know how to build this generator at an acceptable level of effort. Another easy extension to EMI would be adding or modifying dead code rather than simply deleting it. More generally, metamorphic compiler testing is probably an underused idea.

I was interested to read that the vast majority (all but four, it looks like) of the bugs discovered by EMI were triggered by mutated versions of Csmith programs. One reason that this is interesting is that since Csmith programs are “closed” — they take no inputs — the statically and dynamically dead code in such a program is precisely the same code. Therefore, an apparent advantage of using dynamic information — that it can remove code that is not dead in all executions — turns out to be a bit of a red herring. EMI works in this situation because the dead code elimination passes in compilers are not very precise.

An interesting question is: Why is Csmith+EMI so effective? One reason is that Csmith programs tend to contain a large amount of dead code, giving EMI a lot of room to play. It is just hard to generate expressive random code (containing lots of conditionals, etc.) that isn’t mostly dead, as far as we can tell. We’ve known this for a while and basically we don’t care — but we never guessed that it would turn out to be a hidden advantage.

Another problem with using EMI to mutate non-Csmith programs is that many real C programs execute undefined behaviors and even when they do not, it is generally difficult to verify that fact. Csmith, in contrast, has been somewhat co-designed with Frama-C such that the two tools work together with no additional effort. Automated undefined behavior detection is a crucial part of doing automated test-case reduction using C-Reduce.

One might ask: How useful is EMI when applied to C++ given that there is no C++smith? I look forward to learning the answer. The lack of a robust undefined behavior checker for C++ is another problem, although projects like LLVM’s UBsan are slowly chipping away at this.

The EMI authors say “… the majority of [Csmith’s] reported bugs were compiler crashes as it is difficult to steer its random program generation to specifically exercise a compiler’s most critical components—its optimization phases.” This doesn’t follow. The actual situation is subtle, but keep in mind that the entire purpose of Csmith is to exercise the compiler’s optimization phases. We spent years working on making Csmith good at this exact thing. We did in fact report more crash bugs than wrong code bugs but the real reasons are (1) we aggressively avoided duplicate bug reports by reporting only one wrong code bug at a time, and (2) wrong code bugs tend to be fixed much more slowly than crash bugs. In essence, the reasons that we reported fewer wrong code bugs than crash bugs are complex ones having more to do with social factors (and perhaps our own laziness) than to do with weaknesses of Csmith. Of course it might still be the case that EMI is better than Csmith at discovering middle-end optimizer bugs, but the EMI authors have not yet shown any evidence backing up that sort of claim. Finally, it is not necessarily useful to think of compiler crash bugs and wrong code bugs as being different things. The underlying bugs look much the same, the difference is often that in one case someone put the right assertion into the compiler (causing the inconsistency to be detected, leading to crash via assertion violation) and in the other case the inconsistency was undetected.

On a closely related note, after finishing this paper I was left asking: Given a limited testing budget, would it be better to run Csmith or to run Csmith+EMI? In other words, which method discovers either more bugs or higher-quality bugs? This would be a fun experiment to run, although there are some subtleties such as the fact that GCC and LLVM have (effectively) evolved to be somewhat resistant to Csmith, giving any new testing method an implicit advantage.

One thing about the EMI work that makes me selfishly happy is that over the last couple of years I’ve slacked off on reporting compiler bugs. This makes me feel guilty since Csmith continues to be capable of finding bugs in any given version of GCC or LLVM. Anyway, I feel like I’ve done my time here, so have fun with the bug reporting, guys!

In summary, EMI is very cool and anyone interested in compiler correctness should read the paper. It’s also worth thinking about how to apply its ideas (and metamorphic testing in general) to other application domains.

11 responses to “Finding Compiler Bugs by Removing Dead Code”

  1. What has caused you to slack off on the bug reporting? The cost of computer time or the human effort side? (I’m guessing the second.)

    With regards to only reporting one bug at a time. How does that play out? Do you keep probing but just sit on the new test cases? If so, then it seems like this would be an interesting opportunity for some AI work: you could generate and reduce cases as fast as you are willing to pay for the compute to support and use this to populate a stable of tests cases that get re-run on each new release or nightly build of the LLVM/GCC. The AI bit comes in by assuming that test cases that all start passing at the same time are caused by the same bug which gives you training data for a supervised learning system.

  2. Hi bcs, you are right: the limiting resource is entirely me.

    We have done some work on deduplicating bug reports that come from fuzzing but it’s not perfect and I have (inexcusably) not tied Csmith, C-Reduce, and this newer work into a seamless package:

    This summer would be a good time to get that done…

  3. bcs, that’s a nice idea of a way to bootstrap the supervised learning. Myself and some machine learning folks at OSU have been working on just that, with historical bugs from SpiderMonkey 1.6 up through 1.8.5 as the data set, but even using the bugs fixed between 1.6 and 1.7 (a lot of bugs, in this case), it’s hard to learn a really good distance metric. I think our best from that doesn’t do as well as Levenshtein in that PLDI paper. However, right now we’re using function coverage, since statements produce too many dimensions to just out of the box run the metric learning stuff.

  4. Where can one find this »Orion« EMI tool?
    Or is this closed source software?

  5. Hey Dr Regehr,

    For automatic random generation of valid program code, try grammatical evolution.

    The idea: generate a sequence of random numbers (phenotype). Use each number to steer your way through a defined grammar, which is guaranteed to produce valid genotypes in your target language.

    I did this once for a data mining project at University of Louisville. My grammars produced populations of random classifiers, sets of Java-valid if-then rules or simple expressions, which were given to BeanShell to execute and score. Better classifiers got to mix their sequences with other good classifiers, and the population got better at the task.

    It was horribly slow, but I was able to produce “good” classifiers for some data under study. I’ll bet GE could be applied to your problem. Just take the G part and strip off the E part, very low level of effort. Check out O’Neill and Ryan. Take care!

    — Calvin Miracle, UofL Libraries

  6. @Calvin Miracle:

    If I understand correctly:
    – generating grammatically-valid programs is easy.
    – generating semantically-defined programs is hard.

    I saw the talk at PLDI and it was quite interesting, it’s a neat idea well executed. It was nice to see they found no bug in CompCert’s verified parts either!

  7. Valentin is correct, generating valid random programs seems to be fundamentally hard. Not hard in a theoretical sense, but engineering-hard, like building a compiler is hard.

  8. I think it’s probably theoretically hard, also, if you want to be extremely general in the random programs you generate, but you may not want to be that general. I seem to recall someone telling me this for generating some functional language: generating some random well-typed programs was not hard; generating members that included the full generality of well-typed programs was extremely hard.

  9. As an example of a theory-hard problem in random testing, I might ask your fuzzer to be capable of generating any program in the language that terminates.

  10. Hey man, no fair asking for undecideable things. But the generation issue holds even if your property of interest is (easily) decided — like many type systems. I recall the problem being something like

    1) Generating syntactically correct code is easy, but it’s not guarantee to be well-typed just by the grammar, of course
    2) You can do it by just “guess and check” (the cheap way to do it in all cases where definedness/well-typed/whatever is decideable in reasonable time
    3) But it turns out the set of well-typed programs if you just generate fully general and then guess-and-check is such that your expected time to push out one well-type program is roughly “wait till the sun burns out, then start getting impatient” if you want programs of modest length