The Art and Science of Testcase Reduction for Compiler Bugs

Test case reduction is a common problem faced by programmers where some large input (or more generally, some complicated set of circumstances) causes a program to fail, but we wish to know the smallest input (or the simplest circumstances) that causes the same failure. Reduced test cases are important because:

  • Since the bulk of the reduced test case is usually relevant to the problem at hand, a good guess about the underlying problem can often by made simply by inspecting the test case.
  • Usually, the system under test doesn’t spend very long executing the reduced test case, making debugging much easier.
  • A reduced test case makes a good regression test.
  • Even if the original failure-inducing test case is unreportable because it is proprietary, the reduced test case may be simple enough that it is reportable.

Test case reduction is particularly difficult when parts of the input are implicit — time, context switches, cache misses, etc. Reducing inputs for compiler bugs — the topic of this post — is both easier and harder than other common test case reduction problems. They are easier because compiler behavior (usually) depends only on the contents of a collection of source code files comprising its input (this may not be the case when the compiler executes non-deterministically or when it reads hidden state like precompiled header files). Test case reduction for compilers is harder because the inputs are highly structured. Thus, it may be impossible to get rid of some piece of the input (e.g. a struct field) because it is referenced by other parts of the input.

The GCC developers have written about testcase reduction here and here. LLVM comes with its own test case reducer called Bugpoint, and reduction is also discussed here. The most commonly used automated test case reduction tool for compiler inputs is delta, an implementation of the ddmin algorithm. However, delta improves upon the original by exploiting the natural hierarchy found in block-structured programming languages.

A basic strategy for reducing a test case has been known for a long time. It is:

  1. Remove part of the input.
  2. See if it still causes the failure.
  3. If yes, go to step 1. If no, backtrack to the last version that caused the failure and then go to step 1.

This is boring and easy. The tricky thing is that when a fixpoint is reached (no part of the input can be removed without breaking the test case), the resulting test case is probably not the smallest one that triggers the bug. There are several reasons for this:

  • The definition of “part of the input” may not capture necessary transformations. For example, the delta tool does not understand how to remove an argument from a function and also from all calls to the function (no language-independent tool could be expected to do this).
  • The search is greedy; it may have missed a branch much earlier in the search tree that lead to a smaller input.
  • The actual smallest input triggering the bug may bear no resemblance to the testcase that we have.

Escaping local minima requires at least some domain specificity, and at most a deep understanding of the system under test. Let’s look at an example. A few weeks ago I reported a bug where this program causes GCC to crash:

struct S0
  int f1:1;
int g_155;
int g_229;
int g_360;
struct S0
func_93 (int p_94, int p_95, int p_96)
  struct S0 l_272 = {
  for (; g_155;)
    if (0)
      l_272.f1 = 1;
  for (;;)
      unsigned l_663 = 94967295;
      if (g_229++, g_360 = p_94 || (l_272.f1 &= l_663))

As reduced testcases go, this is perfectly fine. However, Jakub Jelinek, a GCC developer, posted a better version:

struct S { int s : 1; };
int a;

foo (int x, int y)
  struct S s;
  s.s = !!y;
  while (1)
      unsigned l = 94967295;
      a = x || (s.s &= l);

While this is clearly derived from the testcase that I reported, it was not produced simply by deleting stuff. Jakub made changes such as:

  • Simplifying identifiers
  • Changing a function to return void and to eliminate an argument
  • Factoring code out of a conditional
  • An interesting refactoring of some loop code into s.s = !!y;

Why does this matter?  Because it would be awesome if we could automate these transformations and save Jakub some time. The first three are clearly possible, but I’m not sure there’s any general version of the fourth transformation.

It turns out that Jakub is kind of an unsung testcase reduction hero. Consider this awesome example, where his reduced test case is not even in the same programming language as the original. Other good ones are PR 48305, PR 47139, PR 47141, and PR 45059. All of these make my reducers look kind of bad. Many more excellent reduced GCC testcases have been created by people like Richard Guenther, Volker Reichelt, and Andrew Pinski.

What other tricks do people use to create the smallest possible test cases?

  • Inlining a function body
  • Declaring multiple variables in a single line
  • Removing a level of indirection from a pointer or a dimension from an array
  • Replacing a struct or union with one or more scalars

A final reduction technique is to deeply understand what is going wrong in the compiler and trigger that behavior in some different way. For example, see the introduction of __UINTPTR_TYPE__ in PR 47141.

In summary, testcase reduction is both an art and a science. The science part is about 98% of the problem and we should be able to automate all of it. Creating a test case from scratch that triggers a given compiler bug is, for now at least, not only an art, but an art that has only a handful of practitioners.


17 responses to “The Art and Science of Testcase Reduction for Compiler Bugs”

  1. Why is the interesting refactoring hard? This one, at least, seems totally automatable. There are three obvious reductions, and one nonobvious:

    1. (if (0) S1 else S2;) => S2;
    2. (for (; E1; ) S1;) => (if (E1) S1;) where S1 cannot modify result of E1
    3. (X = [undefined]; S1;) => (X = 0; S1;) [this is the nonobvious one]
    4. (X = 0; if (E1) X = 1;) => (X = !!E1;)

    There’s probably a finite number of similar patterns. (!!y is a lovely hack one frequently sees in Javascript for some reason.)

    You’re not saying that your test case is minimized, right?, just that it’s small?

  2. Hi Eddie– yeah this is one of those cases where “minimized” is a lie and we just mean “smallish.”

    You’re right about the transformation, I’ll see about making it happen. This kind of structured transformation is a weak point in our tools at the moment, unfortunately. I was kind of wigged out by the bang bang too. It has appeared in Jakub’s reduced tests before, so there’s something going on with that…

  3. “!!” is C-derivativish for “reduce to boolean”. And for a long time C didn’t even *have* booleans, making it a hack of a hack. I’m not quite sure why people consider “!!x” lovelier than “x != 0”, though. Maybe it’s because C’s notion of “0” is overloaded. Or because of the old “less keystrokes is always better” maxim, which somehow never applies to whitespace… Or because it’s just considered smart (in all the meanings of the word).

    C++ folks are exempt because they can never quite be sure which operators are equivalent, Javascript folks are exempt because they can never quite be sure what value is being used to indicate “nothing” today (and so “!!x” is not equivalent to “x != 0” in Javascript). I suppose reducing the cognitive load across languages would be one reason to favor “!!x” — it works the same in multiple languages.

  4. I think the issue around this being hard is that the set of reduction steps that can be used is effectively unlimited. Maybe what is needed is a reduction scripting system:

    Build the simplest source->AST/AST->source pair you can, and add an scripting language API in between them. From there you could build up a corpus of “steps” that have worked in the past and use some sort of heuristics, random selection or the like to select what to apply when. The application of the steps sounds like an interesting AI challenge. As a first thought, some combination of A* and an autonomous classification engine with feedback (did it reduce the test case and is it still valid?) might be interesting.

    Brother, can you spare a grad student? (Or three?)

  5. bcs, you are speaking my language here. We already have sort of a generic reducer that takes transformations as plugins; I’m going to write it up soon, this post was sort of paving the way for a couple more advanced topics.

    Scripting the transformations themselves would be really cool. There’s this whole area of CS about rewriting systems that I know nothing about, maybe that stuff comes in handy.

    A* would be interesting to try. Currently the algorithm is purely greedy: try every possible transformation until a fixpoint is reached. As you can guess, this is super slow unless we’re careful to run the transformations with the biggest gains first. I’m kind of afraid to add backtracking since reduction already takes a long time sometimes.

    Re students, I just heard last night that I got a Google research award that will pay one of my people for a year to work on exactly this stuff.

  6. How much concurrency do you have available? For something with such a large transition cost, I would think A* would distribute beautifully. As a side effect, that would likely end up with a nice large corpus of triggering/non-triggering pairs and the shape of that edge might be interesting to the dev doing the debugging.

  7. bcs, there’s a ton of concurrency here. Parallel A* is a great idea, I’ll try it out. Needed an excuse to buy a 6-core machine for my group anyway :). Distributing across a network will be a big win initially, but not necessarily in the end since as testcase size gets smaller, tests become quite cheap.

    Re. the corpus of triggering/non-triggering test cases, I agree with your intuition that this is valuable. We have been brainstorming about ways to exploit these, but don’t have anything concrete yet.

  8. Re: the network distribution not being a win in the end-game: that, among other things, assumes you only converge a single branch. The implementation I was thinking of would use as much compute as you can get with each node pulling an item from the queue, generating a collection of mutations, testing them all and injecting all the triggering cases back into the queue. This would run on all nodes until some condition is met: e.g. the next smallest unprocessed item is more than a given amount larger than the best item found or some amount of wall time has passed or you have another case you want to reduce. Parallelism in the end equates to backtracking.

    As for examining the edges: If the deltas are small, just looking at random data points in a diff tool might work to some degree.

  9. 6 cores? Drop me a line and I can arrange to run your code on a lot more cores 🙂

  10. Coming at it from the functional programming perspective, it’s clear that a great number of these reductions are just partial evaluation. The classic example of partial evaluation which imperative programmers would be familiar with is constant propagation. Another is to replace uses of a variable by its value, when the variable isn’t being modified. (Pruning pointer chains is essentially the same thing, just inlining the pointer value instead of inlining an int value or a function value.) The problem here is that the syntax of the language is getting in the way of being able to see that it’s just partial evaluation.

    Indeed, it seems like partial evaluation is the very essence of (algorithmic) test case reduction. When you have an original program that exhibits the bug, to find the smallest (congruent) program that expresses the same bug, all we want to do is keep evaluating the program as much as possible while ensuring that we only do non-buggy evaluations. Whatever is left unevaluated at the end must be the stuff that gets evaluated in the wrong way.

    Also, this frame of mind may make it more tractable to determine where the meaningful choice points are in the search. Whenever we have two evaluation steps that we can perform simultaneously without affecting the final semantics, it doesn’t matter which one we do first (this is confluence of lambda-calculi); however, there’s a meaningful choice to make whenever we have two evaluation steps where (a) performing one of them renders the other invalid, or (b) even though it’s valid to perform both reductions at once, doing so would eliminate the buggy behavior.

    As far as taking these observations and moving forward, the question is how to use partial evaluation in test case reduction with a minimum of work on your part. Surely there are C to C optimizing compilers already out there that you could use? Especially if you can get one that allows you to say you only want the safe optimizations— things like constant propagation, function inlining, loop invariant code motion, etc, but excluding esoteric things that are architecture-dependent and excluding weakening optimizations which do not entirely preserve the semantics of the original (though those could be worth considering too). Of course, even with a C to C optimizing complier, what you really want is something you can tune so that it only does one optimization step at a time, instead of doing everything.

    If there is no such tool already out there, then you may consider using some variant of lambda-calculus as the internal representation when writing your own. The reason why is that it makes it clear that removing unused function arguments is just partial evaluation, and it could help reduce the implementation cost since you could implement things like reducing “if (0) E1 then E2;” to “E2;” in the exact same way as function inlining. Also there’s been a lot of research on partial evaluation in the context of lambda-calculi (all that rewriting systems stuff that you know nothing about 😉 Of course, the downside is that it’d mean that you couldn’t draw as much on all the prior work in optimizing C code. But surely there’s a good way to handle the tradeoff; maybe just making the connection is good enough to simplify the implementation without actually needing to implement it as if C were a functional language.

    If you’re interested in pursuing this topic, send me an email and we can talk more about it.

  11. Matt- cool! But really, I think it’s the random testing that needs the massive resources, the testcase reduction isn’t so critical.

    Wren, I agree that the analogy between reduction and optimization is a good one! But a crucial difference is that reduction transformations do not need to be (and in fact we often don’t want them to be) semantics-preserving. For example, one very useful transformation simply replaces a constant, variable, or subexpression with 0.

    We are right now shopping around for a good C->C translator. My group has lots of experience with CIL, but we don’t think it’ll work here because it’s pretty invasive, making lots of little transformations we don’t want, that are sometimes going to break testcases. My opinion had been that Rose was a solid choice, but my student Yang Chen wants to use Clang. Since he’s going to do all the actual work here, I suspect he’ll win this argument.

    For better or worse, my brain is wired for imperative programming. Oddly, most of my students tend to be good functional programmers….

  12. Yeah, it’s true that you don’t always want test-case reduction to be semantics preserving (or rather, the only semantics worth preserving are the buggy ones). However, the issues you brought up sound like wanting to add all the semantics-preserving transformations to the traditional test-case transformations. Well, changing the return type to void for a function with an infinite loop is more of a type-inference thing, but still.

    Thanks for the Rose pointer. I hadn’t heard of that one before.

  13. Wren, agreed — we’ve implemented most or all of the elimination transformations that are needed, and the remaining transformations are basically source-to-source compiler optimizations. No fun. We need to steal as much of this as possible.

  14. Off topic, but what the hell. Once you get used to !!x it’s not a bad idiom. There are certainly worse. A JS idiom for “Do x and y have different truthiness?”: “!x != !y”. F*CK YEAH.

    Some lingering language lawyer doubts settled: this transformation
    3. (X = [undefined]; S1;) => (X = 0; S1;)
    is not necessary for your example, I forgot that the initialization”struct foo x = {};” will set unmentioned fields of x to 0.

  15. Eddie, doesn’t that code also work for comparing truthiness in C/C++, or is there a better idiom? Three excls in one expression definitely wins.

    Ok, I need to make these loop reductions work. I mean I need to convince a student to do it. I’ll just be sitting in my office reading HN and getting pissed off.

  16. It does work in C/C++ too, but is just less important there. Here are some speculative reasons.

    – JS does not have casts. The C/C++ style would be “(bool) x != (bool) y”; you cannot write this in JS; you can write “Boolean(x) != Boolean(y)”, but that style is really non-idiomatic.

    – JS && and || return the last value evaluated, rather than a boolean or equivalent as in C/C++.

    1 && 2 ==> 2
    “” && true ==> “”
    0 || {hello: “John”} ==> {hello: “John”}

    Very useful behavior. But it means that some “boolean” operations, like == and >, always return either false or true, and other “boolean” operations, like && and ||, can return any type. You get used to the need to “cast to true boolean”.