Design and Evolution of C-Reduce (Part 2)

Part 1 of this series introduced C-Reduce and showed how it combines a domain-independent core with a large collection of domain-specific passes in order to create a highly effective test-case reducer for C and C++ code. This part tells the rest of the story and concludes.

Parallel Test-Case Reduction

C-Reduce’s second research contribution is to perform test-case reduction in parallel using multiple cores. The parallelization strategy, based on the observation that most variants are uninteresting, is to speculate along the uninteresting branch of the search tree. Whenever C-Reduce discovers an interesting variant, all outstanding interestingness tests are killed and a new line of speculation is launched. This is the same approach that was subsequently rediscovered by the creator of halfempty.

C-Reduce has a policy choice between taking the first interesting variant returned by any CPU, which provides a bit of extra speedup but makes parallel reduction non-deterministic, or only taking an interesting variant when all interestingness tests earlier in the search tree have reported that their variants are uninteresting. We tested both alternatives and found that the speedup due to non-deterministic choice of variants was minor. Therefore, C-Reduce currently employs the second option, which always follows the same path through the search tree that non-parallel C-Reduce would take. The observed speedup due to parallel C-Reduce is variable, and is highest when the interestingness test is relatively slow. Speedups of 2-3x vs. sequential reduction are common in practice.

Applicability and Moving Towards Domain-Independence

Something that surprised us is how broadly applicable C-Reduce is to test cases in languages other than C and C++. Our users have found it effective in reducing Java, Rust, Julia, Haskell, Racket, Python, SMT-LIB, and a number of other languages. When used in this mode, the highly C/C++-specific C-Reduce passes provide no benefit, but since they fail quickly they don’t do much harm. (C-Reduce also has a --not-c command line option that avoids running these passes in the first place.)

One might ask why C-Reduce is able to produce good results for languages other than C and C++; the answer appears to be based on the common structural elements found across programming languages in the Algol and Lisp families. In contrast, in our limited experience, C-Reduce does a very poor job reducing test cases in binary formats such as PDF and JPEG. Other reducers, such as the afl-tmin tool that ships with afl-fuzz, work well for binary test cases.

A consequence of the modular structure of C-Reduce is that while its transformation passes are aimed at reducing C and C++ code, the C-Reduce core is completely domain-independent. C-Reduce has been forked to create a highly effective reducer for OpenCL programs. We believe it would be relatively straightforward to do something similar for almost any other programming language simply by tailoring the passes that C-Reduce uses.

Limitations

When used for its intended purpose, when does C-Reduce work badly? We have seen two main classes of failure. First, C-Reduce can be annoyingly slow. This typically happens when the passes early in the phase ordering, which are intended to remove a lot of code quickly, fail to do this. Second, highly templated C++ sometimes forces C-Reduce to terminate with an unacceptably large (say, >1 KB) final result. Of course this is better than nothing, and subsequent manual reduction is usually not too difficult, but it is frustrating to have written 69 different Clang-based passes and to still find effectively irreducible elements in test cases. The only solution — as far as we know — is to strengthen our existing transformation passes and to create more such passes.

A small minority of compiler bugs appears to be impossible to trigger using a small test case. These bugs are exceptions to the small scope hypothesis. They typically stem from resource-full bugs in the compiler. For example, a bug in register spilling code requires the test case to use enough registers that spilling is triggered; a bug in logic for emitting long-offset jumps requires the test case to contain enough code that a long offset is required; etc. These test cases are just difficult to work with, and it is not clear to us that there’s anything we can do to make it easier to debug the issues that they trigger.

C-Reduce Design Principles

In summary, C-Reduce was designed and implemented according to the following principles:

Be aggressive: make the final reduced test case as small as possible.
Make the reducer fast, for example using parallelism, careful phase ordering of passes, and avoiding unnecessary I/O traffic, when this can be done without compromising the quality of the final output.
Make it as easy as possible to implement new passes, so that many domain-specific passes can be created.
Keep the C-Reduce core domain-independent.
Focus only on reduction, delegating all other criteria to the user-supplied interestingness test.

Directions for Future Test-Case Reduction Research

Although perhaps a few dozen papers have been written about test-case reduction since Hildebrandt and Zeller’s initial paper 19 years ago, I believe that this area is under-studied relative to its practical importance. I’ll wrap up this article with a collection of research questions suggested by our experience in over a decade of work on creating a highly aggressive reducer for C and C++.

What is the role of domain-specificity in test case reduction? Researchers who cite C-Reduce appear to enjoy pointing out that it is very domain-specific (nobody seems to notice that the C-Reduce core is domain-independent, and that the pass schedule is easy to modify). The implication is that domain-specific hacks are undesirable and, of course, an argument against such hacks would be forceful if backed up by a test-case reducer that produced smaller final C and C++ code than C-Reduce does, without using domain knowledge. So far, such an argument has not been made.

The open research question is whether domain knowledge is actually necessary, or if a domain-independent test-case reducer can beat C-Reduce at its own game. The most impressive such effort that we are aware of is David MacIver’s structureshrink, which uses relatively expensive search techniques to infer structural elements of test cases that can be used to create variants. Anecdotally, we have seen structureshrink produce reduced versions of C++ files that are smaller than we would have guessed was possible without using domain knowledge.

What is the role of non-greedy search in test-case reduction? In many cases, the order in which C-Reduce runs its passes has little or no effect on the final test case. In other words, the search is often diamond-shaped, terminating at the same point regardless of the path taken through the search space. On the other hand, this is not always the case, and when the search is not diamond-shaped, a greedy algorithm like C-Reduce’s risks getting stopped at a local minimum that is worse than some other, reachable minimum. The research question is how to get the benefit of non-greedy search algorithms without making test-case reduction too much slower.

What other parallelization methods are there? C-Reduce’s parallelization strategy is pretty simple, gives a modest speedup in practice, and always returns the same result as sequential reduction. There must be other parallel test-case reduction strategies that hit other useful points in the design space. This is, of course, related to the previous research question. That is, if certain branchings in the search tree can be identified as being worth exploring in both directions, this could be done on separate cores or machines.

What is the role of canonicalization in test-case reduction? A perfectly canonicalizing reducer — that reduces every program triggering a given bug to the same reduced test case — would be extremely useful. Although it seems impossible to achieve this in all cases, there are many relatively simple strategies that can be employed to increase the degree of canonicalization, such as putting arithmetic expressions into a canonical form, assigning canonical names to identifiers, etc. C-Reduce has a number of transformations that are aimed at canonicalization rather than reduction. For example, the reduced test case at the top of part 1 of this piece has four variables a, b, c, and d, which appear in that order. I believe that more work in this direction would be useful.

Can we avoid bug hijacking? Test reduction sometimes goes awry when the bug targeted by the reduction is “hijacked” by a different bug. In other words, the reduced test case triggers a different bug than the one triggered by the original. During a compiler fuzzing campaign this doesn’t matter much since one bug is as good as another, but hijacking can be a problem when the original bug is, for example, blocking compilation of an application of interest. Hijacking is particularly common when the interestingness test is looking for a non-specific behavior such as a null pointer dereference. C-Reduce pushes the problem of avoiding hijacking onto the user who can, for example, add code to the interestingness test looking for specific elements in a stack trace. The research question here is whether there are better, more automated ways to prevent bug hijacking.

Obtaining C-Reduce

Binary C-Reduce packages are available as part of software distributions including Ubuntu, Fedora, FreeBSD, OpenBSD, MacPorts, and Homebrew. Source code can be found here.

Acknowledgments

C-Reduce was initially my own project, but by lines of code the largest contributor by a factor of two is my former student Yang Chen, now a software engineer at Microsoft. Yang wrote effectively all of the Clang-based source-to-source transformation code, more than 50,000 lines in total. Eric Eide, a research professor in computer science at the University of Utah, is the other major C-Reduce contributor (and gave useful feedback on a draft of this piece). Our colleagues Pascal Cuoq, Chucky Ellison, and Xuejun Yang also contributed to the project, and we have gratefully received patches from a number of external contributors. Someone created a fun visualization of the part of C-Reduce’s history that happened on Github.

August 12, 2019

regehr

Compilers, Computer Science

7 responses to “Design and Evolution of C-Reduce (Part 2)”

Benjamin Shropshire says:

August 12, 2019 at 12:04 pm

Those open questions have some interactions I find interesting:
– The bug-hijacking question seems related to the non-greedy question: If you reduce all the paths, then it can’t hijack as it will just *also* find the other bug (unless *all* the reductions hijack).
– The non-greedy question seems related to the canonicalization question (and the diamond comment): If a canonicalization can converge with high probability, then with a non-greedy (with duplicate detection) approach the “effective branching factor” will decrease with depth, and improve non-greedy’s desirability.

One other question I’d be interested in is: How do that answers change with the degree of available parallelism?
In the extreme, if that degree is larger than the number of producible “next steps” then exploiting that more or less *requires* a non-greedy approach. If enough steps fail quickly (e.g. syntax errors), then that point might come much sooner. Cloud computing might actually allow parallelism that comes within sight of that.
regehr says:

August 12, 2019 at 4:20 pm

Hi Benjamin, these are interesting questions! I hadn’t thought about all of these.
I think there’s a problem with “reduce all paths to avoid hijacking” which is that we’ll end up with a lot of reduced test cases and no way to tell which ones trigger which bug. If we knew how to automatically figure out which bugs are triggered, we’d have just included that in the original interestingness test.
Benjamin Shropshire says:

August 12, 2019 at 4:56 pm

I totally agree, with a a non-greedy approach and without strong canonicalization, de-duplication becomes a hard but necessary problem.

On the other hand, without strong canonicalization the branching factor becomes intractable so some sort of pruning would needed regardless. Any pruning would tend to increase the risk of hijack. Maybe some happy median can be found?

On a third hand, there might be some benefits to having multiple distinct local minimums to report if you can identify them as related. I wonder if line coverage/profiling of the compiler would help there? If many bugs result from corner cases, then they may stand out as statistical clusters, if you can find the right measures to inspect (which IIRC is a well understood stats problem).
Alex Groce says:

August 15, 2019 at 3:05 pm

https://agroce.github.io/issta17.pdf tackles some of the questions, but doesn’t have the result of trying canonicalization on a lot of API sequence tests in TSTL and in DeepState. Canonicalization can help reduce, in fact, but the diamond property etc. behavior is just all over the place and hard to predict.
Alex Groce says:

August 15, 2019 at 3:12 pm

We tried some clustering based on coverage and multiple reduction traces (here: https://agroce.github.io/iwsm14.pdf) and it was… ok, but not really that helpful.
Benjamin Shropshire says:

August 16, 2019 at 1:33 pm

@Alex: After skimming that paper, it’s not clear to me what the coverage is being used for. My guess is that the goal was to select a small set of test cases that cover a large amount of tested code. Or maybe the reverse; as a metric to measure how well a selection criteria is doing at covering code. That sounds related to what I was thinking of, but may not be directly applicable. Also, it sounds like it’s only inspecting function coverage. I wonder if branch coverage would be more powerful?

While thinking about it, it seems to me that questions about techniques for avoiding hijacking and doing de-duplication might actually not be that hard to explore; tedious, but not technically challenging. How about this for a proposed set up:
1) Pick a pair of versions of a compiler that differ only in a set of fixes to know bugs. Call them Bugs and Clean.
2) Make sure that the fix for each distinct bug can be patched (or reverted) in isolation and build each of those compiler. Call this set BugN.
3) Add a third compiler from a different implementation (e.g. LLVM vs. GCC). Call it Base.
4) Fuzz for test cases that differ between Bugs and Clean but not between Clean and Base.
(This should select mostly for cases that trigger bugs from the know set.)
5) Reduce and de-duplicate (using Bugs and Base) these test cases using the technique under study, preserving the reduction trace.
6) Rebuild and test the trace using each compiler in BugN to detect which bug(s) are being triggered at each step.

From this, it should be easy to detect hijacking and to find bugs that should be de-duplicated. Unfortunately, this is inherently unusable while trying to find *new* bugs so it’s utility is likely strictly limited to research and not application.
Alex Groce says:

August 16, 2019 at 1:40 pm

Ben, we just used function coverage as a feature for the clustering/furthest-point-first algorithm distance metric, as in the PLDI 2013 paper with John (https://agroce.github.io/pldi13.pdf). So function coverage is part of how we were hoping to distinguish bug A from bug B, and the “reduction trail” is keeping a bunch of the history of the delta-debugging around, more for more information than to avoid hijacking.