C-Reduce 2.1

Over the weekend we released a new version of C-Reduce, a tool for turning a large C/C++ program into a small one that still meets some criterion such as triggering a compiler bug. There are two major improvements since the last release about a year ago:

  1. We are now able to run “interestingness tests” in parallel. The typical result is that C-Reduce speeds up quite a bit if you give it 4 or 8 cores. The parallelization strategy is explained here, but the speedups are now better than those reported in that article.

  2. We’ve added more than 20 new passes for reducing specific features found in C/C++ (mainly C++) programs. For example, C-Reduce now tries to instantiate a template, collapse a level of the class hierarchy, etc. The result is that C-Reduce’s endgame is significantly stronger for nasty C++ programs.

There are also plenty of minor improvements; for example, we now compile against LLVM/Clang 3.3 instead of requiring a random development version. We’ve made sure that C-Reduce works on Cygwin (for sort of a crappy value of “works” — it’s really slow, we believe due to Cygwin overheads). C-Reduce now supports --slow and --sllooww command line options that typically make test cases a little bit smaller, but at a high cost in reduction time.

Originally, C-Reduce was aimed at reducing programs emitted by Csmith. This wasn’t that hard because these programs contain only a limited set of language features. Over the last year or two the focus has been more on reducing C/C++ code found in the wild. This graph summarizes our current results for 15 compilation units from open source C/C++ programs, each of which causes some compiler to crash:

Whereas programs 8 and 12 stumped C-Reduce 2.0, the new version is able to produce fully reduced test cases. Program 3 could probably be improved a bit more (not all compiler bugs can be triggered by a very small test case, but most can). Programs 6 and 7 show a minor regression where the older C-Reduce was able to take some code out of an enclosing for loop and the new one is for some reason not able to do the same thing. Here Delta is being run twice at level 0, twice at level 1, twice at level 2, and then finally twice at level 10. My sense is that running it more times or at other levels would not improve the results much.

If you ever report bugs in C/C++ compilers, please give this tool a try.

Finding Undefined Behavior Bugs by Finding Dead Code

Here’s a draft of a very cool paper by Xi Wang and others at MIT; it is going to appear at the next SOSP.

The idea is to look for code that becomes dead when a C/C++ compiler is smart about exploiting undefined behavior. The classic example of this class of error was found in the Linux kernel several years ago. The code was basically:

struct foo *s = ...;
int x = s->f;
if (!s) return ERROR;
... use s ...

The problem is that the dereference of s in line 2 permits a compiler to infer that s is not null (if the pointer is null then the function is undefined; the compiler can simply ignore this case). Thus, the null check in line 3 gets silently optimized away and now the kernel contains an exploitable bug if an attacker can find a way to invoke this code with a null pointer. Here’s another example that I like where the compiler silently discarded a security check that invoked undefined behavior. There are more examples in the paper and I encourage people to look at them. The point is that these problems are real, and they’re nasty because the problem can only be seen by looking at the compiler’s output. Compilers are getting smarter all the time, causing code that previously worked to break. A sufficiently advanced compiler is indistinguishable from an adversary.

So the premise of this paper is that if you write some code that executes conditionally, and that code can be eliminated by a compiler that understands how to exploit undefined behavior, then there’s a potential bug that you’d like to learn about. But how can we find this kind of code? One way would be to instrument the compiler optimization passes that eliminate code based on undefined behavior. This has some problems. First, due to complex interactions between optimization passes in a real compiler, it is not at all straightforward to determine whether a particular piece of dead code is that way due to undefined behavior, vs. being dead in a boring way. Chris Lattner’s article about undefined behavior has some good discussion of this issue. The second problem is that if we, for example, instrument LLVM to detect cases where it optimizes away code due to undefined behavior, this doesn’t tell us much about problems that we might run into when using a different compiler, or using the next version of LLVM. In particular, it would be nice to be able to find “time bombs” — undefined behavior bugs that aren’t yet exploited by a compiler, but that might be exploited in the future.

The approach taken in this paper was to develop a custom undefined behavior detector based on static analysis of a single function at a time. It takes LLVM IR and converts it into a SAT instance that is satisfiable when code can be eliminated. This has a couple of advantages. First, by considering one function at a time, the tool can scale to very large programs (note that a single function at the LLVM level may contain inlined code from other functions — this increases precision without slowing things down too much). Second, modern SAT solvers are quite strong, giving the tool the opportunity to find bugs that cannot yet be exploited by production compilers. Third, by writing a standalone tool, problems of being entangled with all of the incidental complexity that is found in a collection of real compiler passes are avoided.

At this point you are probably thinking: “I do not want some stupid tool telling me about every piece of dead code in a huge legacy application, because there are millions of those and almost none of them are bugs.” The solution has two parts. First, the most obvious dead code gets eliminated by running LLVM passes like mem2reg, SCCP, and DCE. Second, the authors have a clever solution where they run their tool twice: first without knowledge of undefined behavior, and second with it. A bug is only reported if the second pass can eliminate code that the first one cannot. Based on this technique, the paper reports a very low rate of false positives.

One thing that is interesting about the tool in this paper is that it has very conservative criteria for reporting a bug: either undefined behavior is invoked on all paths (e.g. it finds a type 3 function) or else it finds code that is made dead by exploiting undefined behavior. Although this conservative behavior could be seen as a disadvantage, in many ways it is a big advantage because the problems that it flags are probably serious, because the code that gets eliminated by the compiler is often a safety check of some sort. The authors were able to find, report, and convince developers to fix 157 bugs in open source programs, which is impressive. They also provide some good anecdotes about developers who could not be convinced to fix undefined behavior bugs, something I’ve also run into.

In summary, by adopting a solid premise (“developers want to know when code they write can be eliminated based on exploitation of undefined behavior”) the authors of this paper have found a way to home in on a serious class of bugs and also to avoid the false positive problems that might otherwise plague an intraprocedural static analysis tool.

UPDATE from August 2013: STACK is available now.

What Other Dynamic Checkers for C/C++ are Needed?

Detectors for memory-related undefined behaviors in C/C++, while being imperfect, are at least something that smart people have spent a lot of time thinking about. Integer-related undefined behaviors have received much less attention, but then again they are a lot simpler than memory unsafety.

Today’s question is: What other checkers should we create? Here are a few. I’m not sure that any of these are really pressing problems in practice, but I think it would be far from useless to test important C/C++ codes under these tools. I’d be happy to hear suggestions for other tools that might be useful.

This discussion of undefined behavior checkers for Clang is useful.

Unspecified order of evaluation of arguments to a function

When you write something like f(g(),h()), the arguments can be evaluated in any order. This matters when the arguments perform I/O or when they modify overlapping state.

Writing a rigorous tool for dynamically finding programs whose behavior is sensitive to argument evaluation order would be difficult and its performance would likely be bad. In contrast, it would be simple to hack a compiler to randomize (either at compile time or at run time) the order in which arguments are evaluated, at which point we would simply look for test cases to fail when the option to do this is turned on.

In a sensible world, left-to-right evaluation would be specified by the standards. Failing that, individual compilers should specify this. Sometimes people say that the unspecified order of evaluation permits the generation of more efficient code but it is very hard to believe that this effect is ever usefully exploited. A recent version of gcc that I checked consistently evaluates arguments right-to-left; a recent Clang consistently evaluates them left-to-right. Older versions of gcc would switch orders depending on optimization option.

Unsequenced modifications to lvalues

Modifying an object multiple times within a C/C++ expression, or accessing an lvalue “other than to determine the new value” is undefined. Here’s a program with unsequenced modifications to x:

#include <stdio.h>
#include <limits.h>

int main (void)
{
  unsigned x;
  unsigned *p = &x;
  x = 0;
  printf ("%u\n", x++ + --x);
  return 0;
}

And the result:

$ gcc -w side_effect.c ; ./a.out
4294967295
$ gcc -w -O side_effect.c ; ./a.out
4294967294
$ clang -w side_effect.c ; ./a.out
0

For some examples (such as this program) GCC and Clang provide good warnings, but minor modifications such as accessing a variable through a pointer will suppress the warnings without making the underlying problem go away.

I think a dynamic checker for this kind of undefined behavior could be pretty efficient because the total number of objects touched while evaluating an expression is usually pretty small.

Strict aliasing violations

The strict aliasing rules in C/C++ require basically that if you cast a pointer into any kind of pointer besides char *, you don’t dereference the new pointer. Unfortunately I’m not sure how to check for violations short of a full type safety implementation, which is no fun for anyone.

Compiler warnings for strict aliasing violations exist, but seem to be unpredictable. For example, neither GCC nor Clang warns about either c1() or check() in this post. These are both extremely simple and obvious violations of strict aliasing. So it’s hard to see how these compilers could be providing robust warnings about more difficult examples found in real code.