Yesterday Pascal wrote a nice article addressing practical issues in automated testcase reduction. One issue is that C-Reduce tends to give you what you asked for instead of what you really wanted. Of course, this problem is a common one when performing algorithmic searches. Beyond computer science, the same problem comes up in basically every story where a genie is in a position to grant a wish. In all cases, our goal is to suitably constrain the space of solutions. As examples, this post will examine a couple of industrial-strength interestingness tests that Csmith’s driver infrastructure uses to reduce compiler crash and wrong-code bugs as it finds them.
As a quick review, a test case reducer starts out with a large test case that is interesting and its job is to produce the smallest test case it can that is also interesting. So it has the invariant that its current test case is interesting. The current test case is transformed to produce a series of smaller variants; any time such a variant is found to be interesting, it replaces the current test case. The reducer terminates when it reaches a fixpoint; that is, when no transformation in its repertoire results in an interesting variant.
A Compiler-Crash Interestingness Test
Here’s the whole test. First of all, we set limits on CPU time and memory usage:
ulimit -t 120 ulimit -v 8000000 ulimit -m 8000000
This is just common sense when doing any kind of fuzzing-related activity. Random test cases tend to evoke pathological behaviors in systems under test and we’d like to avoid bringing our test machine to its knees when this happens. These resource limits cause the interestingness test to terminate (and the variant to be deemed uninteresting) when a resource limit is exceeded.
The rest of the interestingness test is just a series of conditions; the test case is interesting if all of the conditions are met. The conditions should be ordered in such a way that the reduction happens as rapidly as possible. Basically this means: put conditions near the front of the script if they are likely to fail (permitting early bail-out) or if they run quickly.
The first condition is simply:
gcc -c small.c > gcc.out 2>&1
In other words, if the current variant is a legal compilation unit according to GCC, the compiler driver will return a 0 exit code, indicating that the condition is met. Otherwise, the driver returns non-zero and the interestingness test exits here. This accounts for the common case where C-Reduce emits a syntactically incorrect variant.
The next part of the interestingness test looks at the compiler’s output:
! grep 'data definition has no type or storage class' gcc.out &&\ ! grep 'assumed to have one element' gcc.out &&\ ! grep 'control reaches end of non-void function' gcc.out &&\ ! grep 'return type defaults to' gcc.out
Here we are saying that a variant is uninteresting if it causes GCC to print any of these four warnings. We do not want to submit a test case that, for example, falls off the end of a non-void function. Not only is this kind of thing bad style, but also an obviously undefined test case might cause a compiler developer to reject a bug report, even for a crash bug, if the developer is not in a good mood.
Conditions that are based on compiler warnings are a little unsatisfying. First, warnings are often delivered in a best-effort fashion. That is, compilers cannot always be counted upon to reliably warn about every occurrence of some bad behavior. Warnings about use of uninitialized storage, for example, are notoriously unreliable. The second problem is that here we’ve keyed on the specific syntax of these warnings; if we upgrade to a new GCC that changes the wording, part of our interestingness test will fail. Furthermore, if the system lies about the compiler that is being invoked (for example, Xcode 5 invokes clang when you type “gcc”) then these won’t work at all.
The next condition gets to the heart of the matter:
! gcc-4.8.2 -Ofast small.c -c -w > out.txt 2>&1
Here we insist that some version of GCC, at a certain optimization level, fails to compile the variant. We also need to ensure that it failed in an interesting way, as opposed to for example segfaulting when we’re looking for a different buggy behavior:
grep 'verify_ssa failed' out.txt
And now we’re done. Of course the string that we grep for, as well as the compiler options triggering the crash, are specific to the particular bug we are dealing with.
A Wrong-Code Interestingness Test
Here’s the code.
The central problem in reducing a wrong code test case is that we must reject as uninteresting any variant that relies on undefined or unspecified behavior. C-Reduce is sneaky and it’s going to try thousands and thousands of ways to make the test case smaller, many of which introduce undefined behavior — we need to reject them all. We start off by looking at a considerably larger set of compiler warnings:
clang -pedantic -Wall -O0 -c small.c >out_clang.txt 2>&1 &&\ ! grep 'conversions than data arguments' out_clang.txt &&\ ! grep 'incompatible redeclaration' out_clang.txt &&\ ! grep 'ordered comparison between pointer' out_clang.txt &&\ ! grep 'eliding middle term' out_clang.txt &&\ ! grep 'end of non-void function' out_clang.txt &&\ ! grep 'invalid in C99' out_clang.txt &&\ ! grep 'specifies type' out_clang.txt &&\ ! grep 'should return a value' out_clang.txt &&\ ! grep 'uninitialized' out_clang.txt &&\ ! grep 'incompatible pointer to' out_clang.txt &&\ ! grep 'incompatible integer to' out_clang.txt &&\ ! grep 'type specifier missing' out_clang.txt &&\ gcc -Wall -Wextra -O1 small.c -o smallz >out_gcc.txt 2>&1 &&\ ! grep 'uninitialized' out_gcc.txt &&\ ! grep 'without a cast' out_gcc.txt &&\ ! grep 'control reaches end' out_gcc.txt &&\ ! grep 'return type defaults' out_gcc.txt &&\ ! grep 'cast from pointer to integer' out_gcc.txt &&\ ! grep 'useless type name in empty declaration' out_gcc.txt &&\ ! grep 'no semicolon at end' out_gcc.txt &&\ ! grep 'type defaults to' out_gcc.txt &&\ ! grep 'too few arguments for format' out_gcc.txt &&\ ! grep 'incompatible pointer' out_gcc.txt &&\ ! grep 'ordered comparison of pointer with integer' out_gcc.txt &&\ ! grep 'declaration does not declare anything' out_gcc.txt &&\ ! grep 'expects type' out_gcc.txt &&\ ! grep 'pointer from integer' out_gcc.txt &&\ ! grep 'incompatible implicit' out_gcc.txt &&\ ! grep 'excess elements in struct initializer' out_gcc.txt &&\ ! grep 'comparison between pointer and integer' out_gcc.txt &&\
These checks reject certain classes of obviously undefined code, while permitting all manner of more subtly undefined code to pass through. We’re fine with that for the moment. So now to the crux of the matter:
gcc -O0 small.c -o small1 > /dev/null 2>&1 &&\ RunSafely 5 1 /dev/null out_small1.txt ./small1 &&\ gcc-4.8.2 -Ofast small.c -o small2 > /dev/null 2>&1 &&\ RunSafely 5 1 /dev/null out_small2.txt ./small2 &&\ ! diff out_small1.txt out_small2.txt
This compound condition ensures that both gcc and gcc-4.8.2.are able to successfully compile the variant, that both compiled variants execute successfully, and that their outputs are different. This is just standard differential testing where we have a reference compiler and a compiler that is believed to be broken. We don’t care what the result of the program is: we only care that the two compilers produce executables with diverging behavior. The “RunSafely” script is one we inherited from somewhere, it simply imposes a smaller timeout (5 seconds, in this case) on the command that it runs. The shorter timeout helps the reduction make progress more rapidly. C-Reduce tends to often create variants containing infinite loops.
Finally, we’ll use a more serious undefined behavior checker. We do this last because it is usually the slowest part of the test:
frama-c -cpp-command 'gcc -C -Dvolatile= -E -I.' -val -no-val-show-progress -machdep x86_64 -obviously-terminates small.c > out_framac.txt 2>&1 &&\ ! egrep -i '(user error|assert)' out_framac.txt >/dev/null 2>&1
Frama-C is a static analyzer for C code that happens to have a couple of important properties. First, it is sound, meaning that (for the subset of undefined behaviors that it handles) it never determines that a program is free of undefined behaviors unless that is actually the case. Second, Frama-C actually works when handed a nearly arbitrary piece of C code. This property of “actually working” requires a huge amount of effort and it is not one belonging to every tool that processes C code.
As I remarked above, Frama-C does not look for all of C’s ~200 undefined behaviors, but rather for a useful subset including tricky ones like array bounds errors, signed integer overflows, and use of stack variables whose scope has ended. For all other undefined behaviors, we have to trust that either C-Reduce does not introduce them or else one of the compiler warnings above tells us about them. Empirically, the results of these reductions are free of undefined behavior. There may be some luck involved, but it is a systematic kind of luck that comes from properties of the tools and the language, not a random kind of luck like flipping a coin and getting lots of heads.
One might doubt the wisdom of throwing random code at a static analyzer and hoping for precise results. There are a few reasons why this works here. First, as the command above shows, we use the preprocessor to eliminate all volatile qualifiers in the code that we hand to Frama-C. This is necessary because Frama-C (correctly — in the general case) treats reading from a volatile as returning an arbitrary value. However, we happen to know that the only value that will be loaded from a volatile is the last one stored to it, meaning that Frama-C will return the correct result even without these qualifiers. Second, the generated programs contain no other sources of nondeterminism such as system calls. Third, Frama-C tries very hard to avoid losing precision during its analysis. For the “closed” programs emitted by Csmith, it is able to act as an interpreter, never losing any precision at all. This would not be the case in general. The -obviously-terminates option is an optimization added by Pascal to avoid overhead due to checking for a fixpoint in the static analyzer.
In summary, reduction of test cases triggering wrong-code bugs is basically impossible without a tool such as Frama-C. The fact that Csmith and Frama-C work so well together should be viewed as a combination of much hard work and also a collection of lucky correspondences between the design points targeted by these two tools. For the Nth time I’ll plead for someone to write Frama-C++ since at present we can’t reliably reduce C++ wrong-code bugs.