Generating a Random Program vs. Generating All Programs

Generating all possible inputs — up to some maximum length — to a software system is one way of creating test cases, and this technique even has a name: bounded exhaustive testing. Back when we were doing Csmith, my then-student Yang Chen spent a while on a bounded exhaustive C program generator which was in fact pretty decent at finding compiler bugs and had the added advantage of not requiring a test case reducer: if you try test cases in order of increasing size, you’re guaranteed that the first input that triggers a bug is already minimal. In the end, however, Csmith worked better and we dropped the exhaustive generator.

Lately I’ve been working on a little exhaustive LLVM function generator. (NOTE: be careful about running this program, it not only writes a lot of files but is also a barely-controlled forkbomb. It would be better to just download a tarball of its output.) Here I don’t want to write about what we’ve learned using it, but rather about writing such a tool.

The first observation is obvious: the difference between a random generator and an exhaustive generator is simply the implementation of the choice operator: the random tool makes a random choice and the exhaustive generator makes every choice. Here you can see my choice implementation which uses fork() since it is simple and fairly efficient.

The rest of this piece is about the less obvious differences between writing a random generator and writing an exhaustive generator. First, most random program generators contain code like this:

restart:
  ch1 = choose(I);
  ch2 = choose(J);
  if (!compatible(ch1, ch2))
    goto restart;

In other words, when the generator gets into a bind, it simply restarts an entire operation. As long as this will eventually succeed, there’s no problem, and in practice a generator that can use this idiom will be moderately simpler than one that doesn’t use it. For exhaustive generation there are two problems. First, wasting choices increases the generator’s branching factor, potentially creating a huge increase in the number of states that must be explored. Second, the exhaustive generator might not even terminate, depending on what kind of “fuel” it uses. There’s no problem if the fuel is the number of choices, but if we want to use a different, user-meaningful kind of fuel (LLVM instructions, for my generator above) then this idiom leads to an infinite fork loop.

To write the same sequence of choices without a restart loop, we end up with this kind of code:

  ch1 = choose(I);
  ch2 = choose(J - incompatible_choices(ch1));
  ch2 = expand(ch1, ch2);

In other words, we need to always choose from a densely populated range of choices and then expand the choice out to the original, sparser choice space. Why does this sort of thing happen in practice? Let’s say that a function in the program we’re generating takes two arguments that should be distinct. It’s easy to generate two random arguments and then restart if they happen to be the same, but slightly harder to generate a random argument and then a second random argument that is guaranteed to be different from the first. The painfulness increases rapidly with the sophistication of the constraints that need to be respected. I’m not sure that it would have been possible to implement Csmith in the non-restart style.

The next issue is the balancing of probabilities that is required to create a good random program generator. This is really hard, we spent an enormous amount of time tuning Csmith so that it often generates something useful. Doing a bad job at this means that a fuzzer — even if it is in principle capable of generating interesting outputs — almost never does that. For example, the main value generation function in my LLVM generator is structured as a sequence of coin tosses. This is fine for exhaustive generation but creates an extremely skewed distribution of programs when used in fuzzer mode.

The final difference between random and exhaustive generation is that exhaustive generators end up being highly sensitive to incidental branching factors. For example, if my LLVM function generator is permitted to simply return one of its arguments (with the actual contents of the function being dead) the number of programs that it generates (for a given number of instructions) increases significantly, with no added testing benefit. Similarly, if we generate instructions that fold (for example, “add 1, 1”) we get another massive increase in the number of outputs (though these are not useless since they test the folder — something that we might or might not be interested in doing). A massive blowup comes from the necessity of choosing every possible value for each literal constant. This is a bit of a showstopper when generating C programs but is not so bad in LLVM since we can limit inputs to be, for example, three bits wide.

In summary, although in principle a fuzzer and an exhaustive generator are the same program with different implementations of the choice operator, in practice the concerns you face while writing one are significantly different from the other. Neither is strictly easier than the other to create.

What afl-fuzz Is Bad At

American fuzzy lop is a polished and effective fuzzing tool. It has found tons of bugs and there are any number of blog posts talking about that. Here we’re going to take a quick look at what it isn’t good at. For example, here’s a program that’s trivial to crash by hand, that afl-fuzz isn’t likely to crash in an amount of time you’re prepared to wait:

#include <stdlib.h>
#include <stdio.h>

int main(void) {
  char input[32];
  if (fgets(input, 32, stdin)) {
    long n = strtol(input, 0, 10);
    printf("%ld\n", 3 / (n + 1000000));
  }
  return 0;
}

The problem, of course, is that we’ve asked afl-fuzz to find a needle in a haystack and its built-in feedback mechanism does not help guide its search towards the needle. Actually there are two parts to the problem. First, something needs to recognize that divide-by-zero is a crash behavior that should be targeted. Second, the search must be guided towards inputs that result in a zero denominator. A finer-grained feedback mechanism, such as how far the denominator is from zero, would probably do the job here. Alternatively, we could switch to a different generation technology such as concolic testing where a solver is used to generate test inputs.

The real question is how to get the benefits of these techniques without making afl-fuzz into something that it isn’t — making it worse at things that it is already good at. To see why concolic testing might make things worse, consider that it spends a lot of time making solver calls instead of actually running tests: the base testing throughput is a small fraction of what you get from plain old fuzzing. A reasonable solution would be to divide up the available CPU time among strategies; for example, devote one core to concolic testing and seven to regular old afl-fuzz. Alternatively, we could wait for afl-fuzz to become stuck — not hitting any new coverage targets within the last hour, perhaps — and then switch to an alternate technique for a little while before returning to afl-fuzz. Obviously the various search strategies should share a pool of testcases so they can interact (afl-fuzz already has good support for communication among instances). Hopefully some concolic testing people will spend a bit of time bolting their tools onto afl-fuzz so we can play with these ideas.

A variant of the needle-in-a-haystack problem occurs when we have low-entropy input such as the C++ programs we would use to test a C++ compiler. Very few ascii strings are C++ programs and concolic testing is of very little help in generating interesting C++. The solution is to build more awareness of the structure of C++ into the testcase generator. Ideally, this would be done without requiring the somewhat elaborate effort we put into Csmith. Something like LangFuzz might be a good compromise. (Of course I am aware that afl-fuzz has been used against clang and has found plenty of ways to crash it! This is great but the bugs being found are not the kind of semantic bugs that Csmith was designed to find. Different testcase generators solve different problems.)

So we have this great fuzzing tool that’s easy to use, but it also commonly runs up against testing problems that it can’t solve. A reasonable way forward will be for afl-fuzz to provide the basic framework and then we can all build extensions that help it reach into domains that it couldn’t before. Perhaps Valgrind is a good analogy: it provides not only an awesome baseline tool but also a rich infrastructure for building new tools.

[Pascal Cuoq gave me some feedback on this post including a bunch of ideas about afl-fuzz-related things that he’ll hopefully blog about in the near future.]