Generating a Random Program vs. Generating All Programs

Generating all possible inputs — up to some maximum length — to a software system is one way of creating test cases, and this technique even has a name: bounded exhaustive testing. Back when we were doing Csmith, my then-student Yang Chen spent a while on a bounded exhaustive C program generator which was in fact pretty decent at finding compiler bugs and had the added advantage of not requiring a test case reducer: if you try test cases in order of increasing size, you’re guaranteed that the first input that triggers a bug is already minimal. In the end, however, Csmith worked better and we dropped the exhaustive generator.

Lately I’ve been working on a little exhaustive LLVM function generator. (NOTE: be careful about running this program, it not only writes a lot of files but is also a barely-controlled forkbomb. It would be better to just download a tarball of its output.) Here I don’t want to write about what we’ve learned using it, but rather about writing such a tool.

The first observation is obvious: the difference between a random generator and an exhaustive generator is simply the implementation of the choice operator: the random tool makes a random choice and the exhaustive generator makes every choice. Here you can see my choice implementation which uses fork() since it is simple and fairly efficient.

The rest of this piece is about the less obvious differences between writing a random generator and writing an exhaustive generator. First, most random program generators contain code like this:

restart:
  ch1 = choose(I);
  ch2 = choose(J);
  if (!compatible(ch1, ch2))
    goto restart;

In other words, when the generator gets into a bind, it simply restarts an entire operation. As long as this will eventually succeed, there’s no problem, and in practice a generator that can use this idiom will be moderately simpler than one that doesn’t use it. For exhaustive generation there are two problems. First, wasting choices increases the generator’s branching factor, potentially creating a huge increase in the number of states that must be explored. Second, the exhaustive generator might not even terminate, depending on what kind of “fuel” it uses. There’s no problem if the fuel is the number of choices, but if we want to use a different, user-meaningful kind of fuel (LLVM instructions, for my generator above) then this idiom leads to an infinite fork loop.

To write the same sequence of choices without a restart loop, we end up with this kind of code:

  ch1 = choose(I);
  ch2 = choose(J - incompatible_choices(ch1));
  ch2 = expand(ch1, ch2);

In other words, we need to always choose from a densely populated range of choices and then expand the choice out to the original, sparser choice space. Why does this sort of thing happen in practice? Let’s say that a function in the program we’re generating takes two arguments that should be distinct. It’s easy to generate two random arguments and then restart if they happen to be the same, but slightly harder to generate a random argument and then a second random argument that is guaranteed to be different from the first. The painfulness increases rapidly with the sophistication of the constraints that need to be respected. I’m not sure that it would have been possible to implement Csmith in the non-restart style.

The next issue is the balancing of probabilities that is required to create a good random program generator. This is really hard, we spent an enormous amount of time tuning Csmith so that it often generates something useful. Doing a bad job at this means that a fuzzer — even if it is in principle capable of generating interesting outputs — almost never does that. For example, the main value generation function in my LLVM generator is structured as a sequence of coin tosses. This is fine for exhaustive generation but creates an extremely skewed distribution of programs when used in fuzzer mode.

The final difference between random and exhaustive generation is that exhaustive generators end up being highly sensitive to incidental branching factors. For example, if my LLVM function generator is permitted to simply return one of its arguments (with the actual contents of the function being dead) the number of programs that it generates (for a given number of instructions) increases significantly, with no added testing benefit. Similarly, if we generate instructions that fold (for example, “add 1, 1”) we get another massive increase in the number of outputs (though these are not useless since they test the folder — something that we might or might not be interested in doing). A massive blowup comes from the necessity of choosing every possible value for each literal constant. This is a bit of a showstopper when generating C programs but is not so bad in LLVM since we can limit inputs to be, for example, three bits wide.

In summary, although in principle a fuzzer and an exhaustive generator are the same program with different implementations of the choice operator, in practice the concerns you face while writing one are significantly different from the other. Neither is strictly easier than the other to create.

May 28, 2015

regehr

Compilers, Computer Science, Software Correctness

11 responses to “Generating a Random Program vs. Generating All Programs”

bcs says:

May 28, 2015 at 6:55 pm

The comments about the difficultly of creating a non-backtracking generator gave me an idea: could you use a non-deterministic AI type construct (NN, RNN, GA, etc.) as the “RNG” and train it to make compatible choices? (To avoid pathologically degenerate output you should probably also include line coverage of the compiler in the reward function, possibly also how varied that coverage is.)
Ally Donaldson says:

May 29, 2015 at 3:19 am

Interesting article – thanks!

Would I be correct in thinking that exhasutive generation of programs that are free from undefined behaviour is a lot more challenging than in the case of random generation? For random generation it suffices to use things like safe math macros, and conservative effect analyses. But if one wants to really be exhaustive one would have to understand precisely whetehr a piece of code can exhibit UBs. I guess with a fast dynamic UB sanitizer this is possible, but in the presence of possibly non-terminating loopy code it could be hard.
Alex Groce says:

May 29, 2015 at 3:36 am

I think of this as being essentially the difference between depth-bounded DFS and BFS.
Ben says:

May 30, 2015 at 9:14 am

This is only barely on-topic, but are you familiar with Monte Carlo tree search (MCTS)? I think of it whenever anyone mentions fiddly search problems. It’s a new-ish family of search algorithms that are largely responsible for go (weiqi, baduk) playing programs improving from weak amateur level to fairly strong amateur over the last decade.

MCTS is a kind of best-first search where the evaluation function for non-leaf nodes is randomized. The descent from root to frontier is also randomized where subtrees with “more promising” nodes are preferred and subtrees that have received less computational effort so far are preferred. In go programs the non-leaf nodes (i.e. incomplete board positions) are evaluated by playing out many games with random moves and observing the win/loss ratio.

I have no idea whether MCTS would be useful for random program generation. I’ve had in the back of my mind to try using it for instruction scheduling and register allocation for a while, but I don’t know if/when I’ll get around to that.

One of the fun things about MCTS is that you can throw all kinds of quick-n-dirty hacks into the evaluation function, because it only needs to be kinda, sorta accurate. (I’m sure the mathematicians in the audience are sharpening their knives, but we engineers can worry about niceties like convergence after there’s a promising prototype.)
regehr says:

May 30, 2015 at 10:48 am

bcs, this is a cool idea, and in fact just the other day I started trying to use the software linked from this post as a C compiler fuzzer:

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

But it ate all my RAM and didn’t train anything up in a reasonable amount of time. I might need to get a good GPU. Anyway this definitely seems worth trying, as long as we don’t mind most of the code being invalid.
regehr says:

May 30, 2015 at 10:52 am

Alastair, indeed it does sound challenging, although in practice the space of C programs is so huge that any practical exhaustive generator is going to come with a ton of caveats — so if we add the restriction that it uses safe math macros to get safety, maybe that’s not so much worse. I think that true exhaustive generation of C code really won’t work…
regehr says:

May 30, 2015 at 10:56 am

Ben I’d be interested to talk more about this. Are you going to be at FCRC?

A while ago I read a pile of MCTS papers and got pretty excited about it but then somehow I lost the thread. It definitely seems promising but (if I recall correctly– this was a few years ago) there are some real questions about how to give feedback to the algorithm. The actual feedback metric of interest is bugs, but something like coverage should work as a weak approximation.

Lately I’ve been noticing how impressive afl-fuzz is using (if I recall..) more or less path coverage with path length == 2 as the coverage metric. Perhaps that’s some sort of sweet spot.
regehr says:

May 30, 2015 at 10:59 am

By “pile of MCTS papers” I probably mean about 3, I should add :). But it would be fun to brainstorm a bit about this and see if there are some obvious things that need to be tried.
Alex Groce says:

May 30, 2015 at 11:20 am

It doesn’t say it can’t work, but a smart ML PhD, his advisor, and I played around with MCTS for API testing for a few months and never could beat random if we evaluated fairly. But might have been the setting (python libraries) or our approaches. It was disappointing. We should talk at PLDI — we tried policy rollout to get cost down, which had a little (weak) promise, but haven’t gotten back to it lately.
bcs says:

May 30, 2015 at 1:02 pm

regehr: I actual saw the “Unreasonable Effectiveness” post before yours. It’s what got me thinking.

NN/RNN training really likes to eat compute cycles. If you are interested in that direction, feel free to contact me directly and I might be able to put you in contact with some people who already have the tooling set up to play in that space.
Alex Groce says:

May 30, 2015 at 6:43 pm

The key problem was that in API testing, you needed to grab coverage information often to build a reasonable reward (only at the end of trajectories, of course, but you run a lot of trajectories), and this was expensive enough with coverage.py that just letting random go full-steam ahead was better. Probably not hard enough SUTs to cover, though.