Fuzzer Puzzle

A recent post on the CERT/CC blog says:

… BFF 2.7 and FOE 2.1 do have an optional feature that in our testing drastically increases the number of uniquely-crashing test cases: crasher recycling. When enabled, the frameworks will add [a] uniquely-crashing test case back into the pool of seed files for further fuzzing. When crashing test cases are fuzzed more, this can uncover both new vulnerabilities as well as new ways in which an already-known vulnerability can cause a crash.

It seems pretty clear why adding a crasher back into the seed pool will find new ways to trigger the bug that the crasher already triggers: fuzzing this test case will generate test cases that are near the crasher (in some abstract test case space that we won’t try to define too closely). These nearby test cases are presumably more likely to find different ways to trigger the same bug.

But why is it the case that this technique “drastically increases the number of uniquely-crashing test cases”? I believe the authors intend “uniquely-crashing test case” to mean a test case that triggers a new bug, one that is distinct from any bugs already found during fuzzing. I’ll avoid speculating for now so I can see what you folks think.

UPDATE: Allen Householder from CERT has written some more about this.

Levels of Fuzzing

Differential Testing for Software is one of my favorite papers about software testing: it is one of the few pre-delta-debugging papers to describe automated test-case reduction and it contains many pieces of wisdom about software testing in general and random testing in particular. One of them outlines a hierarchy of test case generation strategies:

For a C compiler, we constructed sample C source files at several levels of increasing quality:

  1. Sequence of ASCII characters
  2. Sequence of words, separators, and white space
  3. Syntactically correct C program
  4. Type-correct C program
  5. Statically conforming C program
  6. Dynamically conforming C program
  7. Model-conforming C program

The paper goes on to describe how test cases at different levels reveal different kinds of flaws in a compiler. McKeeman’s choice of the words “increasing quality” is interesting because the set of test cases that can be generated at each level is a strict subset of the set of test cases generated at the previous level. In other words, for example, every type-correct C program is syntactically correct but the converse is not true. Since the expressiveness of the test-case generator decreases with increasing level, so must the size of the set of bugs that can be triggered.

So where does “increasing quality” come from? The problem is that in the real world, testing budgets are finite. Therefore, while a random sequence of ASCII characters can in principle trigger every single compiler bug, in practice we would have to run a very large number of test cases before finding the kind of bug-triggering programs shown here. Actually, it is not necessarily the case that every compiler bug can be triggered by 7-bit ASCII codes. To be safe, we should probably define test level 0: a random bit stream.

Given the low probability of triggering difficult optimizer and backend bugs using random bits or characters, we might interpret “increasing quality” as meaning “increasing coverage of compiler code within a reasonable test budget.” So we have this interesting situation where decreasing the set of inputs that we are willing to generate increases the expected level of coverage of the system under test.

Although McKeeman reports finding bugs at levels 1-5, for the Csmith work we did not bother with any of levels 0-5. It’s not that we didn’t think test cases at these levels would trigger some bugs, but rather that we were mainly interested in sophisticated optimizer bugs that are very hard to spot until level 6. My guess, however, is that a high-quality modern compiler (GCC, Clang, Intel CC, MSVC, etc.) is probably not going to have a lot of bugs that you can find using levels 0-3.

Bart Miller’s original fuzzer operates only at level 1. The fact that it found a lot of bugs tells us that the programs being tested contained a lot of bugs in the code that can be reached using level 1 inputs within a very small (by today’s standards) test budget. Basically, the then-unfuzzed UNIX utilities worked in the common case but were very fragile with respect to non-standard inputs. Also, it is most likely the case that many of these programs lacked the kind of sophisticated constraints on what constitutes a valid input that C compilers have.

Miller’s fuzzer, McKeeman’s fuzzer, and Csmith are all generational fuzzers: they generate test cases from scratch. In the time since the original fuzzing paper, the term “fuzz” has come to often mean mutational fuzzing where a valid input to the system under test (such as a PDF document, an Excel spreadsheet, or whatever) is randomly modified. This kind of fuzzer does not fit neatly into McKeeman’s hierarchy, it can be seen as a hybrid of a level 0 fuzzer and a level 7 fuzzer.

I want to finish up by making a few points. First, if you are writing a generational fuzzer, it is worth thinking about which level or levels that fuzzer should operate at. In other words, McKeeman’s hierarchy is not just a way to look at compiler fuzzers, it’s a way to understand all generational fuzzers. I also think it is useful as a means of understanding mutational fuzzers even if they do not fit into the hierarchy so neatly. For example, one can imagine a mutational fuzzer that instead of modifying test cases at level 0, modifies them at level 3 or 4. My second point is that a comprehensive fuzzing campaign should operate at multiple levels of McKeeman’s hierarchy. In other words, it would be useful to create a family of fuzzers, for example one operating at each of levels 0-7, and to split up the testing budget among them. The details of how this split is performed are probably not that important, but in practice I would look at the bug yield of each level after a couple of weeks of fuzzing and then use that information to increase the budget allocated to the more productive levels. However, if I really cared about the quality of the system under test, under no circumstances would I allocate zero test budget to any of the fuzzers. As I said earlier, for the Csmith work we ignored all but level 6: this was because we had a specific research agenda, as opposed to wanting to maximize the quality of some compiler. Finally, it is not the case that all of McKeeman’s levels make sense for all systems under test. For example, if we are fuzzing an API instead of a standalone tool, it probably makes no sense to pass random bit streams to the API.

Guidelines for Research on Finding Bugs

There’s a lot of research being done on finding bugs in software systems. I do some of it myself. Finding bugs is attractive because it is a deep problem, because it’s probably useful work, and because — at least for some people — finding bugs is highly entertaining. This piece contains a few thoughts about bug-finding research, formulated as guidelines for researchers. Some of these are based on my own experience, others are based on opinions I’ve formed by reviewing lots of papers in this area.

Target Software That’s Already Pretty Good

Not all software is good. In fact, some of it is downright crappy. Here crappy is a technical term: it means a code base containing enough serious known bugs that the developers don’t want to hear about any new ones. Finding bugs in crappy software is like shooting fish in a barrel. Crappy software does not need clever computer science: it needs more resources, new management, or simply to be thrown away. In contrast, software that is pretty good (another technical term) may have outstanding bugs, but the developers are actively interested in learning about new ones and will work to fix them. This is the only kind of software that should be targeted by bug-finding tools. Software that is extremely good is probably not a good target either — but this code is quite uncommon in the real world.

Report the Bugs

I’m always surprised when I read a bug-finding paper that claims to have found bugs, but does not mention how the people who developed the buggy software reacted the resulting bug reports. The implication in many cases is that the bugs were identified but not reported. It is a mistake to operate like this. First, when handed a bug, software developers will often discuss it — you can learn a lot by listening in. Some bugs found by bug-finding tools aren’t bugs at all but rather stem from a misunderstanding of the intended behavior. Other bugs are genuine but developers aren’t interested in fixing them — it is always interesting to learn how bugs are prioritized. If a bug-finding research tool is not finding bugs that are useful, then the tool itself is not useful. However, perhaps it can be adapted to filter out the bad ones or to find better ones in the first place. In all cases, feedback from software developers is valuable to the bug-finding researcher. Another reason it’s important to report bugs is that fixed software is better for everyone. Obviously it’s better for the software’s users, but it is also better for bug-finding researchers since bugs tend to hide other bugs. Paradoxically, bug-finding often becomes easier as bugs are fixed. Furthermore, it is important to raise the bar for future bug-finding research: if a dozen different papers all claim credit for finding the same bug, this is hardly a success story. A final reason to report bugs is that the very best outcome that you can report in a paper about a bug-finding tool is that it has found bugs that developers cared about enough to fix.

Prefer Open Source Software as a Bug-Finding Target

In reporting a large number of compiler bugs, I have found that bugs in open source compilers are more likely to be fixed than bugs in commercial tools. One reason, I think, is that people developing commercial compilers have a tendency to spend their precious time supporting paying customers as opposed to humoring obnoxious researchers. Another reason to prefer open source software is that it usually has a public bug reporting system, even if it’s just a mailing list, so you can listen in and perhaps participate when bugs are being discussed. Also, when a bug is fixed you can immediately try out the fixed version. In contrast, it is unusual to be granted access to unreleased versions of commercial software, so you may have to wait six months or two years for a new release, at which point your students have graduated, your grants have expired, and your bug-finding tool might not compile any longer (it definitely won’t compile after six months if your code interfaces with a fast-moving code base like LLVM).

Think Carefully when a Bug-Finding Tool Doesn’t

Certain codes are too hardened to be good bug-finding targets. For example, when hunting for integer overflow errors we found none in several popular libraries such as libjpeg. The problem is that this kind of code has been intensely scrutinized by security people who spent a lot of effort looking for similar kinds of problems. Other codes don’t contain good bugs because they are too small. Of course, a failure to find bugs doesn’t mean they don’t exist: it might mean that the bug-finding tool is not yet mature enough or it may simply be based on an incorrect premise. I’ve supervised a couple of casual student projects where they wrote fuzzers for the Linux system call API. Although this is clearly a hardened target, I have confidence that bugs would have been found if the students had persisted (they did not).

One trick I’ve seen people play when a bug-finding tool doesn’t find any bugs is to relabel the work as an exercise in formal verification where the desired result is evidence of absence of bugs rather than evidence of their presence. In some cases this may work, but generally speaking most bug-finding tools have many embedded assumptions and unsoundnesses that — if we are honest — cause their verification claims to be weak. Although there is certainly some overlap between bug-finding and formal verification, the latter sort of work is often conducted differently, using different tools, and attacking much smaller pieces of software.

Conclusion

I feel like this is all pretty obvious, but that it needs to be said anyway. Other than the superb Coverity article, there’s not a lot of good writing out there about how to do this kind of work. If you know of any, please leave a comment.

Are Compilers Getting More or Less Reliable?

Earlier this summer, Miod Vallat of the OpenBSD project posted some thoughts about compiler reliability. His thesis is that in the good old days of GCC 2.5 through 2.7, the compiler pretty much just worked and you could trust it to emit correct code. Subsequently, the level of code churn in GCC increased for several reasons:

  • new standards appeared, such as C++98
  • more effort was made to exploit platform-specific performance opportunities
  • more people started working on the compiler

The result is that “gcc had bugs, and you were expected to accept that and cope with them.” Upgrading the compiler is one way to get bugs fixed, but this is a mixed bag since new bugs are also added. What is needed is compiler versions with long-term support.

This post is a collection of responses to Miod’s post and also a bit of an attempt to figure out if open-source compilers have really gotten less reliable over the past 20 years.

First of all, to my eye the GCC project already does a good job in supporting their compilers. For example, 4.4.0 was released in April 2009 and 4.4.7 was released in March 2012 — that’s almost three years’ worth of bug fix releases. As a frequent reporter of compiler bugs, I can confirm that it is common for the GCC developers to take a bug report against trunk and backport the fix to at least one of the maintenance branches. So what else is needed here– Longer maintenance periods? More fixes being backported? I’d appreciate hearing from people about this. Miod’s mail indicates that he has some specific complaints:

… gcc developers have been eager to release “bug free” new versions by enforcing a policy that only “regressions” would get fixed, and spending more time changing their definition of “regression” or trying to explain why regressions weren’t, so as not need to fix them in any stable release.

I would tend to respect the opinions of an OpenBSD port maintainer, but on the other hand my observation is that the GCC people are in fact putting quite a bit of effort into the maintenance branches. Of course one could always argue that they are not doing enough.

The overall context for Miod’s mail is an ongoing discussion in the OpenBSD community about switching over to Clang as the default system compiler. One problem with this idea is that at present, the LLVM project does not have any maintenance branches at all. So realistically this switch would seem to be a step backward in terms of compiler stability. I’m not complaining about the reliability of Clang (see below for more details on this); rather, my group writes code that interfaces with LLVM/Clang and our experience is that these projects move forward very quickly.

The second thing I wanted to say is that I think we can add even more items to Miod’s list of reasons why GCC might be getting buggier over time:

  • GCC is now significantly larger than it was: 1300 KLOC for 4.8.1 vs. 331 KLOC for GCC 2.7.2.3 — that’s a million new lines of code for bugs to hide in, and much of this code is as hairy as you’ll see anywhere
  • the codes being compiled now are significantly larger and more complex than they were 20 years ago during the GCC 2.5 through 2.7 era; large and complex codes are more likely to trigger compiler bugs
  • the scope of compiler optimizations, in terms of how much code each optimization considers, has increased substantially in the last 20 years — this makes it harder to get the optimizations right
  • automatically generating code that uses SIMD instructions is tricky, and we do a lot more of that now

After writing this list I feel like we’re lucky to have compilers that work at all. But are there any trends indicating that GCC should be getting more reliable? I think so:

  • vastly improved compiler warnings, static analyzers, and dynamic tools such as Valgrind support easy detection of some kinds of bugs that used to be difficult
  • GCC’s move to an SSA-based internal representation, which caused many bugs at first, has permitted optimizations to be cleaned up
  • GCC’s test suite has gotten better
  • GCC attempts to fail rapidly when something goes wrong by using assertions much more heavily than it used to; I count about 12,000 assertions now, compared to fewer than 500 in 2.7.2.3 (these counts are very rough, using grep)

So now we’re ready to tackle the original question: Was GCC 2.7 really more reliable than our modern open-source compilers? Answering this kind of question is hard and in fact I don’t think anyone — in industry or in academia — has published quantitative information comparing the reliability of various compilers. It just isn’t done. My method (which has some obvious shortcomings that I’ll discuss later) was to take three compilers — GCC 2.7.2.3 for x86, a recent Clang snapshot for x86-64, and a recent GCC snapshot for x86-64 — and use them to compile 100,000 random programs generated by Csmith. Each program was compiled at -O0, -O1, -O2, -Os (if supported — GCC 2.7 did not have this flag yet), and one of -O3 or -Ofast. If any compiler crashed or failed to terminate within three minutes, this was noted. Then the compiled programs were run and their checksums were compared. These checksums are a feature of programs generated by Csmith; they are computed by looking at the values in all global variables just before the randomly-generated program terminates. The intent of the checksum is to support easy differential testing where we don’t know the correct result for a test case but we do know that the result is not supposed to change across compilers or optimization options. If changing the optimization level changed a checksum, the compiler was considered to have emitted incorrect code. For more details see the Csmith paper.

Here are the results:

crashes timeouts wrong code
GCC 2.7.2.3 13866 68 239
GCC r202341 4 0 3
Clang r190166 0 0 7

Here’s an example of a program that, when compiled by GCC 2.7.2.3 at -O2, causes the compiler to run for more than three minutes (in fact, it runs for more than 60 minutes — at which point I got bored and killed it):

int x0;
int x1() {
  for (;;) {
    x0 = 0;
    for (; x0;) {
      int x2 = 1;
      if (x2) return 0;
    }
  }
}

I didn’t investigate the crash bugs or wrong-code bugs in GCC 2.7, these are probably of very little interest now. It’s impressive that Clang never crashed in 100,000 test cases. The four crashes in the current GCC found during this testing run were all variants of this known problem; a patch exists, but it hasn’t been applied to trunk yet. The instances of incorrect code generation in Clang were due to some previously-unknown bugs that I reported as PR 17158 and PR 17179, which are triggered respectively by the following programs.

PR 17158:

int printf(const char *, ...);
struct x0 {
  int x1;
  char x2;
};
struct x3 {
  struct x0 x4;
  char x5;
} x6;
int main() {
  struct x3 x7 = { { 0, 0 }, 1 };
  x6 = x7;
  x7.x4.x2;
  printf("%d\n", x6.x5);
  return 0;
}

PR 17179:

int printf(const char *, ...);
int x0, x1, x2;
int main (void) {
  int x3;
  for (x1 = 0; x1 != 52; x1++) {
    for (x2 = 0; x2 <= 0; x2 = 1) {
      int x4 = x0 ^ x1;
      x3 = x4;
    }
  }
  printf("%d\n", x3);
  return 0;
}

The GCC miscompilations were also due to previously unknown bugs; I reported them as PR 58364, PR 58365, and PR 58385. Here are the two shorter ones.

PR 58364:

int printf(const char *, ...);
int x3 (int x4) { 
  return x4 < 0 ? 1 : x4; 
}
int x0 = 1, x1, x2;
int main (void) {
  if (x3(x0 > x2 == (x1 = 0))) {
    printf("x\n");
  }
  return 0;
}

PR 58385:

int printf(const char *, ...);
int x0, x1 = 1;
int x2() {
  x1 = 0;
  return 0;
}
int main() {
  ((0 || x0) & x2() >= 0) <= 1 && 1;
  printf("%d\n", x1);
  return 0;
}

The three new wrong-code bugs in GCC were all fixed within a few days of being reported. So, I ran another 100,000 random programs against an updated snapshot and here are the results:

crashes timeouts wrong code
GCC r202599 9 0 0

Out of the nine crashes, six were due to the PR 57393 problem mentioned above and the other three were due to null pointer dereference issues. The GCC bug database already contains a number of open bugs for segfaults in the compiler so I’m not in any big rush to report these.

I did not run another Clang test since neither of the new bugs in that compiler has been fixed yet.

Let’s get back to the original question: Has the reliability of open source compilers decreased since GCC 2.7? This does not appear to be the case. In fact, this experiment indicates that the opposite might be true. Of course we need to be careful in interpreting these results, since randomly generated C programs are not necessarily representative: they could trigger bugs that no real C program triggers, while also missing all of the bugs that Miod is complaining about. On the other hand, take a look at the C programs above that trigger the new wrong-code bugs and try to convince yourself that these patterns will never occur during compilation of real code. A different source of bias in these results is that GCC 2.7 was never tested with Csmith, but the current versions of GCC and Clang have been subjected to extensive random testing.

Getting back to Miod’s point about the need for LTS versions of compilers, I completely agree: this should be done. Five or even 10 years of support would not be too long, especially in the embedded systems world where old compilers often persist long after their expire-by date. I suspect that the main problem is that compiler maintenance is too difficult and unsexy to attract lots of volunteers. Of course people could be paid to do it, but as far as I know nobody has stepped up to fund this sort of effort.

Miod’s mail received some comments at Hacker News.

Updates from Sep 16:

  • good reading: “FreeBSD 10 Alpha Built With clang”
  • I added a bit of explanation of Csmith’s checksumming strategy
  • I added the test case text for four of the five new wrong-code bugs just because I think they’re interesting

Jay

Jay Lepreau died five years ago today; I wanted to share a few thoughts about him.

In spring 2000 I had a year of school left. Sarah had graduated and gotten several job offers; Utah was one of them. This seemed like a bit of an odd choice for us except that Jay flew me out and subsequently offered me a postdoc position once I finished up — the chance to work with Jay and his group was one of the main deciding factors for us to move to Utah.

My friend Alastair, who worked for Jay at the same time I did, talked about “the spotlight.” If the spotlight was shining on you, then you had Jay’s full attention; if not, then you might not be able to get his attention at all. Jay was very intense and spent many nights in the office, where he kept a ratty sleeping bag. Routine tasks such as ordering office furniture were put off literally for years, but on the other hand when papers and grant proposals went out, they had a sense of purpose and vision that I’ve spent an awful lot of time trying to recreate on my own.

The best memories I have of working with Jay are of a handful of times where a two or three hour meeting with him resulted in some substantial change in the direction in my work. In particular, I remember one time we met at a cafe near campus, probably in 2002. I showed up in sort of a confused and frustrated state about some piece of work (I don’t even remember what it was) and we just sat out in the sun on this warm day and talked things out for most of the afternoon. Afterward somehow I was no longer frustrated and confused.

Unfortunately I never managed to go on a desert trip with Jay; this short piece, written by a long-time friend of his, gives a sense of what that would have been like. The last time I saw Jay outside of the hospital was in Spring 2008; he and Caroline came over to my house for dinner, I remember it being a really nice evening.

Great Salt Lake Desert Road Trip

A few weeks ago, Matthew Flatt and I took a short road trip to the area west of the Great Salt Lake. We arrived at the Bonneville Salt Flats before dawn.

Unexpectedly, there were plenty of people out on the playa; it turned out to be “Speed Week” where drivers from all over the world show up to test out really fast vehicles. You can see some of their infrastructure here.

Since we had accidentally parked near where the vehicles were starting their speed runs, we watched for a while. People were driving a surprisingly wide variety of vehicles: street-legal motorcycles, 1920s roadsters, cars that looked completely home-made out of sheet metal, extremely fast race cars with parachutes — all sharing the same track. These vehicles were very loud, could barely idle, and were generally incapable of starting from a stop; several needed to be pushed to about 50 MPH before their engines could engage, at which point they roared away. This early in the morning there were very few spectators.

My new obsession is doing a long bike ride across the salt flats.

Next we had 50 miles of driving on dirt roads on a low bench between the salt flats and a small mountain range.

This is a seriously remote area and we didn’t see any other vehicles. Our destination was the Sun Tunnels near the ghost town of Lucin, Utah.

We had an early lunch in the shade of one of the tunnels and then headed home. On the way back I wanted to look for the obscure “Red Man” pictograph in Timpie Valley at the northern end of the Stansbury mountain range. We spent quite a long time hiking up and down steep hillsides in 95 degree heat, looking for caves; we were about to give up when Matthew spotted the right one.

This shot shows the scale; I do not know of any similar pictographs in this part of Utah.

Overall it was good to do a bit of traveling before classes got going.

Counting to 4 Billion Really Fast

Before teaching today I wrote a silly program to make sure that the bitwise complement of any 32-bit value is equal to one less than the two’s complement negation of that value:

#include <stdio.h>
#include <limits.h>
#include <assert.h>

void foo (unsigned x)
{
  unsigned x1 = ~x;
  unsigned x2 = -x - 1;
  assert (x1 == x2);
}

int main (void)
{
  unsigned x = 0;
  long checked = 0;
  while (1) {
    foo (x);
    checked++;
    if (x==UINT_MAX) break;
    x++;
  }
  printf ("checked %ld values\n", checked);
  return 0;
}

When the program terminated without any perceptible delay, I figured there was a bug, but nope: the code is good. It turns out that both GCC and Clang optimize the program into effectively this:

int main (void)
{
  printf ("checked 4294967296 values\n");
  return 0;
}

The surprise (for me) was that at -O1 — which traditionally does not enable interprocedural optimizations or aggressive loop transformations — both compilers looked inside the function foo() closely enough to figure out that it is a nop, and also that both compilers were able to predict that a not-traditionally-structured loop executes 2^32 times. I do so many posts about compiler bugs here that I figured a bit of antidote would be nice.

This is gcc 4.2.1 and Apple LLVM version 4.2.

FizzBuzz Bugs

It is said that an overwhelming majority of applicants for some programming jobs fail the FizzBuzz test. As part of a simple warmup exercise, I asked my computer systems class (a required course for CS majors, usually taken in the 3rd year) to implement FizzBuzz in C. The problem statement was, I think, the standard one:

Write a C program that prints the numbers from 1 to 100 except that for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz” and for numbers which are multiples of both three and five print “FizzBuzz”. Each number or string should be on its own line.

Out of 152 submissions, one failed to compile since the student had for some reason commented out main(). Out of the remaining 151 submissions, 26 did not produce precisely the expected output. If we ignore whitespace-related issues (extra newline at the start, no newline at the end, etc.), 17 submissions were still incorrect. I looked at each of them:

  • Six submissions had what I would characterize as string bugs instead of logic bugs: printing strings like “The number 51 is Fizz” instead of “Fizz”; “Fizzbuzz” or “FissBuzz” instead of “FizzBuzz”; failing to print lines with just the numbers; etc.

  • Four submissions printed a “FizzBuzz” line corresponding to zero instead of starting at one. One of these also neglected to print a line corresponding to 100. An additional submission left off the line corresponding to 100.

  • Four submissions never printed “FizzBuzz” at all due to logic errors, but did print “Fizz” and “Buzz” properly.

  • One submission printed an extra “Fizz” line every time “FizzBuzz” was printed.

  • One submission had a logic error that defies explanation.

While I feel strongly that all third-year students in my program should be able to write a correct FizzBuzz, I was still happy to see that nobody failed the basic task of determining multiples of 3 and 5 and that all of the incorrect solutions had more or less the correct shape. Of course, writing this code at home or in the lab, with plenty of time, is far easier than trying to put it up on a whiteboard in the first 15 minutes of a job interview.