Csmith Released

Here is Csmith, our randomized C program generator. My dream is that it will be a force for good by unleashing a world of hurt upon low-quality C compilers everywhere (it is not uncommon for Csmith to crash a previously-untested tool on the very first try). High-quality C compilers, such as the latest versions of GCC and Clang, will correctly translate many thousands of random programs.

I wanted to release Csmith under the CRAPL, but nobody else in my group thought this was as funny as I do. Actually we put quite a bit of work into creating a stable, usable release. Even so, Csmith is a complex piece of software and there are certainly some lurking problems that we’ll have to fix in subsequent releases.

Finding Integer Undefined Behaviors Using Clang 2.9

My student Peng Li modified Clang to detect integer-related undefined behaviors in C and C++ code. We’ve released the code here, to go along with the recent LLVM 2.9 release. This checker has found problems in PHP, Perl, Python, Firefox, SQLite, PostgreSQL, BIND, GMP, GCC, LLVM, and quite a few other projects I can’t think of right now.

Of course, undefined behaviors are not all created alike. The next thing that someone should do is combine this work with a tainting engine in order to find dangerous operations that depend on the results of operations with undefined behavior. For example, take this code:

a[1<<j]++;

If it is ever the case that 0>j≥32, then a shift-related undefined behavior occurs. The result of this operation is used as an array index — clearly this is dangerous. It may even be exploitable, depending on what a particular C/C++ implementation does with out-of-bounds shifts. Other dangerous operations include memory allocations and system call arguments.

One of my favorite examples of a project getting hosed by an integer undefined behavior is here. An apparently harmless refactoring of code in Google’s Native Client introduced an undefined behavior that subverted the fundamental sandboxing guarantee. This is pernicious because the (flawed) check is sitting right there in the source code, it is only in the compiler output that the check is a nop. If their regression tests used our tester, this problem would have almost certainly been caught right away.

Mistranslation

[Warning: spoilers for Shutter Island below.]

Over the last few nights I watched Shutter Island, a nice atmospheric horror film. Near the end someone pulls out a chalkboard and writes down the names of four important characters, which turn out to contain two pairs of anagrams. This is ridiculous: puzzles of this sort only work when the viewer has a reasonable chance of figuring out the answer in advance. Of course, none of the four names had appeared in writing on the screen at any point, much less all together. Suddenly it was obvious that Shutter Island had been adapted from a book, and also that the filmmakers had missed the opportunity to either turn this textual puzzle into a visual one, or else eliminate it as untranslatable. It’s not as if anagram names are a particularly strong plot device in the first place. Also the “good guy isn’t who he thinks he is” idea is a little old; I think Angel Heart — which I loved, mainly because Alan Parker is awesome — was the last time this one caught me by surprise, and only then because I was 18 when I saw it.

Preemption vs. Event Ordering

There’s no doubt that concurrent programming is hard, but it’s not always clear exactly why. Generally, eliminating data races is not the hard part. On the other hand, dealing with event ordering problems can be extremely difficult. To put that another way, if we remove preemption from the picture by programming with non-preemptive threads or a single-threaded event loop, much of the difficulty of concurrent programming remains.

In 1998/99, while doing an internship at Microsoft Research, I wrote some moderately complicated code for the NT kernel. Since kernel debugging was a pain, I wrote a simple event-driven simulator that emulated the kernel’s execution environment well enough to support running my code in user mode. I was depressed to learn that debugging code in the single-threaded simulator was not hugely easier than debugging the live version. The crux of the issue was atomicity: not “concurrency atomicity” but rather something like “invariant atomicity.” The kernel code I was writing was not only highly stateful, but also quite incremental, meaning that event handlers would often drop what they were doing and other handlers would pick up the job later. Thus, event handlers were seeing extremely unclean system states. The invariants were more complicated than I could easily grasp, which of course was the root of the problem. Software systems using callbacks can be surprisingly hard to get right for similar reasons, and I suspect that many problems with error-handling code have a similar root cause.

In contrast with my NT kernel simulator example, an event driven system will be far easier to get right when stable states of data structures coincide nicely with event boundaries. Adding concurrency/parallelism may not make program development any more difficult if the points of interaction between concurrent flows coincide with relatively stable points in the system’s invariants.

One part of the solution to event ordering problems is to design systems where the invariants are well-structured and unclean system states are exposed to as little code as possible. Another part is to use programming languages and environments that have better support for explicit invariants. Lamentably, almost all of my development these days is either systems programming or scripting — both domains where invariant-based programming (design by contract for example) is weakly supported.

Volatile Bugs, Three Years Later

Almost exactly three years ago Eric Eide and I submitted a paper Volatiles Are Miscompiled, and What to Do about It to the 8th International Conference on Embedded Software (EMSOFT 2008). The points made in this paper were that:

  • C compilers fail to reliably translate accesses to volatile-qualified objects
  • we can automatically detect these failures
  • we can (at least in some cases) automatically work around these failures

Gratifyingly, this paper found an audience among embedded software practitioners and in 2010 it was one of the most highly-downloaded PDFs on the Utah CS web server. Not gratifyingly, it isn’t clear that compilers are, in general, much better at volatile than they were three years ago (we haven’t quantified this, it’s just my general feeling). The root of the problem is that there is a tension between volatile and optimizations: it’s hard to make fast code that is still volatile-correct.

The motivation for writing this paper originated in a lecture for my Advanced Embedded Systems course that I gave in 2006 or 2007. I was presenting some fragments of C code along with their translations into assembly, in order to illustrate the effect of the volatile qualifier, when a student raised his hand and said that one of the translations was incorrect. I assumed that I had made a cut-and-paste error and moved on with the lecture. However, when I checked up later, it turned out there was no cut and paste error: the compiler had been wrong (this was CodeWarrior for ColdFire, I believe). This was surprising, so I kept playing around and the situation got worse every time I tried a new compiler or wrote a new code fragment. Eventually it became clear that systematic wrongness existed and I needed to write something up, though I had no idea how to turn this into any kind of respectable academic paper. Eric saved the day by taking Randprog, an existing random C program generator, and extending it to generate code using volatile. Also, he hacked Valgrind to count accesses to volatiles, giving us the automatic detector. Finally, my student Nathan hacked up a CIL pass for turning volatile accesses into calls to helper functions (I don’t recall why we didn’t include Nathan as an author on the paper — we probably should have). At this point we had a paper. I like this story because it illustrates the way systems research often works in practice. It does not proceed in a nice chain from hypothesis to solution to evaluation. Rather, it begins with a niggling suspicion that something is wrong, and proceeds in little fits and starts, across plenty of dead ends, until finally it becomes clear that a useful result has emerged.

By far the most interesting development in volatile-land during the last three years is CompCert, which has a provably correct implementation of volatile. This is, as I’ve said here before, and will no doubt keep saying, a very impressive result.

The volatile qualifier is a legitimate solution to a real problem in early C compilers: they would optimize away critical accesses to hardware device registers. Prior to the introduction of volatile, extremely dodgy hacks were used to avoid miscompilation. However, in retrospect, I believe volatile has proved more trouble than it’s worth, and that C/C++ would be better off without it. The alternative is to use an explicit function call to access variables that live in special kinds of memory; these calls need not have high overhead since they can be inlined. The argument for explicit accesses comes not just from the compiler side, but also from the user side. Linus Torvalds has ranted about this, and he’s right.

I suspect that Eric and I need to write at least one more volatile paper during the next year or two. Some things that have changed:

  • CompCert supports volatile, so it is available as a basis for comparison
  • GCC and LLVM are less prone to non-volatile miscompilations than they used to be, making it much easier for us to assess the reliability of automatically turning volatile accesses into calls to helper functions
  • my student Yang Chen has created a new Pin-based volatile bug detector that works better than the Valgrind-based one
  • the Pin tool supports testing whether the generated code correctly copes with the case where the volatile location returns a “fresh” value each time it is read — we never tested this before
  • Csmith + the Pin tool support testing whether accesses to volatile locations are illegally reordered, which we never tested before

It will be fun to see what kinds of bugs are turned up by the far more aggressive testing we can now do, compared to what we could do in 2008. However, the more interesting thing will be to implement the automatic volatile bug workaround in GCC and LLVM, by turning volatile accesses into calls to helper functions in the frontends, and turning them back into memory accesses in the backends. This should achieve near total volatile correctness and will also permit hundreds of silly special cases to be removed from these compilers’ optimizers. Ideally the compiler developers will adopt this approach, though I suspect this depends on the performance of the generated code (it should be decent, but won’t be as good as the current, broken approach).

Who Fuzzes the Fuzzer?

Although it’s fun to act like our tool Csmith is an infallible compiler smashing device, this isn’t really true. Csmith is made of ~40,000 lines of C++, some of it quite complicated and difficult. Csmith probably contains about as many bugs per LOC as your average compiler. So how do we debug the bug-finding tool? The answer is simple: it debugs itself, and also the compilers that we are testing help debug it. Specifically:

  • Xuejun put plenty of assertions into the Csmith code.
  • The code emitted by Csmith optionally contains a lot of assertions corresponding to dataflow facts; for example, an alias fact might turn into assert (!p || p==&x). Any time one of these checks fails, either Csmith or a compiler is wrong.
  • Every time Csmith finds a compiler bug, we try to manually verify that the (reduced) test case is in fact a valid one before submitting it. Sometimes the reduced test case ends up being invalid, at which point we’ve found a bug either in Csmith or in the reducer.
  • Just recently we’ve started using Csmith to test some static analysis tools such as Frama-C. As expected, we have been finding bugs in these tools. What we didn’t expect is how effective they would be in spotting tricky flaws in Csmith’s output. The explanation, as far as I can tell, is that compilers don’t look very deeply into the semantics of the compiled code, whereas these tools do.

In the end, Csmith seems to have two advantages in its war on compiler bugs. First, it’s not nearly as big as an aggressive compiler. Second, there are a lot of compilers to test and each of them contains plenty of bugs; in contrast, each Csmith bug has to be fixed just once. Therefore, in practice, we find 10 or more compiler bugs for each Csmith bug.