C-Reduce and Frama-C

Yesterday Pascal wrote a nice article addressing practical issues in automated testcase reduction. One issue is that C-Reduce tends to give you what you asked for instead of what you really wanted. Of course, this problem is a common one when performing algorithmic searches. Beyond computer science, the same problem comes up in basically every story where a genie is in a position to grant a wish. In all cases, our goal is to suitably constrain the space of solutions. As examples, this post will examine a couple of industrial-strength interestingness tests that Csmith’s driver infrastructure uses to reduce compiler crash and wrong-code bugs as it finds them.

As a quick review, a test case reducer starts out with a large test case that is interesting and its job is to produce the smallest test case it can that is also interesting. So it has the invariant that its current test case is interesting. The current test case is transformed to produce a series of smaller variants; any time such a variant is found to be interesting, it replaces the current test case. The reducer terminates when it reaches a fixpoint; that is, when no transformation in its repertoire results in an interesting variant.

A Compiler-Crash Interestingness Test

Here’s the whole test. First of all, we set limits on CPU time and memory usage:

ulimit -t 120
ulimit -v 8000000
ulimit -m 8000000

This is just common sense when doing any kind of fuzzing-related activity. Random test cases tend to evoke pathological behaviors in systems under test and we’d like to avoid bringing our test machine to its knees when this happens. These resource limits cause the interestingness test to terminate (and the variant to be deemed uninteresting) when a resource limit is exceeded.

The rest of the interestingness test is just a series of conditions; the test case is interesting if all of the conditions are met. The conditions should be ordered in such a way that the reduction happens as rapidly as possible. Basically this means: put conditions near the front of the script if they are likely to fail (permitting early bail-out) or if they run quickly.

The first condition is simply:

gcc -c small.c > gcc.out 2>&1

In other words, if the current variant is a legal compilation unit according to GCC, the compiler driver will return a 0 exit code, indicating that the condition is met. Otherwise, the driver returns non-zero and the interestingness test exits here. This accounts for the common case where C-Reduce emits a syntactically incorrect variant.

The next part of the interestingness test looks at the compiler’s output:

! grep 'data definition has no type or storage class' gcc.out &&\
! grep 'assumed to have one element' gcc.out &&\
! grep 'control reaches end of non-void function' gcc.out &&\
! grep 'return type defaults to' gcc.out 

Here we are saying that a variant is uninteresting if it causes GCC to print any of these four warnings. We do not want to submit a test case that, for example, falls off the end of a non-void function. Not only is this kind of thing bad style, but also an obviously undefined test case might cause a compiler developer to reject a bug report, even for a crash bug, if the developer is not in a good mood.

Conditions that are based on compiler warnings are a little unsatisfying. First, warnings are often delivered in a best-effort fashion. That is, compilers cannot always be counted upon to reliably warn about every occurrence of some bad behavior. Warnings about use of uninitialized storage, for example, are notoriously unreliable. The second problem is that here we’ve keyed on the specific syntax of these warnings; if we upgrade to a new GCC that changes the wording, part of our interestingness test will fail. Furthermore, if the system lies about the compiler that is being invoked (for example, Xcode 5 invokes clang when you type “gcc”) then these won’t work at all.

The next condition gets to the heart of the matter:

! gcc-4.8.2 -Ofast small.c -c -w > out.txt 2>&1

Here we insist that some version of GCC, at a certain optimization level, fails to compile the variant. We also need to ensure that it failed in an interesting way, as opposed to for example segfaulting when we’re looking for a different buggy behavior:

grep 'verify_ssa failed' out.txt

And now we’re done. Of course the string that we grep for, as well as the compiler options triggering the crash, are specific to the particular bug we are dealing with.

A Wrong-Code Interestingness Test

Here’s the code.

The central problem in reducing a wrong code test case is that we must reject as uninteresting any variant that relies on undefined or unspecified behavior. C-Reduce is sneaky and it’s going to try thousands and thousands of ways to make the test case smaller, many of which introduce undefined behavior — we need to reject them all. We start off by looking at a considerably larger set of compiler warnings:

clang -pedantic -Wall -O0 -c small.c  >out_clang.txt 2>&1 &&\
! grep 'conversions than data arguments' out_clang.txt &&\
! grep 'incompatible redeclaration' out_clang.txt &&\
! grep 'ordered comparison between pointer' out_clang.txt &&\
! grep 'eliding middle term' out_clang.txt &&\
! grep 'end of non-void function' out_clang.txt &&\
! grep 'invalid in C99' out_clang.txt &&\
! grep 'specifies type' out_clang.txt &&\
! grep 'should return a value' out_clang.txt &&\
! grep 'uninitialized' out_clang.txt &&\
! grep 'incompatible pointer to' out_clang.txt &&\
! grep 'incompatible integer to' out_clang.txt &&\
! grep 'type specifier missing' out_clang.txt &&\
gcc -Wall -Wextra -O1 small.c -o smallz >out_gcc.txt 2>&1 &&\
! grep 'uninitialized' out_gcc.txt &&\
! grep 'without a cast' out_gcc.txt &&\
! grep 'control reaches end' out_gcc.txt &&\
! grep 'return type defaults' out_gcc.txt &&\
! grep 'cast from pointer to integer' out_gcc.txt &&\
! grep 'useless type name in empty declaration' out_gcc.txt &&\
! grep 'no semicolon at end' out_gcc.txt &&\
! grep 'type defaults to' out_gcc.txt &&\
! grep 'too few arguments for format' out_gcc.txt &&\
! grep 'incompatible pointer' out_gcc.txt &&\
! grep 'ordered comparison of pointer with integer' out_gcc.txt &&\
! grep 'declaration does not declare anything' out_gcc.txt &&\
! grep 'expects type' out_gcc.txt &&\
! grep 'pointer from integer' out_gcc.txt &&\
! grep 'incompatible implicit' out_gcc.txt &&\
! grep 'excess elements in struct initializer' out_gcc.txt &&\
! grep 'comparison between pointer and integer' out_gcc.txt &&\

These checks reject certain classes of obviously undefined code, while permitting all manner of more subtly undefined code to pass through. We’re fine with that for the moment. So now to the crux of the matter:

gcc -O0 small.c -o small1 > /dev/null 2>&1 &&\ 
RunSafely 5 1 /dev/null out_small1.txt ./small1 &&\
gcc-4.8.2 -Ofast small.c -o small2 > /dev/null 2>&1 &&\
RunSafely 5 1 /dev/null out_small2.txt ./small2 &&\
! diff out_small1.txt out_small2.txt

This compound condition ensures that both gcc and gcc-4.8.2.are able to successfully compile the variant, that both compiled variants execute successfully, and that their outputs are different. This is just standard differential testing where we have a reference compiler and a compiler that is believed to be broken. We don’t care what the result of the program is: we only care that the two compilers produce executables with diverging behavior. The “RunSafely” script is one we inherited from somewhere, it simply imposes a smaller timeout (5 seconds, in this case) on the command that it runs. The shorter timeout helps the reduction make progress more rapidly. C-Reduce tends to often create variants containing infinite loops.

Finally, we’ll use a more serious undefined behavior checker. We do this last because it is usually the slowest part of the test:

frama-c -cpp-command 'gcc -C -Dvolatile= -E -I.' -val -no-val-show-progress -machdep x86_64 -obviously-terminates small.c > out_framac.txt 2>&1 &&\
! egrep -i '(user error|assert)' out_framac.txt >/dev/null 2>&1

Frama-C is a static analyzer for C code that happens to have a couple of important properties. First, it is sound, meaning that (for the subset of undefined behaviors that it handles) it never determines that a program is free of undefined behaviors unless that is actually the case. Second, Frama-C actually works when handed a nearly arbitrary piece of C code. This property of “actually working” requires a huge amount of effort and it is not one belonging to every tool that processes C code.

As I remarked above, Frama-C does not look for all of C’s ~200 undefined behaviors, but rather for a useful subset including tricky ones like array bounds errors, signed integer overflows, and use of stack variables whose scope has ended. For all other undefined behaviors, we have to trust that either C-Reduce does not introduce them or else one of the compiler warnings above tells us about them. Empirically, the results of these reductions are free of undefined behavior. There may be some luck involved, but it is a systematic kind of luck that comes from properties of the tools and the language, not a random kind of luck like flipping a coin and getting lots of heads.

One might doubt the wisdom of throwing random code at a static analyzer and hoping for precise results. There are a few reasons why this works here. First, as the command above shows, we use the preprocessor to eliminate all volatile qualifiers in the code that we hand to Frama-C. This is necessary because Frama-C (correctly — in the general case) treats reading from a volatile as returning an arbitrary value. However, we happen to know that the only value that will be loaded from a volatile is the last one stored to it, meaning that Frama-C will return the correct result even without these qualifiers. Second, the generated programs contain no other sources of nondeterminism such as system calls. Third, Frama-C tries very hard to avoid losing precision during its analysis. For the “closed” programs emitted by Csmith, it is able to act as an interpreter, never losing any precision at all. This would not be the case in general. The -obviously-terminates option is an optimization added by Pascal to avoid overhead due to checking for a fixpoint in the static analyzer.

In summary, reduction of test cases triggering wrong-code bugs is basically impossible without a tool such as Frama-C. The fact that Csmith and Frama-C work so well together should be viewed as a combination of much hard work and also a collection of lucky correspondences between the design points targeted by these two tools. For the Nth time I’ll plead for someone to write Frama-C++ since at present we can’t reliably reduce C++ wrong-code bugs.

Informative Assertion Failures

The other day a student was asking me how to make assertion failures in software more informative. I told him there are two pretty easy ways to do this. First, define a customized assert that takes two arguments: the predicate and also a description. For example, in the Botan crypto library we see lines like this:

BOTAN_ASSERT(signing_key, "Signing key was set");

The second technique is to use a standard single-argument assert like this:

assert((depth > 0) && "Invalid depth!");

Any other good tricks out there?

Safe, Efficient, and Portable Rotate in C/C++

Rotating a computer integer is just like shifting, except that when a bit falls off one end of the register, it is used to fill the vacated bit position at the other end. Rotate is used in encryption and decryption, so we want it to be fast. The obvious C/C++ code for left rotate is:

uint32_t rotl32a (uint32_t x, uint32_t n)
  return (x<<n) | (x>>(32-n));

Modern compilers will recognize this code and emit a rotate instruction, if available:

        movb    %sil, %cl
        roll    %cl, %edi
        movl    %edi, %eax

The source and assembly code for right rotate are analogous.

Although this x86-64 rotate instruction will have the expected behavior (acting as a nop) when asked to rotate by 0 or 32 places, this C/C++ function must only be called with n in the range 1-31. Outside of that range, the code has undefined behavior due to shifting a 32-bit value by 32 places. In general, crypto codes are designed to avoid rotating 32-bit words by 32 places. However, as seen in my last post, the current versions of five of the 10 open source crypto libraries that I examined perform rotations by 0 places — a potentially serious bug because C/C++ compilers have unpredictable results when shifting past bitwidth. Bear in mind that the well-definedness of the eventual instruction does not rescue the situation: it is the source code that places (or in this case, fails to place) obligations upon the compiler.

Here’s a safer rotate function:

uint32_t rotl32b (uint32_t x, uint32_t n)
  assert (n<32);
  if (!n) return x;
  return (x<<n) | (x>>(32-n));

This one protects against the expected case where n is zero by avoiding the erroneous shift and also against the disallowed case of rotating by too many places. If we don’t want the overhead of the assert for production compiles, we can define NDEBUG in the preprocessor. The problem with this code is that (even with the assert compiled out) it contains a branch, which is sort of ugly. Clang generates the obvious branching code whereas GCC chooses to use a conditional move:

        movl    %edi, %eax
        movb    %sil, %cl
        roll    %cl, %eax
        testl   %esi, %esi
        cmove   %edi, %eax

I figured that was the end of the story but then I saw this version:

uint32_t rotl32c (uint32_t x, uint32_t n)
  assert (n<32);
  return (x<<n) | (x>>(-n&31));

This one is branchless, meaning that it results in a single basic block of code, potentially making it easier to optimize. At the moment it defeats Clang’s rotate recognizer:

        movl    %esi, %ecx
        movl    %edi, %eax
        shll    %cl, %eax
        negl    %ecx
        shrl    %cl, %edi
        orl     %eax, %edi
        movl    %edi, %eax

However, they are working on this.

GCC (built today) has the desired output:

        movl    %edi, %eax
        movb    %sil, %cl
        roll    %cl, %eax

Here is their discussion of that issue.

Summary: If you want to write portable C/C++ code (that is, you don’t want to use intrinsics or inline assembly) and you want to correctly deal with rotating by zero places, the rotl32c() function from this post, and its obvious right-rotate counterpart, are probably the best choices. GCC already generates excellent code and the LLVM people are working on it.

Integer Undefined Behaviors in Open Source Crypto Libraries

Crypto libraries should be beyond reproach. This post investigates integer-related undefined behaviors found in crypto code written in C/C++. Undefined behavior (UB) is bad because according to the standards, it destroys the meaning of any program that executes it. In practice, over the last decade compilers have become pretty smart about exploiting integer undefined behaviors to generate fast but surprising object code. Plenty of examples are here.

This post was motivated by a discussion on the C++ standards committee’s mailing list for undefined behavior where I proposed making signed left-shift work just like unsigned left-shift. In contrast, in C99, C11, and C++11, it is illegal to shift a 1 bit into, out of, or through the sign bit. Many developers are unaware of this restriction. This seemed to me like a pretty safe proposal since it isn’t clear that any existing compiler implements anything other than two’s complement semantics for signed left shifts in the first place — so the change would essentially just mandate the behavior that is already implemented. Chandler Carruth from Google disagreed with this, stating that not only do the LLVM people believe that these UBs can result in useful optimizations but also that developers were appreciative of the bug reports that come from statically and dynamically detecting signed left-shift problems. Of course, if the UBs disappear due to a tighter language standard, then the capability for error detection disappears too. Chandler’s messages contain some good details so I’ll just link to them: 1, 2, 3. Based on that conversation, my goal here is to try to understand where these integer undefined behaviors come from and whether they correspond to “real bugs” or not.

To make matters just a little more complicated, there’s a C++ defect report stating that it should be permissible to shift a 1 into the sign bit, but not out of or through it. The defect report has no effect upon any of the standards until a TC is released, but it does indicate that this particular issue is considered to have been fixed by the committee. There is no corresponding defect report for C, as far as I know.

Let’s get on with the testing. The methodology used here was, for each of a number of open source crypto libraries that include a substantial amount of C/C++ code:

  1. Grab a recent version
  2. Build it using Clang with the -fsanitize=integer flag on an x86-64 machine running Linux
  3. Run the library’s built-in test suite
  4. Examine the resulting undefined behaviors

Note that if you do this kind of work yourself, you’re most likely better off using the -fsanitize=undefined flag because it doesn’t warn about (well-defined) unsigned overflows and it warns about many undefined behaviors not relating to integers.

BeeCrypt 4.2.1

This library executes no integer undefined behaviors while running its test suite. Nice!

Botan 1.11.4

Two of Botan’s rotate functions, found in rotate.h, take an unsigned int and shift it by 32 places. The code looks like this:

template inline T rotate_left(T input, size_t rot)
  return static_cast((input << rot) | (input >> (8*sizeof(T)-rot)));;

The problem occurs not when rotating by 32, but rather when rotating by 0. Rotate by 0 is undefined because it causes the second shift operator to shift by 32 – 0. The fix (applied by the code’s author already) is to add a check for rotate-by-0 and in that case return the unmodified input. This is a little bit annoying because in the expected case where the rotate code is translated into a rotate instruction (both GCC and Clang will do this) the test just adds overhead to the generated code since the machine instruction is perfectly well defined for rotating by zero. The compiler could go ahead and remove the test and exit, but neither has been taught to do so.

Shifting by bitwidth (or more) is a serious kind of undefined behavior that can lead to portability problems across compilers, platforms, and even optimization levels. On the version of Clang from Xcode 5, a little shift-past-bitwidth program that I wrote has different results each time it is run.

HELib from Github on 10/26/13

This library executes numerous integer undefined behaviors, but all of them occur inside code from NTL. However, NTL can be configured using the NTL_CLEAN_INT option, which makes the undefined behaviors go away. In other words, the NTL author (who I had a short discussion with) is aware of the undefined behaviors but prefers the version that executes them since it results in faster code and compilers do not currently (as far as we know) exploit the undefined behaviors to break NTL. My own take is that if NTL were distributed as a library that could be solidly unit tested, this might not be a bad bet to make. However, a lot of NTL code gets included in applications via header files, giving the compiler plenty of opportunity to do tricky things. Personally, I would not trust a modern compiler to fail to exploit an undefined integer overflow when it can see all of the code and propagate constants through it.

I didn’t report anything to the HELib people since the UBs aren’t in their code, and their code appears to be intended for research purposes anyhow.

LibTomCrypt 1.17

This library contains an interesting undefined behavior in a line of code (anubis.c:934) that shifts a value left by 24:

    for (i = 0, pos = 0; i < N; i++, pos += 4) {
      kappa[i] =
         (key[pos    ] << 24) ^
         (key[pos + 1] << 16) ^
         (key[pos + 2] <<  8) ^
         (key[pos + 3]      );

No undefined behaviors can occur when shifting an unsigned value left by 24. However, in this code key is a pointer to unsigned char. The char value is promoted to int before being shifted; if the value in the int is >127, a 1 bit will end up being shifted into the sign bit, which is undefined. This example nicely illustrates how the integer promotion rules in C/C++ can have surprising consequences. There’s another instance of the same problem in the same file at line 1051. I reported these issues but haven’t heard back from the developers yet.

Libgcrypt 1.5.3

Ok, at this point things start to get a bit repetitive so I’ll start leaving out the details. This version of Libgcrypt contains:

  • 4 locations in blowfish.c, 8 locations in cast5.c, 3 locations in des.c, and 8 locations in twofish.c where an unsigned char is promoted to int and then left-shifted by 24 places, resulting in undefined behavior
  • 2 locations in cast5.c where a 32-bit integer is shifted by 32 places, again as part of a rotate operation

Keep in mind that the former problems are probably benign for now, whereas the latter ones should definitely be fixed. These have been reported (see developer’s response here).

Crypto++ 5.6.2

3 locations in misc.h where a 32-bit unsigned value is shifted by 32 places, again in rotate functions. Reported.

Sodium 0.4.5

3 locations in aes256-ctr.c where an unsigned char is promoted to signed int and shifted left by 24. Reported, and fixed within hours.

Libmcrypt 2.5.8


  • 2 locations in cast-128.c where the rotate operation leads to a shift by 32 places of a 32-bit integer
  • 26 locations in cast-256.c with very large shift exponents up to almost 232; the code is heavily macroized so I didn’t dig in too deeply
  • 2 signed multiplication overflows in enigma.c
  • 1 location in des.c, 1 in gost.c, and 2 in loki97.c where a 1 is left-shifted into the sign bit
  • 1 location in tripledes.c where a 1 is left-shifted out of the sign bit

I didn’t bother reporting since libmcrypt was last released in 2007 and it has a number of long-open, serious-looking bugs. Hopefully most people have phased out their use of this library, though I see that it is still provided as a package in Ubuntu 13.10.

Nettle 2.7.1

1 location in blowfish.c and 1 in twofish.c where an unsigned char is promoted to signed int and the left-shifted by 24. 2 locations in cast128.c where a 32-bit value is shifted by 32 places in a rotate function. Reported and the rotate has been fixed already.

OpenSSL SNAP-20131112

At a_int.c:397 and obj_dat.c:143 there are undefined signed left-shifts. These — unlike most of what we’ve seen so far — are the result of variables that were declared as signed, as opposed to being the result of implicit conversions. These shifts should be done more carefully, or should be done using unsigned operations.

In c_enc.c there are 20 sites that shift a 32-bit variable by 32 places, all resulting from a macro called E_CAST.

gost89.c has 8 locations where a 1 is left-shifted into the sign bit, all are due to promotion of an unsigned char to int. Same thing happens at 2 locations in gost_crypt.c.

These have been reported.

The Bad News

The bad news, obviously, is that nearly all of the crypto libraries I tested execute integer undefined behaviors. There appear to be several root causes:

  1. C’s weird integer promotion rules practically guarantee that signed types will be introduced into computations that start out with only unsigned types. Signed math is hard because its operators have much stricter rules to follow if we are to avoid undefined behavior. Unsightly explicit casts to unsigned are the solution.
  2. Some developers continue to purposefully use signed math under the faulty assumption that the C/C++ languages are built around two’s complement arithmetic
  3. Reasoning about function preconditions is hard even for experienced developers. In my opinion, some of these libraries could have used a lot more assertions to go along with their (generally perfectly adequate) test suites.

The Good News

  1. Most of the undefined behaviors seen here stem from a few simple patterns, most notably “rotate by zero” and “unsigned char is promoted to signed int.” Crypto developers can and should learn to recognize and deal correctly with these. In general, the problems seen here are not difficult to fix.
  2. The developers who have responded to my bug reports seem very on the ball and generally happy that people are vetting their software.
  3. Clang’s undefined behavior sanitizer revealed all of these problems trivially. Use it, folks.

Next Steps

The undefined behaviors reported here are a subset of all undefined behaviors executed by these libraries, in three senses:

  1. I only checked for integer undefined behaviors, whereas Clang’s undefined behavior sanitizers can find other problems such as memory safety bugs.
  2. Even Clang’s full set of undefined behavior sanitizers checks only a subset of all undefined behaviors in C/C++. More work on checkers is needed if we are to keep using C/C++ in security-critical roles.
  3. Even if we had a checker for all undefined behaviors, we would still be limited by the quality of each library’s builtin test suite. Static analysis is one solution and better testing is another.

If our security relies on crypto libraries operating correctly, it would not be a bad idea for someone to spend some time and energy vetting crypto code using a tool like Frama-C. Crypto code is nice and mathy: a perfect target for formal methods.

Finally, if I’ve missed your favorite crypto library, please let me know and I’ll take a look. I am interested only in open source code which contains a substantial amount of C/C++ and which can be built on Linux or OS X.

A New Compiler Fuzzing Paper

Google Scholar’s recommendation engine has turned out to be a great resource for learning about new papers. Recently this paper about compiler fuzzing turned up in my feed and I was hooked after noticing that they claim to find a lot more compiler bugs during a 12-hour fuzzing run than Csmith can find.

The work in the paper succeeds by addressing a problem that we punted on while developing Csmith: generating expressions that are statically guaranteed to be free of integer undefined behaviors. They accomplish this using static analysis combined with random search. Csmith, in contrast, wraps all math operations with safety checks in order to prevent undefined behaviors dynamically. This is a klunky solution that — as the authors of the new paper note — reduces Csmith’s expressiveness.

The character of the bugs they found follows directly from the generation strategy:

These bugs are, frankly, a little surprising. GCC and LLVM are mature tools and I’d have hoped they were beyond these simple math errors.

Besides the obvious fact that this new tool is generating expressions that Csmith cannot generate, there’s also a more subtle thing going on, which is that any new C code generator is likely to find bugs that Csmith cannot. This is because there are a hundred little choices that get made while generating test cases such as choice of identifiers, choice of function size, how many parentheses to put in and where, etc. Many of these have no effect on bug-finding power but some do, and there’s really no way to know which of them matter in advance. This is one of the most dissatisfying things about random testing.

As far as I can tell, the new compiler testing tool has not been released. Hopefully the authors will release it. If they do release it under an appropriate license, one thing I would consider doing is integrating it into Csmith. This would be easy to do by having Csmith just fail to generate code for a few functions and then generate them using this new tool. At that point we should get the benefits of both tools.

UPDATE: Here are two more C compiler fuzzers I just learned about:

Producing Good Software From Academia

Writing and maintaining good software from academia isn’t easy. I’ve been thinking about this because last week my student Yang Chen defended his thesis. While I’m of course very happy for him, I’m also depressed since Yang’s departure will somewhat decimate the capacity of my group to rapidly produce good code. Yang looked over my group’s repositories the other day and it turns out he has committed about 200,000 lines of code while working with me.

Specifically, I’m bummed about not having a very good story for maintaining tools like Csmith and C-Reduce. Ironically, I was the original C-Reduce developer and maintained it for several years but then Yang stepped in and wrote 30,000 lines of C++ that made it get much better results — and this work became part of his thesis. However, I’m not super interested in maintaining his code myself: it’s pretty big and pretty hairy and it interacts closely with Clang — a fast-moving project that requires active effort to keep up with. Csmith was never my code base, though I did contribute a little in its early days. The intent was always to throw it away after we figured out the right way to do random program generation. Well, guess what? That goal remains just a goal, and in the meantime it would be nice if we could keep working on Csmith, which continues to be useful to people. Prior to C-Reduce and Csmith my group produced a number of tools that I’d have liked to maintain, but that got abandoned due to lack of development power.

I suspect my situation is a common one for mid-career CS professors who work in systems, software engineering, security, PL, and other engineering-oriented parts of the field. So how should we go about producing good code? Here are some possibilities.

Forget About Coding

I’ve heard people say that it’s not our job in academia to produce good software. I find that to be a silly position to take. Our job is to do research that has impact. If we need to produce good software to do that job, then producing good software is part of the job. There are plenty of other reasons to write code:

  • Ideas without implementations often have gaps and flaws that would have become immediately apparent if an implementation had been attempted.
  • When you write code, you explore and reject a large number of program designs; with each rejected choice you learn something. I’m convinced that the cumulative effect of these small pieces of design feedback is very important in developing and maintaining a solid intuition for research.
  • Feedback from users is also very valuable; you can’t get users without writing code that is at least moderately usable.

Some researchers are capable of doing top-quality work without these types of feedback. Many more are not.

Embrace Crappiness

Even when we reserachers do produce software, the code is often released as a versionless tarball with no particular license. The makefile tends to refer to specific paths on the system where the software was developed. There might be a few test cases and if we’re lucky a README. The tarball was uploaded around the time the final copy for some paper was submitted and it has not been updated since then. In all likelihood, the only way to compile it is on some old version of Linux. This version, needless to say, must be determined by trial and error since it’s not mentioned in the README. Is this sounding a little bit familiar? Also see the CRAPL. Of course, a crappy software release is better than none at all, and at least it serves the goal of reproducibility, assuming that compilation challenges can be overcome. Due to sites like Github, the research software has improved at least a bit, but the bit-rotted tarball is far from dead.

I’m not going to pretend that I haven’t repeatedly embraced crappiness. In fact the ability to do so is one of the main perks of being a researcher in the first place. On the other hand, when an idea ends up having legs, the crapware option stops making sense.

Professor as Hacker

It’s no secret that maybe a third of CS professors are completely inept at writing code. At the other extreme we have people like Xavier Leroy and Matthew Flatt who are principal developers and maintainers of large pieces of high-quality software. When I go to lunch with Matthew he’s always like “Yeah… I really didn’t like the ARM code coming out of Racket so I rewrote the jitter this morning.” Most of us are in the middle somewhere and as far as I can tell, many of us are bummed out that we have very little time to write code anymore.

So in case that wasn’t clear, the problem with maintaining the code ourselves is that we don’t scale very well. For example, I can and do maintain some small things such as the 2,000-line main body of C-Reduce that is now dwarfed by Yang’s C++, but I can’t keep taking on new tasks.

Hire Long-Term Research Staff

A time-honored way to produce high-quality software from academia is to run a research empire large enough that it can sustain full-time research stuff over a period of at least a decade. These staffers are people who like academia and are proficient at code craftsmanship, but who lack a desire to run the show.

For many of us, the problem with this plan is that empire builders become managers who spend most of their time acquiring and keeping grant money. The real work gets done via delegation. I recall a time when my advisor in grad school was going to about one PI meeting per month — that cannot have been fun. Empirically, there’s very little overlap between the set of professors who are PIs on major grants capable of supporting long term staff and the set of professors who write code and are otherwise in touch with the low-level details of their operations.

Leverage Open Source

In this scenario, a research group develops a prototype, open sources it, and then a community of volunteers takes over subsequent development. This is the dream outcome. It could happen, but I don’t think it is very common. We’ve gotten a number of patches for Csmith but most of them have been either relatively simple cleanups or pretty substantial sets of changes that were done to support someone’s custom compiler and that we haven’t been too eager to integrate into our version of Csmith since they would make our own work harder. Csmith isn’t very modular. With C-Reduce, we’ve had better luck — we’ve gotten plenty of good bug reports, a number of patches, and even some new features. But overall, it is not easy to build a new open source community. George Necula, who developed the wonderful CIL tool, once complained that CIL had a lot of users but very few were contributing back. Maintenance of CIL was eventually taken over by volunteers, though it may have now stagnated a bit — LLVM has moved into this research niche in a big way.

A Chain of Students

A final alternative is to try to recruit students who have, or can acquire, strong code craftsmanship skills and then put some of the code development and maintenance burden on them. In this model a new student spends some time developing and maintaining existing codes before, and in conjunction with, working on her own projects.

As I said at the top of this piece, this is a strategy that I have used to some extent. It sort of works but has various problems. First, a very light touch is required: it isn’t fair to burden students with too much work that does not contribute to their own progress towards a degree. Second, even PhD students aren’t around for all that long, in the larger scheme of things. Third, not all students are interested in or capable of maintaining large codes, or of writing code that is worth maintaining.


In general, academia is setup to reward quick projects that result in a few papers and maybe a few tarballs linked to grad students’ web pages. Long term development and maintenance of code requires either heroic effort by the PI or else big funding.

A possibility that I deliberately left out of this piece since it isn’t “producing good software from academia” is spinning off a company. I have a huge amount of respect for academics who form startups, but this option seems so invasive in terms of overall lifestyle that I haven’t seriously considered it yet.

Trust, But Verify

CompCert is an optimizing C compiler whose output provably has the same semantics as its input, at least when the input programs are conforming. Of course this high-level view sweeps a huge number of details under the rug. If we want to gain confidence in CompCert’s correctness we’ll need to either dig into these details — making sure, for example, that the formalizations of x86 and of C are done right — or else we’ll need a different plan.

One kind of alternate plan is differential testing, where we supply both CompCert and one or more other compilers with conforming C programs and check that the resulting executables all have the same behavior. For the Csmith paper we spent quite a bit of time testing CompCert in just this way. We found some problems in it along the way, but were unable to find code generation bugs in the latest version of CompCert at the time that we finalized the paper.

Recently I returned to CompCert testing and wanted to share a few observations about that. The reference compilers were recent development versions of Clang and GCC, targeting x86, with optimizations disabled. One reason that I wanted to get back to CompCert is that its recent versions contain a function inliner. This is a pretty important optimization and it is one that Xavier once told me would be very difficult to handle in CompCert because it changes the shape of the program in a significant way. So this seemed worth testing. Also, CompCert now has a better register allocator, it supports 64-bit integers, and has a few other new features.

I ran 1,000,000 randomly-generated test cases through CompCert 2.0 without asking it to do any inlining. This turned up no problems in CompCert, but — somewhat surprisingly — triggered a GCC wrong-code bug just once out of the million test cases. This bug was rapidly fixed. A few weeks later, I ran another 1,000,000 test cases through CompCert 2.1, this time asking it to do a lot of inlining. This time, no bugs were found in GCC, Clang, or CompCert.

Given that all three of these compilers are capable of translating a million randomly generated programs without a detectable miscompilation, one might ask: Why CompCert? The main answer is that it generates better code than gcc -O0 or clang -O0. If we want better code out of the other compilers we can turn on their optimizers, but at that point they would fail to be capable of translating 1,000,000 random programs without miscompiling a few times. Of course we don’t care about random programs. The real questions are (1) how can we be sure that compiler bugs aren’t going to bite the programs that we care about and (2) when developing safety critical software, how can we make an argument that the compiler has not introduced any flaws? One answer is translation validation where we use an external tool to ensure that the compiler has respected the meaning of the input code. But this is quite difficult to do in the presence of optimizations.

The point I wanted to make in this post is simply that while proofs are great, if the subject of the proof is a piece of code that has to accomplish a real-world task, we can sure breathe a lot easier if we have an independent means of validating that the code works as advertised. In this case, we do and it does.