Skip to content

Testing with Small Capacities

The goal of testing is to expose bugs. To find a bug we must drive the software under test into a part of its state space where the bug becomes evident, for example by causing a crash. Many bugs lurk in dark corners of the state space that are ordinarily difficult to reach. This piece is about a nice technique for making those dark corners easier to reach using either random and non-random testing. Although the method is one that I have used before, I hadn’t really thought about it explicitly until a conversation with Colin Runciman the other day brought it out into the open.

The issue is the small scope hypothesis, which states that most bugs can be triggered using small test cases. Here we’re talking about a class of bug that falls outside of this hypothesis: these bugs are triggered when some implementation limit is exceeded. For example, if we have a bounded buffer with length 10,000, and this buffer contains a bug in its handling of the buffer-full case, then it is clear that no test case can trigger the bug unless it results in more than 10,000 insertions into the buffer. Similarly, a bug in an operating system’s swapping code will not be hit until we put memory pressure on the machine — this will be difficult if we’re testing on a machine with lots of RAM. A bug in a compiler’s register spill logic will not be hit until we run out of registers, which may happen only rarely on a RISC with a large register file.

The testing technique, then, is simply to recognize capacities in the system under test and to artificially reduce them in order to make capacity bugs much easier to trigger. I think that probalby every accomplished programmer has used this technique, perhaps without even thinking about it: a simple example is testing the code for an auto-resizing hash table by making its initial capacity very small. This is just good sense.

Of course, while some capacities appear as explicit constants in the code that can be easily changed, others, such as the number of registers available to a compiler, will tend to be encoded in a more implicit fashion, and will therefore be more difficult to find and change. Also, the fact that a constant is hard-coded into a program doesn’t mean this constant can be freely changed: in some cases the constant is fake (how many programs that reference the CHAR_BIT constant can really tolerate values other than 8?) and in others the program is only designed to operate for a certain range of choices. To continue with the compiler example, the calling convention (and other factors) will almost certainly cause failures when the number of registers falls below a certain number. Similarly, a pipe or a thread pool can easily cause the system to deadlock when it is instantiated with a capacity that is too small. These things are design constraints, not the implementation errors that we are looking for.

In summary, if you have a system that offers a choice of capacity for an internal resource, it is often very useful to reduce this capacity for testing purposes, even if this results in poor throughput. This effect is particularly pronounced during fuzzing, where the random walks through the state space that are induced by random test cases will tend to hover around mean values with high probability.

Find the Integer Bug

Not all of the functions below are correct. The first person to leave a comment containing a minimal fix to a legitimate bug will get a small prize. I’m posting this not because the bug is particularly subtle or interesting but rather because I wrote this code for a piece about integer overflow and thought — wrongly, as usual — that I could get this stuff right the first time.

By “legitimate bug” I mean a bug that would cause the function to execute undefined behavior or return an incorrect result using GCC or Clang on Linux on x86 or x64 — I’m not interested in unexpected implementation-defined integer truncation behaviors, patches for failures under K&R C on a VAX-11, or style problems. For convenience, here are GCC’s implementation-defined behaviors for integers. I don’t know that Clang has a section in the manual about this but in general we expect its integers to behave like GCC’s.

#include <limits.h>
#include <stdint.h>

 * specification:
 * perform 32-bit two's complement addition on arguments a and b
 * store the result into *rp
 * return 1 if overflow occurred, 0 otherwise

int32_t checked_add_1(int32_t a, int32_t b, int32_t *rp) {
  int64_t lr = (int64_t)a + (int64_t)b;
  *rp = lr;
  return lr > INT32_MAX || lr < INT32_MIN;

int32_t checked_add_2(int32_t a, int32_t b, int32_t *rp) {
  int32_t r = (uint32_t)a + (uint32_t)b;
  *rp = r;
    (a < 0 && b < 0 && r > 0) ||
    (a > 0 && b > 0 && r < 0);

int32_t checked_add_3(int32_t a, int32_t b, int32_t *rp) {
  if (b > 0 && a > INT32_MAX - b) {
    *rp =  (a + INT32_MIN) + (b + INT32_MIN);
    return 1;
  if (b < 0 && a < INT32_MIN - b) {
    *rp =  (a - INT32_MIN) + (b - INT32_MIN);
    return 1;
  *rp = a + b;
  return 0;

UPDATE: The prize has been awarded, see comment 5. The prize is an Amazon gift certificate for $27.49 — the cost of the kindle edition of Hacker’s Delight.

Research Advice from Alan Adler

Although I am a happy French press user, I enjoyed reading an article about Alan Adler and the AeroPress that showed up recently on Hacker News. In particular, I love Adler’s advice to inventors:

  1. Learn all you can about the science behind your invention.
  2. Scrupulously study the existing state of your idea by looking at current products and patents.
  3. Be willing to try things even if you aren’t too confident they’ll work. Sometimes you’ll get lucky.
  4. Try to be objective about the value of your invention. People get carried away with the thrill of inventing and waste good money pursuing something that doesn’t work any better than what’s already out there.
  5. You don’t need a patent in order to sell an invention. A patent is not a business license; it’s a permission to be the sole maker of product (even this is limited to 20 years).

Now notice that (disregarding the last suggestion) we can simply replace “invention” with “research project” and Adler’s suggestions become a great set of principles for doing research. I think #4 is particularly important: lacking the feedback that people in the private sector get from product sales (or not), us academics are particularly susceptible to falling in love with pretty ideas that don’t improve anything.

A New Development for Coverity and Heartbleed

As a consequence of my last post, I spent some time on the phone Friday with Andy Chou, Coverity’s CTO and former member of Dawson Engler’s research group at Stanford, from which Coverity spun off over a decade ago. Andy confirmed that Coverity does not spot the heartbleed flaw and said that it remained stubborn even when they tweaked various analysis settings. Basically, the control flow and data flow between the socket read() from which the bad data originates and the eventual bad memcpy() is just too complicated.

Let’s be clear: it is trivial to create a static analyzer that runs fast and flags heartbleed. I can accomplish this, for example, by flagging a taint error in every line of code that is analyzed. The task that is truly difficult is to create a static analysis tool that is performant and that has a high signal to noise ratio for a broad range of analyzed programs. This is the design point that Coverity is aiming for, and while it is an excellent tool there is obviously no general-purpose silver bullet: halting problem arguments guarantee the non-existence of static analysis tools that can reliably and automatically detect even simple kinds of bugs such as divide by zero. In practice, it’s not halting problem stuff that stops analyzers but rather code that has a lot of indirection and a lot of data-dependent control flow. If you want to make a program that is robustly resistant to static analysis, implement some kind of interpreter.

Anyhow, Andy did not call me to admit defeat. Rather, he wanted to show off a new analysis that he and his engineers prototyped over the last couple of days. Their insight is that we might want to consider byte-swap operations to be sources of tainted data. Why? Because there is usually no good reason to call ntohs() or a similar function on a data item unless it came from the network or from some other outside source. This is one of those ideas that is simple, sensible, and obvious — in retrospect. Analyzing a vulnerable OpenSSL using this new technique gives this result for the heartbleed code:

Nice! As you might guess, additional locations in OpenSSL are also flagged by this analysis, but it isn’t my place to share those here.

What issues are left? Well, for one thing it may not be trivial to reliably recognize byte-swapping code; for example, some people may use operations on a char array instead of shift-and-mask. A trickier issue is reliably determining when untrusted data items should be untainted: do they need to be tested against both a lower and upper bound, or just an upper bound? What if they are tested against a really large bound? Does every bit-masking operation untaint? Etc. The Coverity people will need to work out these and other issues, such as the effect of this analysis on the overall false alarm rate. In case this hasn’t been clear, the analysis that Andy showed me is a proof-of-concept prototype and he doesn’t know when it will make it into the production system.

UPDATE: Also see Andy’s post at the Coverity blog.

Heartbleed and Static Analysis

Today in my Writing Solid Code class we went through some of the 151 defects that Coverity Scan reports for OpenSSL. I can’t link to these results but take my word for it that they are a pleasure to read — the interface clearly explains each flaw and the reasoning that leads up to it, even across multiple function calls. Some of the problems were slightly alarming but we didn’t see anything that looks like the next heartbleed. Unfortunately, Coverity did not find the heartbleed bug itself. (UPDATE: See this followup post.) This is puzzling; here’s a bit of speculation about what might be going on. There are basically two ways to find heartbleed using static analysis (here I’ll assume you’re familiar with the bug; if not, this post is useful). First, a taint analysis should be able to observe that two bytes from an untrusted source find their way into the length argument of a memcpy() call. This is clearly undesirable. The Coverity documentation indicates that it taints the buffer stored to by a read() system call (unfortunately you will need to login to Coverity Scan before you can see this). So why don’t we get a defect report? One guess is that since the data buffer is behind two pointer dereferences (the access is via s->s3->, the scanner’s alias analysis loses track of what is going on. Another guess is that two bytes of tainted data are not enough to trigger an alarm. Only someone familiar with the Coverity implementation can say for sure what is going on — the tool is highly sophisticated not just in its static analysis but also in its defect ranking system.

The other kind of static analysis that would find heartbleed is one that insists that the length argument to any memcpy() call does not exceed the size of either the source or destination buffer. Frama-C in value analysis mode is a tool that can do this. It is sound, meaning that it will not stop complaining until it can prove that certain defect classes are not present, and as such it requires far more handholding than does Coverity, which is designed to unsoundly analyze huge quantities of code. To use Frama-C, we would make sure that its own header files are included instead of the system headers. In one of those files we would find a model for memcpy():

/*@ requires \valid(((char*)dest)+(0..n - 1));
  @ requires \valid_read(((char*)src)+(0..n - 1));
  @ requires \separated(((char *)dest)+(0..n-1),((char *)src)+(0..n-1));
  @ assigns ((char*)dest)[0..n - 1] \from ((char*)src)[0..n-1];
  @ assigns \result \from dest;
  @ ensures memcmp((char*)dest,(char*)src,n) == 0;
  @ ensures \result == dest;
extern void *memcpy(void *restrict dest,
                    const void *restrict src, 
                    size_t n);

The comments are consumed by Frama-C. Basically they say that src and dest are pointers to valid storage of at least the required size, that the buffers do not overlap (recall that memcpy() has undefined behavior when called with overlapping regions), that it moves data in the proper direction, that the return value is dest, and that a subsequent memcmp() of the two regions will return zero.

The Frama-C value analyzer tracks an integer variable using an interval: a representation of the smallest and largest value that the integer could contain at some program point. Upon reaching the problematic memcpy() call in t1_lib.c, the value of payload is in the interval [0..65535]. This interval comes from the n2s() macro which turns two arbitrary-valued bytes from the client into an unsigned int:

#define n2s(c,s)        ((s=(((unsigned int)(c[0]))<< 8)| \
                            (((unsigned int)(c[1]))    )),c+=2)

The dest argument of this memcpy() turns out to be large enough. However, the source buffer is way too small. Frama-C would gripe about this and it would not shut up until the bug was really fixed.

How much effort would be required to push OpenSSL through Frama-C? I don’t know, but wouldn’t plan on getting this done in a week or three. Interestingly, a company spun off by Frama-C developers has recently used the tool to create a version of PolarSSL that they promise is immune to CWEs 119, 120, 121, 122, 123, 124, 125, 126, 127, 369, 415, 416, 457, 562, and 690. I think it would be reasonable for the open source security community to start thinking more seriously about what this kind of tool can do for us.


  • In a comment below, Masklinn states that OpenSSL’s custom allocators would defeat the detection of the too-large argument to memcpy(). This is indeed a danger. To avoid it, as part of applying Frama-C to OpenSSL, the custom malloc/free functions would be marked as being malloc-like using the “allocates” and “frees” keywords supported by ACSL 1.8. Coverity lets you do the same thing, and so would any competent analyzer for C. Custom allocators are regrettably common in oldish C code.
  • I’m interested to see what other tools have to say about heartbleed. If you have a link to results, please put it in a comment.
  • I re-ran Coverity after disabling OpenSSL’s custom freelist and also hacking CRYPTO_malloc() and friends to just directly call the obvious function from the malloc family. This caused Coverity to report 173 new defects: mostly use-after-free and resource leaks. Heartbleed wasn’t in the list, however, so I stand by my guess (above) that perhaps something related to indirection caused this defect to not be ranked highly enough to be reported.
  • HN has some discussion of PolarSSL and of this blog post. Also Reddit.


I’m teaching a small Advanced Operating Systems course this spring. Preparing for the course over winter break, I spent some time reading various Linux subsystems such as the scheduler, and was a bit shocked at how complex it has become. I’ve been using Linux, looking at its code, and occasionally hacking it for more than 20 years, and it seems that my impression of it as a fairly simple, easy-to-follow kernel has become badly out of date. It’s not that this glorious 7991-line file is impossible to understand, but rather that it — along with the other 16,000 lines of code in the kernel/sched directory — isn’t obviously the correct thing to inflict on some undergrads who are interested enough in operating systems to take a second course.

While looking around for alternatives I tried out Xv6, which I had known about for a while but hadn’t looked at closely. Xv6 is a rewrite of v6 UNIX in modern C that runs on multicore x86 chips. It compiles in a couple of seconds and is trivial to boot up in QEMU. It took me a while to see the genius of Xv6, which is that it is simpler than I would have thought a working multicore OS with shell and filesystem could be. For example, it lacks wait queues and ready queues — in Xv6, both wakeup and scheduling are accomplished by looping over the all-process table. Similarly, there’s no malloc() in the kernel, but rather just a page allocator. The pipe implementation copies one byte at a time. Amazingly, even the bootloader is a pleasure to read. Another nice thing about Xv6 is that it comes with a short textbook that explains OS concepts in terms of their implementations in Xv6.

So what did we do with Xv6 beside read it? One exercise was to speed up its pipe implementation as much as possible while preserving UNIX semantics. This turned out to be a really nice exercise (I thought). It is fairly easy to get within a factor of two of Linux pipe throughput on a two-core VM. We implemented a ring buffer for zero-copy bulk data transfer between processes. Finally, we added priorities and ready queues.

I have no real complaints about Xv6: the code is clean and commented, the functions are small, and overall it makes a great instructional OS. The look and feel are very similar to what I remember from hacking on Minix when I took an advanced OS class around 1994. Actually I do have one small complaint, which is that in a few places liberties are taken with error checking. For example, in the allocuvm() function from vm.c, which grows a process, there is a call to mappages() that mysteriously fails to check the return value. Is there some reason that this particular call cannot fail? If so, a comment explaining the reasoning is needed. If not, error checking is required. I’m sensitive about this issue since I tell the students over and over that they cannot ignore failures in this kind of code.

Another OS that we’ve been learning from is the Windows 2000 kernel which is — surprisingly, to many people — a simple and elegant multicore OS that provides some real-world contrast to Xv6′s over-simplicity. I haven’t seen any Windows kernels later than 2000, I’d be curious to hear if they have remained so nice.

Automated Reasoning About LLVM Optimizations and Undefined Behavior

Following up on last week’s toy use of Z3 and my LLVM superoptimizer post from a few weeks ago, I wanted to try to prove things about optimizations that involve undefined behavior.

We’ll be working with the following example:

int triple (int x)
  return x + x + x;

Clang’s -O0 output for this code is cluttered but after running mem2reg we get a nice literal translation of the C code:

define i32 @triple(i32 %x) #0 {
  %1 = add nsw i32 %x, %x
  %2 = add nsw i32 %1, %x
  ret i32 %2

Next we can run instcombine, with this result:

define i32 @triple(i32 %x) #0 {
  %1 = mul i32 %x, 3
  ret i32 %1

The optimization itself (replacing x+x+x with x*3) is uninteresting and proving that it is correct is trivial in Z3 or any similar tool that supports reasoning about bitvector operations. To see this, go to the Z3 site and paste this code into the Python window on the right:

x = BitVec('x', 32)
slow = x+x+x
fast = x*3
prove (slow == fast)

The interesting aspect of this optimization is that the original C code and LLVM code are not using two’s complement addition. Rather, + in C and add nsw in LLVM are only defined when the addition doesn’t overflow. On the other hand, the multiplication emitted by instcombine is not qualified by nsw: it wraps in the expected fashion when the multiplication overflows (see the docs for more detail). My goal was to use Z3 to automatically prove that instcombine dropped the ball here: it could have emitted mul nsw which has weaker semantics and is therefore more desirable: compared to the unqualified mul it gives subsequent optimizations more freedom.

When we talk informally about compiler optimizations, we often say that the compiler should replace code with (faster or smaller) code that is functionally equivalent. Actually I just did this above while showing that x+x+x and x*3 are equivalent. But in fact, a compiler optimization does not need to replace code with functionally equivalent code. Rather, the new code only needs to refine the old code. Refinement occurs many times during a typical compilation. For example this C code:

foo (bar(), baz());

can be refined to either this:

tmp1 = bar();
tmp2 = baz();
foo (tmp1, tmp2);

or this:

tmp2 = baz();
tmp1 = bar();
foo (tmp1, tmp2);

These refinements are not (in general) equivalent to each other, nor are they equivalent to the original non-deterministic C code. Analogously, the LLVM instruction mul %x, 3 refines mul nsw %x, 3 but not the other way around.

So how do we teach Z3 to do refinement proofs? I was a bit worried by some wording in the Z3 documentation that says “in Z3 all functions are total” and then I got stuck but Alexey Solovyev, a formal methods postdoc at Utah, helped me out with the following solution. The idea is to use Z3′s ML-style option types to turn functions with undefined behavior into total functions:


Opt = Datatype('Opt')
Opt.declare('some', ('value', BitVecSort(BITWIDTH)))

Opt = Opt.create()
none = Opt.none
some = Opt.some
value = Opt.value

(t, r) = Consts('t r', Opt)

s = Solver()

# print "sat" if f is refined by g, else print "unsat"
# equivalently, you can say "f can be implemented by calling g"
def check_refined(f, g):
    s.add(ForAll([t, r], Implies(f(t, r) != none, f(t, r) == g(t, r))))
    print s.check()

In other words, f() is refined by g() iff g() returns the same value as f() for every member of f()’s domain. For inputs not in the domain of f(), the behavior of g() is a don’t-care.

Here are some LLVM operations that let us reason about the example code above:

INT_MAX = BitVecVal((1 << BITWIDTH - 1) - 1, BITWIDTH)

def signed_add_overflow(y, z):
    ylong = SignExt(BITWIDTH, y)
    zlong = SignExt(BITWIDTH, z)
    rlong = ylong + zlong
    return Or(rlong > INT_MAX_AS_LONG, rlong < INT_MIN_AS_LONG)

def signed_mul_overflow(y, z):
    ylong = SignExt(BITWIDTH, y)
    zlong = SignExt(BITWIDTH, z)
    rlong = ylong * zlong
    return Or(rlong > INT_MAX_AS_LONG, rlong < INT_MIN_AS_LONG)

def add(t, r):
    return If(Or(t == none, r == none), 
              some(value(t) + value(r)))

def add_nsw(t, r):
    return If(Or(t == none, r == none), 
              If(signed_add_overflow(value(t), value(r)), 
                 some(value(t) + value(r))))

def mul(t, r):
    return If(Or(t == none, r == none), 
              some(value(t) * value(r)))

def mul_nsw(t, r):
    return If(Or(t == none, r == none), 
              If(signed_mul_overflow(value(t), value(r)), 
                 some(value(t) * value(r))))

Hopefully this code is pretty clear: add_nsw and its friend mul_nsw act like add and mul as long as there’s no overflow. If an overflow occurs, they map their arguments to none.

At this point I might as well mention a dirty little secret: formal methods code isn’t any easier to get right than regular code, and often it is considerably more difficult. The solution is testing. This is more than a little bit ironic since working around inadequacies of testing is the point of using formal methods in the first place. So here’s some test code. To make it easier to understand I’ve set BITWIDTH to 8 and also put the results of the print statements in comments.

def opt_str(o):
    p = simplify(o)
    if is_const(p):
        return "none"
        return simplify(value(p)).as_signed_long()
ONE = Opt.some(1)
TWO = Opt.some(2)
NEG_ONE = Opt.some(-1)
NEG_TWO = Opt.some(-2)
INT_MIN = Opt.some(INT_MIN)
INT_MAX = Opt.some(INT_MAX)

print opt_str(mul_nsw(ONE, ONE))           #    1
print opt_str(mul_nsw(INT_MAX, ONE))       #  127
print opt_str(mul_nsw(INT_MIN, ONE))       # -128
print opt_str(mul_nsw(INT_MAX, NEG_ONE))   # -127
print opt_str(mul_nsw(INT_MIN, NEG_ONE))   # none
print opt_str(mul_nsw(INT_MAX, TWO))       # none
print opt_str(mul_nsw(INT_MIN, TWO))       # none
print opt_str(mul_nsw(INT_MAX, NEG_TWO))   # none
print opt_str(mul_nsw(INT_MIN, NEG_TWO))   # none

print opt_str(add_nsw(ONE, ONE))           #    2
print opt_str(add_nsw(INT_MAX, ONE))       # none
print opt_str(add_nsw(INT_MIN, ONE))       # -127
print opt_str(add_nsw(INT_MAX, NEG_ONE))   #  126
print opt_str(add_nsw(INT_MIN, NEG_ONE))   # none

check_refined(add, add)                    # sat
check_refined(add, add_nsw)                # unsat

check_refined(add_nsw, add)                # sat
check_refined(add_nsw, add_nsw)            # sat

The formalization passes this quick smoke test. I’ve done some more testing and won’t bore you with it. None of this is conclusive of course.

Finally, we return to the example that motivated this post. Recall that LLVM’s instcombine pass translates two add nsw instructions into a single mul. Let’s make sure that was correct. Again, for convenience, I’ll put outputs in comments.

(a, b) = Consts('a b', Opt)

def slow(a, b):
    return add_nsw(a, add_nsw(a, a))

def fast1(a, b):
    return mul(a, Opt.some(3))

check_refined(slow, fast1)           # sat
check_refined(fast1, slow)           # unsat

As expected, LLVM’s transformation is correct: x+x+x where both + operators have undefined overflow can be refined by x*3 where the * operator has two’s complement wraparound. The converse is not true.

Next, let’s explore the idea that instcombine could emit a mul nsw instruction.

def fast2(a, b):
    return mul_nsw(a, Opt.some(3))

check_refined(slow, fast2)           # sat
check_refined(fast2, slow)           # sat

Not only does this optimization work, but the refinement property also holds for the converse, indicating that the two formulations are equivalent.

We can also ask Z3 to prove equivalence of these different formulations:

prove (slow(a, b) == fast1(a, b))    # counterexample [a = some(137)]
prove (slow(a, b) == fast2(a, b))    # proved

As expected, fast1 is not equivalent to the original code and additionally Z3 has given us a value for a that demonstrates this (for the 8-bit case). The second equivalence check passes.

What have we learned here? First, automated proofs about compiler optimizations in the presence of integer undefined behaviors are not too difficult. This is nice because manual reasoning about undefined behavior is error-prone. Second, there’s room for at least a little bit of improvement in LLVM’s instcombine pass (I’m sure the LLVM people know this). One idea that I quite like is using a solver to improve LLVM’s peephole optimizations by automatically figuring out when qualifiers like “nsw”, “nuw”, and “exact” can be used correctly.

Here’s the Python code all in one piece. You can run it locally after installing Z3 or else you can paste it into the Z3 website’s Python interface.

That Was Easy

I wanted to play around with the Python interface to Z3 and the classic SEND + MORE = MONEY puzzle seemed like a good way to get started. This turned out to be so easy that my code worked almost on the first try:

from z3 import *

# find the (distinct) integer value in 0..9 for each letter that makes
# this equation work:
#   SEND
# + MORE
# ------

S = Int('S')
E = Int('E')
N = Int('N')
D = Int('D')
M = Int('M')
O = Int('O')
R = Int('R')
Y = Int('Y')

s = Solver()

s.add (S >= 0, S < 10, 
       E >= 0, E < 10, 
       N >= 0, N < 10, 
       D >= 0, D < 10, 
       M >= 0, M < 10, 
       O >= 0, O < 10, 
       R >= 0, R < 10, 
       Y >= 0, Y < 10)

s.add ((D + 10*N + 100*E + 1000*S + 
        E + 10*R + 100*O + 1000*M) == 
       (Y + 10*E + 100*N + 1000*O + 10000*M))

s.add (S != E, S != N, S != D, S != M, S != O, S != R, S != Y, 
       E != N, E != D, E != M, E != O, E != R, E != Y, 
       N != D, N != M, N != O, N != R, N != Y, 
       D != M, D != O, D != R, D != Y, 
       M != O, M != R, M != Y, 
       O != R, O != Y, 
       R != Y)

print s.check()
m = s.model()
for d in m.decls():
    print "%s = %s" % (, m[d])


$ python ./
D = 7
S = 9
N = 6
O = 0
R = 8
E = 5
M = 1
Y = 2

Is there a non-quadratic way to express the uniqueness constraint? That would be nice. Anyhow, it’s really a pleasure when research software is this easy to use.

UPDATE: There is a better way! See comment #1 below. Thanks Taylor.

Reproducibility in Computer Systems Research

These results about reproducibility in CS have been the subject of lively discussion at Facebook and G+ lately. The question is, for 613 papers, can the associated software be located, compiled, and run? In contrast with something I often worry about — producing good software — the bar here is low, since even a system that runs might be without value and may not even support the results in the paper.

It is great that these researchers are doing this work. Probably everyone who has worked in applied CS research for any length of time has wanted to compare their results against previous research and been thwarted by crappy or unavailable software; certainly this has happened to me many times. The most entertaining part of the paper is Section 4.3, called “So, What Were Their Excuses? (Or, The Dog Ate My Program)”. The recommendations in Section 5 are also good, I think all CS grad students and faculty should read over this material. Anecdote 2 (at the very end of the paper) is also great.

On the other hand, it’s not too hard to quibble with the methods here. For example, one of my papers was placed in the irreproducible category with the reason of:

could not install: style, flex, GNU indent, LLVM/clang 3.2

It’s not clear why installing these dependencies was difficult, but whatever. In a Facebook thread a number of other people reported learning that they have been doing irreproducible research for similar reasons. The G+ thread also contains some strong opinions about, for example, this build failure.

Another thing we might take issue with is the fact that “reproducible research” usually means something entirely different than successfully compiling some code. Usually, we would say that research is reproducible if an independent team of researchers can repeat the experiment using information from the paper and end up with similar (or at least non-conflicting) results. For a CS paper, you would use the description in the paper to implement the system yourself, run the experiments, and then see if the results agree with those reported.

Although it’s not too hard to put some code on the web, producing code that will compile in 10 or 20 years is not an easy problem. A stable platform is needed. The most obvious one is x86, meaning that we should distribute VM images with our papers, which seems clunky and heavyweight but perhaps it is the right solution. Another reasonable choice is the Linux system call API, meaning that we should distribute statically linked executables or equivalent. Yet another choice (for research areas where this works) might be a particular version of Matlab or some similar language.

UPDATE: You can help watch the watchmen here.

Let’s Work on an LLVM Superoptimizer

A compiler optimization has two basic parts. First, it needs to recognize an optimizable situation in the IR (intermediate representation), such as adjacent increments of the same variable. For the optimization to be maximally effective, the situation recognizer should cast as wide a net as possible. For example, we might broaden the applicability of an increment-coalescer by noticing that it can also operate on increments in the same basic block that are separated by non-dependent instructions, and that it can also operate on decrements. The second part of an optimization is knowing what code to replace the optimizable code with, once it has been found. In the running example, “i++; i++” becomes “i += 2″. Although some optimizations are much more sophisticated than this one, the basic idea applies everywhere. Optimizing compilers achieve good results via sequential application of a large number of optimization passes, each improving the code in some minor fashion.

Let’s look at a more interesting example. Consider this C function:

int cannot_overflow (int a)
  if (a==INT_MAX) return 0;
  return a+1;

Now build it with Clang’s undefined behavior sanitizer:

clang -S -emit-llvm -O2 cannot_overflow.c -fsanitize=undefined

Here’s the output (at the IR level) that we want to see:

In other words, the code that we want to see implements the function but has no extra UBSan logic — overflow checking is not required because the function obviously cannot execute an undefined overflow regardless of the value of the argument. (Actually, due to elimination of the branch, the LLVM code can execute an undefined overflow. This is OK since an undefined overflow in LLVM only affects the result of the operation, in contrast with the unbounded consequences of UB at the C level.)

The code above is what we want. What we actually get — using Clang/LLVM as of today, and also Clang/LLVM 3.4 — is this:

To characterize LLVM’s missing optimization in terms of the two-part model from the top of this piece, the optimizable situation is “an add-with-overflow instruction can be shown to never signal an overflow” and the replacement code is “regular add instruction” (getting rid of the call to the overflow handler is a separate optimization whose situation is “code that can be shown to not execute” and whose replacement code is “nothing”).

Should we conclude that LLVM isn’t very bright at optimizing integer expressions? No, and in fact that kind of generalization is seldom useful when talking about compiler optimizations — the devil is in the details. For example, let’s try an open-coded version of the overflow check:

static int safe_int_add (int a, int b)
  assert (!(b > 0 && a > INT_MAX-b));
  assert (!(b < 0 && a < INT_MIN-b));
  return a+b;

int cannot_overflow_2 (int a)
  if (a==INT_MAX) return 0;
  return safe_int_add (a, 1);

In this case, LLVM produces the ideal output. On the other hand, if we modify this code in a trivial way, LLVM cannot optimize away the assertions:

int cannot_overflow_3 (int a)
  if (a==INT_MAX) return 0;
  return safe_int_add (1, a);

Switching gears slightly, let’s take a quick look at GCC 4.8.1. It is unable to remove integer overflow checks inserted by its -ftrapv logic. It is, on the other hand, able to fully optimize both cannot_overflow_2() and cannot_overflow_3().

What’s the point? The point is that making a smart compiler is a years-long process of teaching it optimizations. This teaching process is very fine-grained. For example, we just saw that LLVM can emit good code for cannot_overflow_2() but not for the nearly-identical cannot_overflow_3(). To fix this, one or more LLVM passes need to be modified, an error prone process requiring the attention of a smart, experienced, expensive, and probably overworked compiler engineer. A totally different set of changes will be necessary to optimize the original cannot_overflow(). Wouldn’t it be nice if there was a better way to make a compiler smarter, freeing up our engineer to go implement C++14 or something?


Instead of relying on a human, a superoptimizer uses search-based techniques to find short, nearly-optimal code sequences. Since the original 1987 paper there have been several more papers; my favorite one is the peephole superoptimizer from 2006. For a number of years I’ve thought that LLVM would benefit from a peephole superoptimizer that operates on the IR.

Lately I’ve been thinking it might be time to actually work on this project. The thing that pushed me over the edge was a series of conversations with Xi Wang, whose excellent Stack tool already knows how to turn LLVM into an SMT instance so that a solver can answer questions about the LLVM code. For the purposes of a superoptimizer, we need the solver to prove equivalence between some LLVM IR found in a program and a smaller or faster piece of IR that we’d like to replace it with. As you can see in this file, the semantic gap between LLVM IR and a modern bitvector-aware SMT solver is small.

Before getting into the details, we need to look at this excellent work on superoptimizing LLVM IR, which contains two ideas not seen (that I know of) in previous superoptimizers. First, it works on a DAG of instructions instead of a linear sequence. Second, it exploits knowledge of undefined behaviors. On the other hand, this work has a few limitations that need to be fixed up. First, it doesn’t always come up with correct transformations. This issue is relatively easy to address via SMT solver. Second, its purpose is to feed ideas to a human who then adds some patterns to a compiler pass such as the instruction combiner. While manual implementation is pragmatic in the short run, I am guessing it is tedious and error prone and unnecessary, especially as the size of the replaced pieces of LLVM code grows.

Although I am writing up this blog post myself, there’s plenty of stuff here that came out of conversations with Xi Wang and Duncan Sands. Also both of them gave me feedback on drafts of this piece. In fact Duncan gave me so much feedback that I had to go rewrite several sections.

LLVM Superoptimizer Design Sketch

We’ve already seen that a compiler optimization needs to recognize optimizable code and replace it with better but functionally equivalent code. A third requirement is that it does these things very rapidly. Let’s look at how this might work for LLVM.

To find candidates for optimization, we’ll start with a largish collection of optimized LLVM IR and harvest every DAG of eligible instructions up to some size limit. At first, only ALU instructions will be eligible for superoptimization but I would expect we can handle memory operations, local control flow, and vectors without too much trouble. Floating point code and loops are more challenging. Here’s the documentation for LLVM IR in case you have not looked at it lately.

To suppress DAGs that differ only in irrelevant details, we need a canonicalization pass. Then, for each canonical DAG, our goal is to find a cheaper DAG that is functionally equivalent. This is expensive; not only will we need to do it offline, but probably we’ll also want to spread the work across a cluster. The method is to exhaustively enumerate canonical DAGs of LLVM code up to some smaller maximum size than we used in the harvesting phase. Next we need to answer the question: which of the enumerated DAGs are functionally equivalent to DAGs that we harvested? The technique from the 2006 superoptimizer was to use a few test vectors to rapidly rule out a large fraction of inequivalent code sequences and then to use a solver to soundly rule out the rest. The same idea will work here.

So far we’ve burned a lot of CPU time finding DAGs of LLVM instructions that can be replaced by cheaper instructions. The next problem is to make these transformations available at compile time without slowing down the compiler too much. We’ll need to rapidly identify qualifying DAGs in a file being optimized, canonicalize them, and rewrite them. Some sort of hashing scheme sounds like the way to go. Duncan suggests that we can use the superoptimizer itself to generate a fast canonicalizer — this would be cool.

In summary, we have the superoptimizer itself, which is slow. Most people won’t want to run it, but large organizations with significant compute resources might want to do so. We also have an LLVM pass that optimizes code using pre-computed optimizations that are generated automatically and proved correct by an SMT solver.


Here are some ideas that are a bit more speculative.

Exploiting dataflow facts. One of the things I’m most excited to try out is optimizing each DAG in the context of a collection of dataflow facts such as “incoming variable x is a non-negative multiple of 4″ and “incoming pointers p and q are not aliases.” I see this as a solid workaround for one of the main shortcomings of a peephole superoptimizer, which is that it myopically considers only a small amount of code at a time. Dataflow facts will support some action at a distance. Although constraints on inputs to a DAG — “incoming variable z is odd” or whatever — seem the most straightforward, I believe we can also exploit anti-constraints on outgoing values, such as “outgoing variable w is a don’t-care unless it is in the range [0..9].” The latter would happen, for example, if w was about to be used as an index into an array containing 10 values. It goes without saying that this kind of evil exploitation of undefined behavior should be optional and only makes sense when used in combination with UBSan-based testing.

Something that will happen at optimization time is that we’ll have a DAG in the code being optimized whose shape matches a DAG from the optimization database but the associated dataflow information doesn’t quite match up. I believe this is easy to deal with: we can perform the substitution if the dataflow information in the code being optimized is stronger than the dataflow information in the optimization database. For example, if we have a DAG where incoming value x>10, we can apply an optimization derived for the weaker condition x>0 but not the stronger condition x>20.

The problem with exploiting dataflow facts is that it makes every part of the superoptimizer more complicated. We’ll need to carefully consider what kind of dataflow information can actually give the superoptimizer good leverage. I’ve been reading passes like value tracking which seem to provide pretty much just what we want.

Avoiding the DAG enumeration step. Enumerating and testing candidate DAGs is inelegant. Wouldn’t it be nice to avoid this? The Denali and Denali-2 superoptimizers were based on the idea that a SAT solver could be used to directly synthesize an optimal instruction sequence matching a specification. The papers did not provide a lot of evidence that this technique worked, but what if Denali was simply ahead of its time? Improvements in SAT/SMT solvers and in processing power over the last decade may have made this technique, or a closely related one, into a realistic proposition.

Another reason to prefer synthesis vs. enumeration is that an enumeration-based approach can’t do much with constants: there are too many choices. This means that our superoptimizer is not going to discover awesome tricks like the fast inverse square root (which is an approximation, so it would fail the equivalence check, but you get the idea).

Better cost functions. Estimating the execution cost of a DAG of LLVM instructions is not totally straightforward since — despite the name — the IR is pretty high-level (it’s not like estimating the execution cost of actual instructions is easy either, on a modern processor). We can imagine estimating costs in simple, hacky ways such as counting instructions or measuring the longest path through the DAG. On the other hand, it may be worthwhile to do something more accurate such as compiling some DAGs to machine code, measuring their execution costs on some test vectors, and then using the resulting data to fit a cost model. Pluggable cost functions will make it easy to optimize for code size instead of speed — something compilers today are not particularly good at.

Duncan guesses that “number of instructions” is the only cost function that we’ll need, but I’m not so sure. Do we really want the superoptimizer to replace a shift and add with a multiply?

Optimizing entire loops. Although any superoptimizer can improve code inside of loops, I don’t know of any superoptimizer that targets code containing loops. There seem to be two obstacles. First, the search space may become problematically large. Second, the straightforward approach to encoding a loop in SMT — to unroll it a few times — is unsound. Can I use an SMT solver to prove equivalence of two loops with unknown numbers of iterations? I feel like I’ve read papers that discuss how to do this but I couldn’t find them just now — references would be appreciated.

Generalizing optimizations. Duncan’s superoptimizer worked in absence of LLVM type information, so it would derive a single rule permitting us to replace, for example, y-(x+y) with -x regardless of the (common) width of x and y. The superoptimizer that I have outlined will run into a bit of a rule explosion because it will derive a separate optimization for each width. Although the LLVM code that I have looked at is dominated by values of the obvious widths (1, 8, 16, 32, 64, 128), the rule explosion may become a serious problem when we try to take dataflow information into account. A fun piece of research would be make a superoptimizer that generalizes optimizations on its own. This might be done by attempting to generalize each optimization as it is derived or perhaps it should be done after the fact by running a generalization pass over the database of derived optimizations, looking for a collection of transformations that can be subsumed by a single more general rule. In summary, the performance and efficacy of our optimizer will be best if we generate broadly-applicable optimizations whenever possible and narrowly-applicable optimizations whenever necessary.

Integration with LLVM backends. An IR-only superoptimizer is always going to miss optimizations that exploit hardware features that are invisible at the LLVM level. For example, since LLVM has no rotate instruction, the kinds of optimizations discussed here are out of reach. This is an acceptable tradeoff, but on the other hand exploiting the detailed semantics of quirky instructions is an area where we would expect a superoptimizer to excel. Can our LLVM->LLVM superoptimizer be extended so that it helps with backend tasks in addition to optimizing IR, or is that the job of a completely separate superoptimizer? I do not know yet, but I do know that a purely post-pass superoptimizer (the 2006 peephole superoptimizer is an example of such a tool) is problematic because it is going to be difficult to make it aware of the concurrency model. In other words, it’s going to try to optimize away memory references that are used to communicate with other threads, with hardware devices, etc.


What outcomes might we expect if this work is successful?

Novel optimizations on real code. You can tell I’m not a real compiler person because I don’t care about making SPEC faster. That dusty pile of crap runs as fast as it’s ever going to run, as far as I’m concerned. On the other hand, there is almost certainly interesting idiomatic code coming out of Emscripten and GHC that LLVM can’t yet optimize fully. Similarly, as I showed at the top of this piece, when we turn on integer overflow checking in Clang, the generated code has plenty of room for improvement. The holy grail of superoptimization is automated discovery and application of tricks like the ones we read in Hacker’s Delight.

Undefined programs implode. Not too long ago I wrote a modest proposal about aggressive exploitation of undefined behavior. Just to be clear, I was not being entirely serious. Well, guess what? An UB-aware superoptimizer is going to smash undefined programs in more or less the way that I described. Obviously I have mixed feelings about this, but my belief is that as long as every undefined behavior that is exploited by the superoptimizer has a corresponding UBSan check, this might be a reasonable tradeoff.

Existing compiler passes are subsumed. Ideally, we point a superoptimizer at unoptimized LLVM code and once a fixpoint is reached, the code is at least as good as what the LLVM optimizers would have produced. I don’t expect this to happen, but on the other hand the 2006 superoptimizer was able to take a small function in unoptimized x86 and turn it into roughly the equivalent of gcc -O2 output. The intriguing possibility here is that compiler internals, which have been growing more complex for decades, could become simpler by relying on a superoptimizer for some classes of transformations. I think we can hope to replace parts of the instruction combiner, which was one of the most fruitful sources of wrong-code bugs that we found after pointing Csmith at LLVM.

We pave the way for more ambitious tools. A while ago I wrote that CompCert needs a superoptimizer. This is still true. I’m not going to work on it, but I would hope that techniques and lessons gained from one superoptimizer will help us create other ones.


Superoptimizer results have already percolated into LLVM and GCC by suggesting missed optimizations. I think it’s time for this technology to play a more direct role. A realistic superoptimizer, once set in motion, will naturally absorb improvements in SMT solving and in compute power. For example, if we can harvest IR DAGs up to size 10 this year, it might be 12 next year. If this idea works out, it leaves compiler developers with more time to focus on other tasks.

Updates from March 4:

  • The title of this piece says “let’s work” but the subsequent paragraph explaining this got lost during some revision. What I mean is that I think it would be a good idea to work on this project in an open source fashion. My group has already had very good experiences with Csmith and C-Reduce as open source tools. One part of the equation that complicates things is that as a researcher, I will need to publish papers based on this work. The right solution, I believe, is to give everyone who contributed to the work being described in a paper the opportunity to be a co-author.

  • At ASPLOS yesterday I had a good talk with Santosh Nagarakatte; his Verified LLVM work is related to this post, it would be cool to make these projects work together somehow.

  • I also talked to Vikram Adve; his intuition is that a superoptimizer has more to contribute to the backend than the middle end. This may end up being the case, but let’s review the rationale for a middle-end superoptimizer. First, it is wrong to generate dead code in the backend, and the superoptimizer will be doing that sometimes — as well as generating other kinds of code that wants to be further optimized by other passes. Second, by operating only on the IR, we gain all sorts of simplicity that will make it easier to pursue more ambitious projects such as exploiting dataflow information, synthesizing optimal IR, etc. As I said above, a backend superoptimizer is definitely also desirable.

And thanks for all the good comments!