Proposal for a CompCert Superoptimizer

CompCert is a C compiler that is provably correct. It is best characterized as lightly optimizing: it performs a number of standard optimizations but its code improvements are not aggressive when compared to those performed by GCC, Clang, or any number of commercial tools. This piece is about what I believe would be a relatively interesting and low-effort way to make CompCert more aggressive.

Consider these functions:

int foo (int x, int y) {
  return (x==y) || (x>y);
}
int bar (int x) {
  return (x>3) && (x>10);
}

Clearly foo() can be simplified to return “x≥y” and bar() can be simplified to return “x>10.” A random version of GCC emits:

_foo:
        xorl    %eax, %eax
        cmpl    %esi, %edi
        setge   %al
        ret

_bar:
        xorl    %eax, %eax
        cmpl    $10, %edi
        setg    %al
        ret

Any aggressive compiler will produce similar output. In contrast, CompCert 1.8.1 does not perform these optimizations. Its output is bulky enough that I won’t give it here. Collectively, the class of missing optimizations that I’m talking about is called peephole optimizations: transformations that operate on a very limited scope and generally remove a redundancy. Opportunities for peephole optimizations can be found even in cleanly written code since inlined functions and results of macro expansions commonly contain redundancies.

An aggressive compiler contains literally hundreds of peephole optimizations, and reproducing most of them in CompCert would be time consuming, not to mention unspeakably boring. Fortunately, there’s a better way: most of these optimizations can be automatically derived. The basic idea is from Henry Massalin who developed a superoptimizer in 1987; it was significantly improved about 20 years later by Bansal and Aiken. This piece is about how to create a superoptimizer that proves its transformations are sound.

The idea is simple: at a suitable intermediate compilation stage — preferably after the regular optimization passes and before any kind of backend transformation — find all subgraphs of the program dependency graph up to a configurable size. For each subgraph G, enumerate all possible graphs of the CompCert IR up to some (probably smaller) configurable size. For each such graph H, if G and H are equivalent and if H is smaller than G, then replace G with H. Subgraph equivalence can be checked by encoding the problem as an SMT instance and sending the query to some existing solver. The proof of equivalence needed to make CompCert work comes “for free” because there exist SMT solvers that emit proof witnesses. (SMTCoq is an example of a tool that generates Coq proofs from SMT output.) Repeat until a fixpoint is reached — the program being compiled contains no subgraphs that can be replaced.

As an example, the IR for foo() above would contain this code:

t1 = x==y;
t2 = x>y;
t3 = t1 || t2;

When attempting to optimize this subgraph, the superoptimizer would eventually test for equivalence with:

t3 = x>=y;

Since t1 and t2 are not subsequently used, a match would be found and the peephole optimization would fire, resulting in smaller and faster code.

Of course, the superoptimizer that I have sketched is extremely slow. The Bansal and Aiken paper shows how to make the technique fast enough to be practical. All of their tricks should apply here. Very briefly, the speedups include:

  1. Testing many harvested sequences at once
  2. Reducing the search space using aggressive canonicalization
  3. Avoiding most SMT calls by first running some simple equivalence tests
  4. Remembering successful transformations in a database that supports fast lookup

The Bansal and Aiken superoptimizer operated on sequences of x86 instructions. Although this had the advantage of permitting the tool to take advantage of x86-specific tricks, it also had a couple of serious disadvantages. First, a short linear sequence of x86 instructions harvested from an executable does not necessarily encode an interesting unit of computation. In contrast, if we harvest subgraphs from the PDG, we are guaranteed to get code that is semantically connected. Second, the Stanford superoptimizer has no ability to see “volatile” memory references that must not be touched — it will consequently break codes that use multiple threads, UNIX signals, hardware interrupts, or hardware device registers.

The technique outlined in this piece is what I call a weak superoptimizer: it finds short equivalent code sequences using brute force enumeration. A strong superoptimizer, on the other hand, would pose the following queries for each harvested subgraph G:

  1. Does there exist a subgraph of size 0 that is equivalent to G? If not…
  2. Does there exist a subgraph of size 1 that is equivalent to G? If not…
  3. Etc.

Clearly this leans much more heavily on the solver. It is the approach used in the Denali superoptimizer. Unfortunately, no convincing results about the workability of that approach were ever published (as far as I know), whereas the weak approach appears to be eminently practical.

In summary, this post contains what I think are two relatively exploitable ideas. First, a peephole superoptimizer should be based on subgraphs of the PDG rather than linear sequences of instructions. Second, proof-producing SMT should provide a relatively easy path to verified peephole superoptimization. If successful, the result should be a significant improvement in the quality of CompCert’s generated code.

Update from 3/29: A random thing I forgot to include in this post is that it would be easy and useful to teach this kind of superoptimizer to take advantage of (nearly) arbitrary dataflow facts. For example, “x /= 16” (where x is signed) is not equivalent to x >>= 4. However, this equivalence does hold if x can be proved to be non-negative using a standard interval analysis. Encoding a fact like “x ≥ 0” in the input to the SMT solver is trivial. The nice thing about dataflow facts is that they give the superoptimizer non-local information.

I should also add that when I said “if H is smaller than G, then replace G with H” of course I really mean “cheaper” rather than “smaller” where the cost of a subgraph is determined using some heuristic or machine model.

Even after adding a nice IR-level superoptimizer to CompCert, a backend superoptimizer just like Bansal and Aiken’s is still desirable: it can find target-dependent tricks and it can also get rid of low-level grunge such as the unnecessary register-register moves that every compiler’s output seems to contain. The main fix to their work that is needed is a way to prevent it from removing or reordering volatile memory references; this should be straightforward.

Peer Review Poker

Peer review is a bureaucratic, consensus-based approach to making decisions. Thus, it is not inherently entertaining and authors like myself need to amuse ourselves as best we can. One of the games I like to play is peer review poker, where certain combinations of review scores are more desirable than others.

Straight: Review scores form a continuous sequence. A straight, particularly if it is anchored at the high end of the spectrum, is extremely desirable because provoking a diversity of reactions means that there is something interesting about the work. The best kind of straight — which I have accomplished only one time, that I can remember — is when each possible score is received exactly once.

Flush: All reviews are the same score. This is not particularly desirable as it indicates a predictable, uncontroversial paper. Of course, if all scores are “strong accept” then predictability doesn’t seem so bad — but this hardly ever happens. The other extreme, all “strong reject” scores, is certainly achievable.

Obviously, two of a kind, etc. are very common peer review outcomes and are not particularly interesting. By taking reviewer expertise into account it should be possible to define a royal flush but probably that’s pushing the analogy too far.

Finding and Understanding Bugs in C Compilers

Today we finished preparing the camera-ready version of our paper that will appear in PLDI 2011. I’m pretty happy with it. Here’s the abstract:

Compilers should be correct. To improve the quality of C compilers, we created Csmith, a randomized test-case generation tool, and spent three years using it to find compiler bugs. During this period we reported more than 325 previously unknown bugs to compiler developers. Every compiler we tested was found to crash and also to silently generate wrong code when presented with valid input. In this paper we present our compiler-testing tool and the results of our bug-hunting study. Our first contribution is to advance the state of the art in compiler testing. Unlike previous tools, Csmith generates programs that cover a large subset of C while avoiding the undefined and unspecified behaviors that would destroy its ability to automatically find wrong-code bugs. Our second contribution is a collection of qualitative and quantitative results about the bugs we have found in open-source C compilers.

Actually, as of today the total is 344 bugs reported but at some point we had to freeze the statistics in the paper.

Race Condition vs. Data Race

A race condition is a flaw that occurs when the timing or ordering of events affects a program’s correctness. Generally speaking, some kind of external timing or ordering non-determinism is needed to produce a race condition; typical examples are context switches, OS signals, memory operations on a multiprocessor, and hardware interrupts.

A data race happens when there are two memory accesses in a program where both:

  • target the same location
  • are performed concurrently by two threads
  • are not reads
  • are not synchronization operations

There are other definitions but this one is fine; it’s from Sebastian Burckhardt at Microsoft Research. Two aspects of the definition require a bit of care. “Concurrently” means there wasn’t anything like a lock that forced one operation to happen before or after the other. “Are not synchronization operations” refers to the fact that a program probably contains special memory operations, such as those used to implement locks, that are not themselves synchronized. By definition, these do not trigger data races.

In practice there is considerable overlap: many race conditions are due to data races, and many data races lead to race conditions. On the other hand, we can have race conditions without data races and data races without race conditions. Let’s start with a simple function for moving money between two bank accounts:

transfer1 (amount, account_from, account_to) {
  if (account_from.balance < amount) return NOPE;
  account_to.balance += amount;
  account_from.balance -= amount;
  return YEP;
}

Of course this is not how banks really move money, but the example is useful anyway because we understand intuitively that account balances should be non-negative and that a transfer must not create or lose money. When called from multiple threads without external synchronization, this function admits both data races (multiple threads can concurrently try to update an account balance) and race conditions (in a parallel context it will create or lose money). We can try to fix it like this:

transfer2 (amount, account_from, account_to) {
  atomic {
    bal = account_from.balance;
  }
  if (bal < amount) return NOPE;
  atomic {
    account_to.balance += amount;
  }
  atomic {
    account_from.balance -= amount;
  }
  return YEP;
}

Here “atomic” is implemented by the language runtime, perhaps simply by acquiring a thread mutex at the start of the atomic block and releasing it at the end, perhaps using some sort of transaction, or perhaps by disabling interrupts — for purposes of the example it doesn’t matter as long as the code inside the block executes atomically.

transfer2 has no data races when called by multiple threads, but obviously it is an extremely silly function containing race conditions that will cause it to create or lose money almost as badly as the unsynchronized function. From a technical point of view, the problem with transfer2 is that it permits other threads to see memory states where a key invariant — conservation of money — is broken.

To preserve the invariant, we have to use a better locking strategy. As long as atomic’s semanatics are to end the atomic section on any exit from the block, the solution can be blunt:

transfer3 (amount, account_from, account_to) {
  atomic {
    if (account_from.balance < amount) return NOPE;
    account_to.balance += amount;
    account_from.balance -= amount;
    return YEP;
  }
}

This function is free of data races and also of race conditions. Can we change it a bit to make an example that has data races but no race conditions? That is simple:

transfer4 (amount, account_from, account_to) {
  account_from.activity = true;
  account_to.activity = true;
  atomic {
    if (account_from.balance < amount) return NOPE;
    account_to.balance += amount;
    account_from.balance -= amount;
    return YEP;
  }
}

Here we are setting flags indicating that some sort of activity occurred on the accounts. Are the data races on these flags harmful? Perhaps not. For example, in the evening we might shut down all transaction-processing threads and then select 10 random accounts that are flagged as having had activity for manual auditing. For this purpose, the data races are entirely harmless.

We’ve ended up covering all possibilities:

data race no data race
race condition transfer1 transfer2
no race condition transfer4 transfer3

The point of this exercise was to look at transfer2 and transfer4, which illustrate that freedom from data races is a very weak property that is neither necessary nor sufficient for establishing the concurrency correctness of a computer program. Why does anyone care about data races at all? As far as I can tell, there are two reasons. First, a data-race-free program can, with some care, be shown to be independent of the whims of whatever weak memory model it sits on. This lets us breathe easier because reasoning about racy programs running on weak memory models is near impossible for humans. Second, data races are easy to find automatically — notice from the definition that it simply requires memory accesses to be monitored; no knowledge of application semantics is required. Race conditions, on the other hand, are intimately tied to application-level invariants, making them far thornier to reason about (quick — what are all the invariants in Firefox? In Linux? In MySQL?).

Everything in this post is pretty obvious, but I’ve observed real confusion about the distinction between data race and race condition by people who should know better (for example because they are doing research on concurrency correctness). Muddying the waters further, even when people are perfectly clear on the underlying concepts, they sometimes say “race condition” when the really mean “data race.” Certainly I’ve caught myself doing this.

Guidelines for Teaching Assistants

I’ve been teaching university-level courses for the last nine years, usually with the support of teaching assistants (TAs): students who get paid to do things like grading, office hours, fielding email questions, making and debugging assignments, proctoring exams, and perhaps even giving a lecture when I’m sick or traveling.

At the start of each semester I sit down with my new TAs and go over some guidelines. Unfortunately, this is a chaotic time and I’m pretty sure I’ve never managed to give the same advice twice. This piece is an attempt to put all of my guidelines in one place. I’m certainly leaving things out — please fill in details in comments. This is all somewhat specific to computer science classes with programming assignments.

The Golden Rule of the CS TA: Never Type at a Student’s Keyboard

When a student doesn’t get it, the temptation to just sit down and take over can be overwhelming. But you have to resist because as long as the student is driving, the student is thinking. When you take over, their brain switches into passive mode.

Progressive Hints

When a student is struggling on an assignment, first give oblique hints: “Did you read Chpater 7?” “Did you search for that on Wikipedia?” If this fails to help, be more concrete: “Are you checking all return codes?” “Did you run it under Valgrind?” Finally, and only when you are convinced the student has made a genuine effort to solve the problem, give a more direct answer, but be sure that it only helps the student past the current roadblock: “Your foo() function has to return 0 on success, not 1.” “You’re dereferencing a NULL pointer.” With time, you will learn which students can run with a quick hint and which ones probably need stronger advice.

My model for progressive hints is based on the Invisiclues packets for old Infocom games — so awesome.

Amend, But Communicate

Often, when answering a student’s question, it becomes clear that some part of the assignment or lab was underspecified and it should have been more clear. In cases like this, your answer to the student is effectively a supplemental part of the assignment. This is fine, but when this happens it is necessary to let the instructor and/or the rest of the class know the additional information.

When Things Aren’t Working, Tell Me

Sometimes there are infrastructure problems: a bunch of machines in the lab are down, or some critical piece of software doesn’t work correctly. Other times, there is a systematic problem in an assignment or lab: a crucial ambiguity or error. When you notice anything like this happening, let me know — I can’t fix problems that I’m unaware of.

Treat Students With Respect

I’ve only seen one problem out of probably 30 TAs, and it wasn’t that serious. But still this is worth saying.

Don’t Play the 10,000 Questions Game

In a big class there are always one or two students who, instead of reading the assignment, just email the instructor or TAs. If you answer, they immediately have another question. This goes on forever — I do not exaggerate, they apparently never eat or sleep. The solution is to recognize these cycles and delay the next response for 12 hours or so. Sooner or later the student realizes that it’ll be more efficient to just read the assignment.

Do the Assignment First

This sucks, but you have to do each assignment, or at least the ones for which you are the principal TA, before it is handed out. Otherwise, it is effectively impossible to help students get past roadblocks they encounter. This mainly applies to courses like operating systems where the assignments may be really hard. For freshmen and sophomore-level programming courses, any conceivable problem can by eyeballed in a few seconds.

Don’t Miss Office Hours

At least give advance warning if it’ll be impossible to show up so students don’t waste time coming to the lab.

Bounce Problem Cases Up To The Instructor

You’re not being paid enough to deal with difficult students. If anyone is pushy about a grade, rude about any issue, or anything else, simply refer the student to me. This almost never happens.

Don’t Screw Up the File

There is one grade spreadsheet, and it has a specific file name. Do not put grades in any other file, do not create duplicate files, do not put unlabeled columns in the spreadsheet. Label each new column carefully. Every time you finish working with the file, make sure the permissions are such that the other TAs and I can read and write it, but nobody else can.

Anonymize Grades Carefully

When posting grades, delete names, student ID numbers, and email addresses before exporting to HTML.

Never Deal With Cheating

This is my job. If you suspect cheating, talk to me; but never bring it up with a student.

Be Timely

Unless I say otherwise, labs should be graded, and grades entered, within a week of being handed in.

Do Not Flake Out

TAs are themselves students and are subject to getting overwhelmed like anyone else. If this seems to be happening, come talk to me. If you just disappear in the middle of the semester and stop doing your job, I will seriously make sure you never work in this town again.

The Simplest Queue?

My student Jianjun is proving things about ARM executables that handle interrupts. It’s very difficult work, so when I asked him to write up a “simple” case study where an interrupt and the main context communicate through a ring buffer, I thought it would be helpful if I handed him the simplest possible queue that is at all realistic. Since queues are already pretty simple, there seemed to be only two tricks worth playing: force queue size to be a power of two, and waste one slot to simplify full-queue detection. Here’s my code:

#include "q.h"

#if ((QSIZE)<2)||(!((((QSIZE)|((QSIZE)-1))+1)/2==(QSIZE)))
#error QSIZE must be >1 and also a power of 2
#endif

#define MASK ((QSIZE)-1)

static int q[QSIZE];
static int head, tail;

static int
inc (int x)
{
  return (x + 1) & MASK;
}

int
full (void)
{
  return inc (head) == tail;
}

int
mt (void)
{
  return head == tail;
}

int
enq (int item)
{
  if (full ())
    return 0;
  q[head] = item;
  head = inc (head);
  return 1;
}

int
deq (int *loc)
{
  if (mt ())
    return 0;
  *loc = q[tail];
  tail = inc (tail);
  return 1;
}

int
qents (void)
{
  int s = head - tail;
  if (s < 0)
    s += (QSIZE);
  return s;
}

The header file simply defines QSIZE and gives prototypes for the functions. Did I miss any good tricks? The code coming out of compilers I have sitting around isn’t quite as clean as I’d have hoped. There’s some redundancy in the code (calling inc() twice in the enqueue and dequeue functions) but eliminating it should be a simple matter for the compiler.

Negative Correlation Achieved

Recently I reviewed 19 papers that were submitted to CAV 2011. This is the first time I’ve been involved with a pure verification conference, and consequently I greatly enjoyed reading the papers because almost every one contained something new. Each time I submitted a review I looked at the ones that were already submitted for that paper, and kept being surprised at how often I disagreed with the other reviewers. Finally I just computed the correlation between my score and the average of the other reviewers’ scores, and the result was an astonishingly low -0.07.

In contrast, when I review papers at a “systems” venue (doesn’t matter if it’s operating systems, embedded systems, or something else) it’s not at all uncommon for me to give exactly the same score as all, or almost all, of the other reviewers. My guess is that at the last five systems conferences I was involved with, the correlation between my score and the average of the other reviewers’ scores was higher than 0.8.

I’m not sure there’s a take-away message here other than “communities have very different evaluation standards for papers.” However, this does shed a bit of light on why it can be quite difficult to switch areas. Closely related: Doug Comer’s excellent piece on how to insult a computer scientist.