Volatile Bugs, Three Years Later

Almost exactly three years ago Eric Eide and I submitted a paper Volatiles Are Miscompiled, and What to Do about It to the 8th International Conference on Embedded Software (EMSOFT 2008). The points made in this paper were that:

  • C compilers fail to reliably translate accesses to volatile-qualified objects
  • we can automatically detect these failures
  • we can (at least in some cases) automatically work around these failures

Gratifyingly, this paper found an audience among embedded software practitioners and in 2010 it was one of the most highly-downloaded PDFs on the Utah CS web server. Not gratifyingly, it isn’t clear that compilers are, in general, much better at volatile than they were three years ago (we haven’t quantified this, it’s just my general feeling). The root of the problem is that there is a tension between volatile and optimizations: it’s hard to make fast code that is still volatile-correct.

The motivation for writing this paper originated in a lecture for my Advanced Embedded Systems course that I gave in 2006 or 2007. I was presenting some fragments of C code along with their translations into assembly, in order to illustrate the effect of the volatile qualifier, when a student raised his hand and said that one of the translations was incorrect. I assumed that I had made a cut-and-paste error and moved on with the lecture. However, when I checked up later, it turned out there was no cut and paste error: the compiler had been wrong (this was CodeWarrior for ColdFire, I believe). This was surprising, so I kept playing around and the situation got worse every time I tried a new compiler or wrote a new code fragment. Eventually it became clear that systematic wrongness existed and I needed to write something up, though I had no idea how to turn this into any kind of respectable academic paper. Eric saved the day by taking Randprog, an existing random C program generator, and extending it to generate code using volatile. Also, he hacked Valgrind to count accesses to volatiles, giving us the automatic detector. Finally, my student Nathan hacked up a CIL pass for turning volatile accesses into calls to helper functions (I don’t recall why we didn’t include Nathan as an author on the paper — we probably should have). At this point we had a paper. I like this story because it illustrates the way systems research often works in practice. It does not proceed in a nice chain from hypothesis to solution to evaluation. Rather, it begins with a niggling suspicion that something is wrong, and proceeds in little fits and starts, across plenty of dead ends, until finally it becomes clear that a useful result has emerged.

By far the most interesting development in volatile-land during the last three years is CompCert, which has a provably correct implementation of volatile. This is, as I’ve said here before, and will no doubt keep saying, a very impressive result.

The volatile qualifier is a legitimate solution to a real problem in early C compilers: they would optimize away critical accesses to hardware device registers. Prior to the introduction of volatile, extremely dodgy hacks were used to avoid miscompilation. However, in retrospect, I believe volatile has proved more trouble than it’s worth, and that C/C++ would be better off without it. The alternative is to use an explicit function call to access variables that live in special kinds of memory; these calls need not have high overhead since they can be inlined. The argument for explicit accesses comes not just from the compiler side, but also from the user side. Linus Torvalds has ranted about this, and he’s right.

I suspect that Eric and I need to write at least one more volatile paper during the next year or two. Some things that have changed:

  • CompCert supports volatile, so it is available as a basis for comparison
  • GCC and LLVM are less prone to non-volatile miscompilations than they used to be, making it much easier for us to assess the reliability of automatically turning volatile accesses into calls to helper functions
  • my student Yang Chen has created a new Pin-based volatile bug detector that works better than the Valgrind-based one
  • the Pin tool supports testing whether the generated code correctly copes with the case where the volatile location returns a “fresh” value each time it is read — we never tested this before
  • Csmith + the Pin tool support testing whether accesses to volatile locations are illegally reordered, which we never tested before

It will be fun to see what kinds of bugs are turned up by the far more aggressive testing we can now do, compared to what we could do in 2008. However, the more interesting thing will be to implement the automatic volatile bug workaround in GCC and LLVM, by turning volatile accesses into calls to helper functions in the frontends, and turning them back into memory accesses in the backends. This should achieve near total volatile correctness and will also permit hundreds of silly special cases to be removed from these compilers’ optimizers. Ideally the compiler developers will adopt this approach, though I suspect this depends on the performance of the generated code (it should be decent, but won’t be as good as the current, broken approach).

Who Fuzzes the Fuzzer?

Although it’s fun to act like our tool Csmith is an infallible compiler smashing device, this isn’t really true. Csmith is made of ~40,000 lines of C++, some of it quite complicated and difficult. Csmith probably contains about as many bugs per LOC as your average compiler. So how do we debug the bug-finding tool? The answer is simple: it debugs itself, and also the compilers that we are testing help debug it. Specifically:

  • Xuejun put plenty of assertions into the Csmith code.
  • The code emitted by Csmith optionally contains a lot of assertions corresponding to dataflow facts; for example, an alias fact might turn into assert (!p || p==&x). Any time one of these checks fails, either Csmith or a compiler is wrong.
  • Every time Csmith finds a compiler bug, we try to manually verify that the (reduced) test case is in fact a valid one before submitting it. Sometimes the reduced test case ends up being invalid, at which point we’ve found a bug either in Csmith or in the reducer.
  • Just recently we’ve started using Csmith to test some static analysis tools such as Frama-C. As expected, we have been finding bugs in these tools. What we didn’t expect is how effective they would be in spotting tricky flaws in Csmith’s output. The explanation, as far as I can tell, is that compilers don’t look very deeply into the semantics of the compiled code, whereas these tools do.

In the end, Csmith seems to have two advantages in its war on compiler bugs. First, it’s not nearly as big as an aggressive compiler. Second, there are a lot of compilers to test and each of them contains plenty of bugs; in contrast, each Csmith bug has to be fixed just once. Therefore, in practice, we find 10 or more compiler bugs for each Csmith bug.

One Day of Winter and Three Days of Spring in the Fins

Bill and I had already made two attempts to backpack into the Fins area, which is part of Ernie’s Country in the Maze District of Canyonlands NP. Spring 2010 was abnormally snowy and Fall 2010 featured torrential rains, both times making roads impassable. This time — Spring 2011 — the weather cooperated. Also it was great to have John Veranth along; he has awesome hand-annotated maps of the area and a second truck provided some margin against mechanical problems or stuck vehicles.

First Day

We left SLC around 6:15 AM and had 3.5 hours of driving before leaving pavement about 20 miles north of Hanksville. Incredibly, at this point we still had 3.5 hours of driving left. Big lobes of sand had blown across the road in places and by the time we got to the ranger station at Hans Flat, the wind was really going and (as the pictures make clear) there was a lot of dust in the air. After picking up our backcountry permit, we drove down the Flint Trail, an amazing piece of road that goes through one of the only breakdowns found anywhere in the Orange Cliffs. Happily, the Flint Trail was not only bone-dry, but also it had been bladed recently and was in excellent condition.

Once below the Orange Cliffs, we quickly got to the top of the Golden Stairs, a trail that drops through the next set of cliffs, giving access to the top level of the Maze District. The alternative to hiking this trail is a long, extremely rough drive around Teapot Rock, which neither John nor Bill was eager to do. At the bottom of the Golden Stairs we walked on the road for maybe a mile and then dropped into Range Canyon, which has two springs that were used by ranchers and improved by the CCC. We skipped the first one and camped near the second, getting there around dusk. Around midnight the wind started to howl, blasting the tents with sand and pebbles and trying to push them over.

[nggallery id=32]

Second Day

Despite the rough night, the winds were down in the morning and (atypically for the desert in March) it was quite warm. After breakfast we walked over to the actual fins, which are massive and amazing. We had lunch in an alcove at the top of the canyon containing the fins and then poked into some side drainages on the way back. Back at the tents the winds had picked up again and everything, including inside the tents, was covered by a nice layer of grit. I ate a lot of sand on this trip.

Before dinner John and I climbed a nasty little 4th-class ramp to the level of slickrock above the canyon bottom near our camp. This gave access to a beautiful expanse of slickrock above the spring; clearly a lot of exploring would be possible there, but we failed to find an easy way out to the canyon rim.

[nggallery id=33]

Third Day

After breakfast we packed up to avoid exposing unoccupied tents to another windy day. We hiked up the canyon that is in between the two springs in Ernie’s Country. It contains a nice surprise: a slot canyon. However, before we had finished exploring it, it started to rain and rule #0 of slot canyons is you are not in them when it rains, so we backed out in a hurry. The rain continued on and off all day, eventually accumulating in pools on the slickrock, but it never rained hard enough to be unpleasant. Whitmore arch is up near the rim of this canyon, easily accessible by a slickrock ramp.

Near the head of Range Canyon there’s a great Anasazi granary and a very short slot canyon. After poking around this area for a while the weather started to look worse and we decided that if it snowed much overnight, we’d have trouble getting up the steep sandstone fin that provides access to this canyon. So, we exited and camped up on the rim. During dinner the drizzle turned into sleet so we turned in early and I read for a while before bed. Although my sleeping bag is rated to 15 degrees, it is not that warm so I slept in most of my layers. Fortunately — unlike my last two backpacking trips — I didn’t spend a miserable, sleepless, shivering-cold night.

[nggallery id=34]

Fourth Day

The weather system cleared out during the night and the morning was cold and spectacular — our first clear weather of the trip. The cool morning was welcome: no need to carry much water up the Golden Stairs. A couple of steep spots on the Flint Trail held snow when we got there, but not enough to cause trouble. After getting to the top we took a short detour on the road to the Big Ridge up to the point where there’s a campsite at a narrow neck between two massive canyon systems: Happy on one side and Cataract on the other. From there we turned around and got back to SLC by 8pm.

[nggallery id=35]

As a backpacking destination, Ernie’s Country is great. Pros:

  • Diverse and excellent scenery, even by Southern Utah standards
  • Springs make water a non-issue
  • Easy walking, heavily cairned routes, and constrained area would make it hard to get seriously lost

Cons:

  • Getting a broken vehicle towed from anywhere under the ledge would probably cost thousands of dollars
  • Some red tape associated with being inside a national park
  • A fairly crowded area relative to its remoteness

Software Bugs and Scientific Progress

When a bug is found in a piece of software, the root cause is often a bug in someone’s thoughts. One way to better understand a bug is to look at how deep the underlying thought error was. In other words: How many assumptions must be revisited as a result of the bug?

Level 1 — Syntax error, such as using the wrong operator or putting a function’s arguments in the wrong order. These are totally unsurprising.

Level 2 — Implementation error, such as wrongly assuming a pointer is non-null. These require developers to revisit whatever implementation-level assumption was made that lead to an invariant violation. Implementation errors can also occur at the level of algorithm design; textbooks can contain level 2 errors. Incorrect or sloppy theorems may have to be revised in response to level 2 errors.

Level 3 — Design error: the system is structured in such a way that it cannot meet its specification. A web server that runs untrusted code in its address space contains a level 3 bug.

Level 4 — Specification error: the software is solving the wrong problem. Assumptions about the purpose of the system must be revisited.

Level 5 — Framework error: the intellectual scaffolding upon which the software was based contains an error or omission. My favorite example comes from the early days of the development of optimizing compilers when people first confronted difficult questions about whether a particular optimization was legal or not. An entire sub-area of computer science had to be created to answer these questions. Something analogous is happening right now where we lack a framework for creating reliable and secure large-scale, software-intensive systems.

Level 6 — Universe error: a previously-unknown physical law prevents the software from working. If good computers had existed in the 19th century, the first person who tried to implement Maxwell’s demon and the first person who tried to implement a universal discriminator for halting vs. non-halting programs would have encountered errors at level 6. Future level 6 errors may result from efforts to build artificial intelligences.

Something interesting about levels 5 and 6 is that they look more like progress than like errors. This is no coincidence: scientific progress occurs when bugs in our collective thinking are fixed. Thus, we could easily construct a scale for scientific advances that mirrors the scale for software bugs. Einstein and Turing operated at level 6 at least a few times in their lives; many other great scientists work at level 5. A typical MS thesis is at level 2 and a dissertation should be no lower than level 3.

Proposal for a CompCert Superoptimizer

CompCert is a C compiler that is provably correct. It is best characterized as lightly optimizing: it performs a number of standard optimizations but its code improvements are not aggressive when compared to those performed by GCC, Clang, or any number of commercial tools. This piece is about what I believe would be a relatively interesting and low-effort way to make CompCert more aggressive.

Consider these functions:

int foo (int x, int y) {
  return (x==y) || (x>y);
}
int bar (int x) {
  return (x>3) && (x>10);
}

Clearly foo() can be simplified to return “x≥y” and bar() can be simplified to return “x>10.” A random version of GCC emits:

_foo:
        xorl    %eax, %eax
        cmpl    %esi, %edi
        setge   %al
        ret

_bar:
        xorl    %eax, %eax
        cmpl    $10, %edi
        setg    %al
        ret

Any aggressive compiler will produce similar output. In contrast, CompCert 1.8.1 does not perform these optimizations. Its output is bulky enough that I won’t give it here. Collectively, the class of missing optimizations that I’m talking about is called peephole optimizations: transformations that operate on a very limited scope and generally remove a redundancy. Opportunities for peephole optimizations can be found even in cleanly written code since inlined functions and results of macro expansions commonly contain redundancies.

An aggressive compiler contains literally hundreds of peephole optimizations, and reproducing most of them in CompCert would be time consuming, not to mention unspeakably boring. Fortunately, there’s a better way: most of these optimizations can be automatically derived. The basic idea is from Henry Massalin who developed a superoptimizer in 1987; it was significantly improved about 20 years later by Bansal and Aiken. This piece is about how to create a superoptimizer that proves its transformations are sound.

The idea is simple: at a suitable intermediate compilation stage — preferably after the regular optimization passes and before any kind of backend transformation — find all subgraphs of the program dependency graph up to a configurable size. For each subgraph G, enumerate all possible graphs of the CompCert IR up to some (probably smaller) configurable size. For each such graph H, if G and H are equivalent and if H is smaller than G, then replace G with H. Subgraph equivalence can be checked by encoding the problem as an SMT instance and sending the query to some existing solver. The proof of equivalence needed to make CompCert work comes “for free” because there exist SMT solvers that emit proof witnesses. (SMTCoq is an example of a tool that generates Coq proofs from SMT output.) Repeat until a fixpoint is reached — the program being compiled contains no subgraphs that can be replaced.

As an example, the IR for foo() above would contain this code:

t1 = x==y;
t2 = x>y;
t3 = t1 || t2;

When attempting to optimize this subgraph, the superoptimizer would eventually test for equivalence with:

t3 = x>=y;

Since t1 and t2 are not subsequently used, a match would be found and the peephole optimization would fire, resulting in smaller and faster code.

Of course, the superoptimizer that I have sketched is extremely slow. The Bansal and Aiken paper shows how to make the technique fast enough to be practical. All of their tricks should apply here. Very briefly, the speedups include:

  1. Testing many harvested sequences at once
  2. Reducing the search space using aggressive canonicalization
  3. Avoiding most SMT calls by first running some simple equivalence tests
  4. Remembering successful transformations in a database that supports fast lookup

The Bansal and Aiken superoptimizer operated on sequences of x86 instructions. Although this had the advantage of permitting the tool to take advantage of x86-specific tricks, it also had a couple of serious disadvantages. First, a short linear sequence of x86 instructions harvested from an executable does not necessarily encode an interesting unit of computation. In contrast, if we harvest subgraphs from the PDG, we are guaranteed to get code that is semantically connected. Second, the Stanford superoptimizer has no ability to see “volatile” memory references that must not be touched — it will consequently break codes that use multiple threads, UNIX signals, hardware interrupts, or hardware device registers.

The technique outlined in this piece is what I call a weak superoptimizer: it finds short equivalent code sequences using brute force enumeration. A strong superoptimizer, on the other hand, would pose the following queries for each harvested subgraph G:

  1. Does there exist a subgraph of size 0 that is equivalent to G? If not…
  2. Does there exist a subgraph of size 1 that is equivalent to G? If not…
  3. Etc.

Clearly this leans much more heavily on the solver. It is the approach used in the Denali superoptimizer. Unfortunately, no convincing results about the workability of that approach were ever published (as far as I know), whereas the weak approach appears to be eminently practical.

In summary, this post contains what I think are two relatively exploitable ideas. First, a peephole superoptimizer should be based on subgraphs of the PDG rather than linear sequences of instructions. Second, proof-producing SMT should provide a relatively easy path to verified peephole superoptimization. If successful, the result should be a significant improvement in the quality of CompCert’s generated code.

Update from 3/29: A random thing I forgot to include in this post is that it would be easy and useful to teach this kind of superoptimizer to take advantage of (nearly) arbitrary dataflow facts. For example, “x /= 16” (where x is signed) is not equivalent to x >>= 4. However, this equivalence does hold if x can be proved to be non-negative using a standard interval analysis. Encoding a fact like “x ≥ 0” in the input to the SMT solver is trivial. The nice thing about dataflow facts is that they give the superoptimizer non-local information.

I should also add that when I said “if H is smaller than G, then replace G with H” of course I really mean “cheaper” rather than “smaller” where the cost of a subgraph is determined using some heuristic or machine model.

Even after adding a nice IR-level superoptimizer to CompCert, a backend superoptimizer just like Bansal and Aiken’s is still desirable: it can find target-dependent tricks and it can also get rid of low-level grunge such as the unnecessary register-register moves that every compiler’s output seems to contain. The main fix to their work that is needed is a way to prevent it from removing or reordering volatile memory references; this should be straightforward.

Peer Review Poker

Peer review is a bureaucratic, consensus-based approach to making decisions. Thus, it is not inherently entertaining and authors like myself need to amuse ourselves as best we can. One of the games I like to play is peer review poker, where certain combinations of review scores are more desirable than others.

Straight: Review scores form a continuous sequence. A straight, particularly if it is anchored at the high end of the spectrum, is extremely desirable because provoking a diversity of reactions means that there is something interesting about the work. The best kind of straight — which I have accomplished only one time, that I can remember — is when each possible score is received exactly once.

Flush: All reviews are the same score. This is not particularly desirable as it indicates a predictable, uncontroversial paper. Of course, if all scores are “strong accept” then predictability doesn’t seem so bad — but this hardly ever happens. The other extreme, all “strong reject” scores, is certainly achievable.

Obviously, two of a kind, etc. are very common peer review outcomes and are not particularly interesting. By taking reviewer expertise into account it should be possible to define a royal flush but probably that’s pushing the analogy too far.