How to Write a C/C++ Compiler That Respects Volatile

The volatile type qualifier in C/C++ means roughly that accesses to the qualified object happen on the actual machine as they do in the abstract machine. I’ve written about volatile pretty extensively, so won’t repeat myself.

An interesting problem with volatile is that in practice, compilers fail to respect it: they add, remove, and reorder accesses in ways that break the rules. This happens because the rules for translating accesses to volatile-qualified objects are very different from the rules for accessing regular C variables: the compiler has nearly complete freedom to add, remove, and reorder non-volatile variable accesses so long as this doesn’t change the program’s externally observable behavior. Thus, many optimization passes require special cases like this:

if (!var->is_volatile()) transform_code();

The problem is that every optimization developer must add these extra safety checks every time — any omission is likely to break the properties that volatile is intended to preserve.

A few years ago Eric Eide and I observed that the rules for accessing volatile objects are very similar to the rules for manipulating function calls. In other words, when the compiler lacks any special knowledge about a called function, it must not add, remove, or reorder function calls. This lead to the idea that if compiler writers would simply model volatile accesses as function calls, all of those special cases in the optimization passes would go away. We tested this idea by writing a source-to-source transformer for C code that turned accesses to volatiles into calls to automatically generated helper functions. In other words, if x is defined as “volatile int x;” then this code:

y=x;

becomes:

y=__volatile_read_int(&x);

Our idea mostly worked. What I mean is that many, but not all, problems in miscompilation of accesses to volatiles went away after transforming programs. Our hypothesis was that when wrapper functions didn’t work, it was always because the compiler was performing a regular miscompilation (i.e., generating the wrong code in a way that has nothing to do with volatile). But we couldn’t back this up since we lacked a correct compiler and we also didn’t have time to manually inspect thousands of failed test cases.

So far this is all old news, but there has been a very nice new development in volatile-land. As of recently, CompCert implements a proved volatile semantics like this:

  1. Accesses to volatile-qualified objects are turned into function calls in the frontend.
  2. The target-independent optimization passes totally ignore these function calls.
  3. In the backend, the function calls are turned into inline code that performs the accesses.

This is basically the strategy that Eric and I came up with, but with a nice improvement: much of the overhead associated with actual function calls is avoided due to the late inline substitution. The overhead of function calls — particularly in terms of code size — would be significant for small embedded applications that consist mainly of accesses to device registers. Some overhead will likely remain due to the calling convention and because CompCert must pessimistically assume that a helper function has updated any global variables that happen to be cached in registers. Hopefully the CompCert folks will quantify the overheads of various alternatives in a paper sometime.

We looked for volatile bugs in CompCert and found a few: they were in, for example, the unproved frontend code that expands C-level struct assignments into Clight. After Xavier fixed these bugs (I think there were two of this ilk) we can no longer find volatile bugs in CompCert. Now we finally come to the point of this post:

Out of the dozen-odd compilers we have tested, there is only one — CompCert — that reliably respects C’s volatile qualifier.

This is an impressive result.

Update from March 1 2011:

My description of CompCert is a bit out of date. Xavier Leroy explains:

One little correction, though: the handling of volatiles you describe is that of CompCert 1.7.

In the latest release CompCert 1.8, I improved the generated code by, in essence, inlining the _volatile_read and _volatile_write functions after optimizations are done, but before register allocation. (In reality, it’s done a bit differently, through a notion of “inlinable builtin functions” of which the volatile operations is an instance.)

This way, the generated code isn’t constrained by the calling conventions: the compiler knows that the caller-save registers are not destroyed by the _volatile_* functions, and that these “functions” can take their arguments in any register, not just those dictated by the calling conventions.

This sounds like exactly the right solution: not only does it give us correct code and optimization passes that are free of volatile-related clutter, but the performance and size of the generated code should be very good.

Life With BibTeX

For a while I’ve had a blog post about BibTeX on the back burner, but now I don’t need to write it because Dan Wallach has done so. It’s a great post, but for emphasis I’ll repeat a handful of points that I tell students (and would like to tell authors of some papers I review) over and over. BibTeX users must:

  • Repeatedly proofread the BibTeX output. (For important papers I usually proofread to fixpoint: keep reading and editing the paper until I can read the entire document without wanting to make a single change. This is incredibly time-consuming.)
  • Never assume a bib entry found online is accurate.
  • Strive for consistency in fields listed, abbreviations used, etc. across all entries used in any single paper.
  • Proofread yet again every time the bibliography style is changed.
  • Include fields that help people identify and track down the cited paper, and (mostly) leave out fields that do not.

Ah, it feels better getting that off my chest. Thanks, Dan, for spelling out all the details!

Differential Whitebox Testing Is Good

[This post is based on data gathered by Chad Brubaker and Peng Li, respectively an undergrad and grad student in CS at Utah.]

Two courses I’ve taught at Utah, CS 4400 and CS 5785, have assignments where students write short integer functions for which we — the instructors — have automatic graders. In 4400 the students get access to the test framework while they work, in 5785 they do not.

Klee is a whitebox test case generation tool: it looks at a piece of code and attempts to generate a collection of inputs that cover all possible execution paths. I was curious how Klee would fare when compared to the hand-written testers, both of which have been in use for more than five years.

One way to use Klee is like this:

  1. Klee generates test cases for a piece of code written by student
  2. We run the test cases through the student code and the reference code, looking for inequal output

This works poorly: Klee fails to find lots of bugs found by the hand-written testers. Klee can also be used to perform differential whitebox test case generation (which I already briefly discussed here):

  1. Klee generates test cases for the code assert(student_code(input)==reference_code(input))
  2. We run the test cases through the student code and the reference code, looking for inequal output

This works very well: Klee’s tests not only find all bugs found by the hand-written testers, but also some new bugs are identified. What appears to be going on is that the hand-written testers use values that the instructors’ intuitions indicate would be good for triggering bugs. But students do not always write code corresponding to the instructors’ intuitions; in that case, paths are missed by the hand-written tester whereas Klee goes ahead and thoroughly tests them.

Results

From CS 4400 we have 1050 bit puzzle solutions from students. 84 buggy functions are identified by the hand-written test suite. Differential test case generation using Klee finds all of these bugs plus seven more, for a total of 91.

From CS 5785 we have 336 saturating operations from students. 115 buggy functions are identified by the hand-written test suite. Differential test case generation using Klee finds all of these bugs plus six more, for a total of 121.

Differential Klee vs. Exhaustive Testing

Klee can find more bugs, but can it find all of them? To explore this, we exhaustively tested each function that has ≤32 bits of input.

For the bit puzzles, exhaustive testing revealed some bugs not found by Klee. I’ll provide details in a subsequent post.

For the saturating operations, exhaustive testing revealed a single bug not found be Klee. The problem turned out to be code that computed the maximum and minimum integer values like this:

int type = 8*sizeof(signed char);
min = -(pow(2,type)/2);
max = min+1;
max = 0-max;

Clang/LLVM fails to optimize away the pow() call, which remains in the bitcode where (for reasons I haven’t looked into) it prevents Klee from generating good test cases. Klee properly found the bug in this function when the code was changed to this:

min = CHAR_MIN;
max = CHAR_MAX;

The student couldn’t have written the code this way, however, since their math operations had to work for four different integer widths.

Conclusion

A 5%-8% increase in the number of buggy functions identified is pretty significant. Also, by using Klee we no longer need to write a test case generator by hand (although I still would, since I wouldn’t necessarily trust Klee to do a good job without some supporting evidence).

Quality Not Quantity?

The perverse incentives for academics to maximize publication and citation counts, as opposed to maximizing the quality and impact of the underlying research, are well-known. Stan Trimble’s recent letter to Nature suggests a partial solution: academic institutions should limit the number of publications that are part of a tenure or promotion case. This is simple enough that it might actually work. I’d suggest a three-publication limit for faculty applications, five for tenure, and perhaps seven for full professor cases. A bit of gaming would be required to navigate certain situations:

  • Do I list a mediocre paper that appeared at a top conference, or a stronger one from a minor venue?
  • Do I list a major paper where I played a minor role, as opposed to a less impressive single-author paper?
  • If I’m switching areas, do I focus on the previous area (which is what my letter writers are likely to discuss) or the new one?

There’s plenty of gamesmanship in the process already, so these shouldn’t be a big deal.

A more extreme version (attributed to Matthias Felleisen, though I don’t know if he originated it) is: faculty are hired as full professors. Upon publishing their first paper, they are demoted to associate professor, then assistant professor, and then after the third publication they are required to leave academia.

The idea behind these various proposals, of course, is to cause people to view publication as a precious resource, rather than as a metric to be maximized. There would be pushback from researchers who act as paper machines, but I’d expect it could be overcome.

Visualizing Math Bugs

After getting too tired to work tonight, I realized I had no Netflix and that I’m bored with the books I’m reading. Therefore, I present visualizations of several solutions I’ve received to my saturating arithmetic homework over the years.

This is what a correct saturating signed add looks like, where the z-axis is the output:

Here are some solutions that are incorrect. I love the ones that look like origami cranes.





















A Few Thoughts About Path Coverage

Klee isn’t the only tool in its class, nor was it the first, but it’s open source and well-engineered and it works. Klee’s goal is to generate a set of test inputs that collectively induce path coverage in a system under test. One of the scenarios I was curious about is the one where Klee achieves full path coverage but still fails to expose a bug, for example because a stronger kind of coverage such as value coverage is required. Coming up with these examples is easy. Consider the following function for computing the number of bits set in a 64-bit word:

const int pop8[256] = {
  0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
  1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
  1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
  2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
  1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
  2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
  2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
  3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
  1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
  2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 4, 5, 6,
  2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
  3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
  2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
  3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
  3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
  4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8,
};
int popcount (unsigned long long x) {
  int c=0;
  int i;
  for (i=0; i<8; i++) {
    c += pop8[x&0xff];
    x >>= 8;
  }
  return c;
}

This function admits only a single path of execution, and therefore Klee comes up with only a single test case: the 64-bit word containing all zero bits. This popcount function is, in fact, wrong (one of the values in the table is off by one) but Klee’s input does not trigger the bug. At this point we’re more or less stuck — Klee cannot find the bug and we’re back to code inspection or random testing or something. But what if we have a second popcount implementation?

int popcount_slow (unsigned long long x) {
  int c=0;
  int i;
  for (i=0; i<64; i++) {
    c += x&1;
    x >>= 1;
  }
  return c;
}

Unlike the previous implementation, this one is correct. Like the previous implementation, it has only a single path — so again Klee will give us just one test case. The way to find an input where the two functions return different answers is to just ask Klee directly:

assert (popcount(x) == popcount_slow(x));

Now there are two paths, a succeeding one and a failing one, and it takes Klee just a few seconds to find inputs that execute both paths. This is a pretty impressive result given the large input space. Of course, in this particular case random testing using uniformly-distributed 64-bit inputs finds this bug very quickly since approximately one in eight inputs triggers the bug. However, we could write code where the probability of a random input triggering the bug is arbitrarily small (note however that fooling a whitebox test case generator is not totally trivial: many of the tempting ways to add a bug would also add a new code path, which Klee would duly explore).

The interesting thing about path coverage is that it’s massively insufficient for finding all bugs in real code, and yet it is also impossible to achieve 100% feasible path coverage on real software systems due to the path explosion problem. The conclusion I draw is that future Klee-like tools should provide the option of targeting even stronger kinds of coverage (path + boundary value coverage, for example) and that these tools will be most useful in unit-testing of smallish codes.

The problem in testing my code above stems from its use of a lookup table. Perhaps people who test software for a living have come up with reasonably generic ways to detect problems in lookup tables, but I’m unaware of them. Notice that MC/DC coverage — used in validating Level A software in DO-178B, and one of the stronger coverage metrics that is actually used in practice — is not any better than path coverage for finding the popcount bug.

Undefined Integer Behaviors in Student Code, Part 1

[This post is based on material from Chad Brubaker, a really smart CS undergrad at Utah who did all the work getting these data. The integer undefined behavior checker was created by my student Peng Li.]

Integer undefined behaviors in C/C++, such as INT_MAX+1 or 1<<-1, create interesting opportunities for compiler optimizations and they also make programming in C/C++ not unlike crossing a minefield while blindfolded. This piece describes the results of running a checker for integer undefined behaviors on code submitted by several years’ worth of students in Utah’s CS 4400. The checker simply inserts checking logic in front of any potentially-undefined operation. Then, we ran each student’s code on the same inputs used by the test harness the students were given as part of the assignment. CS 4400 is based on Computer Systems: A Programmer’s Perspective, by Bryant and O’Hallaron. It’s great textbook and I think students get a lot out of the course. I taught it three times in the early/mid 2000s and others have taught it since then.

The assignment is to write C code solving various small “bit puzzles.” Each solution must be straight-line code and can only use C’s bitwise operators: ! ~ & ^ | + << >>. There are a few additional restrictions that are specific to individual problems, such as limiting the number of operations and further restrictions on operator choice.

Students are given reference solutions for the bit puzzles (written in C, but using loops and otherwise not following the rules) and also they are given a harness that runs their code against the reference solutions on some fixed test vectors.

The results below divide students’ solutions into four bins:

  • correct results and no undefined behavior
  • wrong results (for at least one input) and no undefined behavior
  • correct results but contains undefined behavior (for at least one input)
  • wrong results and contains undefined behavior

The most interesting case is where code gives the right answers despite executing undefined behaviors. These solutions are basically just lucky. In other words, the version of GCC installed on our lab machines, at the optimization level specified by the makefile, happens to have a particular behavior. But even the same code, on the same compiler, using the same optimization options, could break if called from a different context.

get_least_significant_byte

Extract the least significant byte from an int.

correct wrong
no undefined 105 0
undefined 0 0

is_equal

Return 1 if two ints are equal, and zero otherwise (note that == is not an allowed operator).

correct wrong
no undefined 97 0
undefined 8 0

All of the undefined behaviors were overflow of signed addition.

any_even_one

Return 1 if any even bit of an int is set, otherwise return 0.

correct wrong
no undefined 102 3
undefined 0 0

clear_all_but_lower_bits

Clear all but the N least significant bits of an int.

correct wrong
no undefined 67 1
undefined 34 3
  • 16 solutions contained signed addition overflows
  • 4 contained a signed right-shift by negative or past bitwidth
  • 17 contained a signed left-shift by negative or past bitwidth

rotate_right

Rotate the bits in an int to the right.

correct wrong
no undefined 7 3
undefined 87 8
  • 2 overflowed a signed addition
  • 8 contained an unsigned left-shift by negative or past bitwidth
  • 7 contained an unsigned right-shift by negative or past bitwidth
  • 8 contained a signed right-shift by negative or past bitwidth
  • 70 contained a signed left-shift by negative or past bitwidth

indicate_rightmost_one

Generate an int with a single set bit corresponding to the rightmost set bit in another int.

correct wrong
no undefined 1 11
undefined 93 0

93 solutions overflowed a signed addition — only a single student managed to avoid this problem and also return the right answers

divide_by_four

Divide an int by four, rounding towards zero.

correct wrong
no undefined 94 8
undefined 0 3

3 solutions overflowed a signed addition

fits_n_bits

Return 1 if a value can be represented as an N-bit 2’s complement integer.

correct wrong
no undefined 79 12
undefined 10 4
  • 9 solutions overflowed a signed addition
  • 2 solutions contained a signed right-shift by negative or past bitwidth
  • 3 solutions contained a signed left-shift by negative or past bitwidth

This puzzle was interesting because it was the only one where we found undefined behavior in the reference solution, which executes this code to compute the largest representable integer:

tmax_n = (1<<(n-1)) - 1;

1<<31 evaluates to 0x80000000, which is INT_MIN on a machine where int is 32 bits. Then, evaluating INT_MIN-1 is an operation with undefined behavior.

Moreover, in C99, evaluating 1<<31 is undefined.

is_greater

Return 1 if one int is larger than another, and zero otherwise.

correct wrong
no undefined 4 11
undefined 84 6

90 solutions overflowed a signed addition

subtract_overflows

Return 1 if subtracting one int from another overflows, and zero otherwise.

correct wrong
no undefined 0 13
undefined 89 3

92 solutions overflowed a signed addition — nobody avoided this hazard

What’s the Problem?

The problem is that a large fraction of our junior-level CS undergraduates are producing code that executes integer undefined behaviors. At best, this kind of code is not portable across compilers, platforms, or even changes in compiler version, optimization options, or calling context. At worst, undefined behaviors are time bombs that will lead to exploitable vulnerabilities or other serious malfunctions in software systems. I want our graduates to understand what undefined behavior means and how to avoid it, not because C/C++ programming is so great, but because the concepts are much more broadly applicable.

What’s the Solution?

Fixing the problem in the context of Utah’s CS 4400 course should be straightforward:

  • Include some lecture material on undefined behaviors
  • Make Peng’s undefined behavior checker part of the test suite that students are given as part of the assignment; since it gives nice clear error messages, the output should be directly actionable

I’ll see if Erin, the current instructor, is interested in working with me to implement this for Fall 2011. We’ll see how it goes.

[Note: I’ve deliberately left code solving the bit puzzles out of this post, and I will reject any comments containing solutions.]

Should Software Evolve?

Weimer, Nguyen, Le Goues, and Forrest wrote a paper Automatically Finding Patches Using Genetic Programming that proposes and evaluates this approach for fixing a buggy software system:

  1. A failure-inducing test case is found
  2. The system code is mutated randomly, but using smart heuristics to increase the probability that the part of the program that is broken is modified
  3. The new version of the system is tested against the new and old test cases, and is accepted if it passes all of them
  4. The fix is minimized using delta debugging

The paper was published in 2009 at ICSE — a highly selective software engineering conference — and it was one of the award papers the year it appeared. In other words, the software engineering community considered it to be one of the best papers of 2009. There is no doubt the approach is interesting and that it should be tried. On the other hand, it seems pretty clear that fixing bugs by genetic programming is sometimes inappropriate. Since the authors of the paper didn’t speculate about when their technique should and shouldn’t be used, I’ll do it.

First off, the method from the Weimer paper is very similar to something I see when teaching CS courses. Some fraction of the students approaches programming problems like this:

  1. Write code purporting to solve the assigned problem
  2. Run test cases
  3. Tweak the logic more or less randomly
  4. Go back to 2 until time runs out or all test cases succeed

This is not a good approach. When it works, it permits students to believe they do not need to practice incremental development, and that it’s OK to produce code that would be impossible to maintain. It also defeats the purpose of the assignment — learning — by reducing program construction to a series of mindless tweaks. It works poorly in practice, even for fairly simple assignments: the best outcome is an over-constrained and fragile program that solves some or all test cases, but that contains convoluted logic and needless special cases that render it incorrect for inputs not in the test set. The other outcome (and this happens pretty often) is that the student’s code steadily evolves to be more and more wrong until in the end the code must be thrown away and the student starts over.

Of course, just because it’s a disaster when students use evolutionary programming doesn’t mean that computers shouldn’t do it. My first attempt at a rule for when it’s OK to evolve software was this:

Software for non-critical applications may be evolved, whereas software for safety-, security-, or mission-critical systems should not be.

But this isn’t a good rule. It’s perfectly fine for certain parts of the software controlling my car’s engine to evolve (I’ll say why below), and some kinds of totally non-critical software should never evolve. Here’s a rule that I think is better:

Black box software whose correctness can be externally verified can and perhaps should evolve. Program logic that must be understood, modified, analyzed, and (especially) maintained by human developers should not evolve.

For example, if a genetic algorithm is used to evolve the most efficient valve timings for my car engine or the most effective buffer size for my video player, then great — these can be independently verified and the fact that they evolved is irrelevant. On the other hand, if my compiler crashes or emits the wrong code, can it be repaired by an evolutionary technique? No. The logic required to fix any non-trivial compiler bug is intimately entwined with the compiler’s existing logic. Evolving it would be perhaps the dumbest idea on record. My guess is that the core logic of many important software systems (operating systems, virtual machine managers, database engines, etc.) is similarly not amenable to evolutionary bug fixing.

Are evolutionary techniques somehow incompatible with operating systems and compilers? Absolutely not. Parameters to the OS paging or prefetching policy, or the ordering of phases in a compiler, can and should be evolved to be better. But again, this will work because nobody needs to understand the prefetch policy as long as it works. A bad prefetch policy or phase ordering will hurt performance, but won’t break the system.

Of course I’d be happy to be proved wrong, and here’s how the software evolution people can do it: as soon as they ask me to, I’ll start handing them compiler bug reports at the same time I hand these to the LLVM and GCC maintainers. Then, they can try to evolve a fix before the compiler maintainers get around to looking at the problem. This should be easy because evolutionary bug fixing is totally automated whereas manual bug fixing requires stealing time from well-paid and extremely busy developers. Of course the evolved compiler patches should be submitted to the compiler maintainers. If they can get just five evolved patches accepted to the GCC or LLVM trees, I will publicly apologize for being so very, very wrong about their work, and will also buy multiple beers for whatever subset of them I next run into at a conference.

Update: I should add that I would very much like evolutionary fixing of compiler bugs to work out, because this method could be combined with my group’s automated detection of compiler bugs. The result would be a compiler that spontaneously becomes more correct.

Random Testing Gets No Respect

A funny thing about random testing is that it gets little respect from most computer scientists. Here’s why:

  • Random testing is simple, and peer reviewers hate simple solutions.
  • Random testing does not lend itself to formalization and proofs, which peer reviewers love.
  • Random testing provides no guarantees. Peer reviewers love guarantees — regardless of whether the they are actually useful.
  • Random testing is hard to do well, but the difficulty is more art than science: it requires intuition, good taste, and domain knowledge. These — like many other important things — can’t be neatly packaged up in a journal paper.
  • People dislike the probabilistic aspect, and would prefer a fixed test suite.
  • Most researchers don’t produce robust software, and therefore have trouble understanding why anyone would need testing techniques that can help produce robust software by uncovering difficult corner-case bugs.

Of course, random testing has plenty of well-known problems, but I often get the feeling that people are responding to prejudices rather than facts. Even so, the picture is far from bleak and there are plenty of good researchers who exploit or at least appreciate random testing. My favorite recent example is found in this paper by Fox and Myreen, who are working on a formal model of the ARM instruction set architecture in the HOL4 theorem prover. This kind of work is the most pedantic, hard-nosed possible kind of formal methods, and yet they validate their model (and find bugs in it) using humble randomized differential testing — against real ARM boards, no less.