Better Testing With Undefined Behavior Coverage

[The bit puzzle results are based on data from Chad Brubaker and the saturating operation results are based on data from Peng Li. They are respectively an undergrad and a grad student in Utah’s CS program.]

Klee is a tool that attempts to generate a collection of test cases inducing path coverage on a system under test. Path coverage means that all feasible control flow paths are executed. It is a strong kind of coverage, but still misses bugs. One way to improve Klee would be to add support for different kinds of coverage metrics: weaker ones like statement coverage would scale to larger programs, and stronger ones such as boundary-value coverage would find more bugs in small codes.

A different way to improve Klee is to continue to target path coverage, but alter the definition of “path.” For example:

  • When testing an x86-64 binary containing a cmov instruction, we could make sure to execute both its condition-true path and condition-false path.
  • When testing the C expression foo(bar(),baz()), we could make sure to test evaluating foo() and bar() in both orders, instead of just letting the compiler pick one.

This piece proposes undefined behavior coverage, which simply means that for any operation that has conditionally-defined behavior, the well-defined and the undefined behaviors are considered to be separate paths. For example, the C expression 3/y has two paths: one where y is zero and the other where y is non-zero.

Obviously, undefined behavior coverage only makes sense for languages such as C and C++ that admit operations with undefined behavior. An undefined behavior, as defined by the C and C++ standards, is one where the language implementation can do anything it likes. The point is to make the compiler developers’ job easier — they may simply assume that undefined behavior never happens. The tradeoff is that the burden of verification is shifted onto language users.

Undefined behavior coverage makes sense for what I call type 2 functions: those whose behavior is conditionally well-defined.

An Example

Here’s a simple C function:

int add_and_shift (int x, int y, int z) {
  return (x+y)<<z;
}

Due to C’s undefined behaviors, this function has a non-trivial precondition:

0 ≤ z < sizeof(int)*CHAR_BIT
INT_MIN ≤ x+y ≤ INT_MAX

(This is for ANSI C; in C99 the precondition is stronger and quite a bit more complicated, but we won’t worry about that.) If the precondition is not satisfied, the function’s return value is unpredictable. In fact, it’s a bit worse than that: as soon as the program executes an undefined behavior the C implementation is permitted to send email to the developer’s mother (though this hardly ever happens).

The point is that although shift_and_add() seems to admit a single path, it really has a number of additional paths corresponding to failed preconditions for its math operators. If we fail to test these paths, we can miss bugs. Since the precondition checks for math operators in C/C++ are pretty simple, we can just add them in an early phase of the compiler, and that’s exactly what Peng’s hacked version of Clang does.

Without undefined behavior checks, LLVM code for add_and_shift() looks like this:

define i32 @add_and_shift(i32 %x, i32 %y, i32 %z) nounwind readnone {
entry:
  %add = add i32 %y, %x
  %shl = shl i32 %add, %z
  ret i32 %shl
}

Obviously there’s just one path, and the test case that Klee picks to exercise this path is:

  • x = 0, y = 0, z = 0

Next, we compile the same function with undefined behavior checks and run Klee again. This time we get four test cases:

  • x = 0, y = 0, z = 0
  • x = 0, y = 0, z = 64
  • x = -2, y = INT_MIN, z = 0
  • x = 2, y = 1, z = 0

The first three tests are exactly the kind of inputs we’d hope to see after looking at the precondition. The 4th input appears to follow the same path as the first. I don’t know what’s going on — perhaps it emerges from some idiosyncrasy of the checked code or maybe Klee simply throws in an extra test case for its own reasons.

Combining Klee with an undefined behavior checker causes Klee to generate additional test cases that — by invoking operations with undefined behavior — should shine some light into dark corners of the system under test. A potential drawback is that all the extra paths are going to cause the path explosion problem to happen sooner than it otherwise would have. However, this should not be serious since we can just run Klee on both versions of the code.

But this is all just talk. The real question is: does this method find more bugs?

Bit Puzzle Results

The first collection of code is several years’ worth of solutions to an early assignment in Utah’s CS 4400. I already discussed these, so I won’t repeat myself. For each bit puzzle, students receive a reference implementation (which they cannot simply copy since it doesn’t follow the rules for student solutions) and a simple test harness that runs their code against the reference implementation on some inputs, compares the results, and complains about any differences. For each of 10 bit puzzles we have 105 solutions written by students. The automated test suite determines that 84 of these 1050 solutions are faulty. In other words, they return incorrect output for at least one input. Differential testing with Klee finds seven additional buggy functions, for a total of 91.

When the students’ codes are augmented with checks for integer undefined behaviors, Klee finds more paths to explore. The test cases that it generates find the 91 incorrect functions that are already known plus 11 more, for a total of 102 buggy functions. Just to be perfectly clear: a buggy function is one that (after being compiled by GCC) returns the wrong output for an input in a test suite. We are not counting instances of undefined behavior as bugs, we are simply using Klee and the undefined behavior checker to generate a better test suite.

We were able to exhaustively test some of the bit puzzles. In these cases, exhaustive testing failed to find any bugs not found by differential Klee with undefined behavior coverage.

Saturating Operation Results

The second collection of code is 336 saturating math operations. In this case, the additional tests generated by Klee to satisfy undefined behavior coverage found no additional buggy functions beyond those found using differential whitebox testing. My hypothesis is that:

  1. The shift-related undefined behaviors in these functions always involved constant arguments, since shifts were used only to compute values like the maximum and minimum representable integer of a certain width. Since the arguments were constant, Klee had no opportunity to generate additional test cases.
  2. The addition and subtraction overflow undefined behaviors were compiled by GCC into modular arithmetic, despite the fact that this behavior is not guaranteed by the standard. This is a natural consequence of generating code using the x86 add and sub instructions. Modular arithmetic was the behavior that people (including me, as described in the previous post) wanted and expected. Therefore, undefined behavior coverage exposed no bugs. Modern C compilers sometimes compile math overflows in a non-modular way (for example, evaluating (x+1)>x to 1), but the saturating arithmetic functions — by chance — do not use code like that.

We were able to exhaustively test saturating operations that take chars (for 16 total bits of input) and short ints (for 32 total bits of input). In these cases, exhaustive testing failed to find any bugs not already found by differential Klee.

Conclusion

Undefined behavior coverage is a special case of a more interesting code coverage metric that I’ll describe in a subsequent post. We need to try Klee + undefined behavior coverage on some real applications and see what happens; I’m cautiously optimistic that it will be useful.

Undefined Integer Behaviors in Student Code, Part 2

[This post is based on data gathered by my student Peng Li. He also wrote the undefined behavior checker.]

The other day I posted about undefined integer behaviors in code written by students in a class I used to teach. This post is more of the same, this time from CS 5785, my advanced embedded systems course. Early in the course I often ask the students to implement saturating versions of signed and unsigned addition and subtraction. Their solutions are required to work regardless of whether an integer datatype is defined to be 8, 16, 32, or 64 bits long. Students are not given any kind of test suite, but I repeatedly emphasize that they should write lots of test cases and inspect the output. The assignment is graded by a tester I wrote that uses a combination of a few hundred values that lie near interesting boundaries, and a few thousand random values. The main reason I give this assignment is to get students warmed up a bit and to motivate some material later on in the course. The assignment is here (see part 2).

Unsigned Saturating Operations

These functions are easy to get right, and also it’s not hard to avoid undefined behavior since the math is unsigned.

ADD correct wrong
no undefined 66 14
undefined 1 3
SUBTRACT correct wrong
no undefined 74 10
undefined 0 0

Signed Saturating Operations

Although the amount of code required to implement these is small (my solutions are 5 lines each), they’re not completely straightforward.

ADD correct wrong
no undefined 15 10
undefined 28 31
SUBTRACT correct wrong
no undefined 11 4
undefined 20 49

The undefined behaviors were a mix of shift errors and overflowing addition and subtraction. Many students used some sort of shift as part of computing the maximum / minimum representable integer.

My Code Was Wrong Too

It turns out that my reference solutions (written years ago, before I understood the horror that is undefined behavior in C) contained two signed overflows. Ouch.

Can We Avoid Undefined Behavior?

90% of the signed operations written by students prior to 2010 executed an undefined integer behavior. In Fall 2010 I included some lecture material on undefined behavior and I also gave students access to Peng’s undefined behavior checker and told them to use it. The result? 46% of the 52 signed functions (written by the 26 students) executed an undefined integer behavior. This is a significant improvement, but still not very good. The real problem — judging from the large number of incorrect solutions — is that many students didn’t test their code very well. Next time I’ll encourage students to graph the functions implemented by their code; this makes some kinds of bugs obvious.

How to Write a C/C++ Compiler That Respects Volatile

The volatile type qualifier in C/C++ means roughly that accesses to the qualified object happen on the actual machine as they do in the abstract machine. I’ve written about volatile pretty extensively, so won’t repeat myself.

An interesting problem with volatile is that in practice, compilers fail to respect it: they add, remove, and reorder accesses in ways that break the rules. This happens because the rules for translating accesses to volatile-qualified objects are very different from the rules for accessing regular C variables: the compiler has nearly complete freedom to add, remove, and reorder non-volatile variable accesses so long as this doesn’t change the program’s externally observable behavior. Thus, many optimization passes require special cases like this:

if (!var->is_volatile()) transform_code();

The problem is that every optimization developer must add these extra safety checks every time — any omission is likely to break the properties that volatile is intended to preserve.

A few years ago Eric Eide and I observed that the rules for accessing volatile objects are very similar to the rules for manipulating function calls. In other words, when the compiler lacks any special knowledge about a called function, it must not add, remove, or reorder function calls. This lead to the idea that if compiler writers would simply model volatile accesses as function calls, all of those special cases in the optimization passes would go away. We tested this idea by writing a source-to-source transformer for C code that turned accesses to volatiles into calls to automatically generated helper functions. In other words, if x is defined as “volatile int x;” then this code:

y=x;

becomes:

y=__volatile_read_int(&x);

Our idea mostly worked. What I mean is that many, but not all, problems in miscompilation of accesses to volatiles went away after transforming programs. Our hypothesis was that when wrapper functions didn’t work, it was always because the compiler was performing a regular miscompilation (i.e., generating the wrong code in a way that has nothing to do with volatile). But we couldn’t back this up since we lacked a correct compiler and we also didn’t have time to manually inspect thousands of failed test cases.

So far this is all old news, but there has been a very nice new development in volatile-land. As of recently, CompCert implements a proved volatile semantics like this:

  1. Accesses to volatile-qualified objects are turned into function calls in the frontend.
  2. The target-independent optimization passes totally ignore these function calls.
  3. In the backend, the function calls are turned into inline code that performs the accesses.

This is basically the strategy that Eric and I came up with, but with a nice improvement: much of the overhead associated with actual function calls is avoided due to the late inline substitution. The overhead of function calls — particularly in terms of code size — would be significant for small embedded applications that consist mainly of accesses to device registers. Some overhead will likely remain due to the calling convention and because CompCert must pessimistically assume that a helper function has updated any global variables that happen to be cached in registers. Hopefully the CompCert folks will quantify the overheads of various alternatives in a paper sometime.

We looked for volatile bugs in CompCert and found a few: they were in, for example, the unproved frontend code that expands C-level struct assignments into Clight. After Xavier fixed these bugs (I think there were two of this ilk) we can no longer find volatile bugs in CompCert. Now we finally come to the point of this post:

Out of the dozen-odd compilers we have tested, there is only one — CompCert — that reliably respects C’s volatile qualifier.

This is an impressive result.

Update from March 1 2011:

My description of CompCert is a bit out of date. Xavier Leroy explains:

One little correction, though: the handling of volatiles you describe is that of CompCert 1.7.

In the latest release CompCert 1.8, I improved the generated code by, in essence, inlining the _volatile_read and _volatile_write functions after optimizations are done, but before register allocation. (In reality, it’s done a bit differently, through a notion of “inlinable builtin functions” of which the volatile operations is an instance.)

This way, the generated code isn’t constrained by the calling conventions: the compiler knows that the caller-save registers are not destroyed by the _volatile_* functions, and that these “functions” can take their arguments in any register, not just those dictated by the calling conventions.

This sounds like exactly the right solution: not only does it give us correct code and optimization passes that are free of volatile-related clutter, but the performance and size of the generated code should be very good.

Life With BibTeX

For a while I’ve had a blog post about BibTeX on the back burner, but now I don’t need to write it because Dan Wallach has done so. It’s a great post, but for emphasis I’ll repeat a handful of points that I tell students (and would like to tell authors of some papers I review) over and over. BibTeX users must:

  • Repeatedly proofread the BibTeX output. (For important papers I usually proofread to fixpoint: keep reading and editing the paper until I can read the entire document without wanting to make a single change. This is incredibly time-consuming.)
  • Never assume a bib entry found online is accurate.
  • Strive for consistency in fields listed, abbreviations used, etc. across all entries used in any single paper.
  • Proofread yet again every time the bibliography style is changed.
  • Include fields that help people identify and track down the cited paper, and (mostly) leave out fields that do not.

Ah, it feels better getting that off my chest. Thanks, Dan, for spelling out all the details!

Differential Whitebox Testing Is Good

[This post is based on data gathered by Chad Brubaker and Peng Li, respectively an undergrad and grad student in CS at Utah.]

Two courses I’ve taught at Utah, CS 4400 and CS 5785, have assignments where students write short integer functions for which we — the instructors — have automatic graders. In 4400 the students get access to the test framework while they work, in 5785 they do not.

Klee is a whitebox test case generation tool: it looks at a piece of code and attempts to generate a collection of inputs that cover all possible execution paths. I was curious how Klee would fare when compared to the hand-written testers, both of which have been in use for more than five years.

One way to use Klee is like this:

  1. Klee generates test cases for a piece of code written by student
  2. We run the test cases through the student code and the reference code, looking for inequal output

This works poorly: Klee fails to find lots of bugs found by the hand-written testers. Klee can also be used to perform differential whitebox test case generation (which I already briefly discussed here):

  1. Klee generates test cases for the code assert(student_code(input)==reference_code(input))
  2. We run the test cases through the student code and the reference code, looking for inequal output

This works very well: Klee’s tests not only find all bugs found by the hand-written testers, but also some new bugs are identified. What appears to be going on is that the hand-written testers use values that the instructors’ intuitions indicate would be good for triggering bugs. But students do not always write code corresponding to the instructors’ intuitions; in that case, paths are missed by the hand-written tester whereas Klee goes ahead and thoroughly tests them.

Results

From CS 4400 we have 1050 bit puzzle solutions from students. 84 buggy functions are identified by the hand-written test suite. Differential test case generation using Klee finds all of these bugs plus seven more, for a total of 91.

From CS 5785 we have 336 saturating operations from students. 115 buggy functions are identified by the hand-written test suite. Differential test case generation using Klee finds all of these bugs plus six more, for a total of 121.

Differential Klee vs. Exhaustive Testing

Klee can find more bugs, but can it find all of them? To explore this, we exhaustively tested each function that has ≤32 bits of input.

For the bit puzzles, exhaustive testing revealed some bugs not found by Klee. I’ll provide details in a subsequent post.

For the saturating operations, exhaustive testing revealed a single bug not found be Klee. The problem turned out to be code that computed the maximum and minimum integer values like this:

int type = 8*sizeof(signed char);
min = -(pow(2,type)/2);
max = min+1;
max = 0-max;

Clang/LLVM fails to optimize away the pow() call, which remains in the bitcode where (for reasons I haven’t looked into) it prevents Klee from generating good test cases. Klee properly found the bug in this function when the code was changed to this:

min = CHAR_MIN;
max = CHAR_MAX;

The student couldn’t have written the code this way, however, since their math operations had to work for four different integer widths.

Conclusion

A 5%-8% increase in the number of buggy functions identified is pretty significant. Also, by using Klee we no longer need to write a test case generator by hand (although I still would, since I wouldn’t necessarily trust Klee to do a good job without some supporting evidence).

Quality Not Quantity?

The perverse incentives for academics to maximize publication and citation counts, as opposed to maximizing the quality and impact of the underlying research, are well-known. Stan Trimble’s recent letter to Nature suggests a partial solution: academic institutions should limit the number of publications that are part of a tenure or promotion case. This is simple enough that it might actually work. I’d suggest a three-publication limit for faculty applications, five for tenure, and perhaps seven for full professor cases. A bit of gaming would be required to navigate certain situations:

  • Do I list a mediocre paper that appeared at a top conference, or a stronger one from a minor venue?
  • Do I list a major paper where I played a minor role, as opposed to a less impressive single-author paper?
  • If I’m switching areas, do I focus on the previous area (which is what my letter writers are likely to discuss) or the new one?

There’s plenty of gamesmanship in the process already, so these shouldn’t be a big deal.

A more extreme version (attributed to Matthias Felleisen, though I don’t know if he originated it) is: faculty are hired as full professors. Upon publishing their first paper, they are demoted to associate professor, then assistant professor, and then after the third publication they are required to leave academia.

The idea behind these various proposals, of course, is to cause people to view publication as a precious resource, rather than as a metric to be maximized. There would be pushback from researchers who act as paper machines, but I’d expect it could be overcome.