Differential Whitebox Testing Is Good

[This post is based on data gathered by Chad Brubaker and Peng Li, respectively an undergrad and grad student in CS at Utah.]

Two courses I’ve taught at Utah, CS 4400 and CS 5785, have assignments where students write short integer functions for which we — the instructors — have automatic graders. In 4400 the students get access to the test framework while they work, in 5785 they do not.

Klee is a whitebox test case generation tool: it looks at a piece of code and attempts to generate a collection of inputs that cover all possible execution paths. I was curious how Klee would fare when compared to the hand-written testers, both of which have been in use for more than five years.

One way to use Klee is like this:

  1. Klee generates test cases for a piece of code written by student
  2. We run the test cases through the student code and the reference code, looking for inequal output

This works poorly: Klee fails to find lots of bugs found by the hand-written testers. Klee can also be used to perform differential whitebox test case generation (which I already briefly discussed here):

  1. Klee generates test cases for the code assert(student_code(input)==reference_code(input))
  2. We run the test cases through the student code and the reference code, looking for inequal output

This works very well: Klee’s tests not only find all bugs found by the hand-written testers, but also some new bugs are identified. What appears to be going on is that the hand-written testers use values that the instructors’ intuitions indicate would be good for triggering bugs. But students do not always write code corresponding to the instructors’ intuitions; in that case, paths are missed by the hand-written tester whereas Klee goes ahead and thoroughly tests them.


From CS 4400 we have 1050 bit puzzle solutions from students. 84 buggy functions are identified by the hand-written test suite. Differential test case generation using Klee finds all of these bugs plus seven more, for a total of 91.

From CS 5785 we have 336 saturating operations from students. 115 buggy functions are identified by the hand-written test suite. Differential test case generation using Klee finds all of these bugs plus six more, for a total of 121.

Differential Klee vs. Exhaustive Testing

Klee can find more bugs, but can it find all of them? To explore this, we exhaustively tested each function that has ≤32 bits of input.

For the bit puzzles, exhaustive testing revealed some bugs not found by Klee. I’ll provide details in a subsequent post.

For the saturating operations, exhaustive testing revealed a single bug not found be Klee. The problem turned out to be code that computed the maximum and minimum integer values like this:

int type = 8*sizeof(signed char);
min = -(pow(2,type)/2);
max = min+1;
max = 0-max;

Clang/LLVM fails to optimize away the pow() call, which remains in the bitcode where (for reasons I haven’t looked into) it prevents Klee from generating good test cases. Klee properly found the bug in this function when the code was changed to this:

min = CHAR_MIN;
max = CHAR_MAX;

The student couldn’t have written the code this way, however, since their math operations had to work for four different integer widths.


A 5%-8% increase in the number of buggy functions identified is pretty significant. Also, by using Klee we no longer need to write a test case generator by hand (although I still would, since I wouldn’t necessarily trust Klee to do a good job without some supporting evidence).

Visualizing Math Bugs

After getting too tired to work tonight, I realized I had no Netflix and that I’m bored with the books I’m reading. Therefore, I present visualizations of several solutions I’ve received to my saturating arithmetic homework over the years.

This is what a correct saturating signed add looks like, where the z-axis is the output:

Here are some solutions that are incorrect. I love the ones that look like origami cranes.

A Few Thoughts About Path Coverage

Klee isn’t the only tool in its class, nor was it the first, but it’s open source and well-engineered and it works. Klee’s goal is to generate a set of test inputs that collectively induce path coverage in a system under test. One of the scenarios I was curious about is the one where Klee achieves full path coverage but still fails to expose a bug, for example because a stronger kind of coverage such as value coverage is required. Coming up with these examples is easy. Consider the following function for computing the number of bits set in a 64-bit word:

const int pop8[256] = {
  0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
  1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
  1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
  2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
  1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
  2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
  2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
  3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
  1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
  2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 4, 5, 6,
  2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
  3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
  2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
  3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
  3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
  4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8,
int popcount (unsigned long long x) {
  int c=0;
  int i;
  for (i=0; i<8; i++) {
    c += pop8[x&0xff];
    x >>= 8;
  return c;

This function admits only a single path of execution, and therefore Klee comes up with only a single test case: the 64-bit word containing all zero bits. This popcount function is, in fact, wrong (one of the values in the table is off by one) but Klee’s input does not trigger the bug. At this point we’re more or less stuck — Klee cannot find the bug and we’re back to code inspection or random testing or something. But what if we have a second popcount implementation?

int popcount_slow (unsigned long long x) {
  int c=0;
  int i;
  for (i=0; i<64; i++) {
    c += x&1;
    x >>= 1;
  return c;

Unlike the previous implementation, this one is correct. Like the previous implementation, it has only a single path — so again Klee will give us just one test case. The way to find an input where the two functions return different answers is to just ask Klee directly:

assert (popcount(x) == popcount_slow(x));

Now there are two paths, a succeeding one and a failing one, and it takes Klee just a few seconds to find inputs that execute both paths. This is a pretty impressive result given the large input space. Of course, in this particular case random testing using uniformly-distributed 64-bit inputs finds this bug very quickly since approximately one in eight inputs triggers the bug. However, we could write code where the probability of a random input triggering the bug is arbitrarily small (note however that fooling a whitebox test case generator is not totally trivial: many of the tempting ways to add a bug would also add a new code path, which Klee would duly explore).

The interesting thing about path coverage is that it’s massively insufficient for finding all bugs in real code, and yet it is also impossible to achieve 100% feasible path coverage on real software systems due to the path explosion problem. The conclusion I draw is that future Klee-like tools should provide the option of targeting even stronger kinds of coverage (path + boundary value coverage, for example) and that these tools will be most useful in unit-testing of smallish codes.

The problem in testing my code above stems from its use of a lookup table. Perhaps people who test software for a living have come up with reasonably generic ways to detect problems in lookup tables, but I’m unaware of them. Notice that MC/DC coverage — used in validating Level A software in DO-178B, and one of the stronger coverage metrics that is actually used in practice — is not any better than path coverage for finding the popcount bug.

Undefined Integer Behaviors in Student Code, Part 1

[This post is based on material from Chad Brubaker, a really smart CS undergrad at Utah who did all the work getting these data. The integer undefined behavior checker was created by my student Peng Li.]

Integer undefined behaviors in C/C++, such as INT_MAX+1 or 1<<-1, create interesting opportunities for compiler optimizations and they also make programming in C/C++ not unlike crossing a minefield while blindfolded. This piece describes the results of running a checker for integer undefined behaviors on code submitted by several years’ worth of students in Utah’s CS 4400. The checker simply inserts checking logic in front of any potentially-undefined operation. Then, we ran each student’s code on the same inputs used by the test harness the students were given as part of the assignment. CS 4400 is based on Computer Systems: A Programmer’s Perspective, by Bryant and O’Hallaron. It’s great textbook and I think students get a lot out of the course. I taught it three times in the early/mid 2000s and others have taught it since then.

The assignment is to write C code solving various small “bit puzzles.” Each solution must be straight-line code and can only use C’s bitwise operators: ! ~ & ^ | + << >>. There are a few additional restrictions that are specific to individual problems, such as limiting the number of operations and further restrictions on operator choice.

Students are given reference solutions for the bit puzzles (written in C, but using loops and otherwise not following the rules) and also they are given a harness that runs their code against the reference solutions on some fixed test vectors.

The results below divide students’ solutions into four bins:

  • correct results and no undefined behavior
  • wrong results (for at least one input) and no undefined behavior
  • correct results but contains undefined behavior (for at least one input)
  • wrong results and contains undefined behavior

The most interesting case is where code gives the right answers despite executing undefined behaviors. These solutions are basically just lucky. In other words, the version of GCC installed on our lab machines, at the optimization level specified by the makefile, happens to have a particular behavior. But even the same code, on the same compiler, using the same optimization options, could break if called from a different context.


Extract the least significant byte from an int.

correct wrong
no undefined 105 0
undefined 0 0


Return 1 if two ints are equal, and zero otherwise (note that == is not an allowed operator).

correct wrong
no undefined 97 0
undefined 8 0

All of the undefined behaviors were overflow of signed addition.


Return 1 if any even bit of an int is set, otherwise return 0.

correct wrong
no undefined 102 3
undefined 0 0


Clear all but the N least significant bits of an int.

correct wrong
no undefined 67 1
undefined 34 3
  • 16 solutions contained signed addition overflows
  • 4 contained a signed right-shift by negative or past bitwidth
  • 17 contained a signed left-shift by negative or past bitwidth


Rotate the bits in an int to the right.

correct wrong
no undefined 7 3
undefined 87 8
  • 2 overflowed a signed addition
  • 8 contained an unsigned left-shift by negative or past bitwidth
  • 7 contained an unsigned right-shift by negative or past bitwidth
  • 8 contained a signed right-shift by negative or past bitwidth
  • 70 contained a signed left-shift by negative or past bitwidth


Generate an int with a single set bit corresponding to the rightmost set bit in another int.

correct wrong
no undefined 1 11
undefined 93 0

93 solutions overflowed a signed addition — only a single student managed to avoid this problem and also return the right answers


Divide an int by four, rounding towards zero.

correct wrong
no undefined 94 8
undefined 0 3

3 solutions overflowed a signed addition


Return 1 if a value can be represented as an N-bit 2’s complement integer.

correct wrong
no undefined 79 12
undefined 10 4
  • 9 solutions overflowed a signed addition
  • 2 solutions contained a signed right-shift by negative or past bitwidth
  • 3 solutions contained a signed left-shift by negative or past bitwidth

This puzzle was interesting because it was the only one where we found undefined behavior in the reference solution, which executes this code to compute the largest representable integer:

tmax_n = (1<<(n-1)) - 1;

1<<31 evaluates to 0x80000000, which is INT_MIN on a machine where int is 32 bits. Then, evaluating INT_MIN-1 is an operation with undefined behavior.

Moreover, in C99, evaluating 1<<31 is undefined.


Return 1 if one int is larger than another, and zero otherwise.

correct wrong
no undefined 4 11
undefined 84 6

90 solutions overflowed a signed addition


Return 1 if subtracting one int from another overflows, and zero otherwise.

correct wrong
no undefined 0 13
undefined 89 3

92 solutions overflowed a signed addition — nobody avoided this hazard

What’s the Problem?

The problem is that a large fraction of our junior-level CS undergraduates are producing code that executes integer undefined behaviors. At best, this kind of code is not portable across compilers, platforms, or even changes in compiler version, optimization options, or calling context. At worst, undefined behaviors are time bombs that will lead to exploitable vulnerabilities or other serious malfunctions in software systems. I want our graduates to understand what undefined behavior means and how to avoid it, not because C/C++ programming is so great, but because the concepts are much more broadly applicable.

What’s the Solution?

Fixing the problem in the context of Utah’s CS 4400 course should be straightforward:

  • Include some lecture material on undefined behaviors
  • Make Peng’s undefined behavior checker part of the test suite that students are given as part of the assignment; since it gives nice clear error messages, the output should be directly actionable

I’ll see if Erin, the current instructor, is interested in working with me to implement this for Fall 2011. We’ll see how it goes.

[Note: I’ve deliberately left code solving the bit puzzles out of this post, and I will reject any comments containing solutions.]

More Saturating Arithmetic

In a previous post I guessed that 91 bytes was close to the minimum size for implementing signed and unsigned saturating 32-bit addition and subtraction on x86. A commenter rightly pointed out that it should be possible to do better. Attached is my best shot: 83 bytes. Given the crappy x86 calling convention I’m going to guess it’s hard to do much better unless the saturating SSE instructions offer a way. Or can the branches in the signed functions be avoided? It seems like it, but I couldn’t figure out how.

	movl	4(%esp), %eax
	xor     %edx, %edx
	not     %edx
	addl	8(%esp), %eax
	cmovb	%edx, %eax

	movl	4(%esp), %eax
	xor     %edx, %edx
	subl	8(%esp), %eax
	cmovb	%edx, %eax

	movl	4(%esp), %eax
	movl	$2147483647, %edx
	movl	$-2147483648, %ecx
	addl	8(%esp), %eax
	jno     out1
	cmovg	%edx, %eax
	cmovl	%ecx, %eax
out1:	ret

	movl	4(%esp), %eax
	movl	$2147483647, %edx
	movl	$-2147483648, %ecx
	subl	8(%esp), %eax
	jno     out2
	cmovg	%edx, %eax
	cmovl	%ecx, %eax
out2:	ret

Another question: How difficult is it to create a superoptimizer that turns obvious C formulations of saturating operations into the code shown here? This is beyond the capabilities of current superoptimizers, I believe, while being obviously feasible.

Feedback is appreciated.

Fun With Saturating Arithmetic

An assignment that I often give my Advanced Embedded Systems class early in the semester is to implement saturating versions of signed and unsigned addition and subtraction. Saturating operations “stick” at the maximum or minimum value. (INT_MAX +sat 1) therefore evaluates to INT_MAX. The only wrinkle in the assignment is that solutions must be capable of being compiled (via a typedef) for integers of 8, 16, 32, or 64 bits. I give this assignment for several reasons:

  • I like people to see saturating arithmetic since it is useful in some embedded applications
  • it should be easy and fun, but ends up being a bit tricky
  • the (typically) poor performance of the class helps motivate later course material on testing and on integer problems

Over the years I’ve ended up with nearly 100 submissions from students. Tonight I ran them all through my grading script and found that 23 students got all 4 functions right, for all integer widths. Also I rediscovered a few gems:

Segfaults From Integer Code

Two students managed to create solutions that segfault. How is this impressive result even possible, you ask? It was accomplished by implementing addition and subtraction as mutually recursive functions and failing to stop the recursion. Thus, the crash is a stack overflow.

Exponential Running Time

Two more students implemented iterative solutions with running time exponential in the size of the inputs. This can be accomplished for unsigned addition by, for example, repeatedly incrementing one of the arguments and decrementing the other, stopping when the first saturates or the second reaches zero. While spectacularly inefficient, one of these solutions was correct, as far as I could tell. I had to give the student full points because the assignment failed to specify that the functions should terminate in a reasonable time in the 64-bit case.

The Most Creative Way To Pass Tests

My test harness invokes the students’ saturating operations with random inputs as well as values at or near corner cases. By far the most creative wrong solution came from a student who implemented three of the four functions correctly, but forgot to return a value from the fourth. Of course this is an undefined behavior in C, but not a compile time error. Astoundingly, Intel CC, a recent Clang snapshot, and GCC 3.4 all report zero errors from this student’s code, when compiling with optimizations. The reason for this is entertaining:

  1. The compiler, seeing a function that always relies on undefined behavior (I call these Type 3 functions), turns it into a nop.
  2. My test harness calls the reference version of the saturating function, which leaves its result in eax.
  3. The value is copied into ebx, but eax still retains a copy.
  4. The student’s code is called. It is supposed to store its result in eax, but remember that it is now a nop.
  5. The test harness compares eax against ebx. Of course they are equal, and so the test passes.

Have I said before that the way a C compiler silently destroys Type 3 functions is effectively evil? I have, and I’m saying it again.

A Corner Case

Each year a few people implement saturating signed subtraction like this:

mysint sat_signed_sub (mysint a, mysint b)
  return sat_signed_add (a, -b);

This is tantalizingly simple, but wrong.

Big Code == Wrong Code

After being compiled (for the 32-bit integer case, using a recent version of gcc for x86 at -Os) solutions I’ve received range from 76 to 1159 bytes of object code, totaled across all four functions. The largest correct solution is 226 bytes. Bigger solutions tend to be more complicated, and complicated solutions are usually wrong.

The Best Code

I tried quite a few compilers and could not manage to generate a correct solution smaller than 91 bytes of x86 code (again, for the 32-bit case). This is from a recent GCC snapshot. The C code for this solution was produced by one of the brightest undergrads I’ve had the pleasure of teaching; it succeeds by using bit operators instead of arithmetic. I’m sure a good assembly programmer could shave some bytes off of this, but I suspect that the optimal solution is not a whole lot smaller. I’d love to include the code here but would like to avoid invalidating this entertaining assignment.