Undefined Integer Behaviors in Student Code, Part 2

[This post is based on data gathered by my student Peng Li. He also wrote the undefined behavior checker.]

The other day I posted about undefined integer behaviors in code written by students in a class I used to teach. This post is more of the same, this time from CS 5785, my advanced embedded systems course. Early in the course I often ask the students to implement saturating versions of signed and unsigned addition and subtraction. Their solutions are required to work regardless of whether an integer datatype is defined to be 8, 16, 32, or 64 bits long. Students are not given any kind of test suite, but I repeatedly emphasize that they should write lots of test cases and inspect the output. The assignment is graded by a tester I wrote that uses a combination of a few hundred values that lie near interesting boundaries, and a few thousand random values. The main reason I give this assignment is to get students warmed up a bit and to motivate some material later on in the course. The assignment is here (see part 2).

Unsigned Saturating Operations

These functions are easy to get right, and also it’s not hard to avoid undefined behavior since the math is unsigned.

ADD correct wrong
no undefined 66 14
undefined 1 3
SUBTRACT correct wrong
no undefined 74 10
undefined 0 0

Signed Saturating Operations

Although the amount of code required to implement these is small (my solutions are 5 lines each), they’re not completely straightforward.

ADD correct wrong
no undefined 15 10
undefined 28 31
SUBTRACT correct wrong
no undefined 11 4
undefined 20 49

The undefined behaviors were a mix of shift errors and overflowing addition and subtraction. Many students used some sort of shift as part of computing the maximum / minimum representable integer.

My Code Was Wrong Too

It turns out that my reference solutions (written years ago, before I understood the horror that is undefined behavior in C) contained two signed overflows. Ouch.

Can We Avoid Undefined Behavior?

90% of the signed operations written by students prior to 2010 executed an undefined integer behavior. In Fall 2010 I included some lecture material on undefined behavior and I also gave students access to Peng’s undefined behavior checker and told them to use it. The result? 46% of the 52 signed functions (written by the 26 students) executed an undefined integer behavior. This is a significant improvement, but still not very good. The real problem — judging from the large number of incorrect solutions — is that many students didn’t test their code very well. Next time I’ll encourage students to graph the functions implemented by their code; this makes some kinds of bugs obvious.

Differential Whitebox Testing Is Good

[This post is based on data gathered by Chad Brubaker and Peng Li, respectively an undergrad and grad student in CS at Utah.]

Two courses I’ve taught at Utah, CS 4400 and CS 5785, have assignments where students write short integer functions for which we — the instructors — have automatic graders. In 4400 the students get access to the test framework while they work, in 5785 they do not.

Klee is a whitebox test case generation tool: it looks at a piece of code and attempts to generate a collection of inputs that cover all possible execution paths. I was curious how Klee would fare when compared to the hand-written testers, both of which have been in use for more than five years.

One way to use Klee is like this:

  1. Klee generates test cases for a piece of code written by student
  2. We run the test cases through the student code and the reference code, looking for inequal output

This works poorly: Klee fails to find lots of bugs found by the hand-written testers. Klee can also be used to perform differential whitebox test case generation (which I already briefly discussed here):

  1. Klee generates test cases for the code assert(student_code(input)==reference_code(input))
  2. We run the test cases through the student code and the reference code, looking for inequal output

This works very well: Klee’s tests not only find all bugs found by the hand-written testers, but also some new bugs are identified. What appears to be going on is that the hand-written testers use values that the instructors’ intuitions indicate would be good for triggering bugs. But students do not always write code corresponding to the instructors’ intuitions; in that case, paths are missed by the hand-written tester whereas Klee goes ahead and thoroughly tests them.

Results

From CS 4400 we have 1050 bit puzzle solutions from students. 84 buggy functions are identified by the hand-written test suite. Differential test case generation using Klee finds all of these bugs plus seven more, for a total of 91.

From CS 5785 we have 336 saturating operations from students. 115 buggy functions are identified by the hand-written test suite. Differential test case generation using Klee finds all of these bugs plus six more, for a total of 121.

Differential Klee vs. Exhaustive Testing

Klee can find more bugs, but can it find all of them? To explore this, we exhaustively tested each function that has ≤32 bits of input.

For the bit puzzles, exhaustive testing revealed some bugs not found by Klee. I’ll provide details in a subsequent post.

For the saturating operations, exhaustive testing revealed a single bug not found be Klee. The problem turned out to be code that computed the maximum and minimum integer values like this:

int type = 8*sizeof(signed char);
min = -(pow(2,type)/2);
max = min+1;
max = 0-max;

Clang/LLVM fails to optimize away the pow() call, which remains in the bitcode where (for reasons I haven’t looked into) it prevents Klee from generating good test cases. Klee properly found the bug in this function when the code was changed to this:

min = CHAR_MIN;
max = CHAR_MAX;

The student couldn’t have written the code this way, however, since their math operations had to work for four different integer widths.

Conclusion

A 5%-8% increase in the number of buggy functions identified is pretty significant. Also, by using Klee we no longer need to write a test case generator by hand (although I still would, since I wouldn’t necessarily trust Klee to do a good job without some supporting evidence).

Visualizing Math Bugs

After getting too tired to work tonight, I realized I had no Netflix and that I’m bored with the books I’m reading. Therefore, I present visualizations of several solutions I’ve received to my saturating arithmetic homework over the years.

This is what a correct saturating signed add looks like, where the z-axis is the output:

Here are some solutions that are incorrect. I love the ones that look like origami cranes.





















A Few Thoughts About Path Coverage

Klee isn’t the only tool in its class, nor was it the first, but it’s open source and well-engineered and it works. Klee’s goal is to generate a set of test inputs that collectively induce path coverage in a system under test. One of the scenarios I was curious about is the one where Klee achieves full path coverage but still fails to expose a bug, for example because a stronger kind of coverage such as value coverage is required. Coming up with these examples is easy. Consider the following function for computing the number of bits set in a 64-bit word:

const int pop8[256] = {
  0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
  1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
  1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
  2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
  1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
  2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
  2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
  3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
  1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
  2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 4, 5, 6,
  2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
  3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
  2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
  3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
  3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
  4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8,
};
int popcount (unsigned long long x) {
  int c=0;
  int i;
  for (i=0; i<8; i++) {
    c += pop8[x&0xff];
    x >>= 8;
  }
  return c;
}

This function admits only a single path of execution, and therefore Klee comes up with only a single test case: the 64-bit word containing all zero bits. This popcount function is, in fact, wrong (one of the values in the table is off by one) but Klee’s input does not trigger the bug. At this point we’re more or less stuck — Klee cannot find the bug and we’re back to code inspection or random testing or something. But what if we have a second popcount implementation?

int popcount_slow (unsigned long long x) {
  int c=0;
  int i;
  for (i=0; i<64; i++) {
    c += x&1;
    x >>= 1;
  }
  return c;
}

Unlike the previous implementation, this one is correct. Like the previous implementation, it has only a single path — so again Klee will give us just one test case. The way to find an input where the two functions return different answers is to just ask Klee directly:

assert (popcount(x) == popcount_slow(x));

Now there are two paths, a succeeding one and a failing one, and it takes Klee just a few seconds to find inputs that execute both paths. This is a pretty impressive result given the large input space. Of course, in this particular case random testing using uniformly-distributed 64-bit inputs finds this bug very quickly since approximately one in eight inputs triggers the bug. However, we could write code where the probability of a random input triggering the bug is arbitrarily small (note however that fooling a whitebox test case generator is not totally trivial: many of the tempting ways to add a bug would also add a new code path, which Klee would duly explore).

The interesting thing about path coverage is that it’s massively insufficient for finding all bugs in real code, and yet it is also impossible to achieve 100% feasible path coverage on real software systems due to the path explosion problem. The conclusion I draw is that future Klee-like tools should provide the option of targeting even stronger kinds of coverage (path + boundary value coverage, for example) and that these tools will be most useful in unit-testing of smallish codes.

The problem in testing my code above stems from its use of a lookup table. Perhaps people who test software for a living have come up with reasonably generic ways to detect problems in lookup tables, but I’m unaware of them. Notice that MC/DC coverage — used in validating Level A software in DO-178B, and one of the stronger coverage metrics that is actually used in practice — is not any better than path coverage for finding the popcount bug.

Undefined Integer Behaviors in Student Code, Part 1

[This post is based on material from Chad Brubaker, a really smart CS undergrad at Utah who did all the work getting these data. The integer undefined behavior checker was created by my student Peng Li.]

Integer undefined behaviors in C/C++, such as INT_MAX+1 or 1<<-1, create interesting opportunities for compiler optimizations and they also make programming in C/C++ not unlike crossing a minefield while blindfolded. This piece describes the results of running a checker for integer undefined behaviors on code submitted by several years’ worth of students in Utah’s CS 4400. The checker simply inserts checking logic in front of any potentially-undefined operation. Then, we ran each student’s code on the same inputs used by the test harness the students were given as part of the assignment. CS 4400 is based on Computer Systems: A Programmer’s Perspective, by Bryant and O’Hallaron. It’s great textbook and I think students get a lot out of the course. I taught it three times in the early/mid 2000s and others have taught it since then.

The assignment is to write C code solving various small “bit puzzles.” Each solution must be straight-line code and can only use C’s bitwise operators: ! ~ & ^ | + << >>. There are a few additional restrictions that are specific to individual problems, such as limiting the number of operations and further restrictions on operator choice.

Students are given reference solutions for the bit puzzles (written in C, but using loops and otherwise not following the rules) and also they are given a harness that runs their code against the reference solutions on some fixed test vectors.

The results below divide students’ solutions into four bins:

  • correct results and no undefined behavior
  • wrong results (for at least one input) and no undefined behavior
  • correct results but contains undefined behavior (for at least one input)
  • wrong results and contains undefined behavior

The most interesting case is where code gives the right answers despite executing undefined behaviors. These solutions are basically just lucky. In other words, the version of GCC installed on our lab machines, at the optimization level specified by the makefile, happens to have a particular behavior. But even the same code, on the same compiler, using the same optimization options, could break if called from a different context.

get_least_significant_byte

Extract the least significant byte from an int.

correct wrong
no undefined 105 0
undefined 0 0

is_equal

Return 1 if two ints are equal, and zero otherwise (note that == is not an allowed operator).

correct wrong
no undefined 97 0
undefined 8 0

All of the undefined behaviors were overflow of signed addition.

any_even_one

Return 1 if any even bit of an int is set, otherwise return 0.

correct wrong
no undefined 102 3
undefined 0 0

clear_all_but_lower_bits

Clear all but the N least significant bits of an int.

correct wrong
no undefined 67 1
undefined 34 3
  • 16 solutions contained signed addition overflows
  • 4 contained a signed right-shift by negative or past bitwidth
  • 17 contained a signed left-shift by negative or past bitwidth

rotate_right

Rotate the bits in an int to the right.

correct wrong
no undefined 7 3
undefined 87 8
  • 2 overflowed a signed addition
  • 8 contained an unsigned left-shift by negative or past bitwidth
  • 7 contained an unsigned right-shift by negative or past bitwidth
  • 8 contained a signed right-shift by negative or past bitwidth
  • 70 contained a signed left-shift by negative or past bitwidth

indicate_rightmost_one

Generate an int with a single set bit corresponding to the rightmost set bit in another int.

correct wrong
no undefined 1 11
undefined 93 0

93 solutions overflowed a signed addition — only a single student managed to avoid this problem and also return the right answers

divide_by_four

Divide an int by four, rounding towards zero.

correct wrong
no undefined 94 8
undefined 0 3

3 solutions overflowed a signed addition

fits_n_bits

Return 1 if a value can be represented as an N-bit 2’s complement integer.

correct wrong
no undefined 79 12
undefined 10 4
  • 9 solutions overflowed a signed addition
  • 2 solutions contained a signed right-shift by negative or past bitwidth
  • 3 solutions contained a signed left-shift by negative or past bitwidth

This puzzle was interesting because it was the only one where we found undefined behavior in the reference solution, which executes this code to compute the largest representable integer:

tmax_n = (1<<(n-1)) - 1;

1<<31 evaluates to 0x80000000, which is INT_MIN on a machine where int is 32 bits. Then, evaluating INT_MIN-1 is an operation with undefined behavior.

Moreover, in C99, evaluating 1<<31 is undefined.

is_greater

Return 1 if one int is larger than another, and zero otherwise.

correct wrong
no undefined 4 11
undefined 84 6

90 solutions overflowed a signed addition

subtract_overflows

Return 1 if subtracting one int from another overflows, and zero otherwise.

correct wrong
no undefined 0 13
undefined 89 3

92 solutions overflowed a signed addition — nobody avoided this hazard

What’s the Problem?

The problem is that a large fraction of our junior-level CS undergraduates are producing code that executes integer undefined behaviors. At best, this kind of code is not portable across compilers, platforms, or even changes in compiler version, optimization options, or calling context. At worst, undefined behaviors are time bombs that will lead to exploitable vulnerabilities or other serious malfunctions in software systems. I want our graduates to understand what undefined behavior means and how to avoid it, not because C/C++ programming is so great, but because the concepts are much more broadly applicable.

What’s the Solution?

Fixing the problem in the context of Utah’s CS 4400 course should be straightforward:

  • Include some lecture material on undefined behaviors
  • Make Peng’s undefined behavior checker part of the test suite that students are given as part of the assignment; since it gives nice clear error messages, the output should be directly actionable

I’ll see if Erin, the current instructor, is interested in working with me to implement this for Fall 2011. We’ll see how it goes.

[Note: I’ve deliberately left code solving the bit puzzles out of this post, and I will reject any comments containing solutions.]

More Saturating Arithmetic

In a previous post I guessed that 91 bytes was close to the minimum size for implementing signed and unsigned saturating 32-bit addition and subtraction on x86. A commenter rightly pointed out that it should be possible to do better. Attached is my best shot: 83 bytes. Given the crappy x86 calling convention I’m going to guess it’s hard to do much better unless the saturating SSE instructions offer a way. Or can the branches in the signed functions be avoided? It seems like it, but I couldn’t figure out how.

sat_unsigned_add:
	movl	4(%esp), %eax
	xor     %edx, %edx
	not     %edx
	addl	8(%esp), %eax
	cmovb	%edx, %eax
	ret

sat_unsigned_sub:
	movl	4(%esp), %eax
	xor     %edx, %edx
	subl	8(%esp), %eax
	cmovb	%edx, %eax
	ret

sat_signed_add:
	movl	4(%esp), %eax
	movl	$2147483647, %edx
	movl	$-2147483648, %ecx
	addl	8(%esp), %eax
	jno     out1
	cmovg	%edx, %eax
	cmovl	%ecx, %eax
out1:	ret

sat_signed_sub:
	movl	4(%esp), %eax
	movl	$2147483647, %edx
	movl	$-2147483648, %ecx
	subl	8(%esp), %eax
	jno     out2
	cmovg	%edx, %eax
	cmovl	%ecx, %eax
out2:	ret

Another question: How difficult is it to create a superoptimizer that turns obvious C formulations of saturating operations into the code shown here? This is beyond the capabilities of current superoptimizers, I believe, while being obviously feasible.

Feedback is appreciated.