Undefined Integer Behaviors in Student Code, Part 1

[This post is based on material from Chad Brubaker, a really smart CS undergrad at Utah who did all the work getting these data. The integer undefined behavior checker was created by my student Peng Li.]

Integer undefined behaviors in C/C++, such as INT_MAX+1 or 1<<-1, create interesting opportunities for compiler optimizations and they also make programming in C/C++ not unlike crossing a minefield while blindfolded. This piece describes the results of running a checker for integer undefined behaviors on code submitted by several years’ worth of students in Utah’s CS 4400. The checker simply inserts checking logic in front of any potentially-undefined operation. Then, we ran each student’s code on the same inputs used by the test harness the students were given as part of the assignment. CS 4400 is based on Computer Systems: A Programmer’s Perspective, by Bryant and O’Hallaron. It’s great textbook and I think students get a lot out of the course. I taught it three times in the early/mid 2000s and others have taught it since then.

The assignment is to write C code solving various small “bit puzzles.” Each solution must be straight-line code and can only use C’s bitwise operators: ! ~ & ^ | + << >>. There are a few additional restrictions that are specific to individual problems, such as limiting the number of operations and further restrictions on operator choice.

Students are given reference solutions for the bit puzzles (written in C, but using loops and otherwise not following the rules) and also they are given a harness that runs their code against the reference solutions on some fixed test vectors.

The results below divide students’ solutions into four bins:

  • correct results and no undefined behavior
  • wrong results (for at least one input) and no undefined behavior
  • correct results but contains undefined behavior (for at least one input)
  • wrong results and contains undefined behavior

The most interesting case is where code gives the right answers despite executing undefined behaviors. These solutions are basically just lucky. In other words, the version of GCC installed on our lab machines, at the optimization level specified by the makefile, happens to have a particular behavior. But even the same code, on the same compiler, using the same optimization options, could break if called from a different context.

get_least_significant_byte

Extract the least significant byte from an int.

correct wrong
no undefined 105 0
undefined 0 0

is_equal

Return 1 if two ints are equal, and zero otherwise (note that == is not an allowed operator).

correct wrong
no undefined 97 0
undefined 8 0

All of the undefined behaviors were overflow of signed addition.

any_even_one

Return 1 if any even bit of an int is set, otherwise return 0.

correct wrong
no undefined 102 3
undefined 0 0

clear_all_but_lower_bits

Clear all but the N least significant bits of an int.

correct wrong
no undefined 67 1
undefined 34 3
  • 16 solutions contained signed addition overflows
  • 4 contained a signed right-shift by negative or past bitwidth
  • 17 contained a signed left-shift by negative or past bitwidth

rotate_right

Rotate the bits in an int to the right.

correct wrong
no undefined 7 3
undefined 87 8
  • 2 overflowed a signed addition
  • 8 contained an unsigned left-shift by negative or past bitwidth
  • 7 contained an unsigned right-shift by negative or past bitwidth
  • 8 contained a signed right-shift by negative or past bitwidth
  • 70 contained a signed left-shift by negative or past bitwidth

indicate_rightmost_one

Generate an int with a single set bit corresponding to the rightmost set bit in another int.

correct wrong
no undefined 1 11
undefined 93 0

93 solutions overflowed a signed addition — only a single student managed to avoid this problem and also return the right answers

divide_by_four

Divide an int by four, rounding towards zero.

correct wrong
no undefined 94 8
undefined 0 3

3 solutions overflowed a signed addition

fits_n_bits

Return 1 if a value can be represented as an N-bit 2’s complement integer.

correct wrong
no undefined 79 12
undefined 10 4
  • 9 solutions overflowed a signed addition
  • 2 solutions contained a signed right-shift by negative or past bitwidth
  • 3 solutions contained a signed left-shift by negative or past bitwidth

This puzzle was interesting because it was the only one where we found undefined behavior in the reference solution, which executes this code to compute the largest representable integer:

tmax_n = (1<<(n-1)) - 1;

1<<31 evaluates to 0x80000000, which is INT_MIN on a machine where int is 32 bits. Then, evaluating INT_MIN-1 is an operation with undefined behavior.

Moreover, in C99, evaluating 1<<31 is undefined.

is_greater

Return 1 if one int is larger than another, and zero otherwise.

correct wrong
no undefined 4 11
undefined 84 6

90 solutions overflowed a signed addition

subtract_overflows

Return 1 if subtracting one int from another overflows, and zero otherwise.

correct wrong
no undefined 0 13
undefined 89 3

92 solutions overflowed a signed addition — nobody avoided this hazard

What’s the Problem?

The problem is that a large fraction of our junior-level CS undergraduates are producing code that executes integer undefined behaviors. At best, this kind of code is not portable across compilers, platforms, or even changes in compiler version, optimization options, or calling context. At worst, undefined behaviors are time bombs that will lead to exploitable vulnerabilities or other serious malfunctions in software systems. I want our graduates to understand what undefined behavior means and how to avoid it, not because C/C++ programming is so great, but because the concepts are much more broadly applicable.

What’s the Solution?

Fixing the problem in the context of Utah’s CS 4400 course should be straightforward:

  • Include some lecture material on undefined behaviors
  • Make Peng’s undefined behavior checker part of the test suite that students are given as part of the assignment; since it gives nice clear error messages, the output should be directly actionable

I’ll see if Erin, the current instructor, is interested in working with me to implement this for Fall 2011. We’ll see how it goes.

[Note: I’ve deliberately left code solving the bit puzzles out of this post, and I will reject any comments containing solutions.]

Should Software Evolve?

Weimer, Nguyen, Le Goues, and Forrest wrote a paper Automatically Finding Patches Using Genetic Programming that proposes and evaluates this approach for fixing a buggy software system:

  1. A failure-inducing test case is found
  2. The system code is mutated randomly, but using smart heuristics to increase the probability that the part of the program that is broken is modified
  3. The new version of the system is tested against the new and old test cases, and is accepted if it passes all of them
  4. The fix is minimized using delta debugging

The paper was published in 2009 at ICSE — a highly selective software engineering conference — and it was one of the award papers the year it appeared. In other words, the software engineering community considered it to be one of the best papers of 2009. There is no doubt the approach is interesting and that it should be tried. On the other hand, it seems pretty clear that fixing bugs by genetic programming is sometimes inappropriate. Since the authors of the paper didn’t speculate about when their technique should and shouldn’t be used, I’ll do it.

First off, the method from the Weimer paper is very similar to something I see when teaching CS courses. Some fraction of the students approaches programming problems like this:

  1. Write code purporting to solve the assigned problem
  2. Run test cases
  3. Tweak the logic more or less randomly
  4. Go back to 2 until time runs out or all test cases succeed

This is not a good approach. When it works, it permits students to believe they do not need to practice incremental development, and that it’s OK to produce code that would be impossible to maintain. It also defeats the purpose of the assignment — learning — by reducing program construction to a series of mindless tweaks. It works poorly in practice, even for fairly simple assignments: the best outcome is an over-constrained and fragile program that solves some or all test cases, but that contains convoluted logic and needless special cases that render it incorrect for inputs not in the test set. The other outcome (and this happens pretty often) is that the student’s code steadily evolves to be more and more wrong until in the end the code must be thrown away and the student starts over.

Of course, just because it’s a disaster when students use evolutionary programming doesn’t mean that computers shouldn’t do it. My first attempt at a rule for when it’s OK to evolve software was this:

Software for non-critical applications may be evolved, whereas software for safety-, security-, or mission-critical systems should not be.

But this isn’t a good rule. It’s perfectly fine for certain parts of the software controlling my car’s engine to evolve (I’ll say why below), and some kinds of totally non-critical software should never evolve. Here’s a rule that I think is better:

Black box software whose correctness can be externally verified can and perhaps should evolve. Program logic that must be understood, modified, analyzed, and (especially) maintained by human developers should not evolve.

For example, if a genetic algorithm is used to evolve the most efficient valve timings for my car engine or the most effective buffer size for my video player, then great — these can be independently verified and the fact that they evolved is irrelevant. On the other hand, if my compiler crashes or emits the wrong code, can it be repaired by an evolutionary technique? No. The logic required to fix any non-trivial compiler bug is intimately entwined with the compiler’s existing logic. Evolving it would be perhaps the dumbest idea on record. My guess is that the core logic of many important software systems (operating systems, virtual machine managers, database engines, etc.) is similarly not amenable to evolutionary bug fixing.

Are evolutionary techniques somehow incompatible with operating systems and compilers? Absolutely not. Parameters to the OS paging or prefetching policy, or the ordering of phases in a compiler, can and should be evolved to be better. But again, this will work because nobody needs to understand the prefetch policy as long as it works. A bad prefetch policy or phase ordering will hurt performance, but won’t break the system.

Of course I’d be happy to be proved wrong, and here’s how the software evolution people can do it: as soon as they ask me to, I’ll start handing them compiler bug reports at the same time I hand these to the LLVM and GCC maintainers. Then, they can try to evolve a fix before the compiler maintainers get around to looking at the problem. This should be easy because evolutionary bug fixing is totally automated whereas manual bug fixing requires stealing time from well-paid and extremely busy developers. Of course the evolved compiler patches should be submitted to the compiler maintainers. If they can get just five evolved patches accepted to the GCC or LLVM trees, I will publicly apologize for being so very, very wrong about their work, and will also buy multiple beers for whatever subset of them I next run into at a conference.

Update: I should add that I would very much like evolutionary fixing of compiler bugs to work out, because this method could be combined with my group’s automated detection of compiler bugs. The result would be a compiler that spontaneously becomes more correct.

Random Testing Gets No Respect

A funny thing about random testing is that it gets little respect from most computer scientists. Here’s why:

  • Random testing is simple, and peer reviewers hate simple solutions.
  • Random testing does not lend itself to formalization and proofs, which peer reviewers love.
  • Random testing provides no guarantees. Peer reviewers love guarantees — regardless of whether the they are actually useful.
  • Random testing is hard to do well, but the difficulty is more art than science: it requires intuition, good taste, and domain knowledge. These — like many other important things — can’t be neatly packaged up in a journal paper.
  • People dislike the probabilistic aspect, and would prefer a fixed test suite.
  • Most researchers don’t produce robust software, and therefore have trouble understanding why anyone would need testing techniques that can help produce robust software by uncovering difficult corner-case bugs.

Of course, random testing has plenty of well-known problems, but I often get the feeling that people are responding to prejudices rather than facts. Even so, the picture is far from bleak and there are plenty of good researchers who exploit or at least appreciate random testing. My favorite recent example is found in this paper by Fox and Myreen, who are working on a formal model of the ARM instruction set architecture in the HOL4 theorem prover. This kind of work is the most pedantic, hard-nosed possible kind of formal methods, and yet they validate their model (and find bugs in it) using humble randomized differential testing — against real ARM boards, no less.

Two Rules for Random Testing

When you run a test case on a piece of software, you’re conducting an experiment where the null hypothesis is “the system under test correctly executes this test.” The problem is that the value of each such experiment is small because there are so many possible inputs. Random testing has a stronger null hypothesis: “the system under test correctly executes as many generated test cases as we have time to run.” The idea is to explore parts of the space of possible system inputs that a human wouldn’t think to test. This leads us to:

Random Testing Rule 1: Err on the side of expressiveness by prohibiting the generation of as few (legal) inputs as possible.

In other words, a better random tester is one that will — given time — explore a larger part of the input space. Expressiveness, however, can be difficult to achieve:

  1. It is easy to fail to notice parts of the input space that should be explored. How many of us know all of Linux’s system calls? All of C’s trigraphs? All of GNU C’s inline assembly directives?
  2. Some valid inputs are very hard to distinguish from invalid inputs. For example, assume that we’re attempting to generate random JavaScript programs that terminate. See the problem? On the other hand, often there are shortcuts — maybe we can just kill the apparently non-terminating programs after a timeout. In other cases, the system under test can itself tell us whether an input is valid.

Of course, expressiveness isn’t the whole story for random testing. For example, we can generate all possible JavaScript programs by generating random bit strings. However, if we actually do this we’ll waste almost all of our testing time generating bit strings and getting them rejected by the same tiny part of the JS lexer. This observation leads us to:

Random Testing Rule 2: Restrict the expressiveness of generated programs as needed to achieve effective testing.

Trivially, the random bit string generator was too expressive to be an effective tester for a JavaScript implementation. Similarly, if I’m testing a new, buggy floating point optimization pass for a compiler, simply turning off integer variables in the test case generator may be the way to proceed. In general we must shape the inputs so that they exercise the right parts of the system under test.

Good taste is required to distinguish between parts of the input space that are germane vs. those that are irrelevant. For example, there is a large family of JavaScript programs differing only in choice of identifier names, and another large family differing only in choice of floating point constants. These sub-spaces of the input space probably do not need to be explored deeply because they are not tightly connected to the difficult logic in a JS VM. On the other hand, see here for a story about a magic FP constant that hangs PHP on certain platforms.

Summary: Good random testing is achieved by balancing the tension between too much and too little expressiveness in the test case generator.

Probabilities in Random Testing

A typical real computer system has an extremely large input space and no testing method can cover more than an infinitesimal part of it. On the other hand, broad regions of the input space will trigger bugs. The problem is that we do not know the shapes or locations of the buggy parts of the space. Random testing is a shotgun approach to finding these regions; empirically, it works well.

I’ve written, or supervised the writing of, about 10 random test case generators. The most difficult and unsatisfactory part of engineering a good random tester is setting the probabilities properly. For example, if I’m generating JavaScript programs to stress-test a browser, I need to decide how often to create a new variable vs. referencing an existing one, how often to generate a loop vs. a conditional, etc. The shape of the eventual JS program depends strongly on dozens of little decisions like these.

A few more observations:

  • If the probabilities are chosen poorly, the tester will be far less effective than it could have been, and may not find any bugs at all
  • Probabilities alone are an insufficient mechanism for engineering good test cases, and need to be combined with other mechanisms such as heuristic limits on the size of various parts of a test case
  • Equivalent to the previous bullet, probabilities sometimes want to be a function of the test case being constructed; for example, the probability of creating a new function may decrease as the number of generated functions increases
  • The connection between probabilities and high-level properties of test cases may be very indirect
  • Alternatives to playing probability games, such as uniform sampling of the input space or bounded exhaustive testing, are generally harder or less effective than just getting the probabilities right

In practice, probability engineering looks like this:

  1. Think hard about what kind of test cases will drive the system under test into parts of its state space where its logic is weak
  2. Tweak the probabilities to make the generated tests look more like they need to look
  3. Look at a lot of generated test cases to evaluate the success of the tweaking
  4. Go back to step 1

Random testing almost always has a big payoff, but only when combined with significant domain knowledge and a lot of hard work.

Outgunned

Several years ago I published my favorite kind of paper: it took a problem that was hard to solve by hand, and solved it using 20% cleverness and 80% brute force. The details don’t matter here, but the solution had scalability problems. Therefore, the next iteration of the work (done primarily by a very smart student) was more like 80% cleverness and 20% brute force. The student, Usit, disappeared into Microsoft after getting an MS degree, leaving me to give the talk about his work. After giving this talk I spent a while talking with a member of the group that produced Astree, a team that contains some of the heavy hitters in the program analysis world. Of course, they had encountered similar problems to ours in constructing good transfer functions, and they also had some ideas for solving these problems. Not long after this conversation, I decided to drop this line of research. My impression then — and five years later I still think it’s true — was that while I could have continued to publish in this area, my contributions would not have been very deep or interesting ones. Basically I realized that (1) the problems in this area were highly mathematical and (2) any math power I could bring to the table was insignificant compared to groups containing people like the guy I talked to.

I take a positive and a negative lesson from this little story. The negative one is about the state of math education in the US. Why am I so often outclassed by people (who usually don’t have a degree in math) educated outside the USA? It’s ridiculous, and I actually do have a degree in math. I’m not saying that I’m supposed to be a math genius because I took some classes (I chose to go to grad school in CS instead of math for a reason) but the difference is pretty noticable. Looking back, I only learned real mathematical thinking in a handful of upper-division classes for math majors only. In contrast, the math classes that were in service of an engineering degree emphasized cookie-cutter methods for predefined problems.

The positive lesson is that I found it very freeing to realize that a research area is covered by other people who are more capable than I am. This has happened to me a few more times since then and it’s always a relief. As a researcher, life is good as long as there exists just one important problem that I’m most capable of attacking.

The Day I Learned To Love Perl

Following my second year of grad school, I spent the summer of 1997 as an intern for Myricom, a company in the L.A. area that makes really fast local-area networks. It was a great place to work: small, and filled with super-smart ex-Caltech people.

One day I was hacking while the important people were in a meeting with a big customer who was having some sort of network problem. Bob Felderman, my supervisor, came out and told me they had a hypothesis about the problem but some data needed to be analyzed to make sure, and could I do it like right now? At the time I was basically a C hacker and groveling lots of complicated text files using a C program would have been pure hell. I probably gave Bob a skeptical look. He walked over to someone else’s desk, handed me Perl book, and said this was the right tool for the job. I had never so much as touched a Perl interpreter but I started digging and within probably 30 minutes had solved the problem, went into the meeting, and handed them the smoking gun they’d been looking for.

I do not love Perl for all tasks; I definitely do not love it for large projects; and, I seldom love the Perl that anyone else writes. But that day I learned to love Perl and even now 95% of my LOC are in it. Of course this is partly because my students get to write all the fun code and I’m left parsing output files, graphing stuff, and otherwise stringing together long chains of UNIX processes.