It’s Time to Get Serious About Exploiting Undefined Behavior

[Note: I promise, this is almost my last post about undefined behavior for a while. Maybe just one more in the queue.]

The current crop of C and C++ compilers will exploit undefined behaviors to generate efficient code (lots of examples here and here), but not consistently or well. It’s time for us to take this seriously. In this piece I’ll show that aggressively exploiting undefined behavior can massively speed up real codes, while remaining 100% faithful to the language standards.

Since undefined behavior is tricky to talk about, I’ll need to lay a bit of groundwork.

Understanding a C or C++ Program

The meaning of a C or C++ program is determined by an “abstract machine” that is described in the standards. You can think of these abstract machines as the simplest possible non-optimizing interpreters for C and C++. When the abstract machine runs a program, it does so as a sequence of steps, each of which is stated or implied by the standard. We’ll call this sequence of steps an execution. In general, a program will admit many different executions. For example, if we use gzip to compress the Gettysburg Address, the C abstract machine will perform a few million steps which together constitute one possible execution of gzip. If we compress the Gettysburg Address again, we’ll get the same execution (in this post we’ll ignore the fact that these abstract machines are non-deterministic). If we compress “I Have a Dream” then we’ll get a different execution. It is useful to talk about the set of all possible executions of a program. Of course, this set often has a large or infinite number of members.

The Compiler’s Job

The compiler’s job is to emit object code that has the same observable behavior as the C/C++ abstract machine for all non-erroneous executions of the program. Observable behavior is pretty much just like it sounds: it covers the abstract machine’s interactions with hardware (via volatile variables), the operating system, etc. Internal operations such as incrementing a variable or calling malloc() are not considered to be observable. The compiler is free to generate any code it likes as long as observable behaviors are preserved. Consequently, the compiler does not need to translate code that is not touched by any execution of the program. This is dead code elimination, a useful optimization.

Now let’s work undefined behavior into the picture. Every execution step performed by the C/C++ abstract machine is either defined or undefined. If an execution of a program contains any undefined steps, then that execution is erroneous. The compiler has no obligations at all with respect to erroneous executions. Notice that “erroneous” is an all-or-nothing property; it’s not the case that an execution is non-erroneous up to the point where it executes an operation with undefined behavior. (I wrote about this issue in more detail here.)

As a thought experiment, let’s take an arbitrary C or C++ program and add a new line of code at the start of main():

-1<<1;

To evaluate this expression, the abstract machine must perform an operation that (in the C99, C11, and C++11 dialects) has undefined behavior. Every execution of the program must evaluate this expression. Thus, the program admits no non-erroneous executions. In other words, the compiler has no obligations. An efficient course of action for the compiler would be to report success without generating any object code.

What if we put the undefined execution at the end of the program, just before main() returns? Assuming that the program contains no calls to exit() or similar, this has exactly the same effect—the compiler can report success and exit without generating any code.

Do Real Codes Have Lots of Erroneous Executions?

Let’s consider the SPEC benchmark programs. Making them fast is something that most compiler vendors would like to do.  Using IOC we found that a majority of the programs in SPEC CINT 2006 execute undefined behavior during the course of a “reportable” run. This means that a correct C/C++ compiler could generate code that is almost infinitely faster than the code generated by current compilers. The SPEC harness would of course complain that the compiled programs produce the wrong results, but this is actually a bug in the SPEC harness, not a bug in the compilers. If we insist that a compiler turns these SPEC programs into executables that produce the specified results, then we are effectively mandating that this compiler implements a bizarre, undocumented dialect of C where undefined behavior can sometimes be exploited and sometimes not.

If we consider them to be written in C99, LLVM/Clang 3.1 and GCC (SVN head from July 14 2012), running on Linux on x86-64, do not appear to admit any non-trivial, non-erroneous executions. I expect the same is true of most or all previous versions of these compilers. IOC tells us that these compilers execute undefined behaviors even when compiling an empty C or C++ program with optimizations turned off. The implication is that a suitably smart optimizing compiler could create executables for Clang and GCC that are dramatically faster than the executables we use every day. Think how fast you could build a Linux kernel with one of these compilers!

My guess is that most or all executions of GCC and LLVM (and basically all other large programs) would be undefined by C89 (and not just C99) but we can’t tell for sure without using a comprehensive undefined behavior checker for C89. This tool does not exist. I confess to not personally having the gumption necessary for cramming GCC or LLVM through the best available dynamic undefined behavior checkers: KCC and Frama-C.

How Optimizing Compilers Should Work

Compilers are perfectly capable of reasoning about all possible executions of the program being compiled. They do this with static analysis, which avoids decidability problems using approximation. In other words, detecting dead code is perfectly possible as long as we accept that there will sometimes be dead code that we cannot statically detect. The resulting optimization, dead code elimination, is ubiquitous in optimizing compilers. So how can we leverage static analysis to aggressively exploit undefined behavior?

Step 1: Make Undefined Behavior Explicit in the IR

LLVM already has a limited form of explicit undefinedness: an “undef” value. This is useful but it should be augmented with an undef instruction that represents unconditional execution of undefined behavior.

[Update: Owen Anderson points out in the comments that LLVM’s unreachable instruction already serves this purpose! Cool. Also, the undef value has stronger semantics than C/C++ undefined behavior so either its meaning would need to be tweaked or else a new value would need to be added before the undef to unreachable promotion I mention below would work.]

Step 2: Turn Dataflow Facts into Undefined Behaviors

Compilers already turn undefined behaviors into dataflow facts. For example, in the famous example from the Linux kernel, GCC used a pointer dereference to infer that the pointer was not null. This is all good, though there are plenty of instances where compilers could be much more aggressive about this; Pascal has a nice example involving restrict.

On the other hand, not as much has been done on aggressively inferring undefined behavior using dataflow facts. For example:

  • If a program evaluates x+y and we know that both x and y are larger than 1.1 billion (and both variables are 32-bit ints), the addition can be replaced with an undef value.
  • If a program evaluates (*x)++ & (*y)++ and the pointer analysis can show that x and y are aliases, the expression can again be turned into an undef value.
  • Any statement that unconditionally evaluates an expression containing an undef value can be turned into an undef instruction.

There are around 191 kinds of undefined behavior in C99, so a complete set of optimization passes looking for them will require a fair amount of work.

Step 3: Prune Undefined Executions

Now we want to delete any code that is only needed by erroneous executions. This will build upon dominator analysis, which is already done by optimizing compilers. A piece of code X dominates Y if you can only get to Y by first executing X. All code that is dominated by an undef instruction can be removed. The opposite is not true: code can’t be deleted just because it dominates an undef. However, there exists a reversed dominator analysis called post dominator where X post-dominates Y when all executions going through X must also go through Y afterwards. Using the results of this analysis, we can safely delete any code that post-dominates an undef.

Dominator-based undefined behavior optimization will be particularly effective when used interprocedurally at link time. For example, if I write a function that unconditionally uses an uninitialized variable, the intraprocedural analysis will reduce that function to a single undef instruction, but the interprocedural optimization will potentially be able to destroy much more code: everything leading up to, and away from, calls to this function, across the whole program. Imagine what could be accomplished if a common C library function such as malloc(), in a widely deployed version of libc, could be shown to unconditionally execute undefined behavior: almost all C programs on that system would totally disappear. A real Linux distribution could be fit onto a couple of floppy disks, like in the old days.

The big picture is something like this:

Conclusions

Let me be clear: compilers already try to avoid generating code corresponding to erroneous executions. They just don’t do this very well. The optimizations that I’m proposing differ in degree, not in kind.

Perhaps I’ll get a student to implement the more aggressive optimizations as a proof of concept. We’ll impress everyone with 2-second builds of the Linux kernel, 3-second GCC bootstraps, and 99.9% code-size reductions for Firefox.

Undefined Behavior Consequences Contest Winners

The contest that I posted the other day received some very nice entries. I decided to pick multiple winners since the best entries illustrate consequences of several kinds of undefined behavior.

First, here’s the runner up, submitted by Octoploid:

int main (void) {
  int x = 1;
  while (x) x <<= 1;
  return x;
}

This program is undefined by C99, where signed left shifts must not push a 1 bit into or past the sign bit. Recent GCC versions such as r189449 for x86-64 on Linux compile this program into an infinite loop at -O2 and higher. This entry is great because in general, compiler developers have been very reluctant to start enforcing the very strict rules for signed left shift in C99. On the other hand, GCC still performs this optimization when asked to compile in C89 mode, where I don’t think the behavior is undefined. Based on that observation, I filed a GCC bug report. One of the main GCC developers, Joseph Myers, replied with this:

Shifts in GCC are supposed to be defined as long as the shift amount is in range, independent of the LHS, so it should not be folding like that. (Although we document in implement-c.texi that this is subject to change for signed left shift, I don’t think changing it would be a particularly good idea.)

I think it is great to see compiler developers taking this kind of a stand against exploiting certain undefined behaviors. Anyway, although the matter has not yet been completely resolved, and the optimization is perfectly valid in C99, it seems that the developers will back out this transformation. So, while I find this example to be very interesting, it does not seem like a contest winner.

Winner #1

Amonakov submitted this code:

enum {N = 32};
int a[N], pfx[N];
void prefix_sum (void)
{
  int i, accum;
  for (i = 0, accum = a[0]; i < N; i++, accum += a[i])
    pfx[i] = accum;
}

This function forms a bit of an argument against the very terse style of for-loop afforded by C which (in my opinion) makes the undefined behavior harder to see than it might otherwise have been. The undefined operation here is reading a[32]. Recent versions of GCC, apparently as a consequence, decide that the loop exit test can be removed, giving an infinite loop that triggers a segfault when i becomes large enough.

Winner #2

Nick Lewycky submitted this code:

#include <stdio.h>
#include <stdlib.h>

int main() {
  int *p = (int*)malloc(sizeof(int));
  int *q = (int*)realloc(p, sizeof(int));
  *p = 1;
  *q = 2;
  if (p == q)
    printf("%d %d\n", *p, *q);
}

Using a recent version of Clang (r160635 for x86-64 on Linux):

$ clang -O realloc.c ; ./a.out 
1 2

To see the undefined behavior, you need to read the fine print in the C99 standard, which says:

The realloc function deallocates the old object pointed to by ptr and returns a pointer to a new object that has the size specified by size.

In other words, it is an error to use a pointer after it has been passed as an argument to realloc. This particular bit of language does not appear in the ANSI C standard and the situation there is more ambiguous. However, reading between the lines in A.6.2, I believe we can infer that this code was intended to be undefined in C89 as well.

The man pages for realloc() that I looked at do not mention or imply this caveat. If a major compiler is going to exploit this behavior, the documentation must be updated.

Conclusion

I chose these winners because in each case there is an element of surprise. In other words, I would not have guessed that the compiler would exploit these particular undefined behaviors, even after I understood the program or program fragment completely. An entry submitted by Pascal Cuoq, which is otherwise very interesting, fails this test—I would have expected a modern C compiler to exploit those undefined behaviors.

UPDATE: See Pascal’s response here.

Twenty Years of Linux

Twenty years ago, almost to the day, my friend Jamie showed up with four floppy disks. Two of them were a boot/root pair for 386BSD (none of FreeBSD, NetBSD, or OpenBSD existed yet) and two of them were for Linux 0.96c. It’s pretty hard to overstate how cool and amazing it was to see protected-mode multitasking on a PC. This was an old 20 MHz 386 that my parents had given me when they bought a 486; I mostly used it for Pascal programming, dialing into the campus mainframe to read Usenet, and playing games—Wing Commander, mainly, I think, since Doom was still a ways off.

Both operating systems were an obvious improvement over Minix, which I had hacked in OS class. Minix had pretty source code but, lacking memory protection and such, it wasn’t very useful. I wasn’t sure which of Linux or 386BSD to start using, but Jamie had been reading alt.os.linux for a few weeks and was impressed by the excitement it was generating, so I decided to go with Linux. The boot/root system was very limited—not many utilities could be fit onto the root floppy. However, a handful of rudimentary Linux distributions already existed at this point so it wasn’t too hard to get it installed onto my machine’s hard disk. Even years after this initial installation my friends and I could be found walking around the CS building carrying shoeboxes full of floppies.

This post doesn’t have a point, I just thought it was kind of weird that (since I turn 40 next week) I’ve had a Linux machine on my desk for pretty much exactly half of my life.

Anniversary Hike

Sarah and I have been married for 11 years today; we celebrated by going hiking with the kids up at Alta, where our wedding was. Unfortunately, after a couple of hours of great conditions, a cold rain storm reminded us that weather in the mountains can be serious. At least there was no lightning! We ate our picnic lunch at home.

[nggallery id=56]

Contest: Craziest Compiler Output due to Undefined Behavior

[UPDATE: Winners are here.]

The C and C++ standards fail to assign a meaning to a program that overflows a signed integer or performs any of the 190+ other kinds of undefined behavior. In principle, a conforming compiler can turn such a program into a binary that posts lewd messages to Twitter and then formats your disk. In practice, of course, we don’t expect such baroque consequences. Rather, the compiler is just going to do its job, which is to generate efficient code for all executions that don’t have undefined behavior. Disregarding Easter eggs such as the old version of GCC that would exec nethack, the compiler isn’t going to go out of its way to make undefined executions behave badly. It just doesn’t care about them. The problem is that as optimizers have gotten more clever, the consequences of undefined behavior have become a bit more strange. For example:

  • Using uninitialized data as an extra source of randomness caused a compiler to delete an entire seed computation. link
  • A compiler can evaluate (x+1)>x to both 0 and 1 in the same program, compiled at the same optimization level, for the same value of xlink
  • An infinite loop such as a counterexample search (where no counterexample exists) can be terminated by the compiler, permitting, for example, Fermat’s Last Theorem to be “disproved.” link
  • An undefined shift operation caused a compiler to delete an important safety check in Google’s Native Client. link
  • A reference to a possibly-null pointer in the Linux kernel caused the compiler to remove a subsequent null check, creating an exploitable vulnerability. link
  • A division instruction (that potentially crashes the process) can be moved in front of code that has externally visible side effects. link
  • A compiler can run the code inside if (p) { ... } and also inside if (!p) { ... } when p is not initialized. link

I will send a small but nice prize to the person who posts a comment describing the most interesting, surprising, or just plain crazy object code emitted by a compiler as a consequence of undefined behavior. Preference will be given to entries that:

  • Are posted before or on July 18 2012.
  • Describe interesting object code as opposed to describing a secondary consequence of that object code. A secondary consequence would be something like an ROP attack, launching a missile, or whatever.
  • Describe object code legitimately generated by a production-quality C or C++ compiler. In other words, I’m not interested in compiler bugs or in your own hand-crafted C compiler that emails Santa Claus when someone divides by zero.
  • Can be reproduced—so please, if possible, include compiler version, compiler options, a pointer to the code, etc.

If you readers can come to a consensus about the winner, I will go along with it. Otherwise, I’ll choose the winner. An example of an entry that would probably win (assuming that it could be reproduced) is “At -O2, LLVM 2.6 turns the attached C++ code into object code THAT ACTUALLY MAKES DEMONS FLY OUT OF MY NOSE.”

Here’s a similar discussion on Stack Overflow. However, people there are mostly talking about secondary consequences rather than interesting object code.

UPDATE: WordPress’s comment system likes to mangle C code, beware. I don’t know of an easy fix.

Parallelizing Delta Debugging

Delta debugging is a technique for taking an input to a computer program that causes the program to display a certain behavior (such as crashing) and automatically creating a smaller input that triggers the same behavior. For complex programs that typically process large inputs (compilers, web browsers, etc.) delta debugging is an important part of the overall debugging workflow.

A key invariant in a delta debugger is that the current version of the input is always “interesting”—it triggers the bug or whatever. Delta speculatively transforms this input into a smaller input and then tests if this new input is still interesting. If so, it becomes the current input. Otherwise, it is discarded and a different transformation is tried. Termination occurs when no transformation can make the current input smaller. The greedy search looks something like this:

Here the original input (version 1) was transformed four times. Three of the resulting inputs failed to have the property of interest but the fourth was interesting, giving version 2. Version 2 was successfully transformed the first time, giving version 3 which, since it cannot be transformed any more, constitutes the delta debugger’s output.

While the original formulation of delta debugging uses only a single kind of transformation (removing a contiguous chunk of the input), we found that this didn’t produce nicely reduced C programs. My group’s tool, C-Reduce, is parameterized by a collection of pluggable transformations that, together, work better for this specific task.

In many cases, a run of a delta debugger takes only a few minutes, which is fast enough that we probably have no motivation to make it faster. On the other hand, we have found that reducing multi-megabyte C++ programs that trigger compiler bugs can be quite slow, on the order of a couple of days in some of the worst cases we’ve seen. Delta debugging is slow when the test for interestingness is slow and also when the greedy search fails to make rapid progress during its initial phases. Both of these tend to be the case for large C++ programs; rapid initial progress appears to be difficult due to the complex and dense dependency web in typical C++ code.

The other day I decided to parallelize C-Reduce for shared-memory machines. My first idea was to relax the delta debugging invariant so that each processor could have its own current version of the input. This requires a way to merge results so that processors can take advantage of each others’ progress. It would work something like this:

Here the code running on Core 2 has decided that instead of performing a regular C-Reduce transformation in version 3, it is rather going to merge version 3 with Core 1’s current version, which happens to be 2. Of course the result may not be interesting, but this diagram assumes that it is.

A problem is that existing utilities like “merge” are quite conservative; they aggressively signal conflicts. Also, it seems hard to know how often to merge. After a bit of experimentation I decided that this wasn’t the way to go. My next idea was to parallelize only the part of C-Reduce that performs line-based deletion of chunks of a file. If we rig C-Reduce so that instead of deleting lines it replaces them with blank lines, then merging becomes trivial. I think this was a reasonable idea but I stopped working on it after thinking of a simpler way.

The strategy used by the current (development) version of C-Reduce maintains the delta debugging invariant that we always have a global best result. Parallelization happens by speculatively transforming the current input and then forking off subprocesses to test these inputs for interestingness. The strategy for speculation is to assume that no transformed input is interesting; I chose this heuristic because, in practice, transformations produce uninteresting results more often than not. In this illustration, two speculative processes have been forked but neither has completed:

As interestingness tests terminate, more are forked:

As soon as any transformation succeeds in producing an interesting result, all remaining speculative processes are killed off and a new batch is forked:

Since this kind of parallelization does not break the delta debugging invariant, not much of C-Reduce needed to be modified to support it. Although there are many improvements still to implement, performance seems decent. Here’s a graph showing reduction time vs. degree of parallelism on a Core i7-2600 with hyperthreading disabled for a 165 KB Csmith program that triggers this LLVM bug:

Why does performance get worse when C-Reduce uses all four cores? My guess is that some resource such as L3 cache or main memory bandwidth becomes over-committed at that point. It could also be the case that the extra speculation is simply not worth it.

Here’s what happens when we reduce a big C++ program that crashes a commercial compiler using an i7-2600 with hyperthreading turnred on:

When the degree of parallelism exceeds five, performance falls even more off a cliff. For example, when C-Reduce tries to use six CPUs the reduction takes more than seven hours. This graph seems to tell the same story as the previous one: parallelism can help but some resource saturates well before all (virtual) processors are used.

I don’t yet know how C-Reduce will perform on a serious machine, but we have some new 32-core boxes that we’ll experiment with as we get time.

Perhaps my favorite thing about this little experiment with parallelization is that it leverages the purely functional nature of C-Reduce’s API for pluggable transformations. Previous versions of this API had been stateful, which would have made it hard to deal with speculation failures. I’m not particularly a fan of functional programming languages, but it’s pretty obvious that a functional programming style is one of the keys to making parallel programming work.

I decided to write this up since as far as I know, nobody has written a parallelized test case reducer before.

Scooping and New Publication Models

If I come up with a great new research idea, sit on it for a couple of years, and then someone else publishes it before I do, then I got scooped. This is unfortunate but as long as the ideas were developed independently, nobody has done anything wrong. Of course, in the real world, establishing independence can be hard and sometimes this causes nastiness like the Leibniz-Newton controversy. This piece is about some flavors of scooping where ideas are not independently developed.

First I’ll quickly summarize the ground rules. It is always OK for me to publish work that builds upon someone else’s ideas providing that the earlier work has been published and that I properly acknowledge it. It is sometimes OK for me to publish ideas building upon someone else’s work without citation, for example if the work is old enough and well-known enough that it can be considered part of the common knowledge. Without this loophole we’d have to cite a pile of Knuth and Dijkstra (and Leibniz and Newton and Aristotle…) papers almost any time we wrote anything. It is never OK for me to publish ideas that build upon someone’s unpublished work that I acquired through unofficial channels such as reviewing a paper, sitting on a grant panel, finding a paper at the printer, or similar.

These rules are pretty simple and usually there’s no trouble following them. But how do they interact with newer publication possibilities such as arXiv or blogs? I recently heard about a case where a professor placed a paper draft on arXiv. Subsequently, this professor submitted a revised version of the arXiv paper to a conference where it found itself in competition with a different paper, also based on his work that had been posted to arXiv, but which was written by someone completely different who had simply read the paper from arXiv, liked it, and wanted to extend it. Was this OK? How about if something similar happens based on a blog post? What if I host a paper that I’m working on in a public Github repository and someone reads it there, extends it, and wins the publication race?

I propose that the guideline for “proper use” of someone’s ideas should be based on a standard similar to the public disclosure rule for patents in the US. In other words, if a researcher has publicly disclosed his or her ideas—in a blog, on arXiv, in a technical report, on Github, or anywhere else—then I am free to build upon them, provided of course that I cite the source properly. Conversely, a confidential disclosure of research ideas (peer review of grant proposals and papers would fall into this category) would not give me the right to scoop a researcher.

The patent analogy is not perfect but I think we should borrow the general idea that any disclosure is public unless it is explicitly confidential. I’ll be interested to hear what people think. Finally, to return to the example, posting the work on arXiv constituted a public disclosure of the ideas, so the competing researcher was perfectly within his/her rights to submit a paper that would potentially scoop the original author.