Polar Bears 2011

My work life, as readers of this blog have probably gathered, seems to mainly involve trying to keep funding agencies happy, teaching concepts like deadlock avoidance to bored undergraduates, exhorting grad students to work harder, going to pointless meetings, and spending any remaining time responding to emails. Of course I have a blog where I can write endlessly about all of this.

My younger brother Eric works for the US Fish and Wildlife Service and spends a good chunk of his working life inside the Arctic Circle studying the planet’s largest land predators, using helicopters and fixed-wing aircraft, both on land and on sea ice. Of course he does not have a web site.

Anyway, today Eric sent me some pictures from this year’s field season, which was apparently very successful, operating out of the Red Dog Mine facility. He said it was OK to post them, so some of them are below. Hopefully this serves as a bit of an antidote to the computer systems related minutia that constitutes the bulk of the content here.

An Executable Semantics For C Is Useful

The goal of a C/C++ compiler is to turn every sequence of ASCII characters into executable instructions. OK, not really — though it does seem that way sometimes. The real goal of a C/C++ compiler is to map every conforming input into executable instructions that correspond to a legal interpretation of that input. The qualifiers “conforming” and “legal interpretation” are very important. First, the compiler has extremely weak requirements about what it should do with non-conforming inputs, such as programs that contain undefined behaviors (array bounds violations, etc.). Second, all realistic C/C++ programs have a large number of possible interpretations, for example corresponding to different integer sizes, different orders of evaluation for function arguments, etc. The compiler chooses a convenient or efficient one, and the remaining interpretations are latent. They may emerge later on if the compiler options are changed, if the compiler is upgraded, or if a different compiler is used. The point is that the compiler has no obligation to tell us whether the input is conforming or not, nor how many possible interpretations it has.

Thus, while C/C++ compilers are very good at turning conforming programs into efficient executables, they are just about useless for other answering other kinds of questions:

  • Does the program ever execute undefined behaviors, causing it (in principle) to have no meaning and (in practice) to execute attack code or crash?
  • Does the program rely on unspecified behaviors, making it non-portable across compilers, compiler versions, and changes in compiler options?
  • Does the program rely on implementation-defined behaviors, affecting its portability to other compilers and platforms?
  • Why does the program behave in a certain way? In other words, what part of the standard forced that interpretation?

To answer these questions, a wide variety of static analyzers, model checkers, runtime verifiers, and related tools have been developed. These tools are great. However, even taken all together, they are incomplete: there exist plenty of bad (or interesting) program behaviors that few or none of them can find. For example:

  • Very few tools exist that can reliably detect uses of uninitialized storage.
  • Few, if any, tools can correctly diagnose problems resulting from C/C++’s unspecified order of evaluation of function arguments.
  • An lvalue must not be modified multiple times, or be both read and written, in between sequence points. I’m not aware of many tools that can correctly detect that evaluating this function results in undefined behavior when p1 and p2 are aliases:
int foo (int *p1, int *p2) {
  return (*p1)++ % (*p2)++;
}

The Missing Tool

The missing tool (or one of them, at any rate) is an executable semantics for C. An executable semantics is an extremely careful kind of interpreter where every action it takes directly corresponds to some part of the language standard. Moreover, an executable semantics can be designed to tell us whether the standard assigns any meaning at all to the program being interpreted. In other words, it can explicitly check for all (or at least most) undefined, unspecified, and implementation-defined behaviors. For example, when an executable semantics evaluates (*p1)++ % (*p2)++, it won’t assign a meaning to the expression until it has checked that:

  • both pointers are valid
  • neither addition overflows (if the promoted types are signed)
  • p1 and p2 are not aliases
  • *p2 is not 0
  • either *p1 is not INT_MIN or *p2 is not -1

Moreover, the tool should make explicit all of the implicit casts that are part of the “usual arithmetic conversions.” And it needs to do about 500 other things that we usually don’t think about when dealing with C code.

Who Needs an Executable Semantics?

Regular programmers won’t need it very often, but they will occasionally find it useful for settling the kinds of annoying arguments that happen when people don’t know how to read the all-too-ambiguous English text in the standard. Of course, the executable semantics can only settle arguments if we agree that it has captured the sense of the standard. Better yet, we would treat the executable semantics as definitive and the document as a derivative work.

Compiler developers need an executable semantics. It would provide a simple, automated filter to apply to programs that purportedly trigger compiler bugs. A web page at Keil states that “Fewer than 1% of the bug reports we receive are actually bugs.” An executable semantics would rapidly find code fragments that contain undefined or unspecified behaviors — these are a common source of bogus bug reports. Currently, compiler developers do this checking by hand. The GCC bug database contains 4966 bug reports that have been marked as INVALID. Not all of these could be automatically detected, but some of them certainly could be.

People developing safety-critical software may get some benefit from an executable semantics. Consider CompCert, a verified compiler that provably preserves the meaning of C code when translating it into assembly. CompCert’s guarantee, however, is conditional on the C code containing no undefined behaviors. How are we supposed to verify the absence of undefined behaviors when existing tools don’t reliably check for initialization and multiple updates to lvalues? Moreover, CompCert is free to choose any legitimate interpretation of a C program that relies on unspecified behaviors, and it does not need to tell us which one it has chosen. We need to verify up-front that (under some set of implementation-defined behaviors) our safety-critical C program has a single interpretation.

My students and I need an executable semantics, because we are constantly trying to figure out whether random C functions are well-defined or not. This is surprisingly hard. We also need a reliable, automated way to detect undefined behavior because this enables automated test case reduction.

An Executable Semantics for C Exists

I spent a few years lamenting the non-existence of an executable C semantics, but no longer: as of recently, the tool exists. It was created by Chucky Ellison, a PhD student at the University of Illinois working under the supervision of Grigore Rosu. They have written a TR about it and also the tool can be downloaded. Hopefully Chucky does not mind if I provide this link — the tool is very much a research prototype (mainly, it is not very fast). But it works:

regehr@home:~/svn/code$ cat lval.c
int foo (int *p1, int *p2) {
  return (*p1)++ % (*p2)++;
}

int main (void) {
  int a = 1;
  return foo (&a, &a);
}
regehr@home:~/svn/code$ kcc lval.c
regehr@home:~/svn/code$ ./a.out
=============================================================
ERROR! KCC encountered an error while executing this program.
=============================================================
Error: 00003
Description: Unsequenced side effect on scalar object with value computation of same object.
=============================================================
File: /mnt/z/svn/code/lval.c
Function: foo
Line: 2
============================================================

As I mentioned earlier, very few tools for analyzing C code find this error. Chucky’s tool can also perform a state space exploration to find order of evaluation problems and problems in concurrent C codes. Finally, it can run in profile mode. Unlike a regular profiler, this one profiles the rules from the C semantics that fire when the program is interpreted. This is really cool and we plan to use it to figure out what parts of the C standard are not exercised by Csmith.

Chucky’s tool is already an integral part of one of our test case reducers. This reducer takes as input a huge, ugly, bug-triggering C program generated by Csmith. It then uses Delta debugging to output a much smaller bug-triggering program that (ideally) can be included in a compiler bug report without further manual simplification. Before Chucky’s tool arrived, we had spent several years trying to deal with the fact that Delta always introduces undefined behavior. We now seem to have a bulletproof solution to that problem.

The benefits of executable semantics have long been recognized in the PL community. The new thing here is a tool that actually works, for the C language. Hopefully, as Chucky’s tool matures people will find more uses for it, and perhaps it can even evolve into a sort of de facto litmus test for ascertaining the meaning — or lack thereof — of difficult C programs.

Antelope Island

The boys had no school today, so I took them and a friend to Antelope Island, one of the more accessible islands in the Great Salt Lake. Despite the odd weather (it snowed, got sunny, and snowed again about eleven times during the day) we had a fun trip. I brought my backpacking stove and made us a hot lunch, for some reason this is always a hit with kids. We saw the bison herd, but not close up.

Uninitialized Variables

I’ve been tempted, a couple of times, to try to discover how much performance realistic C/C++ programs gain through the languages’ failure to automatically initialize function-scoped storage. It would be easy to take a source-to-source transformer like CIL and use it to add an explicit initializer to every variable that lacks one. Then, presumably, a modern optimizing compiler would eliminate the large fraction of initializers that are dominated by subsequent stores. If compilers are not smart enough to do this, and overhead remains high, a more sophisticated approach would be to:

  • Only initialize a variable when some powerful (but conservative) interprocedural static analyzer thinks it may be needed
  • Initialize a variable close to its first use, to avoid problems with locality and with artificially lengthening the live range

My guess is that many programs would not be slowed down noticeably by automatic initialization, but that it would not be too hard to find codes that slow down 5%-20%.

Lacking automatic initialization, most of us get by using compiler warnings and dynamic tools like Valgrind. I was recently surprised to learn that warnings+Valgrind are not as reliable as I’d have hoped. First, the compiler warnings are fundamentally best-effort in the sense that common compilers more or less give up on code using arrays, pointers, function calls, and some kinds of loops. For example:

int foo1 (void) {
  int y,z;
  for (y=0; y<5; y++)
    z++;
  return z;
}

int foo2 (void) {
  int a[15];
  return a[10];
}

void bar (int *p) {
}

int foo3 (void) {
  int x;
  bar (&x);
  return x;
}

Each of foo1(), foo2(), and foo3() returns a value that depends on a read from uninitialized storage, but recent versions of GCC and Clang fail to give a warning about this, at least for the command line options I could think to try. Intel CC 12.0 finds the first two and Clang’s static analyzer finds the problem with foo2(). However, both ICC and Clang’s analyzer are fairly easy to fool. For example, neither gives any warning about this code:

int foo2b (void) {
  int a[6], i;
  for (i=0; i<5; i++) {
    a[i] = 1;
  }
  return a[5];
}

Next let’s talk about Valgrind. It is greatly handicapped by the fact that optimizing compilers seek out and destroy code relying on undefined behaviors. This means that Valgrind never gets a chance to see the problems, preventing it from reporting errors. We can turn the functions above into a complete program by adding:

int main (void) {
  return foo1() + foo2() + foo3();
}

When compiled by GCC or Clang at -O2, the resulting executable passes through Valgrind with zero errors found. But let’s be clear: the problem still exists, it is just hidden from Valgrind. Each function still has to return something, and the programmer has failed to specify what it is. Basically the compiler will fabricate a value out of thin air. Or, as I like to say in class, it will somehow manage to generate the worst possible value — whatever it is. On the other hand, when optimizations are turned off, the output of either GCC or Clang is properly flagged as returning a value depending on uninitialized storage.

Is Valgrind reliable when it monitors code produced at -O0? Unfortunately not: the following functions pass through without error, regardless of optimization level:

void foo5 (void) {
  int x,y;
  for (x = 0; x < 5; x++)
    y;
}

void foo6 (void) {
  int a[1];
  a[0];
}

Moreover, neither GCC nor Clang warns about these. On the other hand, these functions are perhaps harmless since the uninitialized data aren’t actually used for anything. (Keep in mind, however, that according to the C standard, these functions are unambiguously wrong, and executing either of them destroys the meaning of the entire program.)

Failure to initialize function-scoped variables is one of the many holes in C/C++ that were introduced under the philosophies “trust the programmer” and “performance above all.” The resulting problems have been patched in a not-totally-satisfactory way using a collection of tools.

Is there a lesson here? Perhaps. I think it would be reasonable for tool developers to ensure that the following invariant holds:

For all C/C++ programs, any values read from uninitialized storage either:

  • result in a compiler warning,
  • result in a Valgrind error,
  • fail to propagate (via data flow or control flow) to any system call.

I believe (but am not 100% sure) that Valgrind does not need to be modified to make this happen. Compilers definitely need to be changed. Basically, the compiler has to (1) without optimizations, generate code that permits Valgrind to detect all uses of uninitialized storage and (2) when optimizing, restrain itself from performing any transformation that conceals a problem from Valgrind, unless it emits a warning corresponding to the runtime behavior that has disappeared. Condition 1 is already the case for GCC and Clang (as far as I know) but a fair amount of work would probably be required to guarantee that condition 2 holds.

The other solution — automatic initialization — is of dubious value to C/C++ programmers because the resulting code would not be portable to other compilers. The standard would have to mandate initialization before this became generally useful, and there’s no way that’s going to happen. On the other hand, if I were an operating system vendor, and I cared at all about reliability and security, I’d probably hack the bundled compiler to do automatic initialization. Programmers writing conforming code would never notice, and sloppy code would be silently fixed instead of being perhaps exploitable.

Finally, I modified the six foo* functions above to include explicit initializers. When compiled using GCC, the code size overhead due to initialization is 0 bytes. Using Clang, 15 bytes. Using ICC, 40 bytes. This is all at -Os on x86-64. The ICC overhead all comes from foo2() — the compiler emits code initializing all 15 array elements even though 14 of these are obviously useless. Perhaps GCC does such a good job because its middle-end has already been tuned for ahead-of-time compilation of Java code.

How Can Computer Science Help Us Get To Mars?

SpaceX thinks it can get a person to Mars within 20 years. This seems optimistic, given that SpaceX does not enjoy the significant chunk of the USA’s federal budget that permitted NASA to get to the Moon on a relatively short time scale. Nevertheless, it’s a good goal, and presumably 50 years of improvements in space systems engineering will make up for some of the shortfall.

Since I strongly believe that getting humans off the rock needs to be a priority, I want to help. What are the CS problems that (1) can be addressed in a University and (2) will directly support getting off the rock? Since much of my research is aimed towards increasing embedded software reliability, I suspect that I’m already in the right ballpark, but once we get into the specifics there are a lot of ways to go astray. Perhaps I should see if I can do my next sabbatical at SpaceX. I recently read a book on how space systems can go wrong and (no surprise) there are very many ways, some of them software-related.

Value Loss Coverage

The ugly reality of computer arithmetic on fixed-length bit vectors is that many operations are not equivalent to the mathematical operators they superficially resemble. For example, only a tiny fraction of the possible outcomes of a 32-bit multiplication can be represented in a 32-bit destination register. Although bignum and rational packages exist — and have existed for many years — there are unsolved problems (highly unpredictable runtime and memory allocation behavior) that preclude their use in important classes of applications such as embedded systems that must execute in a predictable fashion, probably on constrained hardware.

The canonical example of harmful value loss was an assignment from a floating point value representing velocity into a 16-bit integer onboard Ariane 5 flight 501. The loss of value initiated a chain of events that resulted in the destruction of the rocket.

Modern compilers warn about potential value loss, which is nice because it makes it easier for us to inspect these code sites. But even so, it is often extremely hard to figure out which of these sites can lead to actual value loss. For example, the Ariane software could not lead to value loss on the Ariane 4, but value loss did occur when the same software module was reused on the Ariane 5. Clearly we cannot expect the compiler to understand the maximum horizontal velocity of a rocket.

The idea behind this piece is that for purposes of test coverage, every operation that potentially results in value loss should be tested in both its lossy and non-lossy modes. Consider these two C functions:

char foo1 (int x) {
  return x;
}
signed char foo2 (unsigned char x) {
  return x;
}

foo1() has value loss when x is outside the range of values representable in a char. We can fully test it by passing it the values 0 and 128 (assuming CHAR_MAX is 127). foo2() does not have a bitwidth constraint like foo1(), but it still can lose values because the value ranges representable in a signed and unsigned char do not fully overlap. foo2(), also, can be fully covered by passing it the values 0 and 128.

As a software testing goal, high value-loss coverage may only make sense when a whitebox test case generator like Klee is available. My student Peng Li implemented a patch to LLVM’s Clang frontend that takes a C program and compiles it into a form where the normal and lossy paths are distinct. These instrumented codes can be passed to Klee, which attempts to generate inputs that induce path coverage on the program. This works well for small test codes. However, we don’t yet have results from real applications showing that this extra coverage is useful for exposing bugs.

How Much and What to Read?

Grad students often aren’t quite sure how much of their work time should be spent reading, and may also have trouble figuring out what to read during that time. (In principle this problem also applies to professors, but it’s not much of an issue in practice since we have almost no time for discretionary reading.) I usually tell people to err on the side of doing too much, as opposed to reading too much. Averaging one day a week on reading is probably about right, but it may be a lot higher or lower than this during certain periods of grad school.

Of the time spent reading, about half should go towards solving immediate research problems. The purpose of this is getting unstuck, learning about alternate approaches, avoiding reinvention of known techniques, and learning how and what the current players in your field are thinking. Of the remaining 50% of reading time, half of it should go towards “deep background reading” — reading in the thesis area that goes a bit further afield and a bit further back in the chain of references than is absolutely necessary to get the work done. This reading will form the basis for the related work sections of papers and the dissertation. The last 25% of reading time is purely discretionary: classic papers, good books, new results from conferences, etc. This material is not expected to be directly relevant (though it may turn out to be) but keeps a person up to date on random interesting topics.

Csmith Released

Here is Csmith, our randomized C program generator. My dream is that it will be a force for good by unleashing a world of hurt upon low-quality C compilers everywhere (it is not uncommon for Csmith to crash a previously-untested tool on the very first try). High-quality C compilers, such as the latest versions of GCC and Clang, will correctly translate many thousands of random programs.

I wanted to release Csmith under the CRAPL, but nobody else in my group thought this was as funny as I do. Actually we put quite a bit of work into creating a stable, usable release. Even so, Csmith is a complex piece of software and there are certainly some lurking problems that we’ll have to fix in subsequent releases.

Finding Integer Undefined Behaviors Using Clang 2.9

My student Peng Li modified Clang to detect integer-related undefined behaviors in C and C++ code. We’ve released the code here, to go along with the recent LLVM 2.9 release. This checker has found problems in PHP, Perl, Python, Firefox, SQLite, PostgreSQL, BIND, GMP, GCC, LLVM, and quite a few other projects I can’t think of right now.

Of course, undefined behaviors are not all created alike. The next thing that someone should do is combine this work with a tainting engine in order to find dangerous operations that depend on the results of operations with undefined behavior. For example, take this code:

a[1<<j]++;

If it is ever the case that 0>j≥32, then a shift-related undefined behavior occurs. The result of this operation is used as an array index — clearly this is dangerous. It may even be exploitable, depending on what a particular C/C++ implementation does with out-of-bounds shifts. Other dangerous operations include memory allocations and system call arguments.

One of my favorite examples of a project getting hosed by an integer undefined behavior is here. An apparently harmless refactoring of code in Google’s Native Client introduced an undefined behavior that subverted the fundamental sandboxing guarantee. This is pernicious because the (flawed) check is sitting right there in the source code, it is only in the compiler output that the check is a nop. If their regression tests used our tester, this problem would have almost certainly been caught right away.

Mistranslation

[Warning: spoilers for Shutter Island below.]

Over the last few nights I watched Shutter Island, a nice atmospheric horror film. Near the end someone pulls out a chalkboard and writes down the names of four important characters, which turn out to contain two pairs of anagrams. This is ridiculous: puzzles of this sort only work when the viewer has a reasonable chance of figuring out the answer in advance. Of course, none of the four names had appeared in writing on the screen at any point, much less all together. Suddenly it was obvious that Shutter Island had been adapted from a book, and also that the filmmakers had missed the opportunity to either turn this textual puzzle into a visual one, or else eliminate it as untranslatable. It’s not as if anagram names are a particularly strong plot device in the first place. Also the “good guy isn’t who he thinks he is” idea is a little old; I think Angel Heart — which I loved, mainly because Alan Parker is awesome — was the last time this one caught me by surprise, and only then because I was 18 when I saw it.