Nobody Expects the Spanish Inquisition, or INT_MIN to be Divided by -1

INT_MIN % -1 and INT_MIN / -1 in C/C++ are little gifts that keep on giving. Recently, Xi Wang has been using this construct to knock over languages implemented in C/C++. Then today Tavis Ormandy posted an excellent local DOS for a Windows 8 machine.

But the fun doesn’t stop there. For one thing, as I wrote a while ago, it’s not even clear that INT_MIN % -1 is undefined behavior in C99, C++98, or any earlier version. It is explicitly undefined in C11 and C++11.

Since undefined behavior is undefined, we get fun results from programs like this:

#include <limits.h>

int foo (int a, int b) {
  return a / b;
}

int main (void) {
  return foo (INT_MIN, -1);
}

Now we’ll compile at a few optimization levels using gcc 4.7.2 on either x86 or x86-64 and see what happens:

$ gcc -O0 intmin.c -o intmin
$ ./intmin 
Floating point exception (core dumped)
$ gcc -O1 intmin.c -o intmin
$ ./intmin 
Floating point exception (core dumped)
$ gcc -O2 intmin.c -o intmin
$ ./intmin 
$

Why does it crash at low optimization levels and not at high ones? It happens that -O2 is the first optimization level where the call to foo() is inlined, exposing its guts to constant propagation and folding, with the result that no idiv needs to be executed, and thus no exception gets thrown. A recent build of Clang has exactly the same behavior. If we replace / with %, the behavior is unchanged (under both compilers).

What should we take away from this? A few things:

  • If you can’t prove that division and modulus operations in your C/C++ code aren’t going to do these operations, explicit checks need to be added.
  • If you are testing a C/C++ program, try to get it to perform these operations, but keep in mind that all such testing can be unreliable due to the effect shown above.
  • Use Clang’s new -fsanitize=integer mode. It’s not in 3.2 but it’s in trunk and should be in 3.3 when it comes out in a few months. The integer sanitizer is based on the excellent work that Will Dietz has done in getting IOC into LLVM’s trunk.

Here’s what happens:

$ clang -fsanitize=integer intmin.c -o intmin
$ ./intmin 
intmin.c:5:12: runtime error: division of -2147483648 by -1 cannot be represented in type 'int'
Floating point exception (core dumped)

How is this different from the errors above? First, we get the nice diagnostic. Second, we’re guaranteed to get the error regardless of optimization level.

UPDATE: A comment on Xi’s post shows how to crash a Bash shell. For example, on my Ubuntu 12.10 machine:

$ ($((-2**63/-1)))
Floating point exception (core dumped)
$ bash -version
GNU bash, version 4.2.37(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
$ 

The behavior on Ubuntu 12.04 is the same. I would love to see more examples like this. If you know of one, please leave a comment.

C and C++ Aren’t Future Proof

A C or C++ program is expected to follow a collection of rules such as “don’t access out-of-bounds array elements.” There are a lot of these rules and they are listed (or are implicit) in the various language standards. A program that plays by all of these rules—called a conforming program—is one where we might, potentially, be able to make a guarantee about the program’s behavior, such as “This program doesn’t crash, regardless of input.” In contrast, when a C or C++ program breaks a rule, the standard doesn’t permit us to say anything about its behavior.

So where’s the problem? It comes in three parts:

  1. Some of the rules imposed by the C and C++ standards are routinely violated. For example, it is not uncommon to see creation (but not dereference) of invalid pointers, signed integer overflow, and violations of strict aliasing rules.
  2. Programmers expect a C/C++ implementation to behave in a certain way even when some of the rules are violated. For example, many people expect that creating an invalid pointer (but, again, not dereferencing it) is harmless. Program analyzers that warn about these problems are likely to lose users.
  3. C/C++ compilers have a standard-given right to exploit undefined behaviors in order to generate better code. They keep getting better and better at this. Thus, every year, some programs that used to work correctly become broken when compiled with the latest version of GCC or Clang or whatever.

This propensity for today’s working programs to be broken tomorrow is what I mean when I say these languages are not future proof. In principle a big C/C++ program that has been extensively tested would be future-proof if we never upgraded the compiler, but this is often not a viable option.

There is a long, sad history of programmers becoming seriously annoyed at the GCC developers over the last 10 years due to GCC’s increasingly sophisticated code generation exploiting the undefinedness of signed integer overflows. Similarly, any time a compiler starts to do a better job at interprocedural optimization (this has recently been happening with LLVM, I believe) a rash of programs that does stupid stuff like not returning values from non-void functions breaks horribly. Programmers used to think it was OK to read uninitialized storage and then compilers began destroying code that did this.

Let’s look at a specific example. In a recent series of posts (1, 2, 3), Pascal Cuoq has been using a formal verification tool called Frama-C to verify zlib. Why zlib? First, it’s not that big. Second, it’s ubiquitous. Third, it is believed to be high quality—if we ignore a build problem on Solaris, the last security issues were fixed in 2005. I would guess that it would be difficult to find a widely-used library that is clearly more solid than zlib.

So what kinds of problems has Pascal found in this solid piece of C code? Well, so far nothing absolutely awful, but it does appear to create invalid pointers and to compare these against valid pointers. Is this bad? That depends on your point of view. It is possible (and indeed likely) that no current compiler exploits this undefined behavior. On the other hand, it is not straightforward to perform formal verification of zlib unless we treat it as being written in a language that is quite similar to C, but that assigns a semantics to invalid pointers. Furthermore, a new compiler could show up at any time which does something horrible (like opening an exploitable vulnerability) any time zlib computes an invalid pointer.

Of course zlib isn’t the real problem; it’s small, and probably pretty close to being correct. The real problem is that there are billions of lines of C and C++ are out there. For every thousand lines of existing code there are probably a handful of undefined behavior time bombs waiting to go off. As we move forward, one of these things has to happen:

  1. We ditch the C and C++, and port our systems code to Objective Ruby or Haskell++ or whatever.
  2. Developers take undefined behavior more seriously, and proactively eliminate these bugs not only in new code, but in all of the legacy code.
  3. The C/C++ standards bodies and/or the compiler writers decide that correctness is more important than performance and start assigning semantics to certain classes of undefined operations.
  4. The undefined behavior time bombs keep going off, causing minor to medium-grade pain for decades to come.

I’m pretty sure I know which one of these will happen.

UPDATE from 1/24/2013: A comment on Hacker News pointed me to this excellent example of C not being future-proof. This is commonplace, folks. This particular undefined behavior, signed integer overflow, can be caught by our IOC tool which is now integrated into Clang as -fsanitize=integer, and will be in the 3.3 release.

The Space Child’s Mother Goose

I just noticed that this old favorite has been reprinted. A quick excerpt:

This is the Theory Jack built.

This is the Flaw
That lay in the Theory Jack built.

This is the Mummery
Hiding the Flaw
That lay in the Theory Jack built.

This is the Summary
Based on the Mummery
Hiding the Flaw
That lay in the Theory Jack built.

flaw_jack_builtThis is Constant K
That saved the Summary
Based on the Mummery
Hiding the Flaw
That lay in the Theory Jack built.

This is the Erudite Verbal Haze
Cloaking Constant K
That saved the Summary
Based on the Mummery
Hiding the Flaw
That lay in the Theory Jack built.

This is the Turn of a Plausible Phrase
That thickened the Erudite Verbal Haze
Cloaking Constant K
That saved the Summary
Based on the Mummery
Hiding the Flaw
That lay in the Theory Jack built.

This is Chaotic Confusion and Bluff
That hung on the Turn of a Plausible Phrase
That thickened the Erudite Verbal Haze
Cloaking Constant K
That saved the Summary
Based on the Mummery
Hiding the Flaw
That lay in the Theory Jack built.

This is the Cybernetics and Stuff
That covered Chaotic Confusion and Bluff
That hung on the Turn of a Plausible Phrase
Amd thickened the Erudite Verbal Haze
Cloaking Constant K
That saved the Summary
Based on the Mummery
Hiding the Flaw
That lay in the Theory Jack built.

This is the Button to Start the Machine
To make with the Cybernetics and Stuff
To cover Chaotic Confusion and Bluff
That hung on the Turn of a Plausible Phrase
And thickened the Erudite Verbal Haze
Cloaking Constant K
That saved the Summary
Based on the Mummery
Hiding the Flaw
That lay in the Theory Jack built.

This is the Space Child with Brow Serene
Who pushed the Button to Start the Machine
That made with the Cybernetics and Stuff
Without Confusion, exposing the Bluff
That hung on the Turn of a Plausible Phrase
And, shredding the Erudite Verbal Haze
Cloaking Constant K
Wrecked the Summary
Based on the Mummery
Hiding the Flaw
And Demolished the Theory Jack built.

Hiding Bugs from Branch Coverage

It’s hard to know when a piece of software has been sufficiently tested. Code coverage metrics are a partial answer. Basically, a coverage metric is a function that maps each execution to a set of coverage targets. For example, if we are performing function coverage, then the universe of targets is comprised of functions from the source code (or object code) and an execution covers every function from which it causes at least one statement or instruction to execute.

Can a bug hide from function coverage? This isn’t hard:

void foo (void) {
  if (!x) bug();
}

As long as foo() is called with x having a non-zero value, the function will be covered but the bug will not fire. We can fix this particular problem by tightening up the coverage metric, for example by insisting on statement coverage, where each statement in the program is a separate coverage target.

Let’s talk now about branch coverage, which insists that every conditional branch executes both ways. This is very similar to statement coverage except that it fixes a major hole in statement coverage where you can get 100% coverage without executing the false condition of any conditional that lacks an else branch.

Branch coverage is interesting because it’s about the strongest coverage metric that makes sense in practice. We can easily define metrics that have more coverage targets, such a path coverage, but achieving full coverage is extremely difficult and also it forces us to confront hard questions about the feasibility of paths. Here’s a page that provides a quick overview of the common and some no-so-common coverage metrics.

The rest of this piece is a partial list of ways that bugs can hide from branch coverage, and also some discussion of how we might go about finding these bugs.

Bug-Hiding Strategy #1: Concurrency

Hiding a bug using multithreading is really easy. For example:

void called_by_thread_1 () {
  p = NULL;
  ... do something quick ...
  p = q;
  ... do something slow ...
}

int called_by_thread_2 () {
  return *p;
}

Assume that p is an atomic variable and that q is never NULL. We can easily get full branch coverage without testing the case where called_by_thread_2() dereferences a null pointer, and in fact that odds are this is what will happen. Concurrent programs want specialized coverage metrics; here’s a paper that I like.

Bug-Hiding Strategy #2: Math Overflow

This code looks innocent:

int average (int x, int y)
{
  return (x+y)/2;
}

But unfortunately, on a platform with 32-bit ints, average (1000000000, 2000000000) returns -647483648, despite the fact that the actual result is perfectly representable in 32-bit two’s complement. The problem is that the intermediate value x+y overflows. This is undefined behavior in C/C++. In practice we expect wrapping behavior from this kind of overflow, but even wrapping produces the wrong answer here.

Common test coverage metrics, including very difficult ones like path coverage, do not help us find math overflow bugs. One possibility would be to add overflow checks to the source code and then aim for branch coverage, but this is horrible. One could also add overflow checks using the compiler and then aim for branch coverage of the object code; this probably only makes sense in conjunction with test-case generation tools such as Klee.

The real solution to math overflow bugs is systematic documentation and enforcement of constraints on integer values, supported by a tool such as Frama-C. For example, the average() function would be preceded by this annotation:

/*@ requires x + y <= INT_MAX; */

Adding x and y inside the annotation does not cause overflow problems because integer operations in the annotation language are the mathematical ones, not the C ones. The problem with writing these annotations is that it is very hard work.

Bug-Hiding Strategy #3: Loop Error

There is much potential for hiding bugs inside loops:

int factorial (int n)
{
  assert (n < 13);
  int f = 1;
  do {
    f *= n;
    n--;
  } while (n>0);
  return f;
}

This code looks pretty good and, in fact, it returns the correct result for 0<n<13. The assertion protects against integer overflow on 32-bit machines. The remaining interesting case is n=0, for which this function is wrong: it returns 0 instead of 1. Branch coverage doesn’t help find this kind of bug.

Test coverage for loops is tricky; some sources recommend forcing each loop to execute 0 times, 1 time, and more than once, when possible.

Bug-Hiding Strategy #4: Lookup Table

Certain kinds of codes get a good speedup by precomputing some sub-results and storing these in a lookup table. This is commonly seen for checksums, math library functions, and crypto codes. For example, this AES implementation has several lookup tables. They cannot be verified by casual inspection. If any entry in such a table is incorrectly computed or gets corrupted somehow, the resulting code will be wrong and branch coverage again does not help find the problem. The famous Pentium fdiv bug was precisely this kind of error.

We might insist that every entry in each lookup table is accessed during testing, or we might simply double check that each table contains that proper values.

Bug-Hiding Strategy #5: Array Index

Code like this is pretty easy to write:

int update (int index, value_t value)
{
  if (index > incorrect_table_size) return ERROR;
  table[index] = value;
  return OK;
}

Here we’ll need to test the OK and ERROR conditions, but if we’re comparing against the wrong table size, RAM corruption can result from code that received 100% branch coverage. The table size is unlikely to be wrong for a statically sized table, but I’ve seen real programs where a piece of code failed to properly notice when a dynamically sized table changed size.

Bug-Hiding Strategy #6: Path-Sensitive Conditionals

This code is highly contrived, but it illustrates the problem:

int silly (int a) {
  int *p = &x;
  if (a > 10) p = NULL;
  if (a < 20) return *p;
  return 0;
}

Assume x is a global int. If we test this code with parameter values of 5 and 25, branch coverage is satisfied, and yet we will not trigger the undefined behavior that happens when a==15. Besides going for path coverage, I don’t know of a good way to find these bugs.

Bug-Hiding Strategy #7: Heap Shape

Tons of problems can hide in heap data structures. For example, assume that we have a red-black tree implementation where the delete() operation leaves the tree in a broken state where subsequent insert() calls will fail. We could easily end up with 100% branch coverage of insert(), delete(), and the other operations without even actually testing the case where a node is deleted and then inserted.

Conclusions

100% branch coverage is often attainable at the unit level, but almost never is at the system level. Regardless, it is insufficient for finding any number of kinds of bugs (the list here is just off the top of my head; if you have any favorites that I missed, please let me know).

There’s no general-purpose way to find these bugs, but rather a variety of partial solutions. First, we can pay close attention to idioms such as lookup tables that can subvert branch coverage. Second, we can add extra paths to the code in an attempt to make implicit predicates explicit, such as integer overflow checks. Third, we can play with the coverage metric, for example by augmenting branch coverage with additional loop coverage targets. Finally, we can add annotations to our code to express necessary preconditions, postconditions, and loop invariants, and then use a tool to check these.

Life After Tenure

I wanted to dequeue one more post before the semester gets going for real. This is a collection of random observations about the contrast between being an untenured and tenured professor.

Tenure is an up-or-out process: either you get it or you lose your job. This is stressful for everyone. Getting tenure is excellent in that this source of stress is removed, but it is also a little bit confusing since we are suddenly forced to ask what exactly it is that we are doing with our lives now that—at long last—we aren’t living up to some random collection of external expectations. You don’t have to look too hard to find examples of people who have voluntarily left academia after getting tenure; my guess is that this effect is sometimes a factor.

Basically everyone who gets tenure, including me, finds him/herself continuing to get busier every year. Not only do the demands on our time increase, but we can no longer fall back on the convenient selfish excuse “sorry—can’t possibly do that time consuming task while I’m on the tenure track.” The constantly overloaded TODO list gets old, and it also makes it hard to get actual research done. It’s no coincidence that a lot of tenured professors rely on students to do all of the technical work associated with their research programs. Untenured professors not only have more time to do research but also may be reluctant to risk their careers by putting students on too many critical paths.

I used to knock off work between 11 and midnight, whereas I now generally quit between 10 and 11. Additionally, during many of my pre-tenure years I got very little exercise, which resulted in high blood pressure and various other problems; now I spend some time staying in shape. On the other hand, I don’t take as many days off now, probably due to more stuff going on in general and the kids being on a school schedule. Even so, I probably work fewer hours per day now than I used to.

I’m trying to write fewer papers now, and to make each paper a really solid one. This blog is part of that effort: it siphons off a good deal of writing energy. At some point leading up to tenure I realized that writing a mediocre paper makes no sense; it is almost as much work as writing a good one and it dilutes whatever reputation for myself that I may be managing to build up. A few of my more recent papers, most notably the Csmith one, are the result of a large amount of effort that would simply not have made sense to allocate to a single paper before tenure. I could have turned Csmith into a series of papers, but why? I can’t see that as being useful to anyone. There’s an avalanche of crappy papers and I would love to not be part of it. It is possible that I’m doing my PhD students a disservice by publishing less; I try to deal with this by being up-front with them about the academic track: if they want to get on it, they need to learn to write fairly rapidly and extremely well. I’m happy to work with them on that project, but I’m not going to push. The impetus needs to come from them.

When I got tenure, one of my plans was to avoid letting my teaching be compromised by research pressures. I’m afraid that I have not accomplished this goal. I do a pretty good job teaching, but no better than I used to. I do, however, feel more free to take risks in the classroom by trying teaching techniques that aren’t totally tried and true and by throwing out old assignments more often. This causes me to take a hit on teaching evaluations sometimes, which is OK.

Finally, I’d like to say that I now take more time to think and read, but I’m not sure that this is the case. Probably the main thing that I do differently along these lines is write blog posts, which is certainly useful as a way to motivate thinking. I read relatively few books and papers related to my research, except on occasional bursts where I need to learn some new research thread or subfield.

When Software Ages Badly

In some respects, software ages gracefully: it generally starts out working poorly but gets better over time as bugs are fixed. Unlike hardware, there’s no physical wearing out of parts. This post is about a few ways in which software doesn’t get better with age.

In The Mythical Man Month Brooks observes that any sufficiently complex software system will tend to have an “irreducible number of errors”—a reference to the fact that in a crufty old software base, any bug fix is likely to break something else. It is possible to fight this phenomenon through aggressive modularization and refactoring but these are expensive.

The second way for old software to become less reliable is when the meaning of its code changes. Obviously this can happen when the software is ported to a new architecture where, for example, pointers or integers have a different size. Also, a library or operating system update—even an obvious fix for a simple bug—can break an application that relied on the previous semantics. A compiler upgrade can break working code in any number of insidious ways. For example, an improved optimizer can make code run faster, exposing previously latent race conditions. A C/C++ compiler’s exploitation of undefined behavior can be ratcheted up a notch, breaking code that previously worked. This has been continuously happening for at least a decade and it’s not going to stop in the foreseeable future.

The inputs that are provided to a software system can change in harmful ways. The best example of this that I can think of is when a new class of attack becomes well understood, such as stack buffer overflows, heap overflows, and integer overflows. When a nasty new fuzzing tool appears, a previously solid-seeming piece of code can be shown to be not correct at all. The input domain can also change shape when a piece of software is reused in a new environment; Ariane 5 flight 501 is an excellent example.

Finally, in the long run, good old software rot can happen. Programming idioms, programming languages, software architectures, and other aspects of code are always evolving and eventually progress gets the better of almost any piece of software.

We’ve been writing software for hardly more than half a century, and I suspect that the vast majority of lines of code in existence were written since 2000. What will the world look like when we’ve had software for as long as we’ve had the printing press? What about when we’ve had computer languages for as long as we’ve had spoken languages? How are we going to perform refactoring on a code base of a hundred billion lines? Vernor Vinge envisions professional programmer-archaeologists whose job it is to deal with these systems; I’m not sure we’re that far off from needing these people now.