Finding and Understanding Bugs in C Compilers

Today we finished preparing the camera-ready version of our paper that will appear in PLDI 2011. I’m pretty happy with it. Here’s the abstract:

Compilers should be correct. To improve the quality of C compilers, we created Csmith, a randomized test-case generation tool, and spent three years using it to find compiler bugs. During this period we reported more than 325 previously unknown bugs to compiler developers. Every compiler we tested was found to crash and also to silently generate wrong code when presented with valid input. In this paper we present our compiler-testing tool and the results of our bug-hunting study. Our first contribution is to advance the state of the art in compiler testing. Unlike previous tools, Csmith generates programs that cover a large subset of C while avoiding the undefined and unspecified behaviors that would destroy its ability to automatically find wrong-code bugs. Our second contribution is a collection of qualitative and quantitative results about the bugs we have found in open-source C compilers.

Actually, as of today the total is 344 bugs reported but at some point we had to freeze the statistics in the paper.

Race Condition vs. Data Race

A race condition is a flaw that occurs when the timing or ordering of events affects a program’s correctness. Generally speaking, some kind of external timing or ordering non-determinism is needed to produce a race condition; typical examples are context switches, OS signals, memory operations on a multiprocessor, and hardware interrupts.

A data race happens when there are two memory accesses in a program where both:

  • target the same location
  • are performed concurrently by two threads
  • are not reads
  • are not synchronization operations

There are other definitions but this one is fine; it’s from Sebastian Burckhardt at Microsoft Research. Two aspects of the definition require a bit of care. “Concurrently” means there wasn’t anything like a lock that forced one operation to happen before or after the other. “Are not synchronization operations” refers to the fact that a program probably contains special memory operations, such as those used to implement locks, that are not themselves synchronized. By definition, these do not trigger data races.

In practice there is considerable overlap: many race conditions are due to data races, and many data races lead to race conditions. On the other hand, we can have race conditions without data races and data races without race conditions. Let’s start with a simple function for moving money between two bank accounts:

transfer1 (amount, account_from, account_to) {
  if (account_from.balance < amount) return NOPE;
  account_to.balance += amount;
  account_from.balance -= amount;
  return YEP;
}

Of course this is not how banks really move money, but the example is useful anyway because we understand intuitively that account balances should be non-negative and that a transfer must not create or lose money. When called from multiple threads without external synchronization, this function admits both data races (multiple threads can concurrently try to update an account balance) and race conditions (in a parallel context it will create or lose money). We can try to fix it like this:

transfer2 (amount, account_from, account_to) {
  atomic {
    bal = account_from.balance;
  }
  if (bal < amount) return NOPE;
  atomic {
    account_to.balance += amount;
  }
  atomic {
    account_from.balance -= amount;
  }
  return YEP;
}

Here “atomic” is implemented by the language runtime, perhaps simply by acquiring a thread mutex at the start of the atomic block and releasing it at the end, perhaps using some sort of transaction, or perhaps by disabling interrupts — for purposes of the example it doesn’t matter as long as the code inside the block executes atomically.

transfer2 has no data races when called by multiple threads, but obviously it is an extremely silly function containing race conditions that will cause it to create or lose money almost as badly as the unsynchronized function. From a technical point of view, the problem with transfer2 is that it permits other threads to see memory states where a key invariant — conservation of money — is broken.

To preserve the invariant, we have to use a better locking strategy. As long as atomic’s semanatics are to end the atomic section on any exit from the block, the solution can be blunt:

transfer3 (amount, account_from, account_to) {
  atomic {
    if (account_from.balance < amount) return NOPE;
    account_to.balance += amount;
    account_from.balance -= amount;
    return YEP;
  }
}

This function is free of data races and also of race conditions. Can we change it a bit to make an example that has data races but no race conditions? That is simple:

transfer4 (amount, account_from, account_to) {
  account_from.activity = true;
  account_to.activity = true;
  atomic {
    if (account_from.balance < amount) return NOPE;
    account_to.balance += amount;
    account_from.balance -= amount;
    return YEP;
  }
}

Here we are setting flags indicating that some sort of activity occurred on the accounts. Are the data races on these flags harmful? Perhaps not. For example, in the evening we might shut down all transaction-processing threads and then select 10 random accounts that are flagged as having had activity for manual auditing. For this purpose, the data races are entirely harmless.

We’ve ended up covering all possibilities:

data race no data race
race condition transfer1 transfer2
no race condition transfer4 transfer3

The point of this exercise was to look at transfer2 and transfer4, which illustrate that freedom from data races is a very weak property that is neither necessary nor sufficient for establishing the concurrency correctness of a computer program. Why does anyone care about data races at all? As far as I can tell, there are two reasons. First, a data-race-free program can, with some care, be shown to be independent of the whims of whatever weak memory model it sits on. This lets us breathe easier because reasoning about racy programs running on weak memory models is near impossible for humans. Second, data races are easy to find automatically — notice from the definition that it simply requires memory accesses to be monitored; no knowledge of application semantics is required. Race conditions, on the other hand, are intimately tied to application-level invariants, making them far thornier to reason about (quick — what are all the invariants in Firefox? In Linux? In MySQL?).

Everything in this post is pretty obvious, but I’ve observed real confusion about the distinction between data race and race condition by people who should know better (for example because they are doing research on concurrency correctness). Muddying the waters further, even when people are perfectly clear on the underlying concepts, they sometimes say “race condition” when the really mean “data race.” Certainly I’ve caught myself doing this.

Guidelines for Teaching Assistants

I’ve been teaching university-level courses for the last nine years, usually with the support of teaching assistants (TAs): students who get paid to do things like grading, office hours, fielding email questions, making and debugging assignments, proctoring exams, and perhaps even giving a lecture when I’m sick or traveling.

At the start of each semester I sit down with my new TAs and go over some guidelines. Unfortunately, this is a chaotic time and I’m pretty sure I’ve never managed to give the same advice twice. This piece is an attempt to put all of my guidelines in one place. I’m certainly leaving things out — please fill in details in comments. This is all somewhat specific to computer science classes with programming assignments.

The Golden Rule of the CS TA: Never Type at a Student’s Keyboard

When a student doesn’t get it, the temptation to just sit down and take over can be overwhelming. But you have to resist because as long as the student is driving, the student is thinking. When you take over, their brain switches into passive mode.

Progressive Hints

When a student is struggling on an assignment, first give oblique hints: “Did you read Chpater 7?” “Did you search for that on Wikipedia?” If this fails to help, be more concrete: “Are you checking all return codes?” “Did you run it under Valgrind?” Finally, and only when you are convinced the student has made a genuine effort to solve the problem, give a more direct answer, but be sure that it only helps the student past the current roadblock: “Your foo() function has to return 0 on success, not 1.” “You’re dereferencing a NULL pointer.” With time, you will learn which students can run with a quick hint and which ones probably need stronger advice.

My model for progressive hints is based on the Invisiclues packets for old Infocom games — so awesome.

Amend, But Communicate

Often, when answering a student’s question, it becomes clear that some part of the assignment or lab was underspecified and it should have been more clear. In cases like this, your answer to the student is effectively a supplemental part of the assignment. This is fine, but when this happens it is necessary to let the instructor and/or the rest of the class know the additional information.

When Things Aren’t Working, Tell Me

Sometimes there are infrastructure problems: a bunch of machines in the lab are down, or some critical piece of software doesn’t work correctly. Other times, there is a systematic problem in an assignment or lab: a crucial ambiguity or error. When you notice anything like this happening, let me know — I can’t fix problems that I’m unaware of.

Treat Students With Respect

I’ve only seen one problem out of probably 30 TAs, and it wasn’t that serious. But still this is worth saying.

Don’t Play the 10,000 Questions Game

In a big class there are always one or two students who, instead of reading the assignment, just email the instructor or TAs. If you answer, they immediately have another question. This goes on forever — I do not exaggerate, they apparently never eat or sleep. The solution is to recognize these cycles and delay the next response for 12 hours or so. Sooner or later the student realizes that it’ll be more efficient to just read the assignment.

Do the Assignment First

This sucks, but you have to do each assignment, or at least the ones for which you are the principal TA, before it is handed out. Otherwise, it is effectively impossible to help students get past roadblocks they encounter. This mainly applies to courses like operating systems where the assignments may be really hard. For freshmen and sophomore-level programming courses, any conceivable problem can by eyeballed in a few seconds.

Don’t Miss Office Hours

At least give advance warning if it’ll be impossible to show up so students don’t waste time coming to the lab.

Bounce Problem Cases Up To The Instructor

You’re not being paid enough to deal with difficult students. If anyone is pushy about a grade, rude about any issue, or anything else, simply refer the student to me. This almost never happens.

Don’t Screw Up the File

There is one grade spreadsheet, and it has a specific file name. Do not put grades in any other file, do not create duplicate files, do not put unlabeled columns in the spreadsheet. Label each new column carefully. Every time you finish working with the file, make sure the permissions are such that the other TAs and I can read and write it, but nobody else can.

Anonymize Grades Carefully

When posting grades, delete names, student ID numbers, and email addresses before exporting to HTML.

Never Deal With Cheating

This is my job. If you suspect cheating, talk to me; but never bring it up with a student.

Be Timely

Unless I say otherwise, labs should be graded, and grades entered, within a week of being handed in.

Do Not Flake Out

TAs are themselves students and are subject to getting overwhelmed like anyone else. If this seems to be happening, come talk to me. If you just disappear in the middle of the semester and stop doing your job, I will seriously make sure you never work in this town again.

The Simplest Queue?

My student Jianjun is proving things about ARM executables that handle interrupts. It’s very difficult work, so when I asked him to write up a “simple” case study where an interrupt and the main context communicate through a ring buffer, I thought it would be helpful if I handed him the simplest possible queue that is at all realistic. Since queues are already pretty simple, there seemed to be only two tricks worth playing: force queue size to be a power of two, and waste one slot to simplify full-queue detection. Here’s my code:

#include "q.h"

#if ((QSIZE)<2)||(!((((QSIZE)|((QSIZE)-1))+1)/2==(QSIZE)))
#error QSIZE must be >1 and also a power of 2
#endif

#define MASK ((QSIZE)-1)

static int q[QSIZE];
static int head, tail;

static int
inc (int x)
{
  return (x + 1) & MASK;
}

int
full (void)
{
  return inc (head) == tail;
}

int
mt (void)
{
  return head == tail;
}

int
enq (int item)
{
  if (full ())
    return 0;
  q[head] = item;
  head = inc (head);
  return 1;
}

int
deq (int *loc)
{
  if (mt ())
    return 0;
  *loc = q[tail];
  tail = inc (tail);
  return 1;
}

int
qents (void)
{
  int s = head - tail;
  if (s < 0)
    s += (QSIZE);
  return s;
}

The header file simply defines QSIZE and gives prototypes for the functions. Did I miss any good tricks? The code coming out of compilers I have sitting around isn’t quite as clean as I’d have hoped. There’s some redundancy in the code (calling inc() twice in the enqueue and dequeue functions) but eliminating it should be a simple matter for the compiler.

Negative Correlation Achieved

Recently I reviewed 19 papers that were submitted to CAV 2011. This is the first time I’ve been involved with a pure verification conference, and consequently I greatly enjoyed reading the papers because almost every one contained something new. Each time I submitted a review I looked at the ones that were already submitted for that paper, and kept being surprised at how often I disagreed with the other reviewers. Finally I just computed the correlation between my score and the average of the other reviewers’ scores, and the result was an astonishingly low -0.07.

In contrast, when I review papers at a “systems” venue (doesn’t matter if it’s operating systems, embedded systems, or something else) it’s not at all uncommon for me to give exactly the same score as all, or almost all, of the other reviewers. My guess is that at the last five systems conferences I was involved with, the correlation between my score and the average of the other reviewers’ scores was higher than 0.8.

I’m not sure there’s a take-away message here other than “communities have very different evaluation standards for papers.” However, this does shed a bit of light on why it can be quite difficult to switch areas. Closely related: Doug Comer’s excellent piece on how to insult a computer scientist.

The Little C Function From Hell

The other day a student and I were trying to understand a subtle part of the C standard. Often, the easiest way to clarify this kind of issue is to recognize that compiler writers have already grappled with it — so just write some code and see what various compilers do with it. I wrote this function:

int foo (char x) {
  char y = x;
  return ++x > y;
}

Since the expression ++x evaluates to the incremented value of x, it is clear that this function should return “1” for most values of x. The question is: What does it compute for CHAR_MAX?  One possibility is that the function reliably returns “0” for that input, the other possibility is that ++x is undefined on platforms where char is a signed type. For completeness here’s the test harness that prints foo()’s output for all possible inputs:

int main (void) {
  int i;
  for (i=CHAR_MIN; i<=CHAR_MAX; i++) {
    printf ("%d ", foo(i));
    if ((i&31)==31) printf ("\n");
  }
  return 0;
}

This code only works if char is narrower than int, but I’ve never seen a platform where that is not the case.

The first sign that something odd was going on appeared when I compiled the code using Clang:

regehr@home:~$ clang -O foo.c -o foo
regehr@home:~$ ./foo
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Not cool — the function is supposed to mostly return true. But then I realized that my default Clang was out of date (2.7) so I tried a very recent Clang snapshot (rev 126534):

regehr@home:~$ clang -O0 overflow.c -o overflow
regehr@home:~$ ./overflow
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
regehr@home:~$ clang -O1 overflow.c -o overflow
regehr@home:~$ ./overflow
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

The result (look at the last character) changed when we changed the optimization level — this is OK if incrementing CHAR_MAX is undefined, and is a compiler bug otherwise.

The Intel C compiler (12.0.2 for x86-64) has similar behavior:

[regehr@bethe ~]$ icc -O0 foo.c -o foo
[regehr@bethe ~]$ ./foo
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
[regehr@bethe ~]$ icc -O foo.c -o foo
[regehr@bethe ~]$ ./foo
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

A very recent GCC (rev 170512 for x86) gives a consistent output:

regehr@home:~$ current-gcc -O0 foo.c -o foo
regehr@home:~$ ./foo
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
regehr@home:~$ current-gcc -O2 foo.c -o foo
regehr@home:~$ ./foo
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0

However, the story changes if we ask it not to perform function inlining:

regehr@home:~$ current-gcc -O2 -fno-inline foo.c -o foo
regehr@home:~$ ./foo
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

So far, Clang 2.7 appears to be wrong, but all other observed behavior is consistent with ++x being undefined when x is CHAR_MAX. Then I tried CompCert and things took a turn for the worse:

regehr@home:~$ ccomp foo.c -o foo
regehr@home:~$ ./foo
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

This is very odd because CompCert contains a verified version of C’s tricky implicit casts — the exact thing that foo() was designed to test in the first place.

To make a long story short: when the “char” type is narrower than the “int” type and when x has type “signed char” and value CHAR_MAX, ++x is well-defined by both ANSI C and C99. We know that it is well-defined because:

  1. The standard says: “The expression ++E is equivalent to (E+=1).”
  2. The standard says: “A compound assignment of the form E1 op= E2 differs from the simple assignment
    expression E1 = E1 op (E2) only in that the lvalue E1 is evaluated only once.”
  3. The RHS of the simple assignment expression “E1 op E2” is subject to C’s “usual arithmetic conversions.”
  4. The usual arithmetic conversions ensure that two operands to a “+” operator of type signed char are both promoted to signed int before the addition is performed.
  5. When int is wider than char, there is no possibility of the resulting addition overflowing. Thus, the behavior is well-defined and every correct compiler must emit this output:

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0

In other words, to evaluate ++x > y when both variables have type signed char and value CHAR_MAX:

  1. Signed char 127 is promoted to signed int 127.
  2. Signed int 127 is incremented to signed int 128.
  3. Signed int 128 is cast to signed char -128, which is the new value for x and also the value of the subexpression ++x. (Update: As Mans points out in comment #5, the result of this type cast is implementation defined. All common C implementations define the behavior to be truncation of a 2’s complement integer.)
  4. Signed char -128 is promoted to signed int -128.
  5. Signed int -128 > signed int 127 is evaluated to 0.

But what about the fact that none of the four compilers I commonly use reliably returns the correct result?

  • In GCC the bug is known and has existed for some time.
  • In LLVM/Clang the bug was not known but was fixed in less than 24 hours.
  • The Intel compiler is wrong too.
  • In CompCert there is no bug. However, there is an unfortunate interaction between its definition of “char” to be unsigned and the signed values for CHAR_MIN and CHAR_MAX found in my Linux machine’s limits.h file. Verifying a compiler is an extremely difficult problem and verifying its entire environment (header files, libraries, assemblers, I/O routines, etc.) is pretty much an open problem. This is why we test.

That’s a lot of trouble being caused by a two-line function. C may be a small language, but it’s not a simple one.

[Thanks to Chucky Ellison from UIUC and to Xavier Leroy for helping me puzzle though all this.]

Better Testing With Undefined Behavior Coverage

[The bit puzzle results are based on data from Chad Brubaker and the saturating operation results are based on data from Peng Li. They are respectively an undergrad and a grad student in Utah’s CS program.]

Klee is a tool that attempts to generate a collection of test cases inducing path coverage on a system under test. Path coverage means that all feasible control flow paths are executed. It is a strong kind of coverage, but still misses bugs. One way to improve Klee would be to add support for different kinds of coverage metrics: weaker ones like statement coverage would scale to larger programs, and stronger ones such as boundary-value coverage would find more bugs in small codes.

A different way to improve Klee is to continue to target path coverage, but alter the definition of “path.” For example:

  • When testing an x86-64 binary containing a cmov instruction, we could make sure to execute both its condition-true path and condition-false path.
  • When testing the C expression foo(bar(),baz()), we could make sure to test evaluating foo() and bar() in both orders, instead of just letting the compiler pick one.

This piece proposes undefined behavior coverage, which simply means that for any operation that has conditionally-defined behavior, the well-defined and the undefined behaviors are considered to be separate paths. For example, the C expression 3/y has two paths: one where y is zero and the other where y is non-zero.

Obviously, undefined behavior coverage only makes sense for languages such as C and C++ that admit operations with undefined behavior. An undefined behavior, as defined by the C and C++ standards, is one where the language implementation can do anything it likes. The point is to make the compiler developers’ job easier — they may simply assume that undefined behavior never happens. The tradeoff is that the burden of verification is shifted onto language users.

Undefined behavior coverage makes sense for what I call type 2 functions: those whose behavior is conditionally well-defined.

An Example

Here’s a simple C function:

int add_and_shift (int x, int y, int z) {
  return (x+y)<<z;
}

Due to C’s undefined behaviors, this function has a non-trivial precondition:

0 ≤ z < sizeof(int)*CHAR_BIT
INT_MIN ≤ x+y ≤ INT_MAX

(This is for ANSI C; in C99 the precondition is stronger and quite a bit more complicated, but we won’t worry about that.) If the precondition is not satisfied, the function’s return value is unpredictable. In fact, it’s a bit worse than that: as soon as the program executes an undefined behavior the C implementation is permitted to send email to the developer’s mother (though this hardly ever happens).

The point is that although shift_and_add() seems to admit a single path, it really has a number of additional paths corresponding to failed preconditions for its math operators. If we fail to test these paths, we can miss bugs. Since the precondition checks for math operators in C/C++ are pretty simple, we can just add them in an early phase of the compiler, and that’s exactly what Peng’s hacked version of Clang does.

Without undefined behavior checks, LLVM code for add_and_shift() looks like this:

define i32 @add_and_shift(i32 %x, i32 %y, i32 %z) nounwind readnone {
entry:
  %add = add i32 %y, %x
  %shl = shl i32 %add, %z
  ret i32 %shl
}

Obviously there’s just one path, and the test case that Klee picks to exercise this path is:

  • x = 0, y = 0, z = 0

Next, we compile the same function with undefined behavior checks and run Klee again. This time we get four test cases:

  • x = 0, y = 0, z = 0
  • x = 0, y = 0, z = 64
  • x = -2, y = INT_MIN, z = 0
  • x = 2, y = 1, z = 0

The first three tests are exactly the kind of inputs we’d hope to see after looking at the precondition. The 4th input appears to follow the same path as the first. I don’t know what’s going on — perhaps it emerges from some idiosyncrasy of the checked code or maybe Klee simply throws in an extra test case for its own reasons.

Combining Klee with an undefined behavior checker causes Klee to generate additional test cases that — by invoking operations with undefined behavior — should shine some light into dark corners of the system under test. A potential drawback is that all the extra paths are going to cause the path explosion problem to happen sooner than it otherwise would have. However, this should not be serious since we can just run Klee on both versions of the code.

But this is all just talk. The real question is: does this method find more bugs?

Bit Puzzle Results

The first collection of code is several years’ worth of solutions to an early assignment in Utah’s CS 4400. I already discussed these, so I won’t repeat myself. For each bit puzzle, students receive a reference implementation (which they cannot simply copy since it doesn’t follow the rules for student solutions) and a simple test harness that runs their code against the reference implementation on some inputs, compares the results, and complains about any differences. For each of 10 bit puzzles we have 105 solutions written by students. The automated test suite determines that 84 of these 1050 solutions are faulty. In other words, they return incorrect output for at least one input. Differential testing with Klee finds seven additional buggy functions, for a total of 91.

When the students’ codes are augmented with checks for integer undefined behaviors, Klee finds more paths to explore. The test cases that it generates find the 91 incorrect functions that are already known plus 11 more, for a total of 102 buggy functions. Just to be perfectly clear: a buggy function is one that (after being compiled by GCC) returns the wrong output for an input in a test suite. We are not counting instances of undefined behavior as bugs, we are simply using Klee and the undefined behavior checker to generate a better test suite.

We were able to exhaustively test some of the bit puzzles. In these cases, exhaustive testing failed to find any bugs not found by differential Klee with undefined behavior coverage.

Saturating Operation Results

The second collection of code is 336 saturating math operations. In this case, the additional tests generated by Klee to satisfy undefined behavior coverage found no additional buggy functions beyond those found using differential whitebox testing. My hypothesis is that:

  1. The shift-related undefined behaviors in these functions always involved constant arguments, since shifts were used only to compute values like the maximum and minimum representable integer of a certain width. Since the arguments were constant, Klee had no opportunity to generate additional test cases.
  2. The addition and subtraction overflow undefined behaviors were compiled by GCC into modular arithmetic, despite the fact that this behavior is not guaranteed by the standard. This is a natural consequence of generating code using the x86 add and sub instructions. Modular arithmetic was the behavior that people (including me, as described in the previous post) wanted and expected. Therefore, undefined behavior coverage exposed no bugs. Modern C compilers sometimes compile math overflows in a non-modular way (for example, evaluating (x+1)>x to 1), but the saturating arithmetic functions — by chance — do not use code like that.

We were able to exhaustively test saturating operations that take chars (for 16 total bits of input) and short ints (for 32 total bits of input). In these cases, exhaustive testing failed to find any bugs not already found by differential Klee.

Conclusion

Undefined behavior coverage is a special case of a more interesting code coverage metric that I’ll describe in a subsequent post. We need to try Klee + undefined behavior coverage on some real applications and see what happens; I’m cautiously optimistic that it will be useful.