ISSTA 2011

Earlier this week I gave one of the keynote talks at ISSTA, the International Symposium on Software Testing and Analysis. A year ago Matt Dwyer, the general chair, sent me the following invitation:

I would like to invite you to give a keynote talk to the meeting about the challenges in testing, dynamic and static analysis aimed at fault detection for embedded software and particularly sensor network applications. I believe that as sensor network apps continue to mature that new, perhaps domain-specific, V&V techniques will be needed in order to field reliable systems. This topic has received very little attention…

I thought this was pretty cool because it sounds almost exactly like something that I’d have written. The premise of my talk was that there’s a huge amount of interesting research on testing that still needs to be done in the embedded domain, and that — unlike in the past — there are now a number of really nice open-source embedded platforms such as Arduino, TinyOS, Android, and ROS that should provide ready-made audiences for solid tool work. Here are the slides:

Issta11

View more presentations from regehr

Of the ISSTA talks I saw (which unfortunately wasn’t all that many due to the multi-tracked nature of the event and the fact that I had to skip a couple of sessions to get my own talk done) one of the ones I really liked was Philip Guo’s talk about a hacked Python interpreter that makes computations persistent and also transparently memoizes them when this is possible and profitable. The idea is that a lot of people use languages like Python to do data analysis and end up spending a lot of time parsing and printing temporary files, and also end up having a difficult time figuring out which scripts need to be re-run when something changes. Persistence means you don’t have to explicitly write data to files, and memoization means that you can just invoke all of the code every time you want the results, and that you will get a cached result unless something has changed that actually necessitates re-running. Other than manually deciding what to re-run (which is what I do) it’s possible to write a makefile but that is a big pain. His infrastructure takes care of this stuff automatically. I’d use it except for the fact that I do all of this kind of work in Perl. Oh well. Philip is also the person who wrote the excellent CDE packager for Linux applications.

Lionel Briand gave a great talk about some problems with the idea of adaptive random testing (ART), which is a variation of black box fuzzing where test inputs are selected to be “far” from previous test inputs under some distance metric. The hypothesis is that if we look at the space of test inputs under this metric, buggy regions will be somewhat compact. Therefore, we should be trying to spread test cases as widely as possible over the input space. Briand’s paper makes several points but the main criticism is that an ART implementation requires a number of distance comparisons that grows quadratically. Therefore, for realistic choices of testing parameters, a distance check has to be something like 1e5 times cheaper than running a test for ART to have any chance of paying off. The point isn’t that ART is bad, but rather that its proponents had better think of ways to avoid getting killed by the distance checks. In general I love this kind of talk, which takes a critical look at previous work. I feel sure that it caused contention among the program committee and I’m glad they decided to accept it. Even the targets of this work (the people who wrote the previous ART papers) should be flattered. What I mean is that since most academic work goes unread, we should feel lucky if someone criticizes our work because it means they read it and thought about it carefully.

Another talk I liked very much was the other keynote by Laurie Hendren, whose current project is to provide support for the large number of scientists and engineers who do most of their work in Matlab. Matlab is one of those programming languages whose specification is encoded in a single implementation and she entertainingly described the process of reverse-engineering things like how Matlab looks up a name, which should be — but is not at all — simple.

Overall I found ISSTA (which I had never attended) to be a very friendly conference with a lot of smart people and interesting work. Actually, “smart” doesn’t make it stand out since all computer science conferences are mostly smart people. The thing that I liked most was the practical focus on solving real software problems.

<div style=”width:425px” id=”__ss_8665731″> <strong style=”display:block;margin:12px 0 4px”><a href=”http://www.slideshare.net/regehr/issta11″ title=”Issta11″ target=”_blank”>Issta11</a></strong> <iframe src=”http://www.slideshare.net/slideshow/embed_code/8665731″ width=”425″ height=”355″ frameborder=”0″ marginwidth=”0″ marginheight=”0″ scrolling=”no”></iframe> <div style=”padding:5px 0 12px”> View more <a href=”http://www.slideshare.net/” target=”_blank”>presentations</a> from <a href=”http://www.slideshare.net/regehr” target=”_blank”>regehr</a> </div> </div>

Split Vote

In my group’s recent compiler testing paper we wrote:

We have never seen an “interesting” split vote where randomized differential testing of a collection of C compilers fails to produce a clear consensus answer

Randomized differential testing is just a fancy way of describing this process:

  1. Randomly generate a test input
  2. Run it through several different implementations of the same specification
  3. Check if all implementations produced equivalent output

Today we saw our first split vote using a program generated by Csmith. The reduced test case is:

#include <stdio.h>
struct S0 {
  unsigned f1 : 1;
};

struct S0 s;

int main (void) {
  int x = -3;
  int y = x >= (0, s.f1);
  printf ("%d\n", y);
  return 0;
}

GCC, KCC, and CompCert all print “0\n”. MSVC 2010, Intel CC 12.0.2, and today’s development snapshot of Clang (all for x86-64) print “1\n”. All compilers have optimizations turned off. There are two possibilities:

  1. The test case is ill-formed or ambiguous.
  2. Three of the six tools are wrong.

I’m pretty sure that (using the C99 standard) the test case is fine and the correct value for y is 0. The reasoning is that s.f1, an unsigned value, is promoted to signed int before the comparison is performed, making the comparison operator signed, resulting in false, or zero. The type and value of the left operand to the comma operator should be irrelevant.

There are a few interesting things going on here:

  • Two of the three (apparently) correct results were produced by relatively lightly-tested research compilers.
  • Most compiler bugs are in the optimizers. Therefore, problems like this that show up with optimizations disabled are relatively rare.
  • C is not the simple “portable assembly language” that people like to claim it is. Nobody gets all of its corner cases right, even for something relatively simple like integers.
  • Just yesterday, Xuejun — the main Csmith hacker — added support for the comma operator. Most compilers’ implementations of it are probably not very well tested.
  • Intuitively, n-version programming should work. Knight and Leveson famously showed that it may not.

Related previous posts from this blog are here and here.

Update: I should have added that I’m interested to see if there are any other compilers that get this wrong. If you have access to a compiler not on my list above and wouldn’t mind running the test and posting the result in a comment, I’d appreciate it.

Update from July 13: The behavior of this program is a bit more subtle than my explanation above indicates. John McCall’s comment has the best explanation I’ve seen so far.

Update from July 14: In C, a global variable DOES NOT need an explicit initializer. It is automatically initialized to zero.

Why Verify Software?

People like me who work on software verification (I’m using the term broadly to encompass static analysis, model checking, and traditional formal verification, among others) like to give talks where we show pictures of exploding rockets, stalled vehicles, inoperable robots, and crashed medical devices. We imply that our work is helping, or at least could help, prevent very serious software-related problems. It’s not clear that this sort of claim stands up to a close examination.

What would it take to show a causal link between verification efforts and software safety? A direct demonstration using a controlled experiment would be expensive. An indirect argument would need several parts. First, we’d have to show that flaws revealed by verification efforts are of the kind that could compromise safety. Second, we’d need to show that these flaws would not have been found prior to deployment by traditional V&V — those bugs are merely expensive, not harmful. Third, we’d need to argue that a given dollar spent on verification adds more safety than that same dollar spent on other ways of improving software. Finally, we would need to argue that a successful verification effort won’t have unintended consequences such as emboldening management to substantially accelerate the schedule.

Of course, none of this criticizes software verification research, which I work on and very much believe in. We simply need to be clear about its purpose, which is to reduce overall cost. A top-level goal for software verification that fails to mention cost (for example “minimize damage caused by bugs in software intensive systems”) is untenable because obviously the best way to minimize such damage is to radically simplify, or even eliminate, the software. Of course, in practice we do not wish to radically simplify or eliminate the software because it brings so many benefits.

A more reasonable high-level goal for software verification might be “increase, to the largest possible extent given the methods available, total system utility.” “Total system utility” has both positive and negative components, and verification is mainly about mitigating some of the negative components, or costs, including not just development and manufacturing costs, but also maintenance and accident-related costs. In the next couple of days I’ll post a more specific example where the cost-based analysis of verification is much more useful than the feel-good analysis promoted by exploding rocket pictures.

Safe From Compiler Bugs?

A few people have asked me: Does there exist a subset of the C language that is not, in practice, miscompiled? The intuition behind the question is perfectly reasonable. First, it is clear that there exist C features, such as bitfields and volatile variables, whose compiler support is not so reliable. Second, there exist C subsets like MISRA C that are intended for safety critical application development and we would hope they can be reliably compiled.

There probably do exist subsets of C that avoid compiler bugs. For example, if we avoid all math, control flow, and I/O, then it’s at least conceivable that we’re on safe ground. If not, then almost certainly we’d be OK by permitting only “return 0;” in function bodies. However, my group’s experience in reporting compiler bugs has convinced me that there is no remotely useful subset of C that reliably avoids compiler bugs. Let’s take this C subset as an example:

  • only variables of type int (or unsigned int, if that seems better)
  • no dynamic memory allocation
  • only functions of 30 lines or less
  • only if/else statements, function calls, and simple for loops for flow control (no break, continue, do/while, goto, longjmp, etc.)
  • only single-level structs, single-level pointers, and one-dimensional arrays
  • no type qualifiers
  • all expressions are in three-address form; for example “x = y + z;”

We could add more restrictions, but that would seem to make it hard to get work done in the subset. Of course, to make the subset real we’d need to nail down a lot of additional details and also write a checking tool (neither activity would be difficult).

Next we ask the question: what percentage of the compiler’s optimization passes will have useful work to do when compiling code written in this subset? The answer, unfortunately, is “most of them.” For an aggressive compiler, this amounts to a lot of extremely subtle code. In practice, some of it will be wrong. Thus, I am claiming that even a fairly severe C subset provides little shelter from compiler bugs.

The best way to back up my claim would be to rewrite some large C codes in my subset and show that they are still miscompiled. Unfortunately that is too much work. Rather, I’ll point to a few compiler bugs that we’ve found which, while they are not generally triggered by programs in the subset I outlined above, are triggered by pretty simple test cases. I would argue that — taken together — these effectively kill the idea of a bug-safe subset. Keep in mind that most of these test cases are not fully reduced; in some cases a compiler developer was able to do considerably better (GCC hacker Jakub Jelinek is most amazing test case reducer I know of). Also, my guess is that most of these bugs could be tickled by programs in a smaller subset of C. For example, test cases for bugs that we report often use the volatile qualifier not because it has anything to do with the bug, but rather because volatile objects serve as a convenient mechanism for suppressing optimizations that would otherwise mask the bug.

GCC bug 42952 results in this code being miscompiled:

static int g_16[1];
static int *g_135 = &g_16[0];
static int *l_15 = &g_16[0];
static void foo (void) {
  g_16[0] = 1;
  *g_135 = 0;
  *g_135 = *l_15;
  printf("%d\n", g_16[0]);
}

GCC bug 42512 results in this code being miscompiled:

int g_3;
int main (void) {
  long long l_2;
  for (l_2 = -1; l_2 != 0; l_2 = (unsigned char)(l_2 - 1)) {
    g_3 |= l_2;
  }
  printf("g_3 = %d\n", g_3);
  return 0;
}

GCC bug 41497 results in this code being miscompiled:

static uint16_t add (uint16_t ui1, uint16_t ui2) {
  return ui1 + ui2;
}

uint32_t g_108;
uint8_t f3;
uint8_t f0;

void func_1 (void) {
  for (f3 = 0; f3 <= 0; f3 = 1) {
    for (g_108 = -13; g_108 == 0; g_108 = add (g_108, 0)) {
      f0 = 1;
    }
  }
}

LLVM bug 2487 results in this code being miscompiled:

int g_6;
void func_1 (void) {
  char l_3;
  for (l_3 = 0; l_3 >= 0; ++l_3) {
    if (!l_3) {
      for (g_6 = 1; 0; --g_6);
    }
  }
}

LLVM bug 2716 results in this code being miscompiled:

int func_3 (void) {
  long long p_5 = 0;
  signed char g_323 = 1;
  return (1 > 0x14E7A1AFC6B86DBELL) <= (p_5 - g_323);
}

LLVM bug 3115 results in this code being miscompiled:


unsigned int g_122;

int func_1 (void) {
  unsigned int l_19 = 1;
  if (1 ^ l_19 && 1) return 0;
  return 1;
}

I could go on, but six examples should suffice.

A better question than “Is there a bug-safe C subset?” might be “Is there a subset of C that, when combined with restrictions on the optimization passes used, is safe from compiler bugs?” I don’t know the answer to this, but I do know that disabling optimizations is probably the only reliable way to keep large, possibly-buggy parts of the compiler from executing. Also, organizations creating safety critical systems often limit the amount of optimization that is performed by compilers they use (though that is probably as much to give traceable, debuggable code as it is to avoid miscompilations). In the medium to long run, an even better idea would be to forget about artificial constraints on the language and instead use a verified compiler or a tool such as one of the emerging translation validators for LLVM.

Generalizing and Criticizing Delta Debugging

Delta debugging is a search-based technique for taking an input to a program that triggers a bug and making that input smaller. For example, you might have a sequence of GUI operations that causes Thunderbird to crash. Assuming the crash is deterministic and the input can be replayed automatically, you can iteratively remove UI actions until you end up at a local minimum: a collection of events where removing even one of them makes the crash go away. If Delta is implemented well and properly takes advantage of the structure of the input, the resulting failure-inducing input is not only small, but also most of it is relevant to the failure. Debugging a failure triggered by a small, highly-relevant input is generally pretty easy. The definitive Delta debugging reference is Simplifying and Isolating Failure-Inducing Input by Andreas Zeller and Ralf Hildebrandt (ZH02 from now on).

Delta is fantastically useful, particularly in the context of random testing. As part of my group’s compiler bug-finding project we implemented three new test case minimizers using Delta-like ideas, and we also use an existing line-based Delta implementation. Each of the four reducers has different strengths and weaknesses and in fact local minima can often by escaped by running the implementations one after the other.

Significantly, none of the four test case reducers is a straightforward implementation of an algorithm from the original Delta paper — each of them generalizes it in one or more ways. After spending a lot of time working on test case reduction and thinking about this, I got the idea of writing a paper perhaps called “Generalized Delta Debugging” which weakens many of the assumptions found in the original work. The problem was that the more parts of Delta debugging I generalized, the more the result looked just like a generic greedy search. Thus, it started to look extremely doubtful whether there was any research component to generalizing Delta debugging. This piece explores the consequences of that observation.

Delta == Greedy Search

Just to be clear, by “greedy search” I mean the class of optimization algorithms that are based on a transformation operator and a fitness function. They work by repeatedly transforming the current solution to arrive at a new solution, and replacing the current with the new if the fitness level has increased. No doubt I’m butchering the accepted terminology, but the ideas here are really simple.

The “minimizing delta debugging algorithm” from ZH02 is an instance of greedy search where the fitness of a failure-inducing input is the inverse of its size and the fitness of all non-failure-inducing inputs is zero. The transformation operator removes a contiguous chunk of the input. When the transformation gets stuck — by running out of choices of chunks to remove — it reduces the chunk size. When it gets stuck at the minimum chunk size (a single line, character, or other atom of input) the search is finished.

The “general delta debugging algorithm” from ZH02 is very similar but its goal is to minimize the difference between the current solution and a given input, instead of simply minimizing size. Since I haven’t found many uses for this algorithm in practice, and since it’s not all that different from the minimizing Delta, I won’t discuss it further. Whenever I mention the “Delta algorithm” or similar, it is the minimizing Delta to which I refer.

Which parts of the Delta algorithms from ZH02 can be usefully generalized? As it turns out, pretty much all of them. Let’s look at the different elements in turn.

Generalizing the Transformation Operator

The Delta transformer that deletes contiguous chunks of input at ever-finer levels of granularity is reasonably generic and efficient. However, when attacking highly-structured test cases it often gets stuck at a local maximum long before the test case is fully reduced. (Sorry if I keep switching between minimum and maximum. When discussing size the goal is minimization, when discussing fitness in a generic context, I’ll stick to the convention that the goal is maximization.) Hierarchical delta debugging is a variant that improves performance by operating on sub-trees of tree-structured inputs.

Another generalization is to use a transformer that replaces a chunk of input with something else, instead of simply deleting it. For example, one of our new reducers for C code tries to replace uses of variables with constants. Another replaces function calls with their results, including side effects. These are very effective in practice.

It is also useful to delete parts of the input in a non-local way. For example, to remove an argument to a function in a C program, we must delete it from the definition, declaration, and all uses. Making this transformation work requires a painful degree of friendliness with the C code, but again it’s very useful in practice.

Finally, we sometimes use transformations that don’t even make the test case smaller. For example it may be desirable to replace a small, complex construct (like a call to a trig function in a C program) with a larger but simpler construct (a math expression approximating the trig function’s behavior, perhaps). Similarly, it may be desirable to replace an array with a collection of scalars or a struct assignment with a collection of assignments to members. The scalars or the assignments are then vulnerable to subsequent reduction.

All of these examples point towards a more general idea which is that there is a strong synergy between test case reduction and compiler optimization (which I wrote about earlier).

Generalizing the Fitness Function

ZH02’s minimizing Delta uses 1/size as its fitness function and its general Delta uses the inverse of the string distance between current solution and goal. There are plenty of other useful fitness functions. As I mentioned in the previous paragraph, considering the complexity of different program constructs is useful. We’ve also experimented with using Delta techniques to minimize the number of instructions executed by the test case. The insight is that the complexity of a failing execution depends not only on syntactic characteristics of the failure-inducing input, but also on the dynamic behavior induced by the test case.

A major gap in the ZH02 paper is that it does not address the validity problem: does the transformed test case satisfy the constraints imposed on test inputs? For some uses of Delta no validity test is required because the system under test can detect invalid inputs. On the other hand, the validity problem for C programs is extremely difficult to deal with (in theory, it’s undecidable; in practice, no fun at all) and this has been a major stumbling block in our C compiler bug-finding work — but now solved (thank goodness not by us). Sometimes it is desirable to test software with invalid inputs, but for the C compiler work we want to say that all invalid test cases have zero fitness.

Generalizing the Search Framework

The third element of the Delta debugging algorithms from ZH02 that can be usefully generalized is the search algorithm itself. Backtracking can be used to avoid getting stuck too early, as can other techniques such as simulated annealing or genetic algorithms. Basically, test case reduction can be based on any search algorithm, greedy or not.

A Verdict on Delta

Now I’d like to ask a couple of specific questions about the ZH02 paper. First, why weren’t the algorithms presented as “plugins” to the greedy search framework? This would have improved the presentation by making it clear which elements of the algorithms are important vs. boilerplate. Also, it would have made it easier for readers to understand how subsequent improvements to delta should be structured.

Second, given that delta debugging is a fairly straightforward application of greedy search, how significant is its research contribution? The answer that I lean towards is that the main contribution of delta is giving a nice name to an excellent idea that was, in 2002, somewhat obvious.

Since it is dangerous to criticize an idea by saying, in retrospect, that it was obvious, I’ll provide a bit of justification. First, two of the previously published papers cited by Zeller and Hildebrand (references 8 and 9 in the paper) apply ideas that are basically the same as Delta debugging, but without calling it that. Additionally, a paper they missed — Differential Testing for Software, published in 1998 — described a search-based, automated test case reducer. So it’s clear that by 2002 the ideas forming the core of Delta debugging were floating around.

My opinion is that the two Delta algorithms from the ZH02 paper have little enduring value because they simply do not work very well without modification. At least, we couldn’t make them work without significant generalization. The enduring value of the paper is to popularize and assign a name to the idea of using a search algorithm to improve the quality of failure-inducing test cases.

Moving Forward

As the complexity of computer systems continues to increase, the advantages derived from deterministic execution and automated test case reduction will also continue to increase. Delta debugging provides a conceptual framework for doing so. Unfortunately, it seems that few useful papers on reducing the size of failure-inducing programs that have appeared since the original Delta work. A notable exception is the hierarchical delta debugging work I mentioned earlier. Iterative Delta Debugging is interesting but solves a slightly different problem. Deriving Input Syntactic Structure is a technique that can make hierarchical delta easier to implement. If anyone knows of more work along these lines, I’d like to hear about it.

An Executable Semantics For C Is Useful

The goal of a C/C++ compiler is to turn every sequence of ASCII characters into executable instructions. OK, not really — though it does seem that way sometimes. The real goal of a C/C++ compiler is to map every conforming input into executable instructions that correspond to a legal interpretation of that input. The qualifiers “conforming” and “legal interpretation” are very important. First, the compiler has extremely weak requirements about what it should do with non-conforming inputs, such as programs that contain undefined behaviors (array bounds violations, etc.). Second, all realistic C/C++ programs have a large number of possible interpretations, for example corresponding to different integer sizes, different orders of evaluation for function arguments, etc. The compiler chooses a convenient or efficient one, and the remaining interpretations are latent. They may emerge later on if the compiler options are changed, if the compiler is upgraded, or if a different compiler is used. The point is that the compiler has no obligation to tell us whether the input is conforming or not, nor how many possible interpretations it has.

Thus, while C/C++ compilers are very good at turning conforming programs into efficient executables, they are just about useless for other answering other kinds of questions:

  • Does the program ever execute undefined behaviors, causing it (in principle) to have no meaning and (in practice) to execute attack code or crash?
  • Does the program rely on unspecified behaviors, making it non-portable across compilers, compiler versions, and changes in compiler options?
  • Does the program rely on implementation-defined behaviors, affecting its portability to other compilers and platforms?
  • Why does the program behave in a certain way? In other words, what part of the standard forced that interpretation?

To answer these questions, a wide variety of static analyzers, model checkers, runtime verifiers, and related tools have been developed. These tools are great. However, even taken all together, they are incomplete: there exist plenty of bad (or interesting) program behaviors that few or none of them can find. For example:

  • Very few tools exist that can reliably detect uses of uninitialized storage.
  • Few, if any, tools can correctly diagnose problems resulting from C/C++’s unspecified order of evaluation of function arguments.
  • An lvalue must not be modified multiple times, or be both read and written, in between sequence points. I’m not aware of many tools that can correctly detect that evaluating this function results in undefined behavior when p1 and p2 are aliases:
int foo (int *p1, int *p2) {
  return (*p1)++ % (*p2)++;
}

The Missing Tool

The missing tool (or one of them, at any rate) is an executable semantics for C. An executable semantics is an extremely careful kind of interpreter where every action it takes directly corresponds to some part of the language standard. Moreover, an executable semantics can be designed to tell us whether the standard assigns any meaning at all to the program being interpreted. In other words, it can explicitly check for all (or at least most) undefined, unspecified, and implementation-defined behaviors. For example, when an executable semantics evaluates (*p1)++ % (*p2)++, it won’t assign a meaning to the expression until it has checked that:

  • both pointers are valid
  • neither addition overflows (if the promoted types are signed)
  • p1 and p2 are not aliases
  • *p2 is not 0
  • either *p1 is not INT_MIN or *p2 is not -1

Moreover, the tool should make explicit all of the implicit casts that are part of the “usual arithmetic conversions.” And it needs to do about 500 other things that we usually don’t think about when dealing with C code.

Who Needs an Executable Semantics?

Regular programmers won’t need it very often, but they will occasionally find it useful for settling the kinds of annoying arguments that happen when people don’t know how to read the all-too-ambiguous English text in the standard. Of course, the executable semantics can only settle arguments if we agree that it has captured the sense of the standard. Better yet, we would treat the executable semantics as definitive and the document as a derivative work.

Compiler developers need an executable semantics. It would provide a simple, automated filter to apply to programs that purportedly trigger compiler bugs. A web page at Keil states that “Fewer than 1% of the bug reports we receive are actually bugs.” An executable semantics would rapidly find code fragments that contain undefined or unspecified behaviors — these are a common source of bogus bug reports. Currently, compiler developers do this checking by hand. The GCC bug database contains 4966 bug reports that have been marked as INVALID. Not all of these could be automatically detected, but some of them certainly could be.

People developing safety-critical software may get some benefit from an executable semantics. Consider CompCert, a verified compiler that provably preserves the meaning of C code when translating it into assembly. CompCert’s guarantee, however, is conditional on the C code containing no undefined behaviors. How are we supposed to verify the absence of undefined behaviors when existing tools don’t reliably check for initialization and multiple updates to lvalues? Moreover, CompCert is free to choose any legitimate interpretation of a C program that relies on unspecified behaviors, and it does not need to tell us which one it has chosen. We need to verify up-front that (under some set of implementation-defined behaviors) our safety-critical C program has a single interpretation.

My students and I need an executable semantics, because we are constantly trying to figure out whether random C functions are well-defined or not. This is surprisingly hard. We also need a reliable, automated way to detect undefined behavior because this enables automated test case reduction.

An Executable Semantics for C Exists

I spent a few years lamenting the non-existence of an executable C semantics, but no longer: as of recently, the tool exists. It was created by Chucky Ellison, a PhD student at the University of Illinois working under the supervision of Grigore Rosu. They have written a TR about it and also the tool can be downloaded. Hopefully Chucky does not mind if I provide this link — the tool is very much a research prototype (mainly, it is not very fast). But it works:

regehr@home:~/svn/code$ cat lval.c
int foo (int *p1, int *p2) {
  return (*p1)++ % (*p2)++;
}

int main (void) {
  int a = 1;
  return foo (&a, &a);
}
regehr@home:~/svn/code$ kcc lval.c
regehr@home:~/svn/code$ ./a.out
=============================================================
ERROR! KCC encountered an error while executing this program.
=============================================================
Error: 00003
Description: Unsequenced side effect on scalar object with value computation of same object.
=============================================================
File: /mnt/z/svn/code/lval.c
Function: foo
Line: 2
============================================================

As I mentioned earlier, very few tools for analyzing C code find this error. Chucky’s tool can also perform a state space exploration to find order of evaluation problems and problems in concurrent C codes. Finally, it can run in profile mode. Unlike a regular profiler, this one profiles the rules from the C semantics that fire when the program is interpreted. This is really cool and we plan to use it to figure out what parts of the C standard are not exercised by Csmith.

Chucky’s tool is already an integral part of one of our test case reducers. This reducer takes as input a huge, ugly, bug-triggering C program generated by Csmith. It then uses Delta debugging to output a much smaller bug-triggering program that (ideally) can be included in a compiler bug report without further manual simplification. Before Chucky’s tool arrived, we had spent several years trying to deal with the fact that Delta always introduces undefined behavior. We now seem to have a bulletproof solution to that problem.

The benefits of executable semantics have long been recognized in the PL community. The new thing here is a tool that actually works, for the C language. Hopefully, as Chucky’s tool matures people will find more uses for it, and perhaps it can even evolve into a sort of de facto litmus test for ascertaining the meaning — or lack thereof — of difficult C programs.

Uninitialized Variables

I’ve been tempted, a couple of times, to try to discover how much performance realistic C/C++ programs gain through the languages’ failure to automatically initialize function-scoped storage. It would be easy to take a source-to-source transformer like CIL and use it to add an explicit initializer to every variable that lacks one. Then, presumably, a modern optimizing compiler would eliminate the large fraction of initializers that are dominated by subsequent stores. If compilers are not smart enough to do this, and overhead remains high, a more sophisticated approach would be to:

  • Only initialize a variable when some powerful (but conservative) interprocedural static analyzer thinks it may be needed
  • Initialize a variable close to its first use, to avoid problems with locality and with artificially lengthening the live range

My guess is that many programs would not be slowed down noticeably by automatic initialization, but that it would not be too hard to find codes that slow down 5%-20%.

Lacking automatic initialization, most of us get by using compiler warnings and dynamic tools like Valgrind. I was recently surprised to learn that warnings+Valgrind are not as reliable as I’d have hoped. First, the compiler warnings are fundamentally best-effort in the sense that common compilers more or less give up on code using arrays, pointers, function calls, and some kinds of loops. For example:

int foo1 (void) {
  int y,z;
  for (y=0; y<5; y++)
    z++;
  return z;
}

int foo2 (void) {
  int a[15];
  return a[10];
}

void bar (int *p) {
}

int foo3 (void) {
  int x;
  bar (&x);
  return x;
}

Each of foo1(), foo2(), and foo3() returns a value that depends on a read from uninitialized storage, but recent versions of GCC and Clang fail to give a warning about this, at least for the command line options I could think to try. Intel CC 12.0 finds the first two and Clang’s static analyzer finds the problem with foo2(). However, both ICC and Clang’s analyzer are fairly easy to fool. For example, neither gives any warning about this code:

int foo2b (void) {
  int a[6], i;
  for (i=0; i<5; i++) {
    a[i] = 1;
  }
  return a[5];
}

Next let’s talk about Valgrind. It is greatly handicapped by the fact that optimizing compilers seek out and destroy code relying on undefined behaviors. This means that Valgrind never gets a chance to see the problems, preventing it from reporting errors. We can turn the functions above into a complete program by adding:

int main (void) {
  return foo1() + foo2() + foo3();
}

When compiled by GCC or Clang at -O2, the resulting executable passes through Valgrind with zero errors found. But let’s be clear: the problem still exists, it is just hidden from Valgrind. Each function still has to return something, and the programmer has failed to specify what it is. Basically the compiler will fabricate a value out of thin air. Or, as I like to say in class, it will somehow manage to generate the worst possible value — whatever it is. On the other hand, when optimizations are turned off, the output of either GCC or Clang is properly flagged as returning a value depending on uninitialized storage.

Is Valgrind reliable when it monitors code produced at -O0? Unfortunately not: the following functions pass through without error, regardless of optimization level:

void foo5 (void) {
  int x,y;
  for (x = 0; x < 5; x++)
    y;
}

void foo6 (void) {
  int a[1];
  a[0];
}

Moreover, neither GCC nor Clang warns about these. On the other hand, these functions are perhaps harmless since the uninitialized data aren’t actually used for anything. (Keep in mind, however, that according to the C standard, these functions are unambiguously wrong, and executing either of them destroys the meaning of the entire program.)

Failure to initialize function-scoped variables is one of the many holes in C/C++ that were introduced under the philosophies “trust the programmer” and “performance above all.” The resulting problems have been patched in a not-totally-satisfactory way using a collection of tools.

Is there a lesson here? Perhaps. I think it would be reasonable for tool developers to ensure that the following invariant holds:

For all C/C++ programs, any values read from uninitialized storage either:

  • result in a compiler warning,
  • result in a Valgrind error,
  • fail to propagate (via data flow or control flow) to any system call.

I believe (but am not 100% sure) that Valgrind does not need to be modified to make this happen. Compilers definitely need to be changed. Basically, the compiler has to (1) without optimizations, generate code that permits Valgrind to detect all uses of uninitialized storage and (2) when optimizing, restrain itself from performing any transformation that conceals a problem from Valgrind, unless it emits a warning corresponding to the runtime behavior that has disappeared. Condition 1 is already the case for GCC and Clang (as far as I know) but a fair amount of work would probably be required to guarantee that condition 2 holds.

The other solution — automatic initialization — is of dubious value to C/C++ programmers because the resulting code would not be portable to other compilers. The standard would have to mandate initialization before this became generally useful, and there’s no way that’s going to happen. On the other hand, if I were an operating system vendor, and I cared at all about reliability and security, I’d probably hack the bundled compiler to do automatic initialization. Programmers writing conforming code would never notice, and sloppy code would be silently fixed instead of being perhaps exploitable.

Finally, I modified the six foo* functions above to include explicit initializers. When compiled using GCC, the code size overhead due to initialization is 0 bytes. Using Clang, 15 bytes. Using ICC, 40 bytes. This is all at -Os on x86-64. The ICC overhead all comes from foo2() — the compiler emits code initializing all 15 array elements even though 14 of these are obviously useless. Perhaps GCC does such a good job because its middle-end has already been tuned for ahead-of-time compilation of Java code.