Isolation with a Very Small TCB

A basic service provided by most computer systems is isolation between different computations. Given reliable isolation we can safely run programs downloaded from the web, host competing companies’ sites on the same physical server, and perhaps even run your car’s navigation or entertainment software on the same processor that is used to control important system functions.

Other factors being equal, we’d like an isolation implementation to have a trusted computing base (TCB) that is as small as possible. For example, let’s look at the TCB for isolating Javascript programs from different sites when we browse the web using Firefox on a Linux box. It includes at least:

  • pretty much the entire Firefox executable including all of the libraries it statically or dynamically links against (even unrelated code is in the TCB due to the lack of internal isolation in a C/C++ program)
  • the Linux kernel including all kernel modules that are loaded
  • GCC and G++
  • some elements of the hardware platform (at least the processor, memory controller, memory modules, and any peripherals that use DMA)

A deliberate or accidental malfunction of any of these elements may subvert the isolation guarantee that Firefox would like to provide. This is a large TCB, though probably somewhat smaller than the corresponding TCB for a Windows machine. Microkernel-based systems potentially have a much smaller TCB, and a machine-verified microkernel like seL4 removes even the microkernel from the TCB. Even so, the seL4 TCB is still nontrivial, containing for example an optimizing compiler, the substantially complicated specification of the microkernel’s functionality, a model of the hardware including its MMU, and the hardware itself.

My student Lu Zhao’s PhD work is trying to answer the question: How small can we make the trusted
computing base for an isolation implementation? The target for this work is ARM-based microcontrollers that are large enough to perform multiple tasks, but way too small to run Linux (or any other OS that uses memory protection). Lu’s basic strategy is to implement software fault isolation (SFI). His tool rewrites an ARM executable in order to enforce two properties: memory safety and control flow integrity. We’re using a weak form of memory safety that forces all stores to go into a memory region dedicated to that task. Control flow integrity means that control never escapes from a pre-computed control flow graph. This boils down to rewriting the binary to put a dynamic check in front of every potentially dangerous operation. Using the excellent Diablo infrastructure, this was pretty easy to implement. I won’t go into the details but they can be found in a paper we just finished.

The trick to making Lu’s work have a small TCB was to remove the binary rewriter from the TCB by proving that its output has the desired properties. To do this, he needed to create mathematical descriptions of memory safety and control flow integrity, and also to borrow an existing mathematical description of the ARM ISA. Reusing the ARM model was ideal because it has been used by several previous projects and it is believed to be of very high quality. Automating the construction of a non-trivial, machine-checked proof is not so easy and Lu spent a lot of time engineering proof scripts. The saving grace here is that since we control where to add safety checks, the proof is in some sense an obvious one — the postcondition of the check provides exactly the property needed to establish the safety of the potentially-unsafe operation the comes after the check.

The TCB for Lu’s isolation solution contains the HOL4 theorem prover, his definitions of memory safety and control flow integrity (less than 60 lines of HOL), the model of the ARM ISA, and of course the ARM processor itself. All of these elements besides the processor are of quite manageable size. This work currently has two major drawbacks. First, the proofs are slow to construct and the automated proving part only succeeds for small executables. Second, overhead is far higher than it is for a state of the art sandboxing system such as Native Client. In principle the theorem proving infrastructure will be a powerful tool for optimizing sandboxed executables — but we haven’t gotten to that part of the work yet.

As far as we can tell, there’s not much more that can be removed from the TCB. Lu’s SFI implementation is the first one to do full formal verification of its guarantees for real executables. An important thing I learned from working with Lu on this project is that dealing with a mechanical theorem prover is a bit like having kids in the sense that you can’t approach it casually: you either go whole-hog or avoid it entirely. Perhaps in the future the theorem proving people will figure out how to make their tools accessible to dabblers like me.

Proposal for Automated Compiler Bug Reports

[Yesterday I submitted a proposal to Google for a modest amount of money to work on turning large random programs that expose compiler flaws into concise bug reports. Below is a transcription that is mostly faithful (citations are omitted and I changed the example bug report from a floating figure into inline text). Feedback is appreciated.]

Abstract

Csmith, our random testing tool for C compilers, has helped us discover and report nearly 400 previously unknown compiler bugs, mostly in GCC and LLVM. More than 90% of these bugs have now been fixed by compiler developers. Moreover, most of these bugs were in the “middle ends” of the compilers, meaning that they impact multiple programming languages (Go, C++, Ada, etc.) as well as many target architectures.

By proactively finding compiler bugs missed by existing test suites, we help compiler teams eliminate flaws that would otherwise make it into deployed tools. Though it is uncommon to encounter a serious compiler bug, when one does bite a development effort a lot of time is typically wasted. We help by performing nearly continuous testing of the GCC and LLVM development versions. After reporting hundreds of bugs, we appear to have reached a steady state where both compilers are substantially correct but every week or two a new bug is found.

The process of turning a bug-triggering compiler input into a good bug report is tedious and time consuming, while also requiring technical skill and good taste. We propose removing humans from this part of the picture by automating the construction of compiler bug reports.

Research Goals

The goal of the proposed work is to enable a workflow where someone installs our software on a fast machine and tells it how to build the latest compiler version and where to send bug reports.

The machine is now totally unattended. Each day it will build the latest compiler version and spend the rest of the day testing it. When a new bug is discovered, it will be reported. The bug reports will look like this:

Bug-triggering program:

extern int printf (const char *, ...);

static int *g_135 = &g_16[0];
static int *l_15 = &g_16[0];

int main (void) {
  g_16[0] = 1;
  *g_135 = 0;
  *g_135 = *l_15;
  printf("%d\n", g_16[0]);
}
  • Expected output — produced by GCC 3.4, GCC 4.1, Intel CC 11.0, Clang 2.8, and Microsoft VC9 — is “0”.
  • Using r156486 for i686-pc-linux-gnu, “gcc -O” produces “1”. The same version at -O0 produces the expected output.
  • First version of GCC containing this bug is r147613.
  • Probable fault location: the dead store elimination pass.

The manually generated report for this bug is here.

The problem we need to solve is taking a large chunk of random C code (empirically, the sweet spot for randomized compiler bug finding is around 80KB of code) and turning it into an actionable bug report. Our three years’ of experience reporting compiler bugs has given us a good working understanding of what compiler developers want and need in a bug report. The rest of this proposal explains how the different parts of this report will be produced.

The Most Important Problem: Test Case Minimization

The most important problem is test case minimization: turning a large pile of random C code into a small failure-triggering input like the one above. The ddmin Delta debugging algorithm is the best-known solution for test case minimization. In a nutshell, a Delta debugger is just a greedy search that works by iteratively removing a chunk of the input and checking if the new input still triggers the bug. Delta can be significantly optimized by progressively removing smaller and smaller chunks of the input and also by coordinating its choice of chunks with the structure of the test input. The best-known Delta implementation, by Wilkerson, incorporates both of these optimizations.

It works but has two serious problems when used to reduce the size of C programs. First, it gets stuck at local minima that are far from the (apparent) true minimum sizes for failure-inducing compiler inputs due to being line-based. Many interesting reduction operations for C programs need to be aware of its token-level and AST-level structure. Second, Wilkerson’s Delta usually creates reduced test cases that invoke undefined behavior. The GCC developers — to put it mildly — do not appreciate bug reports containing test cases that rely on undefined behavior.

We propose investigating two different approaches to fixing these problems. The first approach has two parts:

  • A C-aware Delta debugger that will significantly improve on the Wilkerson Delta by first performing local transformations such as reducing x+y to x or x to 0, and second by performing global simplifications such as removing an argument not only from a function definition, but also from all uses of the function, removing a level of indirection from a pointer, and performing inline substitution of simple function bodies.
  • We are working with Chucky Ellison at UIUC who is developing an executable semantics for C that robustly detects most kinds of undefined behavior (including several, such as unsequenced updates to an lvalue, that are largely undetected by other tools). His tool is a research prototype; we are helping refine it into a useful part of a Delta debugging loop.

Our second new approach to testcase reduction takes an entirely different approach: using Csmith itself to perform test case minimization. Our insight is that this will work because Csmith understands not only the structure of C, but also how to avoid introducing undefined behavior. Drawbacks of the Delta-in-Csmith approach are that it increases the complexity of an already complex tool and that it cannot be used to reduce failure-inducing C programs that did not come from Csmith in the first place. We believe that it is worth exploring both approaches.

Secondary Research Problems

Although our major focus will be on creating reportable test cases, we also plan to attack the following problems. First, making a robust guess at the first revision that contains a compiler bug. We have already tried using a naive binary search to find the first broken version and too often, the result of the search is a version that contains an incidental change that merely exposes the bug. By randomly exploring a variety of similar test cases and compiler options, we hope to robustly locate the first version that contains the bug.

Second, we will perform fault localization—try to guess where in the compiler the problem lies. When the problem lies in an optional part of the compiler (i.e., an optimization pass) this is fairly easy using Whalley’s work. When the bug lies in the frontend or backend, the situation is more difficult and is beyond the scope of this proposal.

Outcome

Evaluating this work is easy. The only important question is: Are our bug reports good enough that they motivate compiler developers to fix the bugs? We will answer this question by continuing to report bugs to major open source compiler projects like GCC and LLVM. Furthermore, our test case reduction and related tools will be released as open source as part of the already open-sourced Csmith.

Relation to Previous Work

We are already taking advantage of as much previous work as seems to be available. Mainly, this consists of Zeller’s Delta debugging papers and a handful of follow-up papers that have appeared over the last ten years, in addition to Wilkerson’s Delta implementation and Ellison’s executable C semantics. No previous compiler bug-finding effort (that we are aware of) has resulted in anything close to the nearly 400 bugs we have reported, and our guess is that for this reason nobody has yet had to develop good tool support for automated bug reports.

ISSTA 2011

Earlier this week I gave one of the keynote talks at ISSTA, the International Symposium on Software Testing and Analysis. A year ago Matt Dwyer, the general chair, sent me the following invitation:

I would like to invite you to give a keynote talk to the meeting about the challenges in testing, dynamic and static analysis aimed at fault detection for embedded software and particularly sensor network applications. I believe that as sensor network apps continue to mature that new, perhaps domain-specific, V&V techniques will be needed in order to field reliable systems. This topic has received very little attention…

I thought this was pretty cool because it sounds almost exactly like something that I’d have written. The premise of my talk was that there’s a huge amount of interesting research on testing that still needs to be done in the embedded domain, and that — unlike in the past — there are now a number of really nice open-source embedded platforms such as Arduino, TinyOS, Android, and ROS that should provide ready-made audiences for solid tool work. Here are the slides:

Issta11

View more presentations from regehr

Of the ISSTA talks I saw (which unfortunately wasn’t all that many due to the multi-tracked nature of the event and the fact that I had to skip a couple of sessions to get my own talk done) one of the ones I really liked was Philip Guo’s talk about a hacked Python interpreter that makes computations persistent and also transparently memoizes them when this is possible and profitable. The idea is that a lot of people use languages like Python to do data analysis and end up spending a lot of time parsing and printing temporary files, and also end up having a difficult time figuring out which scripts need to be re-run when something changes. Persistence means you don’t have to explicitly write data to files, and memoization means that you can just invoke all of the code every time you want the results, and that you will get a cached result unless something has changed that actually necessitates re-running. Other than manually deciding what to re-run (which is what I do) it’s possible to write a makefile but that is a big pain. His infrastructure takes care of this stuff automatically. I’d use it except for the fact that I do all of this kind of work in Perl. Oh well. Philip is also the person who wrote the excellent CDE packager for Linux applications.

Lionel Briand gave a great talk about some problems with the idea of adaptive random testing (ART), which is a variation of black box fuzzing where test inputs are selected to be “far” from previous test inputs under some distance metric. The hypothesis is that if we look at the space of test inputs under this metric, buggy regions will be somewhat compact. Therefore, we should be trying to spread test cases as widely as possible over the input space. Briand’s paper makes several points but the main criticism is that an ART implementation requires a number of distance comparisons that grows quadratically. Therefore, for realistic choices of testing parameters, a distance check has to be something like 1e5 times cheaper than running a test for ART to have any chance of paying off. The point isn’t that ART is bad, but rather that its proponents had better think of ways to avoid getting killed by the distance checks. In general I love this kind of talk, which takes a critical look at previous work. I feel sure that it caused contention among the program committee and I’m glad they decided to accept it. Even the targets of this work (the people who wrote the previous ART papers) should be flattered. What I mean is that since most academic work goes unread, we should feel lucky if someone criticizes our work because it means they read it and thought about it carefully.

Another talk I liked very much was the other keynote by Laurie Hendren, whose current project is to provide support for the large number of scientists and engineers who do most of their work in Matlab. Matlab is one of those programming languages whose specification is encoded in a single implementation and she entertainingly described the process of reverse-engineering things like how Matlab looks up a name, which should be — but is not at all — simple.

Overall I found ISSTA (which I had never attended) to be a very friendly conference with a lot of smart people and interesting work. Actually, “smart” doesn’t make it stand out since all computer science conferences are mostly smart people. The thing that I liked most was the practical focus on solving real software problems.

<div style=”width:425px” id=”__ss_8665731″> <strong style=”display:block;margin:12px 0 4px”><a href=”http://www.slideshare.net/regehr/issta11″ title=”Issta11″ target=”_blank”>Issta11</a></strong> <iframe src=”http://www.slideshare.net/slideshow/embed_code/8665731″ width=”425″ height=”355″ frameborder=”0″ marginwidth=”0″ marginheight=”0″ scrolling=”no”></iframe> <div style=”padding:5px 0 12px”> View more <a href=”http://www.slideshare.net/” target=”_blank”>presentations</a> from <a href=”http://www.slideshare.net/regehr” target=”_blank”>regehr</a> </div> </div>

Split Vote

In my group’s recent compiler testing paper we wrote:

We have never seen an “interesting” split vote where randomized differential testing of a collection of C compilers fails to produce a clear consensus answer

Randomized differential testing is just a fancy way of describing this process:

  1. Randomly generate a test input
  2. Run it through several different implementations of the same specification
  3. Check if all implementations produced equivalent output

Today we saw our first split vote using a program generated by Csmith. The reduced test case is:

#include <stdio.h>
struct S0 {
  unsigned f1 : 1;
};

struct S0 s;

int main (void) {
  int x = -3;
  int y = x >= (0, s.f1);
  printf ("%d\n", y);
  return 0;
}

GCC, KCC, and CompCert all print “0\n”. MSVC 2010, Intel CC 12.0.2, and today’s development snapshot of Clang (all for x86-64) print “1\n”. All compilers have optimizations turned off. There are two possibilities:

  1. The test case is ill-formed or ambiguous.
  2. Three of the six tools are wrong.

I’m pretty sure that (using the C99 standard) the test case is fine and the correct value for y is 0. The reasoning is that s.f1, an unsigned value, is promoted to signed int before the comparison is performed, making the comparison operator signed, resulting in false, or zero. The type and value of the left operand to the comma operator should be irrelevant.

There are a few interesting things going on here:

  • Two of the three (apparently) correct results were produced by relatively lightly-tested research compilers.
  • Most compiler bugs are in the optimizers. Therefore, problems like this that show up with optimizations disabled are relatively rare.
  • C is not the simple “portable assembly language” that people like to claim it is. Nobody gets all of its corner cases right, even for something relatively simple like integers.
  • Just yesterday, Xuejun — the main Csmith hacker — added support for the comma operator. Most compilers’ implementations of it are probably not very well tested.
  • Intuitively, n-version programming should work. Knight and Leveson famously showed that it may not.

Related previous posts from this blog are here and here.

Update: I should have added that I’m interested to see if there are any other compilers that get this wrong. If you have access to a compiler not on my list above and wouldn’t mind running the test and posting the result in a comment, I’d appreciate it.

Update from July 13: The behavior of this program is a bit more subtle than my explanation above indicates. John McCall’s comment has the best explanation I’ve seen so far.

Update from July 14: In C, a global variable DOES NOT need an explicit initializer. It is automatically initialized to zero.

Why Verify Software?

People like me who work on software verification (I’m using the term broadly to encompass static analysis, model checking, and traditional formal verification, among others) like to give talks where we show pictures of exploding rockets, stalled vehicles, inoperable robots, and crashed medical devices. We imply that our work is helping, or at least could help, prevent very serious software-related problems. It’s not clear that this sort of claim stands up to a close examination.

What would it take to show a causal link between verification efforts and software safety? A direct demonstration using a controlled experiment would be expensive. An indirect argument would need several parts. First, we’d have to show that flaws revealed by verification efforts are of the kind that could compromise safety. Second, we’d need to show that these flaws would not have been found prior to deployment by traditional V&V — those bugs are merely expensive, not harmful. Third, we’d need to argue that a given dollar spent on verification adds more safety than that same dollar spent on other ways of improving software. Finally, we would need to argue that a successful verification effort won’t have unintended consequences such as emboldening management to substantially accelerate the schedule.

Of course, none of this criticizes software verification research, which I work on and very much believe in. We simply need to be clear about its purpose, which is to reduce overall cost. A top-level goal for software verification that fails to mention cost (for example “minimize damage caused by bugs in software intensive systems”) is untenable because obviously the best way to minimize such damage is to radically simplify, or even eliminate, the software. Of course, in practice we do not wish to radically simplify or eliminate the software because it brings so many benefits.

A more reasonable high-level goal for software verification might be “increase, to the largest possible extent given the methods available, total system utility.” “Total system utility” has both positive and negative components, and verification is mainly about mitigating some of the negative components, or costs, including not just development and manufacturing costs, but also maintenance and accident-related costs. In the next couple of days I’ll post a more specific example where the cost-based analysis of verification is much more useful than the feel-good analysis promoted by exploding rocket pictures.

Safe From Compiler Bugs?

A few people have asked me: Does there exist a subset of the C language that is not, in practice, miscompiled? The intuition behind the question is perfectly reasonable. First, it is clear that there exist C features, such as bitfields and volatile variables, whose compiler support is not so reliable. Second, there exist C subsets like MISRA C that are intended for safety critical application development and we would hope they can be reliably compiled.

There probably do exist subsets of C that avoid compiler bugs. For example, if we avoid all math, control flow, and I/O, then it’s at least conceivable that we’re on safe ground. If not, then almost certainly we’d be OK by permitting only “return 0;” in function bodies. However, my group’s experience in reporting compiler bugs has convinced me that there is no remotely useful subset of C that reliably avoids compiler bugs. Let’s take this C subset as an example:

  • only variables of type int (or unsigned int, if that seems better)
  • no dynamic memory allocation
  • only functions of 30 lines or less
  • only if/else statements, function calls, and simple for loops for flow control (no break, continue, do/while, goto, longjmp, etc.)
  • only single-level structs, single-level pointers, and one-dimensional arrays
  • no type qualifiers
  • all expressions are in three-address form; for example “x = y + z;”

We could add more restrictions, but that would seem to make it hard to get work done in the subset. Of course, to make the subset real we’d need to nail down a lot of additional details and also write a checking tool (neither activity would be difficult).

Next we ask the question: what percentage of the compiler’s optimization passes will have useful work to do when compiling code written in this subset? The answer, unfortunately, is “most of them.” For an aggressive compiler, this amounts to a lot of extremely subtle code. In practice, some of it will be wrong. Thus, I am claiming that even a fairly severe C subset provides little shelter from compiler bugs.

The best way to back up my claim would be to rewrite some large C codes in my subset and show that they are still miscompiled. Unfortunately that is too much work. Rather, I’ll point to a few compiler bugs that we’ve found which, while they are not generally triggered by programs in the subset I outlined above, are triggered by pretty simple test cases. I would argue that — taken together — these effectively kill the idea of a bug-safe subset. Keep in mind that most of these test cases are not fully reduced; in some cases a compiler developer was able to do considerably better (GCC hacker Jakub Jelinek is most amazing test case reducer I know of). Also, my guess is that most of these bugs could be tickled by programs in a smaller subset of C. For example, test cases for bugs that we report often use the volatile qualifier not because it has anything to do with the bug, but rather because volatile objects serve as a convenient mechanism for suppressing optimizations that would otherwise mask the bug.

GCC bug 42952 results in this code being miscompiled:

static int g_16[1];
static int *g_135 = &g_16[0];
static int *l_15 = &g_16[0];
static void foo (void) {
  g_16[0] = 1;
  *g_135 = 0;
  *g_135 = *l_15;
  printf("%d\n", g_16[0]);
}

GCC bug 42512 results in this code being miscompiled:

int g_3;
int main (void) {
  long long l_2;
  for (l_2 = -1; l_2 != 0; l_2 = (unsigned char)(l_2 - 1)) {
    g_3 |= l_2;
  }
  printf("g_3 = %d\n", g_3);
  return 0;
}

GCC bug 41497 results in this code being miscompiled:

static uint16_t add (uint16_t ui1, uint16_t ui2) {
  return ui1 + ui2;
}

uint32_t g_108;
uint8_t f3;
uint8_t f0;

void func_1 (void) {
  for (f3 = 0; f3 <= 0; f3 = 1) {
    for (g_108 = -13; g_108 == 0; g_108 = add (g_108, 0)) {
      f0 = 1;
    }
  }
}

LLVM bug 2487 results in this code being miscompiled:

int g_6;
void func_1 (void) {
  char l_3;
  for (l_3 = 0; l_3 >= 0; ++l_3) {
    if (!l_3) {
      for (g_6 = 1; 0; --g_6);
    }
  }
}

LLVM bug 2716 results in this code being miscompiled:

int func_3 (void) {
  long long p_5 = 0;
  signed char g_323 = 1;
  return (1 > 0x14E7A1AFC6B86DBELL) <= (p_5 - g_323);
}

LLVM bug 3115 results in this code being miscompiled:


unsigned int g_122;

int func_1 (void) {
  unsigned int l_19 = 1;
  if (1 ^ l_19 && 1) return 0;
  return 1;
}

I could go on, but six examples should suffice.

A better question than “Is there a bug-safe C subset?” might be “Is there a subset of C that, when combined with restrictions on the optimization passes used, is safe from compiler bugs?” I don’t know the answer to this, but I do know that disabling optimizations is probably the only reliable way to keep large, possibly-buggy parts of the compiler from executing. Also, organizations creating safety critical systems often limit the amount of optimization that is performed by compilers they use (though that is probably as much to give traceable, debuggable code as it is to avoid miscompilations). In the medium to long run, an even better idea would be to forget about artificial constraints on the language and instead use a verified compiler or a tool such as one of the emerging translation validators for LLVM.

Generalizing and Criticizing Delta Debugging

Delta debugging is a search-based technique for taking an input to a program that triggers a bug and making that input smaller. For example, you might have a sequence of GUI operations that causes Thunderbird to crash. Assuming the crash is deterministic and the input can be replayed automatically, you can iteratively remove UI actions until you end up at a local minimum: a collection of events where removing even one of them makes the crash go away. If Delta is implemented well and properly takes advantage of the structure of the input, the resulting failure-inducing input is not only small, but also most of it is relevant to the failure. Debugging a failure triggered by a small, highly-relevant input is generally pretty easy. The definitive Delta debugging reference is Simplifying and Isolating Failure-Inducing Input by Andreas Zeller and Ralf Hildebrandt (ZH02 from now on).

Delta is fantastically useful, particularly in the context of random testing. As part of my group’s compiler bug-finding project we implemented three new test case minimizers using Delta-like ideas, and we also use an existing line-based Delta implementation. Each of the four reducers has different strengths and weaknesses and in fact local minima can often by escaped by running the implementations one after the other.

Significantly, none of the four test case reducers is a straightforward implementation of an algorithm from the original Delta paper — each of them generalizes it in one or more ways. After spending a lot of time working on test case reduction and thinking about this, I got the idea of writing a paper perhaps called “Generalized Delta Debugging” which weakens many of the assumptions found in the original work. The problem was that the more parts of Delta debugging I generalized, the more the result looked just like a generic greedy search. Thus, it started to look extremely doubtful whether there was any research component to generalizing Delta debugging. This piece explores the consequences of that observation.

Delta == Greedy Search

Just to be clear, by “greedy search” I mean the class of optimization algorithms that are based on a transformation operator and a fitness function. They work by repeatedly transforming the current solution to arrive at a new solution, and replacing the current with the new if the fitness level has increased. No doubt I’m butchering the accepted terminology, but the ideas here are really simple.

The “minimizing delta debugging algorithm” from ZH02 is an instance of greedy search where the fitness of a failure-inducing input is the inverse of its size and the fitness of all non-failure-inducing inputs is zero. The transformation operator removes a contiguous chunk of the input. When the transformation gets stuck — by running out of choices of chunks to remove — it reduces the chunk size. When it gets stuck at the minimum chunk size (a single line, character, or other atom of input) the search is finished.

The “general delta debugging algorithm” from ZH02 is very similar but its goal is to minimize the difference between the current solution and a given input, instead of simply minimizing size. Since I haven’t found many uses for this algorithm in practice, and since it’s not all that different from the minimizing Delta, I won’t discuss it further. Whenever I mention the “Delta algorithm” or similar, it is the minimizing Delta to which I refer.

Which parts of the Delta algorithms from ZH02 can be usefully generalized? As it turns out, pretty much all of them. Let’s look at the different elements in turn.

Generalizing the Transformation Operator

The Delta transformer that deletes contiguous chunks of input at ever-finer levels of granularity is reasonably generic and efficient. However, when attacking highly-structured test cases it often gets stuck at a local maximum long before the test case is fully reduced. (Sorry if I keep switching between minimum and maximum. When discussing size the goal is minimization, when discussing fitness in a generic context, I’ll stick to the convention that the goal is maximization.) Hierarchical delta debugging is a variant that improves performance by operating on sub-trees of tree-structured inputs.

Another generalization is to use a transformer that replaces a chunk of input with something else, instead of simply deleting it. For example, one of our new reducers for C code tries to replace uses of variables with constants. Another replaces function calls with their results, including side effects. These are very effective in practice.

It is also useful to delete parts of the input in a non-local way. For example, to remove an argument to a function in a C program, we must delete it from the definition, declaration, and all uses. Making this transformation work requires a painful degree of friendliness with the C code, but again it’s very useful in practice.

Finally, we sometimes use transformations that don’t even make the test case smaller. For example it may be desirable to replace a small, complex construct (like a call to a trig function in a C program) with a larger but simpler construct (a math expression approximating the trig function’s behavior, perhaps). Similarly, it may be desirable to replace an array with a collection of scalars or a struct assignment with a collection of assignments to members. The scalars or the assignments are then vulnerable to subsequent reduction.

All of these examples point towards a more general idea which is that there is a strong synergy between test case reduction and compiler optimization (which I wrote about earlier).

Generalizing the Fitness Function

ZH02’s minimizing Delta uses 1/size as its fitness function and its general Delta uses the inverse of the string distance between current solution and goal. There are plenty of other useful fitness functions. As I mentioned in the previous paragraph, considering the complexity of different program constructs is useful. We’ve also experimented with using Delta techniques to minimize the number of instructions executed by the test case. The insight is that the complexity of a failing execution depends not only on syntactic characteristics of the failure-inducing input, but also on the dynamic behavior induced by the test case.

A major gap in the ZH02 paper is that it does not address the validity problem: does the transformed test case satisfy the constraints imposed on test inputs? For some uses of Delta no validity test is required because the system under test can detect invalid inputs. On the other hand, the validity problem for C programs is extremely difficult to deal with (in theory, it’s undecidable; in practice, no fun at all) and this has been a major stumbling block in our C compiler bug-finding work — but now solved (thank goodness not by us). Sometimes it is desirable to test software with invalid inputs, but for the C compiler work we want to say that all invalid test cases have zero fitness.

Generalizing the Search Framework

The third element of the Delta debugging algorithms from ZH02 that can be usefully generalized is the search algorithm itself. Backtracking can be used to avoid getting stuck too early, as can other techniques such as simulated annealing or genetic algorithms. Basically, test case reduction can be based on any search algorithm, greedy or not.

A Verdict on Delta

Now I’d like to ask a couple of specific questions about the ZH02 paper. First, why weren’t the algorithms presented as “plugins” to the greedy search framework? This would have improved the presentation by making it clear which elements of the algorithms are important vs. boilerplate. Also, it would have made it easier for readers to understand how subsequent improvements to delta should be structured.

Second, given that delta debugging is a fairly straightforward application of greedy search, how significant is its research contribution? The answer that I lean towards is that the main contribution of delta is giving a nice name to an excellent idea that was, in 2002, somewhat obvious.

Since it is dangerous to criticize an idea by saying, in retrospect, that it was obvious, I’ll provide a bit of justification. First, two of the previously published papers cited by Zeller and Hildebrand (references 8 and 9 in the paper) apply ideas that are basically the same as Delta debugging, but without calling it that. Additionally, a paper they missed — Differential Testing for Software, published in 1998 — described a search-based, automated test case reducer. So it’s clear that by 2002 the ideas forming the core of Delta debugging were floating around.

My opinion is that the two Delta algorithms from the ZH02 paper have little enduring value because they simply do not work very well without modification. At least, we couldn’t make them work without significant generalization. The enduring value of the paper is to popularize and assign a name to the idea of using a search algorithm to improve the quality of failure-inducing test cases.

Moving Forward

As the complexity of computer systems continues to increase, the advantages derived from deterministic execution and automated test case reduction will also continue to increase. Delta debugging provides a conceptual framework for doing so. Unfortunately, it seems that few useful papers on reducing the size of failure-inducing programs that have appeared since the original Delta work. A notable exception is the hierarchical delta debugging work I mentioned earlier. Iterative Delta Debugging is interesting but solves a slightly different problem. Deriving Input Syntactic Structure is a technique that can make hierarchical delta easier to implement. If anyone knows of more work along these lines, I’d like to hear about it.