Generalizing and Criticizing Delta Debugging

Delta debugging is a search-based technique for taking an input to a program that triggers a bug and making that input smaller. For example, you might have a sequence of GUI operations that causes Thunderbird to crash. Assuming the crash is deterministic and the input can be replayed automatically, you can iteratively remove UI actions until you end up at a local minimum: a collection of events where removing even one of them makes the crash go away. If Delta is implemented well and properly takes advantage of the structure of the input, the resulting failure-inducing input is not only small, but also most of it is relevant to the failure. Debugging a failure triggered by a small, highly-relevant input is generally pretty easy. The definitive Delta debugging reference is Simplifying and Isolating Failure-Inducing Input by Andreas Zeller and Ralf Hildebrandt (ZH02 from now on).

Delta is fantastically useful, particularly in the context of random testing. As part of my group’s compiler bug-finding project we implemented three new test case minimizers using Delta-like ideas, and we also use an existing line-based Delta implementation. Each of the four reducers has different strengths and weaknesses and in fact local minima can often by escaped by running the implementations one after the other.

Significantly, none of the four test case reducers is a straightforward implementation of an algorithm from the original Delta paper — each of them generalizes it in one or more ways. After spending a lot of time working on test case reduction and thinking about this, I got the idea of writing a paper perhaps called “Generalized Delta Debugging” which weakens many of the assumptions found in the original work. The problem was that the more parts of Delta debugging I generalized, the more the result looked just like a generic greedy search. Thus, it started to look extremely doubtful whether there was any research component to generalizing Delta debugging. This piece explores the consequences of that observation.

Delta == Greedy Search

Just to be clear, by “greedy search” I mean the class of optimization algorithms that are based on a transformation operator and a fitness function. They work by repeatedly transforming the current solution to arrive at a new solution, and replacing the current with the new if the fitness level has increased. No doubt I’m butchering the accepted terminology, but the ideas here are really simple.

The “minimizing delta debugging algorithm” from ZH02 is an instance of greedy search where the fitness of a failure-inducing input is the inverse of its size and the fitness of all non-failure-inducing inputs is zero. The transformation operator removes a contiguous chunk of the input. When the transformation gets stuck — by running out of choices of chunks to remove — it reduces the chunk size. When it gets stuck at the minimum chunk size (a single line, character, or other atom of input) the search is finished.

The “general delta debugging algorithm” from ZH02 is very similar but its goal is to minimize the difference between the current solution and a given input, instead of simply minimizing size. Since I haven’t found many uses for this algorithm in practice, and since it’s not all that different from the minimizing Delta, I won’t discuss it further. Whenever I mention the “Delta algorithm” or similar, it is the minimizing Delta to which I refer.

Which parts of the Delta algorithms from ZH02 can be usefully generalized? As it turns out, pretty much all of them. Let’s look at the different elements in turn.

Generalizing the Transformation Operator

The Delta transformer that deletes contiguous chunks of input at ever-finer levels of granularity is reasonably generic and efficient. However, when attacking highly-structured test cases it often gets stuck at a local maximum long before the test case is fully reduced. (Sorry if I keep switching between minimum and maximum. When discussing size the goal is minimization, when discussing fitness in a generic context, I’ll stick to the convention that the goal is maximization.) Hierarchical delta debugging is a variant that improves performance by operating on sub-trees of tree-structured inputs.

Another generalization is to use a transformer that replaces a chunk of input with something else, instead of simply deleting it. For example, one of our new reducers for C code tries to replace uses of variables with constants. Another replaces function calls with their results, including side effects. These are very effective in practice.

It is also useful to delete parts of the input in a non-local way. For example, to remove an argument to a function in a C program, we must delete it from the definition, declaration, and all uses. Making this transformation work requires a painful degree of friendliness with the C code, but again it’s very useful in practice.

Finally, we sometimes use transformations that don’t even make the test case smaller. For example it may be desirable to replace a small, complex construct (like a call to a trig function in a C program) with a larger but simpler construct (a math expression approximating the trig function’s behavior, perhaps). Similarly, it may be desirable to replace an array with a collection of scalars or a struct assignment with a collection of assignments to members. The scalars or the assignments are then vulnerable to subsequent reduction.

All of these examples point towards a more general idea which is that there is a strong synergy between test case reduction and compiler optimization (which I wrote about earlier).

Generalizing the Fitness Function

ZH02’s minimizing Delta uses 1/size as its fitness function and its general Delta uses the inverse of the string distance between current solution and goal. There are plenty of other useful fitness functions. As I mentioned in the previous paragraph, considering the complexity of different program constructs is useful. We’ve also experimented with using Delta techniques to minimize the number of instructions executed by the test case. The insight is that the complexity of a failing execution depends not only on syntactic characteristics of the failure-inducing input, but also on the dynamic behavior induced by the test case.

A major gap in the ZH02 paper is that it does not address the validity problem: does the transformed test case satisfy the constraints imposed on test inputs? For some uses of Delta no validity test is required because the system under test can detect invalid inputs. On the other hand, the validity problem for C programs is extremely difficult to deal with (in theory, it’s undecidable; in practice, no fun at all) and this has been a major stumbling block in our C compiler bug-finding work — but now solved (thank goodness not by us). Sometimes it is desirable to test software with invalid inputs, but for the C compiler work we want to say that all invalid test cases have zero fitness.

Generalizing the Search Framework

The third element of the Delta debugging algorithms from ZH02 that can be usefully generalized is the search algorithm itself. Backtracking can be used to avoid getting stuck too early, as can other techniques such as simulated annealing or genetic algorithms. Basically, test case reduction can be based on any search algorithm, greedy or not.

A Verdict on Delta

Now I’d like to ask a couple of specific questions about the ZH02 paper. First, why weren’t the algorithms presented as “plugins” to the greedy search framework? This would have improved the presentation by making it clear which elements of the algorithms are important vs. boilerplate. Also, it would have made it easier for readers to understand how subsequent improvements to delta should be structured.

Second, given that delta debugging is a fairly straightforward application of greedy search, how significant is its research contribution? The answer that I lean towards is that the main contribution of delta is giving a nice name to an excellent idea that was, in 2002, somewhat obvious.

Since it is dangerous to criticize an idea by saying, in retrospect, that it was obvious, I’ll provide a bit of justification. First, two of the previously published papers cited by Zeller and Hildebrand (references 8 and 9 in the paper) apply ideas that are basically the same as Delta debugging, but without calling it that. Additionally, a paper they missed — Differential Testing for Software, published in 1998 — described a search-based, automated test case reducer. So it’s clear that by 2002 the ideas forming the core of Delta debugging were floating around.

My opinion is that the two Delta algorithms from the ZH02 paper have little enduring value because they simply do not work very well without modification. At least, we couldn’t make them work without significant generalization. The enduring value of the paper is to popularize and assign a name to the idea of using a search algorithm to improve the quality of failure-inducing test cases.

Moving Forward

As the complexity of computer systems continues to increase, the advantages derived from deterministic execution and automated test case reduction will also continue to increase. Delta debugging provides a conceptual framework for doing so. Unfortunately, it seems that few useful papers on reducing the size of failure-inducing programs that have appeared since the original Delta work. A notable exception is the hierarchical delta debugging work I mentioned earlier. Iterative Delta Debugging is interesting but solves a slightly different problem. Deriving Input Syntactic Structure is a technique that can make hierarchical delta easier to implement. If anyone knows of more work along these lines, I’d like to hear about it.

An Executable Semantics For C Is Useful

The goal of a C/C++ compiler is to turn every sequence of ASCII characters into executable instructions. OK, not really — though it does seem that way sometimes. The real goal of a C/C++ compiler is to map every conforming input into executable instructions that correspond to a legal interpretation of that input. The qualifiers “conforming” and “legal interpretation” are very important. First, the compiler has extremely weak requirements about what it should do with non-conforming inputs, such as programs that contain undefined behaviors (array bounds violations, etc.). Second, all realistic C/C++ programs have a large number of possible interpretations, for example corresponding to different integer sizes, different orders of evaluation for function arguments, etc. The compiler chooses a convenient or efficient one, and the remaining interpretations are latent. They may emerge later on if the compiler options are changed, if the compiler is upgraded, or if a different compiler is used. The point is that the compiler has no obligation to tell us whether the input is conforming or not, nor how many possible interpretations it has.

Thus, while C/C++ compilers are very good at turning conforming programs into efficient executables, they are just about useless for other answering other kinds of questions:

  • Does the program ever execute undefined behaviors, causing it (in principle) to have no meaning and (in practice) to execute attack code or crash?
  • Does the program rely on unspecified behaviors, making it non-portable across compilers, compiler versions, and changes in compiler options?
  • Does the program rely on implementation-defined behaviors, affecting its portability to other compilers and platforms?
  • Why does the program behave in a certain way? In other words, what part of the standard forced that interpretation?

To answer these questions, a wide variety of static analyzers, model checkers, runtime verifiers, and related tools have been developed. These tools are great. However, even taken all together, they are incomplete: there exist plenty of bad (or interesting) program behaviors that few or none of them can find. For example:

  • Very few tools exist that can reliably detect uses of uninitialized storage.
  • Few, if any, tools can correctly diagnose problems resulting from C/C++’s unspecified order of evaluation of function arguments.
  • An lvalue must not be modified multiple times, or be both read and written, in between sequence points. I’m not aware of many tools that can correctly detect that evaluating this function results in undefined behavior when p1 and p2 are aliases:
int foo (int *p1, int *p2) {
  return (*p1)++ % (*p2)++;
}

The Missing Tool

The missing tool (or one of them, at any rate) is an executable semantics for C. An executable semantics is an extremely careful kind of interpreter where every action it takes directly corresponds to some part of the language standard. Moreover, an executable semantics can be designed to tell us whether the standard assigns any meaning at all to the program being interpreted. In other words, it can explicitly check for all (or at least most) undefined, unspecified, and implementation-defined behaviors. For example, when an executable semantics evaluates (*p1)++ % (*p2)++, it won’t assign a meaning to the expression until it has checked that:

  • both pointers are valid
  • neither addition overflows (if the promoted types are signed)
  • p1 and p2 are not aliases
  • *p2 is not 0
  • either *p1 is not INT_MIN or *p2 is not -1

Moreover, the tool should make explicit all of the implicit casts that are part of the “usual arithmetic conversions.” And it needs to do about 500 other things that we usually don’t think about when dealing with C code.

Who Needs an Executable Semantics?

Regular programmers won’t need it very often, but they will occasionally find it useful for settling the kinds of annoying arguments that happen when people don’t know how to read the all-too-ambiguous English text in the standard. Of course, the executable semantics can only settle arguments if we agree that it has captured the sense of the standard. Better yet, we would treat the executable semantics as definitive and the document as a derivative work.

Compiler developers need an executable semantics. It would provide a simple, automated filter to apply to programs that purportedly trigger compiler bugs. A web page at Keil states that “Fewer than 1% of the bug reports we receive are actually bugs.” An executable semantics would rapidly find code fragments that contain undefined or unspecified behaviors — these are a common source of bogus bug reports. Currently, compiler developers do this checking by hand. The GCC bug database contains 4966 bug reports that have been marked as INVALID. Not all of these could be automatically detected, but some of them certainly could be.

People developing safety-critical software may get some benefit from an executable semantics. Consider CompCert, a verified compiler that provably preserves the meaning of C code when translating it into assembly. CompCert’s guarantee, however, is conditional on the C code containing no undefined behaviors. How are we supposed to verify the absence of undefined behaviors when existing tools don’t reliably check for initialization and multiple updates to lvalues? Moreover, CompCert is free to choose any legitimate interpretation of a C program that relies on unspecified behaviors, and it does not need to tell us which one it has chosen. We need to verify up-front that (under some set of implementation-defined behaviors) our safety-critical C program has a single interpretation.

My students and I need an executable semantics, because we are constantly trying to figure out whether random C functions are well-defined or not. This is surprisingly hard. We also need a reliable, automated way to detect undefined behavior because this enables automated test case reduction.

An Executable Semantics for C Exists

I spent a few years lamenting the non-existence of an executable C semantics, but no longer: as of recently, the tool exists. It was created by Chucky Ellison, a PhD student at the University of Illinois working under the supervision of Grigore Rosu. They have written a TR about it and also the tool can be downloaded. Hopefully Chucky does not mind if I provide this link — the tool is very much a research prototype (mainly, it is not very fast). But it works:

regehr@home:~/svn/code$ cat lval.c
int foo (int *p1, int *p2) {
  return (*p1)++ % (*p2)++;
}

int main (void) {
  int a = 1;
  return foo (&a, &a);
}
regehr@home:~/svn/code$ kcc lval.c
regehr@home:~/svn/code$ ./a.out
=============================================================
ERROR! KCC encountered an error while executing this program.
=============================================================
Error: 00003
Description: Unsequenced side effect on scalar object with value computation of same object.
=============================================================
File: /mnt/z/svn/code/lval.c
Function: foo
Line: 2
============================================================

As I mentioned earlier, very few tools for analyzing C code find this error. Chucky’s tool can also perform a state space exploration to find order of evaluation problems and problems in concurrent C codes. Finally, it can run in profile mode. Unlike a regular profiler, this one profiles the rules from the C semantics that fire when the program is interpreted. This is really cool and we plan to use it to figure out what parts of the C standard are not exercised by Csmith.

Chucky’s tool is already an integral part of one of our test case reducers. This reducer takes as input a huge, ugly, bug-triggering C program generated by Csmith. It then uses Delta debugging to output a much smaller bug-triggering program that (ideally) can be included in a compiler bug report without further manual simplification. Before Chucky’s tool arrived, we had spent several years trying to deal with the fact that Delta always introduces undefined behavior. We now seem to have a bulletproof solution to that problem.

The benefits of executable semantics have long been recognized in the PL community. The new thing here is a tool that actually works, for the C language. Hopefully, as Chucky’s tool matures people will find more uses for it, and perhaps it can even evolve into a sort of de facto litmus test for ascertaining the meaning — or lack thereof — of difficult C programs.

How Can Computer Science Help Us Get To Mars?

SpaceX thinks it can get a person to Mars within 20 years. This seems optimistic, given that SpaceX does not enjoy the significant chunk of the USA’s federal budget that permitted NASA to get to the Moon on a relatively short time scale. Nevertheless, it’s a good goal, and presumably 50 years of improvements in space systems engineering will make up for some of the shortfall.

Since I strongly believe that getting humans off the rock needs to be a priority, I want to help. What are the CS problems that (1) can be addressed in a University and (2) will directly support getting off the rock? Since much of my research is aimed towards increasing embedded software reliability, I suspect that I’m already in the right ballpark, but once we get into the specifics there are a lot of ways to go astray. Perhaps I should see if I can do my next sabbatical at SpaceX. I recently read a book on how space systems can go wrong and (no surprise) there are very many ways, some of them software-related.

Value Loss Coverage

The ugly reality of computer arithmetic on fixed-length bit vectors is that many operations are not equivalent to the mathematical operators they superficially resemble. For example, only a tiny fraction of the possible outcomes of a 32-bit multiplication can be represented in a 32-bit destination register. Although bignum and rational packages exist — and have existed for many years — there are unsolved problems (highly unpredictable runtime and memory allocation behavior) that preclude their use in important classes of applications such as embedded systems that must execute in a predictable fashion, probably on constrained hardware.

The canonical example of harmful value loss was an assignment from a floating point value representing velocity into a 16-bit integer onboard Ariane 5 flight 501. The loss of value initiated a chain of events that resulted in the destruction of the rocket.

Modern compilers warn about potential value loss, which is nice because it makes it easier for us to inspect these code sites. But even so, it is often extremely hard to figure out which of these sites can lead to actual value loss. For example, the Ariane software could not lead to value loss on the Ariane 4, but value loss did occur when the same software module was reused on the Ariane 5. Clearly we cannot expect the compiler to understand the maximum horizontal velocity of a rocket.

The idea behind this piece is that for purposes of test coverage, every operation that potentially results in value loss should be tested in both its lossy and non-lossy modes. Consider these two C functions:

char foo1 (int x) {
  return x;
}
signed char foo2 (unsigned char x) {
  return x;
}

foo1() has value loss when x is outside the range of values representable in a char. We can fully test it by passing it the values 0 and 128 (assuming CHAR_MAX is 127). foo2() does not have a bitwidth constraint like foo1(), but it still can lose values because the value ranges representable in a signed and unsigned char do not fully overlap. foo2(), also, can be fully covered by passing it the values 0 and 128.

As a software testing goal, high value-loss coverage may only make sense when a whitebox test case generator like Klee is available. My student Peng Li implemented a patch to LLVM’s Clang frontend that takes a C program and compiles it into a form where the normal and lossy paths are distinct. These instrumented codes can be passed to Klee, which attempts to generate inputs that induce path coverage on the program. This works well for small test codes. However, we don’t yet have results from real applications showing that this extra coverage is useful for exposing bugs.

How Much and What to Read?

Grad students often aren’t quite sure how much of their work time should be spent reading, and may also have trouble figuring out what to read during that time. (In principle this problem also applies to professors, but it’s not much of an issue in practice since we have almost no time for discretionary reading.) I usually tell people to err on the side of doing too much, as opposed to reading too much. Averaging one day a week on reading is probably about right, but it may be a lot higher or lower than this during certain periods of grad school.

Of the time spent reading, about half should go towards solving immediate research problems. The purpose of this is getting unstuck, learning about alternate approaches, avoiding reinvention of known techniques, and learning how and what the current players in your field are thinking. Of the remaining 50% of reading time, half of it should go towards “deep background reading” — reading in the thesis area that goes a bit further afield and a bit further back in the chain of references than is absolutely necessary to get the work done. This reading will form the basis for the related work sections of papers and the dissertation. The last 25% of reading time is purely discretionary: classic papers, good books, new results from conferences, etc. This material is not expected to be directly relevant (though it may turn out to be) but keeps a person up to date on random interesting topics.

Preemption vs. Event Ordering

There’s no doubt that concurrent programming is hard, but it’s not always clear exactly why. Generally, eliminating data races is not the hard part. On the other hand, dealing with event ordering problems can be extremely difficult. To put that another way, if we remove preemption from the picture by programming with non-preemptive threads or a single-threaded event loop, much of the difficulty of concurrent programming remains.

In 1998/99, while doing an internship at Microsoft Research, I wrote some moderately complicated code for the NT kernel. Since kernel debugging was a pain, I wrote a simple event-driven simulator that emulated the kernel’s execution environment well enough to support running my code in user mode. I was depressed to learn that debugging code in the single-threaded simulator was not hugely easier than debugging the live version. The crux of the issue was atomicity: not “concurrency atomicity” but rather something like “invariant atomicity.” The kernel code I was writing was not only highly stateful, but also quite incremental, meaning that event handlers would often drop what they were doing and other handlers would pick up the job later. Thus, event handlers were seeing extremely unclean system states. The invariants were more complicated than I could easily grasp, which of course was the root of the problem. Software systems using callbacks can be surprisingly hard to get right for similar reasons, and I suspect that many problems with error-handling code have a similar root cause.

In contrast with my NT kernel simulator example, an event driven system will be far easier to get right when stable states of data structures coincide nicely with event boundaries. Adding concurrency/parallelism may not make program development any more difficult if the points of interaction between concurrent flows coincide with relatively stable points in the system’s invariants.

One part of the solution to event ordering problems is to design systems where the invariants are well-structured and unclean system states are exposed to as little code as possible. Another part is to use programming languages and environments that have better support for explicit invariants. Lamentably, almost all of my development these days is either systems programming or scripting — both domains where invariant-based programming (design by contract for example) is weakly supported.

Software Bugs and Scientific Progress

When a bug is found in a piece of software, the root cause is often a bug in someone’s thoughts. One way to better understand a bug is to look at how deep the underlying thought error was. In other words: How many assumptions must be revisited as a result of the bug?

Level 1 — Syntax error, such as using the wrong operator or putting a function’s arguments in the wrong order. These are totally unsurprising.

Level 2 — Implementation error, such as wrongly assuming a pointer is non-null. These require developers to revisit whatever implementation-level assumption was made that lead to an invariant violation. Implementation errors can also occur at the level of algorithm design; textbooks can contain level 2 errors. Incorrect or sloppy theorems may have to be revised in response to level 2 errors.

Level 3 — Design error: the system is structured in such a way that it cannot meet its specification. A web server that runs untrusted code in its address space contains a level 3 bug.

Level 4 — Specification error: the software is solving the wrong problem. Assumptions about the purpose of the system must be revisited.

Level 5 — Framework error: the intellectual scaffolding upon which the software was based contains an error or omission. My favorite example comes from the early days of the development of optimizing compilers when people first confronted difficult questions about whether a particular optimization was legal or not. An entire sub-area of computer science had to be created to answer these questions. Something analogous is happening right now where we lack a framework for creating reliable and secure large-scale, software-intensive systems.

Level 6 — Universe error: a previously-unknown physical law prevents the software from working. If good computers had existed in the 19th century, the first person who tried to implement Maxwell’s demon and the first person who tried to implement a universal discriminator for halting vs. non-halting programs would have encountered errors at level 6. Future level 6 errors may result from efforts to build artificial intelligences.

Something interesting about levels 5 and 6 is that they look more like progress than like errors. This is no coincidence: scientific progress occurs when bugs in our collective thinking are fixed. Thus, we could easily construct a scale for scientific advances that mirrors the scale for software bugs. Einstein and Turing operated at level 6 at least a few times in their lives; many other great scientists work at level 5. A typical MS thesis is at level 2 and a dissertation should be no lower than level 3.