Generalizing and Criticizing Delta Debugging


Delta debugging is a search-based technique for taking an input to a program that triggers a bug and making that input smaller. For example, you might have a sequence of GUI operations that causes Thunderbird to crash. Assuming the crash is deterministic and the input can be replayed automatically, you can iteratively remove UI actions until you end up at a local minimum: a collection of events where removing even one of them makes the crash go away. If Delta is implemented well and properly takes advantage of the structure of the input, the resulting failure-inducing input is not only small, but also most of it is relevant to the failure. Debugging a failure triggered by a small, highly-relevant input is generally pretty easy. The definitive Delta debugging reference is Simplifying and Isolating Failure-Inducing Input by Andreas Zeller and Ralf Hildebrandt (ZH02 from now on).

Delta is fantastically useful, particularly in the context of random testing. As part of my group’s compiler bug-finding project we implemented three new test case minimizers using Delta-like ideas, and we also use an existing line-based Delta implementation. Each of the four reducers has different strengths and weaknesses and in fact local minima can often by escaped by running the implementations one after the other.

Significantly, none of the four test case reducers is a straightforward implementation of an algorithm from the original Delta paper — each of them generalizes it in one or more ways. After spending a lot of time working on test case reduction and thinking about this, I got the idea of writing a paper perhaps called “Generalized Delta Debugging” which weakens many of the assumptions found in the original work. The problem was that the more parts of Delta debugging I generalized, the more the result looked just like a generic greedy search. Thus, it started to look extremely doubtful whether there was any research component to generalizing Delta debugging. This piece explores the consequences of that observation.

Delta == Greedy Search

Just to be clear, by “greedy search” I mean the class of optimization algorithms that are based on a transformation operator and a fitness function. They work by repeatedly transforming the current solution to arrive at a new solution, and replacing the current with the new if the fitness level has increased. No doubt I’m butchering the accepted terminology, but the ideas here are really simple.

The “minimizing delta debugging algorithm” from ZH02 is an instance of greedy search where the fitness of a failure-inducing input is the inverse of its size and the fitness of all non-failure-inducing inputs is zero. The transformation operator removes a contiguous chunk of the input. When the transformation gets stuck — by running out of choices of chunks to remove — it reduces the chunk size. When it gets stuck at the minimum chunk size (a single line, character, or other atom of input) the search is finished.

The “general delta debugging algorithm” from ZH02 is very similar but its goal is to minimize the difference between the current solution and a given input, instead of simply minimizing size. Since I haven’t found many uses for this algorithm in practice, and since it’s not all that different from the minimizing Delta, I won’t discuss it further. Whenever I mention the “Delta algorithm” or similar, it is the minimizing Delta to which I refer.

Which parts of the Delta algorithms from ZH02 can be usefully generalized? As it turns out, pretty much all of them. Let’s look at the different elements in turn.

Generalizing the Transformation Operator

The Delta transformer that deletes contiguous chunks of input at ever-finer levels of granularity is reasonably generic and efficient. However, when attacking highly-structured test cases it often gets stuck at a local maximum long before the test case is fully reduced. (Sorry if I keep switching between minimum and maximum. When discussing size the goal is minimization, when discussing fitness in a generic context, I’ll stick to the convention that the goal is maximization.) Hierarchical delta debugging is a variant that improves performance by operating on sub-trees of tree-structured inputs.

Another generalization is to use a transformer that replaces a chunk of input with something else, instead of simply deleting it. For example, one of our new reducers for C code tries to replace uses of variables with constants. Another replaces function calls with their results, including side effects. These are very effective in practice.

It is also useful to delete parts of the input in a non-local way. For example, to remove an argument to a function in a C program, we must delete it from the definition, declaration, and all uses. Making this transformation work requires a painful degree of friendliness with the C code, but again it’s very useful in practice.

Finally, we sometimes use transformations that don’t even make the test case smaller. For example it may be desirable to replace a small, complex construct (like a call to a trig function in a C program) with a larger but simpler construct (a math expression approximating the trig function’s behavior, perhaps). Similarly, it may be desirable to replace an array with a collection of scalars or a struct assignment with a collection of assignments to members. The scalars or the assignments are then vulnerable to subsequent reduction.

All of these examples point towards a more general idea which is that there is a strong synergy between test case reduction and compiler optimization (which I wrote about earlier).

Generalizing the Fitness Function

ZH02’s minimizing Delta uses 1/size as its fitness function and its general Delta uses the inverse of the string distance between current solution and goal. There are plenty of other useful fitness functions. As I mentioned in the previous paragraph, considering the complexity of different program constructs is useful. We’ve also experimented with using Delta techniques to minimize the number of instructions executed by the test case. The insight is that the complexity of a failing execution depends not only on syntactic characteristics of the failure-inducing input, but also on the dynamic behavior induced by the test case.

A major gap in the ZH02 paper is that it does not address the validity problem: does the transformed test case satisfy the constraints imposed on test inputs? For some uses of Delta no validity test is required because the system under test can detect invalid inputs. On the other hand, the validity problem for C programs is extremely difficult to deal with (in theory, it’s undecidable; in practice, no fun at all) and this has been a major stumbling block in our C compiler bug-finding work — but now solved (thank goodness not by us). Sometimes it is desirable to test software with invalid inputs, but for the C compiler work we want to say that all invalid test cases have zero fitness.

Generalizing the Search Framework

The third element of the Delta debugging algorithms from ZH02 that can be usefully generalized is the search algorithm itself. Backtracking can be used to avoid getting stuck too early, as can other techniques such as simulated annealing or genetic algorithms. Basically, test case reduction can be based on any search algorithm, greedy or not.

A Verdict on Delta

Now I’d like to ask a couple of specific questions about the ZH02 paper. First, why weren’t the algorithms presented as “plugins” to the greedy search framework? This would have improved the presentation by making it clear which elements of the algorithms are important vs. boilerplate. Also, it would have made it easier for readers to understand how subsequent improvements to delta should be structured.

Second, given that delta debugging is a fairly straightforward application of greedy search, how significant is its research contribution? The answer that I lean towards is that the main contribution of delta is giving a nice name to an excellent idea that was, in 2002, somewhat obvious.

Since it is dangerous to criticize an idea by saying, in retrospect, that it was obvious, I’ll provide a bit of justification. First, two of the previously published papers cited by Zeller and Hildebrand (references 8 and 9 in the paper) apply ideas that are basically the same as Delta debugging, but without calling it that. Additionally, a paper they missed — Differential Testing for Software, published in 1998 — described a search-based, automated test case reducer. So it’s clear that by 2002 the ideas forming the core of Delta debugging were floating around.

My opinion is that the two Delta algorithms from the ZH02 paper have little enduring value because they simply do not work very well without modification. At least, we couldn’t make them work without significant generalization. The enduring value of the paper is to popularize and assign a name to the idea of using a search algorithm to improve the quality of failure-inducing test cases.

Moving Forward

As the complexity of computer systems continues to increase, the advantages derived from deterministic execution and automated test case reduction will also continue to increase. Delta debugging provides a conceptual framework for doing so. Unfortunately, it seems that few useful papers on reducing the size of failure-inducing programs that have appeared since the original Delta work. A notable exception is the hierarchical delta debugging work I mentioned earlier. Iterative Delta Debugging is interesting but solves a slightly different problem. Deriving Input Syntactic Structure is a technique that can make hierarchical delta easier to implement. If anyone knows of more work along these lines, I’d like to hear about it.


6 responses to “Generalizing and Criticizing Delta Debugging”

  1. I think it would be interesting to see what other static/dynamic analyses tools would improve the step selection. Right off hand, any tool that could detect what code can’t or isn’t run (dead code elimination and code profiling come to mind) would give a good idea where to trim next.

  2. Hi bcs, one of our implementations does exactly this. It’s very effective on randomly generated testcases which contain a lot of dead code.

  3. Fantastic blog post. I’ve long wondered about why there’s not more work on delta debugging. I like your insights that delta debugging is a greedy search algorithm.

    I gotta disagree with your statement that delta debugging was obvious. I think it’s one of those great ideas that are obvious once you see them, but non-obvious before then. One of the great insights was the idea of even trying to solve the problem. Sure, once you formulate the problem and get the idea that it might solvable, it’s not too hard to come up with technical solutions, but that’s not the contribution; a large part of the contribution is coming up with the problem statement in the first place.

    I suspect one of the reasons we don’t have more work on delta debugging is because it’s not easy to figure out how to extend it to get a research direction. All the directions that are useful are likely to be viewed as “just engineering” and “too easy”. And we know program committees prefer papers that pose difficult problems that require deep, clever insights to solve over papers that achieve useful results with “mere engineering”.

    Hierarchical delta debugging is very cool. I’ve had some ideas for extending it. Hierarchical delta debugging requires a perfect parse of the input data, i.e., it requires us to have a complete parser for the input file format. That’s annoying, because often we don’t have an existing parser we can easily reuse; we have to build one, before we can start to use hierarchical delta debugging.

    So here’s an idea. It would be possible to build a version of hierarchical delta debugging that only needs a “heuristic partial parser”. A perfect parser outputs the block structure of the file, as a tree on the input bytes; for a file with n bytes, each node in the tree corresponds to a range of bytes, say bytes [i,j] of the input. A “heuristic partial parser” might relax the requirement for a tree (allowing a DAG of such nodes), and might also relax the requirement for a unique parse. So, imagine we had a heuristic partial parser which outputs a set of ranges of bytes. Each range [i,j] corresponds to a consecutive sequence of bytes from the input file which might be a logical block in the input (no promises). Thus, each range represents a “guess” at what might be a logical block. Because there is no significant penalty to “wrong guesses”, you might be able to build a heuristic partial parser without complete knowledge of the input file format. For instance, for each pair of matching delimiters (e.g., (), {}, /* */, []), emit a range corresponding to each matching pair of delimiters. Emit a range corresponding to each line of the file. For each of a number of plausible terminators (e.g., newline, semicolon, blank line, #, //), emit a range corresponding to every region between two consecutive terminators. And so on.

    Once you’ve got a heuristic partial parser, it should be easy to design an extension of hierarchical delta debugging that works with it. I thought about trying something like this, but I wasn’t convinced it would be publishable. How would I evaluate it? Even if it worked acceptably well, would referees appreciate something that is so heavily heuristic-based? And, it seemed likely to be limited to text-based formats.

    Another research idea I thought about at one point is trying to handle binary file formats as follows. Take any of the published techniques for reverse-engineering the format of an input file, by dynamically observing/instrumenting the program as it reads and operates on that input file. Then, use this deduced file format and feed it into hierarchical delta debugging. The idea is that given a binary that reads a binary file format, we’d automatically be able to use delta debugging on that program, without needing to explicitly build a parser for the input file format. It would be fun to see how well this works, compared to standard delta debugging with a hand-crafted parser for the input file format. However, it didn’t seem likely to be publishable, because it’s just plugging together two previously published idea, so I was never able to convince myself that it is a super-promising direction.

    I’ve also wondered about whether we could use delta debugging to improve security fuzz testing. However the only plausible angle I was able to come up with is to find a minimal change that causes the program to crash (we naturally end up with an input file that doesn’t crash, and a modified variant that does crash; we could minimize the diff, to make it easier to debug the crash). However, this didn’t seem compelling or publishable, either.

    Anyway, those were my attempts to build a research project based upon delta debugging. Not very successful, alas. Oh well.

    While we’re on the subject, I gotta say that I never liked ZH02 very much as an introduction to the ideas. I feel the idea is so simple and beautiful, but the ZH02 paper hides it behind a mass of notation and unnecessary complexity in the algorithm. My favorite exposition of the idea is found in the documentation for the Lithium tool, by Jesse Ruderman, which can be found here: http://www.squarefree.com/lithium/using.html http://www.squarefree.com/lithium/algorithm.html

  4. Hi AnonSecurityGuy, thanks for the detailed comments!

    Did you take a look at the pre-2002 papers I mentioned that did automated test case reduction? It is only in the context of that work that I say the ideas for Delta were just sitting around waiting to be given a name.

    Definitely, most Delta extensions are “just engineering.” We’re going to write up our work on C code reduction (after putting a ton of effort into it!) but even so I anticipate a bit of a struggle getting it past a program committee.

    The “heuristic partial parser” idea is excellent. Did you take a look at the “Deriving Input Syntactic Structure” paper I mentioned? It is very closely related to your ideas. I would expect that research in this direction is definitely publishable. Basically you just need to show that you significantly outperform naive Delta for a variety of input formats, and without adding a lot of engineering effort in creating the reducer.

    I agree about the ZH02 paper’s readability. In the authors’ defense, they probably felt that the extra formality was needed to get the work accepted to a top journal. There are always reviewers waiting to pounce on a paper as too applied, not formal enough, not principled enough, etc.

    Finally I had not seen the Lithium tool, thanks for the link!

  5. Thanks for prodding me to take a look at the pre-2002 papers. I confess I hadn’t read them before. It is interesting to see the shoulders that the Delta Debugging work is standing on.

    I still perceive a contribution in Delta Debugging, though I’m not sure if I can articulate it or persuade anyone else. For one thing, Delta Debugging puts this in a simpler, more generally applicable framework. One difference that feels particularly significant to me is that the Delta Debugging paper has “binary search”-like efficiency, where it starts by removing large chunks and then reduces the chunk sizes bit by bit. The prior works, if I’m reading them properly, seem to be about removing a single unit at a time (a single line, a single function, etc) rather than a large consecutive sequence of units.

    But I realize these kinds of comparisons are probably ultimately pretty subjective. I don’t feel any great confidence in my assessment, and I can certainly understand where you are coming from.

    P.S. No, I hadn’t seen the “Deriving Input Syntactic Structure” paper previously. I’ve put it on my pile to read; it looks fascinating, from a quick skim of the introduction. I’m looking forward to reading it. Thanks for highlighting it! I always appreciate pointers to good reading.

  6. AnonSecurityGuy, it sounds like we agree about this paper — it’s clearly a very influential paper but kind of hard to pick out the actual technical contribution.

    BTW I think I first heard the idea of using binary search to reduce failure-inducing inputs in Bentley’s _Programming Pearls_, from 1986.