Squeezing a CS Research Idea

[Disclaimer: I wrote about this topic once before, but it was buried in a longer piece. Also, it was in one of the first posts on this blog and I don’t think anyone read it. Update: I checked the stats. Nobody read it.]

A hammer I like to use when reviewing papers and PhD proposals is one that (lacking a good name) I call the “squeeze technique” and it applies to research that optimizes something. To squeeze an idea you ask:

  1. How much of the benefit can be attained without the new idea?
  2. If the new idea succeeds wildly, how much benefit can be attained?
  3. How large is the gap between these two?

Though this is simple, surprisingly often people have failed to work it out for themselves. Let’s go through an example and then get back to the details.

In 2004 Markus Mock wrote a clever paper which squeezes the idea of programmer specified aliasing, as specified by the C99 language. The “restrict” qualifier added in that version of the language serves as a hint from developers to the compiler; it means that the qualified pointer will not be aliased by other pointers, permitting better code generation in some situations.

Mock’s idea was to instrument a collection of benchmarks to record actual pointer behavior and then use this information to add the maximum possible number of “restrict” qualifiers. In other words, he used dynamic information to optimistically add a restrict qualifier in every location where this was permissible for the inputs used during benchmarking. He then re-ran the modified benchmarks and found less than 1% average speedup.

This result means that for the given compilers and benchmarks, no conceivable human intervention or static pointer analysis can achieve more than 1% average speedup through the addition of restrict qualifiers. The conclusion was that since “restrict” is dangerous in practice (misplaced qualifiers will cause incorrect code to be generated) programmer-specified aliasing is not generally a good idea.

Squeezing a research idea has three elements. The first — left implicit in the list above — is to select a metric. In general this should be externally meaningful: something that people other than computer scientists would care about, perhaps because it has economic consequences. For example speedup, throughput, MTTF, or energy consumption might make a good metric (when coupled with good workloads, a good test platform, good experiment design, etc.). In contrast, the number of “restrict” qualifiers, the average size of an alias set, and the number of context switches are probably poor top-level metrics because they enjoy (at best) an indirect relationship with something a person on the street might be worried about.

The second important choice is the baseline: the amount of benefit achieved without the current idea. This can be tricky and there are many ways of going wrong (probably baseline selection deserves its own blog post). A pernicious mistake is to choose a straw man: an obviously primitive or defective baseline. For example, maybe I’ve created a highly parallelizable algorithm for coalescing hoovulators. Then, I choose the single-processor version of my algorithm as the baseline and go on to show that I can achieve near-linear scalability up to 512 processors. This is a great result unless I am ignoring Stickleback’s 1983 algorithm that is non-parallelizable but performs the same job 750 times faster than my single-processor version, on the same hardware.

The third important element of the squeeze technique is determining an upper bound on improvement. In many cases — as Mock showed — a little creativity is all that is required. Other examples:

  • Amdahl’s Law bounds parallel speedup
  • The power consumption of the disk, display, and memory often bound the increase in mobile device lifetime that can be attained by reducing processor power usage
  • Time spent copying data can bound throughput improvements in networking
  • Time spent traveling at 0.3c can bound latency improvements in networking
  • CPU, memory, or I/O saturation can bound improvements in throughput due to better resource scheduling

In summary, no matter what we are trying to optimize, there is probably a component that we can improve and a component that we cannot. Understanding the relative magnitudes of these components is a powerful tool for helping to select a research problem to work on. I find myself extremely reluctant to sign off on a document like a PhD proposal that does not display evidence of this sort of thinking.

Iterating Over The Full Range

Every couple of months I need to write code that runs for every value of, for example, a char. This is usually to exhaustively test something so the code should be fast, so I write it in C. At first I always want to write:

for (c = CHAR_MIN; c <= CHAR_MAX; c++) {
  use (c);
}

This is very wrong when c has type “char.” Declaring c as “int” is one solution, and here’s another:

c = CHAR_MIN;
do {
  use (c);
} while (c++ != CHAR_MAX);

There is no signed overflow because the char is promoted to int before being incremented. However, neither solution can iterate over all values of an int. It would be possible to cast to unsigned before incrementing (ugly) or use a larger type than int (not supported by all compilers, and not fast on some platforms). Thus I usually end up with something like this:

i = INT_MIN;
while (1) {
  use (i);
  if (i == INT_MAX) break;
  i++;
}

But this is super clunky. Anyone have a better idiom for this case?

Downtown SLC Without a Car

People who visit Salt Lake City for a meeting or something often end up staying downtown, and often don’t bother renting a car. The other day I was mailing a friend a list of things to do in this situation. Here it is, cleaned up a bit:

  • Sam Weller’s Bookstore is on Main St between 200 S and 300 S; it’s a great store (be sure to check out all three levels) and has a good coffee shop with free wireless
  • Ken Sanders bookstore at 268 South on 200 East is another of my favorites
  • The taco stand on State St just north of 800 S (in the Sears parking lot) is awesome; $4 gets you four carne asada tacos plus a coke; catty-corner from here is the Epic Brewery which makes some good products (not a brewpub — just fridges full of bottles)
  • City Creek Canyon is a very short walk from downtown; the paved road is a good place to walk; the total length of the canyon is 14.5 miles and there are plenty of side trails, so there’s a lot to do here; the upper part of this canyon is seldom visited and quite wild; even-numbered days are most pleasant for walking the road since bikes are not allowed (directions)
  • The main Salt Lake library is a neat building, it’s on 400 S between 200 E and 300 E; if you take the elevators to the top floor, the big sloping ramp that descends back to ground level has nice views of the city and mountains
  • Salt Lake Roasting Company on 400 S between 300 E and 400 E (just east of the library) has great coffee and pastries
  • The light rail system provides easy access to some other parts of the city
  • Lots of restaurants can be found along 300 South between Main St and 300 W
  • Some good bars are scattered throughout; the best beer selection is found at Bayou and Beerhive (but the food isn’t that great at either place)

This list is heavily biased towards what I would do given a free day in SLC. Hopefully it’s useful.

Size Matters

Not long ago I served on an NSF panel: a collection of researchers who look at some grant proposals and provide recommendations to NSF program officers who then make decisions about which, if any, of the proposals to fund. After the panel finished, the program manager asked us for feedback about various issues including the question: How large should NSF awards be? To put the question in context, in the programs that fund my kind of research, NSF offers “small” awards of up to $500,000 over up to three years, “medium” awards of up to $1.2M over up to four years, and “large” awards of up to $3M over up to five years. Few large awards are given and the bar for getting these is very high. Most grant proposals in a given size category ask for close to the maximum amount of money. For a given amount of grant money, should NSF (and other agencies) give more small awards or few large ones? I’ll present various arguments and give my opinion at the end.

  • Most grant money in computer science in the US pays PhD students around $20k per year to be “research assistants” working not only towards their degrees, but also towards larger research goals (however, in terms of grant money, the cost can be upwards of $50k per year due to university overhead and other factors). A problem with short-duration grants is that they cannot support PhD students for their entire lifetime in the system, requiring multiple grants to be pieced together.
  • Small grants cause significant administrative overhead at the NSF (running panels, negotiating budgets, etc.). Large grants reduce NSF overheads but push this burden out into the universities who now have to deal with more complicated accounting and reporting.
  • In Europe a lot of the money has moved towards very large grants. From what I hear, this necessitates a lot of jockeying for position, forming coalitions with the correct geographical distribution, etc. Nobody from Europe who I’ve talked to loves this or thinks it is very effective.
  • It can be difficult for small institutions to compete for large grants.
  • Small grants (unless accompanied by higher hit rates) force researchers to spend more time writing proposals, reducing the amount of real work they get done.
  • Large grants favor the status quo: they most often go to groups who have previously received large grants. Often, these groups belong to research centers and institutes that have evolved very sophisticated machinery for acquiring large grants and could not survive without them.
  • Some work is almost impossible to do using small grants. For example, expensive equipment or large software implementation efforts may be required to make a project succeed.
  • Large grants may engender collaboration, which can be beneficial. On the other hand, collaboration has overhead. Furthermore, some fraction of large grants support what I call “false collaboration” which happens when researchers get together and acquire a large grant, but then the individuals just go off and work on whatever they want, separately.

Clearly some sort of balance is needed. My guess is that most funding, maybe 80%, should go towards a large number of small awards, with the rest supporting carefully-chosen larger efforts.

Externally Relevant Open Problems in Computer Science

Most academic fields have some externally relevant problems: problems whose solutions are interesting or useful to people who are totally ignorant of, and uninterested in, the field itself. For example, even if I don’t want to know anything about virology, I would still find a cure for the common cold to be an excellent thing. Even if I failed calculus I will find it interesting when physicists invent a room-temperature superconductor or figure out why the universe exists.

This piece is about computer science’s externally relevant open problems. I have several reasons for exploring this topic. First, it’s part of my ongoing effort to figure out which problems matter most, so I can work on them. Second, I review a lot of proposals and papers and often have a strong gut reaction that a particular piece of work is either useful or useless. An idea I wanted to explore is that a piece of research is useless precisely when it has no transitive bearing on any externally relevant open problem.

A piece of work has “transitive bearing” on a problem if it may plausibly make the problem easier to solve. Thus, an esoteric bit of theoretical physics may be totally irrelevant or it may be the key to room temperature superconductivity. Of course, nobody thinks they’re doing irrelevant work. Nevertheless, it’s hard to escape the conclusion that a lot of CS (and other) research is in fact irrelevant. My third motivation for writing this piece is that I think everyone should do this sort of analysis on their own, rather than just believing the accepted wisdom about which research topics are worth working on. The analysis is important because the accepted wisdom is — at best — a lagging indicator.

Below is my list. I’ll be interested to see what people think I’ve missed (but note that I’m deliberately leaving out problems like P vs. NP which I don’t think are externally relevant).

Giving More People Meaningful Control Over Computation

There are plenty of people — scientists, analysts, managers, etc. — who are not programmers but who would benefit from gaining better control over computers. I’m using the phrase “meaningful control over computation” as a bit of a hedge because it’s clear that most of these people don’t have 2-5 years to spare in which to become competent programmers. The goal is to give people the power that they need to solve their problems without miring them in the Turing tarpit. A good example is the class of spreadsheet programming languages which expose a useful kind of computation without most of the problems traditionally associated with programming. Overall, this problem is maybe 1/3 programming language design, 1/3 user interface design, and 1/3 domain specific.

Trustworthy Automation

Can we create computer systems that do what we want them to do? This problem encompasses both security and reliability. It’s a good problem to work on because solutions have not only have short-term economic benefit but in the long run they directly support getting humans off the rock, which as I’ve said before is something we need to be working very hard on.

The whole problem of specifying what we want systems to do is enormously hard, and in fact we generally have no precise ideas about this. Even highly mathematical objects like compilers are most commonly specified in human language, if at all. Moreover, the programming languages that we use to specify computations contain elements such as floating point computations, fixed-width integers, and excessive use of side effects — all of which seem designed to impede reasoning.

Intelligent Systems

Can computers be created that interface gracefully with humans? How can augmented cognition be used to sidestep limitations of humans and computers? Which sub-problems of AI are achievable and useful? Here I mean “intelligent” literally, not in the sense it is usually used, where it means “not as obviously bad as the previous thing.” Of course, the goal of AI may turn out to conflict with “trustworthy automation” but we’ll have to cross that bridge when we come to it.

Observing, Simulating, Modeling, and Predicting Everything

The universe produces a gigantic amount of data at scales from atoms to galaxies. Luckily, the theoretical bounds on the amount of computation that can be done using the available energy are very generous. The ugly little space heaters with which we currently compute have not even started to scratch the surface of what’s possible.

This one is pretty vague, but I’m not sure right now how to improve it.

Further Reading

Vernor Vinge and Hans Moravec have created some of the best unconstrained thinking about computers. Asimov was pretty good in his day, but Sladek nailed it. The transhumanist material on the web seems pretty bad, as is Kurzweil’s stuff. Dertouzos’s Unfinished Revolution was not inspiring. There must be more (good and bad) writing on this subject, please make suggestions.

Update from evening of 5/10: This list of external problems is of course subjective. I’d be very interested to see other people’s blog posts describing what they think CS has to offer the non-CS world.

What I Want From a Bibliography System

As a professor I spend a fair amount of time wrangling with references. Because it’s free and reasonably simple, I use BibTeX: an add-on tool for LaTeX that automates the construction of a bibliography by pulling references out of a separate text file, assigning them numbers (or other identifiers), and formatting the entries appropriately.

BibTeX entries are widely available on the web. In principle this is great, but in practice the quality of downloadable BibTeX entries is poor: they contain so many typos, omissions, misclassifications, and other problems that I no longer download them, but rather create a new entry from scratch, perhaps using the existing entry as a rough specification. Even BibTeX entries from the ACM’s Digital Library — a for-pay service provided by the premier professional society for computer science — are not directly usable, basically every one of them requires editing. Over time I’ve come to more and more blame BibTeX for these problems, rather than blaming the people creating the entries. Here are the design points I’d like to see in a bibliography system.

Style neutrality: A bibliography entry should be a collection of factual information about a document. The “bibliography style” — a collection of formatting guidelines specific to a particular journal or similar — should be a separate concern. BibTeX does support bibliography styles but the implementation is incomplete, forcing entries to be tweaked when the style is changed.

Minimal redundancy: Any large BibTeX file contains a substantial amount of redundancy, because BibTeX poorly supports the kind of abstraction that would permit, for example, the common parts of two papers appearing at the same conference to be factored out. Duplication is bad because it consumes time and also makes it basically impossible to avoid inconsistencies.

Standalone entries: I should be able to download a bibliography entry from the net and use it right away. This goal is met by BibTeX but only at the expense of massive redundancy. Meeting both goals at the same time may be tricky. One solution would be to follow an entry’s dependency chain when exporting it, in order to rip a minimal, self-sufficient collection of information out of the database. This seems awkward. A better answer is probably to rely on conventions: if there is a globally recognized name for each conference and journal, a self-sufficient entry can simply refer to it.

Plain-text entries: Putting the bibliography in XML or some random binary format isn’t acceptable; the text-only database format is one of the good things about BibTeX.

Network friendliness: It should be easy to grab bibliography entries from well-known sources on the network, such as arXiv. Assuming that the problem of these entries sucking can be solved, I should not even need a local bibliography file at all.

User friendliness: BibTeX suffers from weak error checking (solved somewhat by add-on tools like bibclean) and formatting difficulties. For example, BibTeX’s propensity for taking capitalized letters from the bib entry and making them lowercase in the output causes numerous bibliographies I read to contain proper nouns that are not capitalized. Messing with BibTeX styles is quite painful.

BibTeX is a decent tool, it just hasn’t aged or scaled well. CrossTeX and BibLaTeX sound nice and solve some of the problems I’ve mentioned here. However, since I have about 37,000 lines of BibTeX sitting around, I’m not interested in migrating to a new system until a clear winner emerges.

Publishing for Impact

Choices made in the process of doing research can lead to a 5-10x difference in number of publications for exactly the same underlying work. For example, do I publish my idea in a workshop, then a conference, and then a journal? On the other hand I can just do the conference version. Do I take the trouble to write up a paper about an MS thesis where the student has somehow managed to escape without giving me a paper? (The common tactic for accomplishing this is to get a great job offer with a short fuse, and then utterly disappear.) Do I incrementalize my work into a collection of neat little papers each based on a single idea, or do I glom it all together into monster papers that contain a half-dozen or more research contributions?

Today’s question is:

If my goal is simply to maximize my own research impact over my career, should I aim for the low end (1-3 papers per year) or the high end (ten times more) of this spectrum?

I’ve puzzled through this and have concluded that as long as we keep the goal simple — maximize impact — there are few arguments for publishing more.

The first problem with publishing more is that writing a paper requires significant effort over and above that of doing the research in the first place. I usually estimate the cost of writing a paper at one month, but it could be considerably more or less depending on a number of factors. Since teaching uses up about three months of my effort each year, grant proposals cost 1-3 months, and service and administration another three months, the loss of a month of research time is a serious matter. In a nutshell: since my time is finite, writing more papers means I get less research done. Over a 40-year career, the loss of productivity due to writing 5-10 extra papers per year would be huge.

The other big problem with publishing more is that my target audience consists of extremely busy people. Having impact means that people who matter need to read and understand the material that I produce. If I produce 20 papers per year, how many will be read carefully? Probably not that many. My own reading time is highly constrained, and therefore I have become strongly prejudiced against reading papers by authors who seem like paper machines, because all too often any individual paper is weak and/or it contains a lot of similar thinking to something I’ve already read by that author. On the other hand, there are other researchers who produce less material where I’ll drop everything to read their latest piece of work because it reliably teaches me something new. In a nutshell: producing watered-down content creates a disincentive for people to read my work.

The thing that surprised me, when I sat down and thought through these issues, was how one-sided the arguments against prolific publication seem to be. Of course I’m not unaware of the arguments in the other direction. First, a larger number of papers impresses people who primarily count. Second, publishing more benefits students by giving them practice writing and by building up their vitae. Third, prolific publishing in a wide variety of venues probably exposes a larger number of people to one’s work — though I’m not sure this factor is very important given the social, searchable web as it exists today.

A Day in the West Desert

Southern Utah is famous for its canyons and arches. Utah’s West Desert, on the other hand, is not as well-known and is perhaps not as easy to appreciate. It encompasses a large area, containing entire mountain ranges that are virtually unknown (ever heard of the Confusion Range or the Wah Wahs?). It is also very remote: services are hard to come by and in many areas you could go weeks or months without seeing another person.

Yesterday Matthew Flatt and I drove to the House Range with our combined children. The main goal was to look for trilobite fossils: Isaac, my four-year-old, has become obsessed with them. Being completely clueless about fossil hunting we decided to make things easy by visiting U-DigĀ  Fossils, a privately owned quarry where they use machines to remove the overburden and break up the rocks somewhat. This turned out to be a great idea. The fossil-finding process is to select a thick piece of shale and repeatedly split it using a rock hammer. This is not hard since the shale has a distinct bedding plane — the kids could do it too, by selecting smaller rocks that had already weathered a bit. Nineteen times out of 20, the split rock reveals nothing, but every now and then a happy little trilobite is hiding inside — having waited in the dark for half a billion years.

After a couple hours’ fossil hunting and a picnic lunch, we drove through Marjum Canyon on the original unpaved road connecting Delta Utah with Nevada, superseded by Highway 6 in the 1950s. We hiked a short distance up a side canyon to see the cave where a hermit lived from the 1920s to 40s. It was a lot more comfortable-looking than we’d expected, with a concrete floor, stove, shelves, and two windows.

After exiting Marjum Canyon we entered Tule Valley, a bleak basin between the House Range and the Confusion Range. The main thing I wanted to do is see Notch Peak’s west face, which at 4450 vertical feet is one of the highest cliffs in North America. I had climbed this mountain (from the other side, obviously) in 2006 but had never seen it from the west. I’m not especially afraid of heights but had found myself totally incapable of looking directly over the edge of these cliffs.

We got back to Delta around dinner time, had pizza, and got home around dusk. A fun trip.

Generalizing and Criticizing Delta Debugging

Delta debugging is a search-based technique for taking an input to a program that triggers a bug and making that input smaller. For example, you might have a sequence of GUI operations that causes Thunderbird to crash. Assuming the crash is deterministic and the input can be replayed automatically, you can iteratively remove UI actions until you end up at a local minimum: a collection of events where removing even one of them makes the crash go away. If Delta is implemented well and properly takes advantage of the structure of the input, the resulting failure-inducing input is not only small, but also most of it is relevant to the failure. Debugging a failure triggered by a small, highly-relevant input is generally pretty easy. The definitive Delta debugging reference is Simplifying and Isolating Failure-Inducing Input by Andreas Zeller and Ralf Hildebrandt (ZH02 from now on).

Delta is fantastically useful, particularly in the context of random testing. As part of my group’s compiler bug-finding project we implemented three new test case minimizers using Delta-like ideas, and we also use an existing line-based Delta implementation. Each of the four reducers has different strengths and weaknesses and in fact local minima can often by escaped by running the implementations one after the other.

Significantly, none of the four test case reducers is a straightforward implementation of an algorithm from the original Delta paper — each of them generalizes it in one or more ways. After spending a lot of time working on test case reduction and thinking about this, I got the idea of writing a paper perhaps called “Generalized Delta Debugging” which weakens many of the assumptions found in the original work. The problem was that the more parts of Delta debugging I generalized, the more the result looked just like a generic greedy search. Thus, it started to look extremely doubtful whether there was any research component to generalizing Delta debugging. This piece explores the consequences of that observation.

Delta == Greedy Search

Just to be clear, by “greedy search” I mean the class of optimization algorithms that are based on a transformation operator and a fitness function. They work by repeatedly transforming the current solution to arrive at a new solution, and replacing the current with the new if the fitness level has increased. No doubt I’m butchering the accepted terminology, but the ideas here are really simple.

The “minimizing delta debugging algorithm” from ZH02 is an instance of greedy search where the fitness of a failure-inducing input is the inverse of its size and the fitness of all non-failure-inducing inputs is zero. The transformation operator removes a contiguous chunk of the input. When the transformation gets stuck — by running out of choices of chunks to remove — it reduces the chunk size. When it gets stuck at the minimum chunk size (a single line, character, or other atom of input) the search is finished.

The “general delta debugging algorithm” from ZH02 is very similar but its goal is to minimize the difference between the current solution and a given input, instead of simply minimizing size. Since I haven’t found many uses for this algorithm in practice, and since it’s not all that different from the minimizing Delta, I won’t discuss it further. Whenever I mention the “Delta algorithm” or similar, it is the minimizing Delta to which I refer.

Which parts of the Delta algorithms from ZH02 can be usefully generalized? As it turns out, pretty much all of them. Let’s look at the different elements in turn.

Generalizing the Transformation Operator

The Delta transformer that deletes contiguous chunks of input at ever-finer levels of granularity is reasonably generic and efficient. However, when attacking highly-structured test cases it often gets stuck at a local maximum long before the test case is fully reduced. (Sorry if I keep switching between minimum and maximum. When discussing size the goal is minimization, when discussing fitness in a generic context, I’ll stick to the convention that the goal is maximization.) Hierarchical delta debugging is a variant that improves performance by operating on sub-trees of tree-structured inputs.

Another generalization is to use a transformer that replaces a chunk of input with something else, instead of simply deleting it. For example, one of our new reducers for C code tries to replace uses of variables with constants. Another replaces function calls with their results, including side effects. These are very effective in practice.

It is also useful to delete parts of the input in a non-local way. For example, to remove an argument to a function in a C program, we must delete it from the definition, declaration, and all uses. Making this transformation work requires a painful degree of friendliness with the C code, but again it’s very useful in practice.

Finally, we sometimes use transformations that don’t even make the test case smaller. For example it may be desirable to replace a small, complex construct (like a call to a trig function in a C program) with a larger but simpler construct (a math expression approximating the trig function’s behavior, perhaps). Similarly, it may be desirable to replace an array with a collection of scalars or a struct assignment with a collection of assignments to members. The scalars or the assignments are then vulnerable to subsequent reduction.

All of these examples point towards a more general idea which is that there is a strong synergy between test case reduction and compiler optimization (which I wrote about earlier).

Generalizing the Fitness Function

ZH02’s minimizing Delta uses 1/size as its fitness function and its general Delta uses the inverse of the string distance between current solution and goal. There are plenty of other useful fitness functions. As I mentioned in the previous paragraph, considering the complexity of different program constructs is useful. We’ve also experimented with using Delta techniques to minimize the number of instructions executed by the test case. The insight is that the complexity of a failing execution depends not only on syntactic characteristics of the failure-inducing input, but also on the dynamic behavior induced by the test case.

A major gap in the ZH02 paper is that it does not address the validity problem: does the transformed test case satisfy the constraints imposed on test inputs? For some uses of Delta no validity test is required because the system under test can detect invalid inputs. On the other hand, the validity problem for C programs is extremely difficult to deal with (in theory, it’s undecidable; in practice, no fun at all) and this has been a major stumbling block in our C compiler bug-finding work — but now solved (thank goodness not by us). Sometimes it is desirable to test software with invalid inputs, but for the C compiler work we want to say that all invalid test cases have zero fitness.

Generalizing the Search Framework

The third element of the Delta debugging algorithms from ZH02 that can be usefully generalized is the search algorithm itself. Backtracking can be used to avoid getting stuck too early, as can other techniques such as simulated annealing or genetic algorithms. Basically, test case reduction can be based on any search algorithm, greedy or not.

A Verdict on Delta

Now I’d like to ask a couple of specific questions about the ZH02 paper. First, why weren’t the algorithms presented as “plugins” to the greedy search framework? This would have improved the presentation by making it clear which elements of the algorithms are important vs. boilerplate. Also, it would have made it easier for readers to understand how subsequent improvements to delta should be structured.

Second, given that delta debugging is a fairly straightforward application of greedy search, how significant is its research contribution? The answer that I lean towards is that the main contribution of delta is giving a nice name to an excellent idea that was, in 2002, somewhat obvious.

Since it is dangerous to criticize an idea by saying, in retrospect, that it was obvious, I’ll provide a bit of justification. First, two of the previously published papers cited by Zeller and Hildebrand (references 8 and 9 in the paper) apply ideas that are basically the same as Delta debugging, but without calling it that. Additionally, a paper they missed — Differential Testing for Software, published in 1998 — described a search-based, automated test case reducer. So it’s clear that by 2002 the ideas forming the core of Delta debugging were floating around.

My opinion is that the two Delta algorithms from the ZH02 paper have little enduring value because they simply do not work very well without modification. At least, we couldn’t make them work without significant generalization. The enduring value of the paper is to popularize and assign a name to the idea of using a search algorithm to improve the quality of failure-inducing test cases.

Moving Forward

As the complexity of computer systems continues to increase, the advantages derived from deterministic execution and automated test case reduction will also continue to increase. Delta debugging provides a conceptual framework for doing so. Unfortunately, it seems that few useful papers on reducing the size of failure-inducing programs that have appeared since the original Delta work. A notable exception is the hierarchical delta debugging work I mentioned earlier. Iterative Delta Debugging is interesting but solves a slightly different problem. Deriving Input Syntactic Structure is a technique that can make hierarchical delta easier to implement. If anyone knows of more work along these lines, I’d like to hear about it.