Should Software Evolve?


Weimer, Nguyen, Le Goues, and Forrest wrote a paper Automatically Finding Patches Using Genetic Programming that proposes and evaluates this approach for fixing a buggy software system:

  1. A failure-inducing test case is found
  2. The system code is mutated randomly, but using smart heuristics to increase the probability that the part of the program that is broken is modified
  3. The new version of the system is tested against the new and old test cases, and is accepted if it passes all of them
  4. The fix is minimized using delta debugging

The paper was published in 2009 at ICSE — a highly selective software engineering conference — and it was one of the award papers the year it appeared. In other words, the software engineering community considered it to be one of the best papers of 2009. There is no doubt the approach is interesting and that it should be tried. On the other hand, it seems pretty clear that fixing bugs by genetic programming is sometimes inappropriate. Since the authors of the paper didn’t speculate about when their technique should and shouldn’t be used, I’ll do it.

First off, the method from the Weimer paper is very similar to something I see when teaching CS courses. Some fraction of the students approaches programming problems like this:

  1. Write code purporting to solve the assigned problem
  2. Run test cases
  3. Tweak the logic more or less randomly
  4. Go back to 2 until time runs out or all test cases succeed

This is not a good approach. When it works, it permits students to believe they do not need to practice incremental development, and that it’s OK to produce code that would be impossible to maintain. It also defeats the purpose of the assignment — learning — by reducing program construction to a series of mindless tweaks. It works poorly in practice, even for fairly simple assignments: the best outcome is an over-constrained and fragile program that solves some or all test cases, but that contains convoluted logic and needless special cases that render it incorrect for inputs not in the test set. The other outcome (and this happens pretty often) is that the student’s code steadily evolves to be more and more wrong until in the end the code must be thrown away and the student starts over.

Of course, just because it’s a disaster when students use evolutionary programming doesn’t mean that computers shouldn’t do it. My first attempt at a rule for when it’s OK to evolve software was this:

Software for non-critical applications may be evolved, whereas software for safety-, security-, or mission-critical systems should not be.

But this isn’t a good rule. It’s perfectly fine for certain parts of the software controlling my car’s engine to evolve (I’ll say why below), and some kinds of totally non-critical software should never evolve. Here’s a rule that I think is better:

Black box software whose correctness can be externally verified can and perhaps should evolve. Program logic that must be understood, modified, analyzed, and (especially) maintained by human developers should not evolve.

For example, if a genetic algorithm is used to evolve the most efficient valve timings for my car engine or the most effective buffer size for my video player, then great — these can be independently verified and the fact that they evolved is irrelevant. On the other hand, if my compiler crashes or emits the wrong code, can it be repaired by an evolutionary technique? No. The logic required to fix any non-trivial compiler bug is intimately entwined with the compiler’s existing logic. Evolving it would be perhaps the dumbest idea on record. My guess is that the core logic of many important software systems (operating systems, virtual machine managers, database engines, etc.) is similarly not amenable to evolutionary bug fixing.

Are evolutionary techniques somehow incompatible with operating systems and compilers? Absolutely not. Parameters to the OS paging or prefetching policy, or the ordering of phases in a compiler, can and should be evolved to be better. But again, this will work because nobody needs to understand the prefetch policy as long as it works. A bad prefetch policy or phase ordering will hurt performance, but won’t break the system.

Of course I’d be happy to be proved wrong, and here’s how the software evolution people can do it: as soon as they ask me to, I’ll start handing them compiler bug reports at the same time I hand these to the LLVM and GCC maintainers. Then, they can try to evolve a fix before the compiler maintainers get around to looking at the problem. This should be easy because evolutionary bug fixing is totally automated whereas manual bug fixing requires stealing time from well-paid and extremely busy developers. Of course the evolved compiler patches should be submitted to the compiler maintainers. If they can get just five evolved patches accepted to the GCC or LLVM trees, I will publicly apologize for being so very, very wrong about their work, and will also buy multiple beers for whatever subset of them I next run into at a conference.

Update: I should add that I would very much like evolutionary fixing of compiler bugs to work out, because this method could be combined with my group’s automated detection of compiler bugs. The result would be a compiler that spontaneously becomes more correct.

,

14 responses to “Should Software Evolve?”

  1. John,

    I’ll be happy to sit back and see who “wins”. The quotation here is because at the end everybody wins: better compilers, better techniques, less stressed people.

    There is one point, however, that you did not mention: the goddness of the ‘fitness function’. In this case it’s the test cases.

    If we limit ourselves to evolve software to pass test cases we’ll have just that: software that can pass those test cases. And assuming test cases account for every possible condition is a big assumption that shifts the responsibility from the coder to the person coding the tests. Something like teaching kids to pass tests and not the comprehension of a subject.

    Don’t get me wrong: genetic programming is a great tool to search a vast search space, but at the end it’s just a filtering function. Its weaknesses are somewhere else. It’s like trying to write a computer virus by mutating bits of random code. You might end up with a great piece of code that does not do what you inteded it to do, and you throw it away.

    I think that at the end genetic programming might be one of the tools in the toolkit (black!) belt of a competent software tester; and especially where it’s not vital software.

    That said, interesting point, and keep us posted 🙂

  2. Your rule is difficult to apply as is because many terms it uses are vague. At this moment, you have nothing to publicly apologize for because you are not even wrong yet. Let me give some examples of this vagueness, by way of asking you to sharpen (and perhaps simplify) your statement of (and perhaps argument for) the rule.

    What counts as “evolving” programs, besides genetic programming? Many SAT solvers perform mindless random search; does that put Bodík’s sketching work into your “should not” category? Most large programs today, including gcc and llvm, can be said to have “evolved”: their developers, being the non-deterministic beings that humans are, literally “tweak the logic more or less randomly”. If that is not the sense of “evolve” you mean that is “the dumbest idea on record”, what is?

    Whether a piece of software is “black box” and whether it can be “externally verified” depend on what is sealed in the box and what mechanisms of external verification are allowed. What if the software’s job is to print out a patch which, when presented to the compiler maintainers, would be accepted into the trees? This specification is externally verifiable: just run the software and email the result. What if the software’s job is to print out a program along with a succinct proof that it is correct? Succinctness is well approximated by shortness in a language with good abstraction facilities.

    Maybe, as the race you propose suggests, the problem with evolution is just that it takes too much time. This problem is especially severe when external verification of each candidate solution takes a long time (such as when humans are required to check that a paging policy does not risk resource starvation) or when the search space is large and irregular (so good heuristics for combining and mutating candidates are required). But then the rule might just be: when making or changing programs by whatever means, be sure they satisfy your needs, and don’t take too long because that would take too long.

  3. Hi Chung-chieh,

    I’m glad you brought up SAT — an earlier version of this post used it as an example. A solution to SAT (or any other NP-complete problem) is the perfect thing to evolve because the solution can be efficiently verified. The asymmetry between creating and verifying a solution seems to be a key part of the puzzle here.

    I don’t agree that there is a lot of random changing going on in the core logic of GCC and LLVM. Could you perhaps provide some evidence? There is, I’m sure, plenty of trial and error at the design level and at the level of tuning parameters to optimizations.

    You’re right that I’ve used pretty vague terms, but it doesn’t bother me very much. Software maintenance is difficult and requires good taste — why shouldn’t the decision about whether to use randomized methods in software maintenance also require good taste?

  4. The comment in your update makes me think there should be a proof that it won’t work based on it breaking the second law of thermodynamics.

  5. BCS, I like the second law analogy, but in this case there’s a different wrinkle: a second compiler would probably serve an an oracle that determines the correct answer for each test case, so it’s not as if the compiler fixes itself magically. It would have an example to follow.

  6. Good points Lorenzo.

    Regarding the fitness function, something I meant to bring up in the article, but left out of the final version, was that if we can exhaustively test the software system, then it becomes perfectly OK to evolve patches. I’m not sure this gets us anywhere in practice, though.

  7. Using the 2nd (and 3rd and 4th..) compilers as an (imperfect) oracle is an interesting way to look at it. At a guess, I’d expect it to put an upper bound on the correctness. The Byzantine Generals’ Problem and forward error correction limits come to mind.

  8. When I read the original paper, I thought to myself, ‘this sounds like a good way to get insight into the bug – see what genetic patch can fix it, and as part of the usual debugging process, one reverse-engineers the patch into something better’.

  9. Hello again,

    I’m quite happy to require, as you suggested, that whoever makes “the decision about whether to use randomized methods in software maintenance” have good taste. (Perhaps that is what the authors of the paper assumed as well.) But then your rule seems superfluous—who does it help?

    (As for “random changing going on in the core logic of GCC and LLVM”, I just meant that people are non-deterministic and so the changes people make to software are necessarily random ones.)

  10. Chung-chieh, the second part of the rule says: “Program logic that must be understood, modified, analyzed, and (especially) maintained by human developers should not evolve.” This seems to offer pretty clear guidance.

    But of course this is just a blog post — I’m happy for people to disagree.

  11. One key distinction between cases like engine valve timing and compiler software is that the set of inputs to the former can be regarded as having a fixed probability distribution. We generally seek to maximize expected performance, and by choosing a sufficiently large test set (larger than the search space of potential algorithms) we ensure that any algorithm which performs well on the test set has a high probability of exhibiting similarly good expected performance when faced with problems randomly drawn from the probability distribution.

    For security software, however, we seek to prevent an adversary from finding an input which causes insecure behavior. This adversary may depend on the algorithm itself — it cannot be regarded as fixed. (Note that it is not necessary for a compiler, etc. to be correct — it is only necessary that it appear correct from the point of view of fast adversaries.) Hence no fixed, sub-exponential, test set can adequately represent the inputs the compiler must face.