How Did Software Get So Reliable Without Proof?


Tony Hoare’s 1996 paper How Did Software Get So Reliable Without Proof? addresses the apparent paradox where software more or less works without having been either proved correct or tested on more than an infinitesimal fraction of its execution paths. Early in the paper we read:

Dire warnings have been issued of the dangers of safety-critical software controlling health equipment, aircraft, weapons systems and industrial processes, including nuclear power stations. The arguments were sufficiently persuasive to trigger a significant research effort devoted to the problem of program correctness. A proportion of this research was based on the ideal of certainty achieved by mathematical proof.

Fortunately, the problem of program correctness has turned out to be far less serious than predicted. A recent analysis by Mackenzie has shown that of several thousand deaths so far reliably attributed to dependence on computers, only ten or so can be explained by errors in the software: most of these were due to a couple of instances of incorrect dosage calculations in the treatment of cancer by radiation. Similarly predictions of collapse of software due to size have been falsified by continuous operation of real-time software systems now measured in tens of millions of lines of code, and subjected to thousands of updates per year.

So what happened? Hoare’s thesis—which is good reading, but largely uncontroversial—is that a combination of factors was responsible. We got better at managing software projects. When testing is done right, it is almost unreasonably effective at uncovering defects that matter. Via debugging, we eliminate the bugs that bite us often, leaving behind an ocean of defects that either don’t manifest often or don’t matter very much when they do. Software is over-engineered, for example by redundant checks for exceptional conditions. Finally, the benefits of type checking and structured programming became widely understood during the decades leading up to 1996.

The rest of this piece looks at what has changed in the 16 years since Hoare’s paper was published. Perhaps most importantly, computers continue to mostly work. I have no idea if software is better or worse now than it was in 1996, but most of us can use a computer all day without major hassles. As far as I know, the number of deaths reliably attributed to software is still not hugely larger than ten.

Software, especially in critical systems, has grown dramatically larger since 1996: by a factor of at least 100 in some cases. Since the factor of growth was probably similar during the 1980-1996 period, it seems a bit odd that Hoare didn’t include much discussion of modularity; this would be at the top of my list of reasons why large software can work. The total size of a well-modularized software system is almost irrelevant to most developers, whereas even a small system can become intractable to develop if it is sufficiently badly modularized. My understanding is that both the Windows NT and Linux kernels experienced serious growing pains at various times when coupling between subsystems made development and debugging more difficult than it needed to be. In both cases, a significant amount of effort had to be explicitly devoted to cleaning up interfaces between subsystems. When I attended the GCC Summit in 2010, similar discussions were underway about how to slowly ratchet up the degree of modularity. I’m guessing that every large project faces the same issues.

A major change since 1996 is that security has emerged as an Achilles’ Heel of software systems. We are impressively poor at creating networked systems (and even non-networked ones) that are even minimally resistant to intrusion. There does not appear to be any evidence that critical software is more secure than regular Internet-facing software, though it does enjoy some security through obscurity. We’re stuck with a deploy-and-patch model, and the market for exploits that goes along with it, and as far as I know, nobody has a better way out than full formal verification of security properties (some of which will be very difficult to formalize and verify).

Finally, formal correctness proofs have not been adopted (at all) as a mainstream way to increase the quality of software. It remains almost unbelievably difficult to produce proofs about real software systems even though methods for doing so have greatly improved since 1996. On the other hand, lightweight formal methods are widely used; static analysis can rapidly detect a wide variety of shallow errors that would otherwise result in difficult debugging sessions.


23 responses to “How Did Software Get So Reliable Without Proof?”

  1. I think the key insight is that testing removes all bugs that matter. In that sense the first time a piece of code goes into production the most thorough and important test starts. After a few patch releases all bugs that matter are gone.

    GCC is a great example of this: It is buggy as hell, but most bugs don’t matter. Everything that people actually do with the product has been rigorously tested through practical use.

    And that is what people mean by “a mature piece of software”. Software matures.

  2. Hi tobi, of course the problem is that testing removes many of the bugs that matter, but not all, and it’s better at some kinds of bugs than others. At home my wife has a Windows 7 machine and I have one running the latest Ubuntu and each of these machines downloads many MB of security patches per month.

  3. I saw Sir Tony give a talk about 4 years back in which he stated that software correctness is the “big problem” in software engineering, analogous to other”big problems” such as incompleteness, halting, and some other examples from the sciences. He also discussed SLAM-model checking drivers-which is relevant to this conversation. It seems like model-checking is halfway between coverage testing and formal proof systems.

  4. Since I am someone who publicly argues that meta-proofs in PL are overrated, I can’t argue with overrating proofs for software.

    I agree with some parts of your analysis, especially “testing catches bugs” and “deployment plus maturity eliminates bugs”.

    But I will argue with your analysis of Hoare’s paper. First, in 1996 type checking was mostly done with unsound type systems. They have about the same value as a style checker — useful, but meaningless. Second, we seem to forget pain much faster than we forget joyful experiences. Sure enough, software worked back then, especially for us who used Unix-y systems. But ask some consumers who used Windows Foo.Bar. They have war-stories to tell about blue screens of death wiping out hours/weeks/months worth of work. Third, it is wrong to measure the flaws of software in terms of death. It’s like measuring how many bridges collapsed and killed how many people, a favorite occupation of civil engineering students. Once we get as old as civ eng, I am sure our discipline will have killed as many people as any other engineering discipline.

    So let’s measure the software problem in terms of ‘dollars lost’. Bridge-building engineers have much less of a problem there. The only cost is maintenance. Our broken OSes destroy people’s work. Stop aircraft carries in the middle of oceans. Eliminate web browsers from all government offices for a few months (come to think of it, this might be a benefit).

    How did we get better, a lot better since 1996? MS adopted static analysis, a product of academic CS. Others did too. Advanced parts at MS adopted model checking, another CS product. Some hardwire companies do use interactive theorem provers and that’s effective at low-level software, too.

    We do ave a software quality problem left, we have a cost problem left, and we should figure out how to diagnose these properly.

  5. Hi Matthias, I suspect we mostly agree but let me play the devil’s advocate a bit…

    Does anyone besides POPL authors really care about the soundness of a type system? For example, if you claim I should care about this, can you quantify the benefits, perhaps in terms of programmer productivity?

    Do you think Microsoft’s static analysis and model checking has had a measurable effect on product quality? This is probably very tough to quantify. The facts are that (1) most of the bugs that matter are still caught by testing and (2) MS software (and everyone else’s) is still patched like crazy. It’s not so totally clear to me that these tools are yet having a major effect.

    Finally, of course I agree that overall it’s all about money. If we wanted an error-free OS we’d use NASA-style development methods and it would cost only 5% of the GDP.

  6. The soundness of a type system has a directly measurable consequence for every programmer. By analogy, a type system is a weather predictor; when the program is run, you experience the weather. A sound type system predicts the weather accurately; an unsound type system may predict the weather correctly for your living room but not the back yard. [Words carefully chosen here.]

    The purpose of a type system is to rule out misinterpretations of bits by operations, e.g., bits representing a function pointer should not be negated as if it were a boolean. The specific statement is that a program in a language with a sound type system either produces a value of the correct kind, goes into an infinite loop, or stops the very moment when an operation may manipulate the wrong bits — and a real theorem enumerates the precise situations when this may happen.

    How is the ordinary consumer concerned? A language with an unsound type system — say C++ — checks that certain rules are satisfied and then runs machine code on bits. If the instructions tell it to negate the bits of a function pointer and to interpret them afterwards as a number, the computer will happily do so. Eventually this program may print ’42’ and you have no clue whether the bits it printed bling to a number. Think of a pace maker printing the wrong number.

    How is the ordinary programmer affected? If the programmer is lucky, a faulty interpretation of bits eventually triggers a set fault, a bus error, etc. Then he needs to search for the actual bug. BUT since he cannot rely on the type checker’s prediction, the typings do NOT narrow down the bug search. The programmer has to assume that in principle bug can be anywhere in the program. In contrast, a type sound language narrows down the search for bugs to very narrow slices of the program. Programmers become much more productive.

    Industry confirmation: IBM conducted a large-scale product development experiment in the mid 1990s. The project, dubbed San Francisco (may still be searchable) involved 120 software architects trying to create a complete ‘back room’ management system for large corporations. They started in C++ and switched to Java because they measured a factor (not percent) of 3 to 4 more productivity per programmer in the project. It was this project and a follow-up measurement that convinced IBM to switch almost all of its new software creation to Java.

    The use of C++ in industry is no longer justified technically. The proper approach would be to use a type-sound language and to drop to C for hot spots as needed. A good research topic is how to make this ‘drop’ as cheap as possible.

    p.s. C’s type system is extremely honest. It explicitly says that a type simply specifies how many bits are allocated. There is no pretense that it is sound at all. For C++, however, this is not the case. Programmers are made to believe that classes define bundles of bits and operations on them and that they are safe. Nothing could be further from the truth.

    p.p.s. The other part of Java’s advantage is automatic memory management. In C++, a programmer must make ‘resource claims’ at component boundaries, e.g., this pointer is mine and this one is yours. Of course, nobody really writes them down and nothing checks them if they are written down in English. More generally, the lack of automatic memory management forces programmers to extreme care in a program with many components. By comparison, Java’s memory management relieves programmers from this task, vastly increasing the modularity of the program. — You and I agree that modularity is critically important. We may disagree how much it helped the Java programmers in the IBM experiment. I am willing to split the benefits at the 50-50 level, because I have a long-running experience programming in an untyped world.

  7. Microsoft experienced disastrous software problems at the turn of the century. I am sure if you google for them, you can still find them.

    In response, Gates eventually requested two changes to their software production process: a day of bug fixing and the use of static analysis tools. These tools had been in the making for a decade; indeed, one large tool is implemented as the largest Scheme program in the world (to this day). One of the people who ran this tool project at MSR, Daniel Weise, moved to MS/development and realized he didn’t need this heavy-weigfht tool but simple analyses (“scanning” and “parsing” he labeled them). So they injected such tools into the daily software process (array bond checks, deallocation checks etc) faking memory safety and type safety was well as they could. That is, the tool helped make individual programs ‘sound’ because the language did not guarantee it for all programs. Daniel used to give talks on this story, and he may still do so.

    The story of model checkers at MS is well documented in academic papers. I do not know how deep its impact is, but it is certainly much shallower than that of static analysis tools.

  8. Hi Matthias, I guess my skepticism about soundness comes from the fact that in the end I’m a systems programmer and at some point during the day I am going to take a pile of bits from a hardware register and interpret some of them as pointer offsets, some of them as regular integers, and some of them as flags.

    Also, however much I love GC in day to day life, I’m going to write programs that cannot tolerate it, meaning that any language which mandates GC isn’t going to get the job done.

    Does Perl have a sound type system? Does Python? I’m not trying to make a point here but rather actually asking the question.

  9. Yes. Racket, Perl, and Python are type-sound as ‘ideals’ but their implementations are likely to contain bugs.

    I understand systems programming. I feel your pain.

  10. A book really worth the read about these problems: Les Hatton’s “safer C”, if K&R is the bible for any C programmers, this one should have in a second place.

    Now, my humble opinion is that many new programming technologies are over rated, JAVA or unit testing (or what you want) do not avoid silly programmers (sort of pebkac). Simplicity (both in design and implementation), program life time (multiple uses cases too) and programmers that know what they do, and master their tools… There are keys of robustness.

  11. Hi John, dealing with hardware registers and bags of bits does indeed require that the language have some special mechanisms, but I don’t think that necessarily means that the language should give programmer’s the ability to violate the language itself’s constructs. For example, reading and writing from memory buffers and hardware registers doesn’t necessitate having unsafe casts between object types or unsafe covariance of object arrays.

    There have been some successful projects that do low-level programming with safe languages, without introducing violations of language abstractions. Typically it is instead done by introducing new types and values for low-level things and then allowing the rest of the language constructs to remain unchanged; if the language has nice compositional properties, then the new low-level things can easily be incorporated. Languages where this have been successful include Oberon, Haskell, Ocaml, Smalltalk, and C# at the very least. It has been rather less successful in languages like Java. (For the record, I would agree with Mathias in that C++ is failed spectacularly in the sense that it repeatedly violates its own fictions and every single one of its abstractions is leaky).

    The problem seems that there is huge inertia to be overcome, and generation after generation of languages have been unable to compete with the C/not C mindset.

  12. “If we wanted an error-free OS we’d use NASA-style development methods and it would cost only 5% of the GDP.”

    We need a term more specific than NASA-style. JPL drives around Mars ok, but I assure you does not use any technique that would get you an error-free OS, or anything like it.

  13. “static analysis can rapidly detect a wide variety of shallow errors that would otherwise result in difficult debugging sessions.”

    I think static analysis is being undersold here. I think a large part of the benefit of moderate adoption of reasonably heavy static analysis is catching bugs that are low probability and escape most testing altogether. Including some that would escape massive-scale random testing, even if it were applicable, because they don’t fit any probability distribution that works well for random generation. They don’t just increase productivity/decrease bug detection cost, they can simply eliminate important bugs.

  14. Sorry Alex, I should have been more specific! I was thinking about methods more like the ones I’ve heard described for space shuttle software development where every change goes through a review board, etc.

  15. I assumed. Shuttle avionics is around 2MLOC, right? I wonder what the cost/line is.

    Of course, the shuttle suggests that perhaps comparing to the perfection of other engineering fields is a bit dubious…

  16. My second response was too terse so I shouldn’t have.

    What I would like to argue is that systems programming can be done in languages other than C. See old Lisp machines for a counter-example. See MrEd for a 12 year old example. Gambit was ported to tiny little ‘bot’s and is uses for embedded-style OS-style programming there, continuations and GC and all (no numeric hierarchy). And see Singularity for a recent one. Modula III was intended for this level, too. There is no intrinsic need to write all systems code in C.

    Systems people have ignored explorations in this direction for reasons that I consider psychological and sociological in nature. They have chosen C over alternatives for an eternally long time, and now they argue that there is too much C code to be ignored. Should we really feel sorry for them?

  17. I would say that the exponential increase in system complexity has emerged as the Achilles’ Heel of software systems, not just security. In essence, growing interdependence between systems that “work,” but are not correct, results in interfaces that are increasingly “leaky” or unpredictable (whether the interface be an API, an Android UI, or a financial product). Malicious folks exploit this phenomenon for their own gain, but underlying issues with independence manifests itself in other important ways, such as the 2010 “flash crash” and other erratic financial market behavior. IMHO the importance of bringing these historically esoteric reliability techniques towards mainstream testing and development practices is correlated to the increase in complexity of the software that supports our daily activities (and subsequently, to time).

  18. > Should we really feel sorry for them?

    “Them”? You’re using lots of devices programmed in C.

  19. @Matthias:

    The proof of the pudding is in the eating, not in the contemplation thereof. In other words, arguments for “Lisp for system programming” would be a lot more persuasive with a few more real-world examples. Even better, a real-world examples plus well-written “what worked well, what didn’t” papers about them. (A classic in this genre is Howard Trickey, “C++ versus LISP: A Case Study”, ACM SIGPLAN Notices 23(2), Feb 1988, pp 9-18, http://dx.doi.org/10.1145/43908.43909)

    Linux has demonstrated that the barrier to new general-purpose OS kernels isn’t that high (plenty of us early-adopters ran kernels of the 0.9* vintage a few years back). So… how about some Unix-kernel-written-in-Lisp projects?

    Or if an OS kernel is too big of a job, how about reenginnering some existing non-trivial system code in Lisp and demonstrating how much better (easier/faster to program, easier/faster to debug, freer of bugs, more secure, etc etc) it is than the original? LCC and PCC demonstrate that “interesting” C compilers don’t have to be multi-million-lines-of-code monsters, so how about an LCC/PCC-level compiler written in Lisp? How about an office suite to compete with Libreoffice?

    Or how about a TCP/IP stack? From what I’ve read, OS’s existing TCP/IP stacks tend to be really ugly code full of scads of void* type-punning, so rewrites that make them cleaner would be very desirable. So… given the claims that going back and forth to C for hot spots is so easy, how about a Lisp TCP/IP stack that plugs into an existing free OS kernel (Linux, *BSD, etc)?

    Or how about reenginnering our esteemed blogmaster’s Csmith (currently ~40K lines of C++) in Lisp?

  20. The number of deaths is surely more than something approaching ten. I’m pretty sure that software caused deaths per year are >100, based on the number of medical scanner mishaps alone.

  21. Of course there are potentially indirect deaths too. Somebody uses a security exploit to steal nuclear technology from a military computer and uses it to blow up Detroit, or whatever.