The Future of Software System Correctness

A few weeks ago I re-read Tanenbaum et al.’s 2006 article Can We Make Operating Systems Reliable and Secure. They begin by observing that it would be nice if our general-purpose operating systems were as reliable as our cars and televisions. Unfortunately, Tanenbaum’s vision is being realized in the worst way: as the amount of software in cars and televisions increases, these products are becoming far more prone to malfunctions resulting from software errors. This begs the question: can we create a future where large, important, software-intensive systems work?

Let’s briefly look at some non-solutions to the software problem. First, we can’t just prove everything to be correct. This is way too expensive and most real systems lack formal specifications. At present, it is not even clear that the correct behavior of large systems can be formalized at all, though hopefully this will be possible someday (exercise for the reader: formalize Asimov’s Three Laws in HOL, Coq, or a similar language). Another non-route to safe software is pervasive use of hardware interlocks to prevent decisions made by software from doing damage. This is not going to happen because interlocks are too expensive and in many cases humans are an inappropriate fallback, for example because they are too slow or not in the right place at the right time. There’s no silver-bullet programming language, certainly; nor does traditional software engineering have the answers. A final non-solution would be to reduce our reliance on software systems. For reasons I think we do not fully understand, the universe wants lots of computation to be embedded in it, and we are going to put it there.

Given that all non-dystopian futures for the human race (and many dystopian ones too) include massive reliance on software-intensive systems, and taking into account the obvious impossibility of eliminating all bugs in these systems, the best we can reasonably hope for is a future where bugs do not have serious real-world impact. In other words, they do not (often) make mistakes that cost a lot of money, damage the environment, or kill a lot of people. Realizing this future is going to require many technological innovations and also we’ll need to significantly rethink the process of creating software systems.

The rest of this piece discusses some of the technical ideas that I believe will be most important during the next 25 years or so. Wherever possible, I’ll provide links to examples of the kinds of thinking that I’m talking about. Obviously the software safety problem has many human aspects; I’m not writing about those.

Automated Generation of Test Inputs

Whitebox testcase generators that take the structure of the system under test into account have improved greatly in the last few years. One of the best examples is Klee, which is not only open source but (atypically for research products) is engineered well enough that it works for a broad class of inputs other than the ones tested by the original developers. Drawbacks of whitebox testing tools include:

  • the path explosion problem prevents them from testing medium to large systems (the current limit seems to be around 10 kloc)
  • they do not support complex validity constraints on inputs
  • code living behind hash functions and other hard-to-reverse computations is inherently resistant to whitebox techniques

A different approach is to randomly generate well-formed testcases; the recently announced cross_fuzz tool and my group’s compiler testing tool are good examples, having found ~100 and ~300 bugs respectively in deployed systems. Random testing of this kind has significant, well-known drawbacks, but it can be extremely effective if the generator is carefully designed and tuned.

Future tools for automatically generating test cases will combine the best features of whitebox and constrained random testing. In fact, an example combining these features already exists; it is really cool work but it’s not clear (to me at least) how to reuse its ideas in more complicated situations. I think it’s fair to say that the problems encountered when mixing whitebox-type constraints and sophisticated validity constraints (e.g. “input is a valid collection of sensor inputs to the aircraft” or “input is a valid C program”) are extremely difficult and it will take a while to work out good solutions. Note that you can’t get around the validity problem by saying something like “well, the system should operate properly for all inputs.” Even if this is true, we still want to spend much of our time testing valid inputs and these typically make up an infinitesimal part of the total input space.

Scalability problems in software testing are inherent; the only possible solutions I can see are those that exploit modularity. We need to create smallish pieces of software, test them thoroughly, and then come up with ways to make the testing results say something about compositions of these smaller units (and this reasoning needs to apply at all levels of software construction). While “unit testing is good” may seem too stupidly obvious to be worth saying, there are many complex software systems with substantial internal modularity (GCC and Linux, for example) whose component parts get little or no individual testing.

Large-Scale Virtualized Environments

Executable models of complex systems — the Internet, the financial system, a fleet of ships, a city-wide automated driving system — are needed. The National Cyber Range is an example of an effort in this direction. These environments will never capture all the nuances of the emulated system (especially if it includes humans) but they’re a lot better than the alternative which is to bring up fragile systems that have never been tested against meaningful large-scale events, such as an earthquake or a concerted DoS attack in the case of a city-scale automatic driving system.

Increasingly, software verification tools and testcase generators will be applied at the level of these large-scale emulated environments, instead of being applied to individual machines or small networks. What kind of rogue trading program will maximally stress the stock market systems? What kind of compromised vehicles or broken network links will maximally wig out the automated driving system? We’d sure like to know this before something bad happens.

Years ago I did a bunch of reading about federated simulation environments. At that point, the status seemed to be that building the things was a huge amount of work but that most of the technical problems were fairly straightforward. What I would hope is that as software systems increase in number and complexity, the number of kinds of physical objects with which they interact will grow only slowly. In that case. the economics of reuse will lead us to create more interoperable physics engines, astro- and hydrodynamics packages, SoC simulators, etc., greatly reducing the cost of creating future large-scale simulation environments.

Self-Checking Systems

What would happen if we created a large system under the restriction that 98% of all system resources had to be used for redundancy, error detection, checkpointing and rollback, health monitoring, and related non-functional activities? Would we think of interesting ways to use all those cycles and bytes? Probably so. Large-scale future systems are going to have to put a much larger fraction of their resources into this kind of activity. This will not be a problem because resources are still rapidly getting cheaper. As a random example, the simplex architecture is pretty awesome.

Independent Failure

Independent failures are at the core of all reliability arguments. A major problem with computer systems is that it is very difficult to show that failures will be independent. Correlated failures occur at different levels:

  • all Windows machines may be affected by the same zero-day exploit
  • hard drives from the same batch may all fail at around 15,000 hours
  • a VMM bug or a bad RAM cell can take out multiple VMs on the same physical platform
  • n-version programming is known to not be a panacea

Simply put, we need to develop better ways to create defensible arguments about independent failure. This is a tough nut to crack and I don’t have good ideas. I did, however, recently see an interesting talk with some ideas along these lines for the case of timing faults.

Margin for Software Systems

Margin is at the center of all reliable engineered systems, and yet the concept of margin is almost entirely absent from software. I’ve written about this before, so won’t repeat it here.


Modularity is the only reason we can create large software systems at all. Even so, I’m convinced that modularity is far from being played out as a source of benefit and ideas. Here are some areas where work is needed.

The benefits that we get from modularity are fewer if the interfaces between modules are the wrong ones. Our interfaces tend to be very long lived (consider TCP/IP, the UNIX system call interface, and x86) and the suboptimality of these is a constant low-level drag on system construction. Furthermore, the increasing size of software systems has made it quite difficult to start from scratch, meaning that even academic researchers are quite often constrained by 30 year-old interfaces. Eddie Kohler’s work represents an excellent recent body of work on interface re-thinking at the operating system level.

Big system problems often happen due to unintended interactions between components, which are basically failures of modularity. These failures occur because we pretend to be using abstractions, but we’re actually using pieces of code. Real software tends to not be very modular at all with respect to bugs, timing behavior, or memory allocation behavior. Coming to grips with the impact of leaky abstractions on system construction is a critical problem. Plugging the leaks is hard and in most cases it has to be done one at a time. For example, switching to a type-safe language partially plugs the class of abstraction leaks caused by memory safety violations; synchronous languages like Esterel render systems independent of certain kinds of timing behavior; etc. An alternative to plugging leaks is to roll the leaking behaviors into the abstraction. For example, one of my favorite programming books is Leventhal and Saville’s 6502 Assembly Language Subroutines. Each routine contains documentation like this:

Registers Used: All

Execution time: NUMBER OF SHIFTS * (16 + 20 * LENGTH OF OPERAND IN BYTES) + 73 cycles

Data Memory Required: Two bytes anywhere in RAM plus two bytes on page 0.

Obviously it is challenging to scale this up, but how else are we going to reason about worst-case behavior of timing-sensitive codes?

Increasingly, some of the modules in a software system, like compilers or microkernels, will have been proved correct. In the not-too-distant future we’ll also be seeing verified device drivers, VMMs, and libraries. The impact of verified modules on software testing is not, as far as I know, well-understood. What I mean is: you sure as hell don’t stop testing just because something got proved correct — but how should you test differently?


We have a long way to go. The answers are pretty much all about modularity, testing, and the interaction between modularity and testing.

Almost Everything in Subversion

Every file I use on a day to day basis — excluding only data shared with other people, email folders, and bulk media — is kept in a big subversion repository. For the five years that I’ve been doing this, I’ve averaged 3.5 commits per day. Overall it works really well. Advantages of this scheme include:

  • Seamless operation across Mac, UNIX, and Windows.
  • I’m always using a local copy of data, so access is fast and there’s no problem if I lose the network.
  • Synchronization is fast and low-bandwidth, so it works fine using airport and coffee shop connections.
  • My machines become stateless; reinstalling an OS is hassle-free and I never backup hard disks anymore.
  • Explicit add of files means that large temporary objects are never dragged across the network.

On the other hand, a few aspects of the plan are less than ideal:

  • Explicit push, pull, and add means I have to always remember to do these, though I seldom forget anymore.
  • Every now and then I have a file that seems a bit too large to checkin to svn, but that doesn’t really fit into my bulk media backup plan.
  • Subversion’s diffing and patching is geared towards text files, and works less than ideally for binary objects such as the .docx and .pptx files I sometimes need to deal with.

But overall I’m totally happy with working this way and probably won’t change for years to come.

A Critical Look at the SCADE Compiler Verification Kit

While searching for related work on compiler testing and certification, I ran across the SCADE Compiler Verification Kit: a collection of SCADE-generated C code along with some test vectors. The idea is to spot compiler problems early and in a controlled fashion by testing the compiler using the kinds of C code that SCADE generates.

SCADE is a suite of tools for building safety critical embedded systems; it generates C code that is then cross-compiled for the target platform using a convenient compiler. Using C as a portable assembly language is a fairly common strategy due to the large variety of embedded platforms out there.

Given that there is a lot of variation in the quality of embedded C compilers (in other words, there are some really buggy compilers out there), something like the Compiler Verification Kit (CVK) probably had to be created. It’s a great idea and I’m sure it serves its purpose. Unfortunately, the developers over-state their claims. The CVK web page says:

Once this verification of the compiler is complete, no further low-level testing is needed on the SCADE-generated object code. [emphasis is theirs]

This is saying that since the CVK tests all language features and combinations of language features used by SCADE, the CVK will reveal all compiler bugs that matter. This claim is fundamentally wrong. Just as a random example, we’ve found compiler bugs that depend on the choice of a constant used in code. Is the CVK testing all possible choices of constants? Of course not.

The web page also states that that CVK has:

…the test vectors needed to achieve 100% MC/DC of structural code coverage.

No doubt this is true, but it is misleading. 100% coverage of the generated code is not what is needed here. 100% coverage of the compiler under test would be better, but even that would be insufficient to guarantee the absence of translation bugs. Creating safety-critical systems that work is a serious business, and it would be nice if the tool vendors in this sector took a little more trouble to provide accurate technical claims.

Grading on a Curve

Although university professors aren’t trained as teachers, many parts of teaching — lecturing, answering questions, creating tests — come naturally to most of us. Assigning grades, on the other hand, is a part of the game that I’ve never felt very comfortable about. Part of the problem is that as a student I found grades to be kind of irrelevant, so it’s disconcerting to realize how important grades are for many students.

It’s easy to just use a standard 90/80/70/60 scale but this often feels unsatisfactory. One problem is that the desirable grades are compressed in the top 17% of the range of scores: the statistical significance of the median A- vs. the median B+ seems seriously questionable, for example. Grades aren’t required to be statistically significant, but certainly we should avoid grading practices that are especially sensitive to noise.

A different grading approach is to decide beforehand that a fixed fraction of the class will get As, another fraction will get Bs, etc. While this doesn’t reduce the fundamental noisiness in students’ scores, it enables me to use the entire range of scores and it avoids the (I would argue artificial) coupling between students’ scores and specific grades. However, students really hate this approach: they feel much more comfortable being judged against an objective standard (nevermind that this doesn’t really exist). Also, if students feel they are competing with each other, they’re less likely to help each other out; that is undesirable since ideally, they learn as much or more from each other as they do from me. Finally, I’ve seen instances where a small class is abnormally strong or weak: in these cases it makes less sense to apply a bell curve.

My sense is that most professors mix these approaches, for example by adjusting the A/B/C/D scale to something other than 90/80/70/60. I’ve talked to several people who make a scatter plot of scores and then put the grade boundaries between clusters of students. This approach has a certain logic to it, but one of my colleagues argues that it’s statistical nonsense. When pressed, he failed to describe a convincing model of student behavior under which one could argue about the relative amounts of signal and noise in grades, but my feeling is that his intuition is sound. An even worse curving practice is “curve to highest” where everyone is graded against the highest achieved score instead of the maximum possible score. This is bad because the highest score is by definition an outlier — we should try to minimize the effect of outliers, not amplify it.

Over the years I’ve evolved a pretty simple approach to assigning grades in “real classes” (as opposed to graduate seminars and such): I look at the grades that would be assigned using a 90/80/70/60 scale and decide — based on what I know from watching students work, talking to them, looking at their code, etc. — if the grades are appropriate or if they are too low. In the latter case, I curve students scores by giving then back a fixed fraction of points they missed. For example, I may give back 1 point for every 3 points missed, turning a 70% in the course into an 80%, a 0% into a 33%, etc. This is simple and has a pleasing rationale: my assumption is that some missed points are because the student failed to study or whatever, but other missed points are because I worded a question poorly, bungled a lecture, or similar. This approach also doesn’t compress scores near the top very much. In principle, if the 90/80/70/60 scale leads to scores that are too high, I should apply a reverse curve. In practice, this would piss off the students too much, and anyway I’m not sure that I’ve seen it happen.

I’d be interested to hear others’ rationales for assigning grades in a particular way. Probably this discussion is very US-specific — due to working on graduate admissions I’m well aware that grading conventions differ considerably around the world.

On the Synergy Between Facebook and CAN Bus Error Confinement

One of my favorite lectures in my embedded systems class covers the design of CAN bus: a highly robust network most often used in automotive applications. CAN’s design is elegant in many ways and it includes several mechanisms to help keep the network operating under adverse conditions. One of these mechanisms is error confinement (see pp. 61-63 of the CAN 2.0 standard). Each CAN node maintains a transmit error count and a receive error count. These are incremented when various transmit and receive errors are detected, and are decremented when the bus is used successfully. The idea is to build up a crude estimate of how faulty a node is. When either count exceeds 127 a node loses the right to transmit, and when either count exceeds 255 the node is required to remove itself from the network completely.

The other day I realized that I deal with Facebook friends in a similar way to how CAN bus deals with faulty nodes: if someone’s signal to noise ratio falls below some critical value I hide their subsequent updates. Previously I hadn’t formalized the rules but it’s something like this:

  • Each friend’s score starts at 0
  • +1 for any unoriginal political comment
  • +3 for a particularly asinine political statement
  • +1 for a remark about the performance of a college or professional sports team
  • +1 for a cryptic personal comment
  • +1 for describing the contents of any meal
  • +1 for too much information
  • Reset to 0 anytime the friend posts something offbeat, funny, original, relevant, or interesting
  • Hide anyone whose count reaches 6

So far I’ve hidden just 5 friends out of 161 — not too bad. I suspect the number would be a lot higher if those appallingly stupid Farmville updates couldn’t be blocked separately.

A Quick Guide to Surviving Departmental Politics

One of my favorite cartoons from What They Didn’t Teach You in Graduate School, a book all junior faculty should read

There are some good guides to navigating office politics. This isn’t one of them. Rather, it’s a bare-bones survival guide for junior professors who would rather be doing anything other than wondering what their Sun Tzu-reading, Machiavelli-emulating colleagues are up to.

  1. Buy a copy of Straight Man and read it carefully. Although it is an extremely funny book, resist the temptation to laugh by reminding yourself that all of the characters actually exist, and you work with them. If this seems to not be the case, it’s because you don’t know your colleagues well enough yet. Russo was a professor and everything in this book is true, especially the part about the goose.
  2. When you’re at a faculty meeting and a high-stakes decision is made uncharacteristically quickly, people are acting a bit out of character, or a group of people changes their story for no obvious reason, trust your intuition. Eight times out of ten you’re seeing the public face of some back-hallway deal. Keep your mouth shut and later on ask someone what was really going on. If they tell you nothing was going on, find someone smarter or more honest — eventually you’ll learn about a vast conspiracy that ends with grassy knolls and chemtrails (if not, your colleagues are a lot less creative than mine are). Then make up your own mind.
  3. Don’t be perceived as a threat. The easiest ways to do this are to appear to be career-centered and politically inept. Both of these come naturally to many of us.
  4. In a politically charged situation, saying the first thing that comes to mind is always wrong, unless it serves the goal of appearing inept (see #3).
  5. The way to understand almost any situation is to follow the money.  Well, there’s probably not much money going around, so follow the resources that matter: faculty slots and space. When people speak, ask yourself what they stand to gain or lose; it won’t take long to notice that there are people who never speak against their short-term self interest. It’s useful to know.
  6. As long as you are single-mindedly career-focused, nobody is likely to stand in your way because the department benefits if your goals are met. Once you get entwined in hiring or curriculum issues, help out a colleague or student who’s being treated poorly, or similar, the possibility of conflict arises. Just be clear about your goals and try to understand others’ stake in the situation before charging in.
  7. Tenure, by putting incompatible, unfireable people in close proximity for decades on end, leads to behavior no daycare teacher would tolerate for a second. Learn some of the history behind the personality issues. The details will not be surprising or interesting, but they may help you understand (or at least diagnose) your colleagues’ behavior.

So far I’ve muddled my way through using a combination of strategies: careerism, political ineptness, being too busy raising kids to care about much beyond my own teaching and research, and finding a large group of colleagues I genuinely like and respect. Let’s hope it keeps working.