Academic Bug-Finding Projects in the Long Run

A number of computer science researchers, including me, have made careers out of creating tools that automate, at least partially, the process of finding bugs in computer programs. Recent work like this can be found in almost any good systemsy conference proceedings such as SOSP, OSDI, ASPLOS, ICSE, or PLDI. Examples go back many years, but a good early-modern paper in this mold is “WARLOCK–A Static Data Race Detection Tool” from USENIX 1993.

How do bug-finding papers get published? Two things are required. First, bugs should be found. A few bugs is fine if they are high-value, deep, or otherwise interesting, but it’s also ok to find a large number of garden variety bugs. It’s not required that the bugs are turned into bug reports that are well-received by the people who maintain the buggy software, but that is definitely a plus. It’s not even required that a bug-finding tool (or algorithm) finds any bugs at all, though I personally don’t have much use for that sort of paper. The second thing that is required is some level of novelty in the bug-finding tool. For example, the novel thing about Csmith was taking undefined and unspecified behavior seriously in a program generator and, in particular, the way that it interleaves program generation and static analysis.

So now you know how it works: someone creates a tool, finds some bugs, writes a paper, and that’s that. The problem, as I’ve increasingly come to realize, is that this cycle is not as useful as we would hope. For example, let’s take one of my tools, IOC (here I’m using “my” in a bit of a broad sense: I wrote very little of IOC, though I did cause it to be created). We created the tool, found some integer overflow bugs, wrote a paper. At this point, if I want to maximize things about me that can be measured (papers, citations, grants, etc.), then I’ll stop right there and move on to the next project. But this is a problem — at the time my paper is published, I’ll have only found and reported a tiny fraction of the integer overflow bugs lurking in all the C/C++ codes that are out there. Furthermore, even if I find and report, for example, all of the integer overflows in the Perl interpreter or the Linux kernel, it’s not like these code bases are going to stay clean. Overflow mistakes are so easy to make that developers will just add them back in unless they get frequent feedback from a tool. So, how can we extract more benefit from a tool?

  • One possibility is that my actual tool doesn’t matter since people will read the paper and implement similar techniques. From a selfish point of view this is the best case since my work has impact without my doing any additional work, but given the flood of CS papers, it’s a lot to expect that busy people in the real world will re-implement all of the useful bug-finding work that’s out there.
  • If I open-source my tool, people can use it. However, the bar is pretty high if (as is the case for IOC) people have to build a patched Clang, build their project using the new compiler and some special flags, run some tests, and then interpret the tool’s output. Some academic bug-finding tools are easier to use than this (and in fact IOC will be easier to use if we can get it included in LLVM), some are harder, and many are impossible since the source is not available or the tool is effectively unusable.
  • If I create a company based around my tool (think Coverity) then it becomes widely available, but so far I haven’t been interested in facing the big lifestyle change that comes along with a startup. Also, I believe that the products of publicly funded research should be open source.

The point is that the hit-and-run style of bug-finding research is depressingly inadequate as a way of improving software quality in the long run. We need to do better.

Damn You, Zork II

Some significant part of my middle school years (age 11-14, roughly) was spent struggling with a couple of text adventure games and none of them was more tantalizing or frustrating than Zork II. In particular, the bank and baseball maze parts of this game completely stopped me, despite my large piles of notes and countless hours spent typing increasingly random commands. My family lived in Africa at the time, so local Zork expertise was hard to come by. Moreover, I had no access to USENET, a BBS, or any similar resource (this was the mid-1980s) so I eventually shelved the game and only returned to it after we moved back to the USA and I found an employee at a local computer store who I could pester for hints. I had more or less forgotten about this odd part of my life until today the Digital Antiquarian singled out exactly those two Zork II puzzles as being particularly difficult and unfair. It was an infuriating game and I wish I could play it afresh.

Recording a Class at Udacity

A good chunk of my time between the end of Spring semester and now was spent recording a class on software testing at Udacity. This piece isn’t a coherent essay but rather some random impressions and thoughts.

The main questions are: What is an online course supposed to look like? How can we keep students engaged and interested without in-person contact? Nobody really knows the answers but Udacity’s current solution is:

  • A course consists of seven week-long units.
  • A unit contains about an hour’s worth of nuggets: short video segments (one to 10 minutes, usually) mainly showing either a programming environment or a person’s hand writing on a tablet, but supplemented by interviews, field trips, and other material.
  • Nuggets are punctuated by quizzes: multiple choice, short answer, and coding in the online Python IDE.
  • Units are followed up by homework assignments.

After spending some time going though their course material, I decided that I like this structure. For one thing, it matches my short attention span. After talking to Dave Evans and some Udacity TAs, it became clear that designing good programming quizzes is one of the keys to turning lecture material into actual learning. Tight integration between listening and hacking is one of the reasons that online learning will — in some cases — end up being superior to sitting in class. So this became my main challenge: creating a sequence of programming quizzes that basically leads students through the material. It was hard and I probably succeeded only unevenly. The other really difficult thing was creating course material without the continuous feedback from students that I’m used to getting (even if it’s only counting how many people are working on their laptops during lecture).

The recording process is this. You shut yourself into a dark, soundproofed room, put on a mic, and turn on a video camera and a program that records the tablet. Then, record written and spoken material. The editors can remove bad takes and pauses, which after years of lecturing to a live audience was hard for me to internalize. However, in the end it was very freeing since I could draw pictures, change colors, lookup details on Wikipedia, and attempt to write neatly (always a struggle for me) — all without boring anyone except the editors.The editors are 100% crucial to what Udacity is doing, and the difference between unedited material and the final output is striking. I can’t say enough about how good their work is.

Initially I lectured from a very detailed script, which was useful at first, but eventually I felt that it wasted too much time preparing things that didn’t quite work, so I switched to going from coarser notes. However, even these coarser notes were quite a bit more detailed than what I’d typically put on Powerpoint slides. Recording for 8-12 hours a day was intense and left me fried, but on the other hand this has advantages. When spreading course prep across an entire semester, it’s sometimes hard to see the big picture and there are often some unfortunate compromises due to work travel and paper deadlines.

In addition to learning how to teach in a new format, I learned Python in order to teach this class. More than 15 years ago I realized I needed a scripting language and chose Perl since I hated Python’s significant whitespace. I still find it extremely annoying, but many other things about the language are beautiful and it’s probably the best current choice for teaching lower-division CS classes. My imperfect command of advanced Python may actually work in the students’ favor since it lead me to write obvious, not-too-idiomatic code.

To better support programming-intensive courses, Udacity’s Python IDE is going to need to support additional features such as access to documentation and access to saved files and to previously submitted solutions (ideally, it’ll connect to my Dropbox). A workspace something like Matlab’s is what I think it wants to evolve into.

The people working at Udacity are mostly young and work super hard: there were always people in the office when I recorded on evenings and weekends. Basically they’re in full startup mode. Food and drink is available, and even beer which came in handy after a long day recording.

Although I have sort of a faint dislike for some parts of Silicon Valley, the California Ave neighborhood in Palo Alto where Udacity’s offices are located is very pleasant, with plenty of mom and pop businesses and great espressos available at Zombie Runner nearby.

After my first day recording, Sebastian Thrun had me, the two other profs recording classes at the time, and several Udacity folks over for dinner. This was a great way to get to know people and it was fun trying on the Google Goggles and hearing Sebastian’s views on education. He has a huge amount of passion for this topic, as you might expect. He made an interesting analogy between university education now and show business around the time when film was invented. It took years for people to figure out how to use film and indeed, conventions are still evolving. Moreover, film didn’t eliminate theater, it just replaced a significant part of it. I’m reproducing his points badly, but the analogy stuck with me and I think it is one that has value.

My class hasn’t gone online yet but I’ll try to post more updates as that happens.

Why Isn’t C Memory Safe Yet?

Memory safety errors in C/C++ code include null pointer dereferences, out-of-bounds array accesses, use after free, double free, etc. Taken together, these bugs are extremely costly since they are at the root of a substantial fraction of all security vulnerabilities. At the same time, lots of nice research has been done on memory safety solutions for C, some of which appear to not be incredibly expensive in terms of runtime overhead. For example, baggy bounds checking was shown to reduce Apache throughput by 8% and to increase SPECINT runtime by 60% — both effects are probably of a similar magnitude to compiling at -O0. Here I’m talking about strong versions of memory safety; weaker guarantees from sandboxing can be cheaper.

Although we commonly use memory safety tools for debugging and testing, these are basically never used in production despite the fact that they would eliminate this broad class of security errors. Today’s topic is: Why not? Here are some guesses:

  • The aggregate cost of all vulnerabilities due to memory unsafety is less than the cost of a 60% (or whatever) slowdown. Although this is clearly the case sometimes, I don’t buy it always. For example, wouldn’t it be nice to have Acrobat Reader be guaranteed to be free of all vulnerabilities due to memory safety violations, forever, even if it slows down a little? I would take that tradeoff.
  • Memory safety solutions come with hidden costs not advertised in the papers such as toolchain headaches, instability, false positives, and higher overheads on real workloads. My guess is that these costs are real, but that people may also exaggerate them because we don’t like toolchain changes.
  • People are not very good at estimating the aggregate costs that will be incurred by not-yet-discovered vulnerabilities. My guess is that this effect is also real.

Anyway, I’m interested by this kind of apparent paradox where a technical problem seems solvable but the solutions aren’t used. There are lots of other examples, such as integer overflows. This seems related to Worse is Better.

The Last Mile for a Verification Problem

A few weeks ago my student Lu Zhao defended his PhD thesis and we were lucky to get his external committee member, Magnus Myreen, to attend in person since he was in the USA anyway for a meeting. While he was out here, Magnus gave a talk on some not-yet-published work that is really cool.

The background for Magnus’s work is L4.verified, a tour de force of formal verification where a microkernel was proved to be correct. Now the first thing you should always ask when anyone says that anything was “proved to be correct” is: What was proved about what, and how? The question is important because historically speaking (and currently, but especially in the past) these words have been used rather loosely. In the case of L4.verified, “proved to be correct” means roughly that:

The implementation of a microkernel in C, whose meaning was determined using a formal semantics for C that was developed by the L4.verified researchers, was proved using Isabelle/HOL to implement a formal specification of the L4.verified kernel. Residual unproved assumptions remain with respect to some inline assembly language and also some unformalized aspects of ARM hardware such as its virtual memory subsystem.

Where could this go wrong? It could be the case that the high-level specification does not formalize what was intended. It could be the case that incorrect assumptions were made about the hardware and about the inline assembly that manipulates it. It could be the case that the C formalization fails to match up with the compiler that is used. This mismatch could be due to an error in the formalization or an error in the compiler. Of course, it is well-known that compilers contain bugs. Finally, the Isabelle/HOL prover could itself be incorrect, though most people would consider this highly unlikely.

Magnus’s work (done in conjunction with Thomas Sewell of NICTA, and currently in progress) is focused on eliminating the dependency on a correct C formalization and C compiler. Of course, L4.verified still has to run on some target machine. Their idea is to replace the C abstract machine (augmented with ARM-specific stuff) with the formal model of the ARM instruction set that was developed at Cambridge at least ten years ago. This model has been used in a variety of projects and is believed to be of high quality.

The most obvious way to eliminate L4.verified’s trusted C compiler is to use CompCert, and in fact Magnus and Thomas tried to do that. This line of work was stopped by incompatibilities between CompCert’s view of the meaning of C and L4.verified’s, and also by practical differences between Isabelle/HOL and Coq — the latter prover being the one used by CompCert. It wasn’t clear to me how much effort would have been required to overcome these problems.

The second approach that Thomas and Magnus tried is decompiling the L4.verified kernel’s ARM object code into math. In other words, they treat each machine code function as a mathematical function. Decompilation was the topic of Magnus’s PhD work and he achieved some very impressive results using it. There are two obvious problems with the decompilation approach: loops and side effects. Loops are simply decompiled into tail recursion, and side-effecting functions can be handled by treating the machine’s memory (or part of it) as an explicit input to and output from each function. Plenty of details are found in Magnus’s papers and in his PhD thesis. The next step is to connect the decompiled functions with their equivalent functions as described by the C semantics, and this is accomplished by some rewriting steps and finally using an SMT solver to do some of the dirty work. Magnus described this last part, connecting the C code with the decompiled code, as working for 65% of the functions in L4.verified. Of course, we might expect that it works for the easiest 65% of functions, so it could be the case that a significant amount of work remains, and my guess is that the last dozen or so functions will turn into serious headaches.

I was very curious whether they had so far run into any compiler bugs and it seems that they have not. On the other hand, this is expected since compiler bugs will force the equivalence check to fail, meaning that all miscompiled functions are still hiding in the 35% of functions whose equivalence has not yet been verified. My guess is that it’s substantially likely that they will not run into any serious compiler bugs while doing this work, for two reasons. First, the L4.verified kernel has been tested fairly heavily already, and has been subjected to worst-case execution time analysis at the object code level. It seems likely that serious miscompilations would have already been discovered. Second, L4.verified is coded in a somewhat careful subset of C that eschews, for example, bitfields in favor of explicit shifts and masks, side-stepping a common area where compiler bugs are found. It seems likely that they might run into a few fairly harmless miscompilations such as illegally reordered memory operations where the reordering doesn’t matter that much.

Anyway, it was great having Magnus out in Utah and hearing about this work, which represents a nice improvement on an already extremely impressive piece of work.

When Will Software Verification Matter?

When will average people start to use formally verified software? Let me be clear right from the start that I’m not talking about software that has been subjected to static analysis or about software that implements proved-correct algorithms, but rather about software whose functional correctness has been verified at the implementation level in a way that might pass the piano test. As far as I know, such software is not yet in widespread use — if anyone knows differently, please let me know.

The problem isn’t that we cannot formally verify software, the problem is that at present the cost/benefit ratio is too high. Let’s look at these in turn. First, how much does full formal verification (i.e., a machine-checked correctness proof of an actual implementation, using believable specifications) increase the cost of development? For many projects, the multiplier might as well be infinity since we don’t know how to verify something like MS Word. For other projects (most notably L4.verified and CompCert, but there are plenty of other good examples) the cost multiplier is probably well in excess of 10x. Next, how much benefit is gained by formal verification? This is harder to estimate, but let me argue by analogy that the value implicitly assigned to software correctness is very low in most circumstances. For example, compilers (empirically) become more likely to miscompile code as the optimization level is increased. But how much of the software that we use in practice was compiled without optimization? Almost none of it. The 2x or 3x performance gain far outweighs the perceived risk of incorrectness due to compiler defects. The only domain that I am aware of where optimization is routinely disabled for correctness reasons is avionics.

The cost/benefit argument tells us, with a good degree of accuracy, where we will first see mass use of formally verified software. It will be codes that:

  • have very stable interfaces, permitting specification and verification effort to be amortized over time
  • are amenable to formal specification in the first place
  • are relatively small

The codes that come immediately to mind are things like crypto libraries, compression and decompression utilities, and virtual machine monitors. I mean, why hasn’t anyone verified gzip yet? Is it really that hard? What about xmonad, can that be verified? A verified virtual machine monitor is clearly more difficult but this is a very high-value domain, and there are several such projects underway.

In all cases above — hypervisor, gzip, OS kernel, compiler — the question I’m trying to get at isn’t whether the artifacts can be created, because they can. The question is: Will they see widespread use? For this to happen, the verified codes are going to have to be relatively fully-featured (right now, most projects would never choose CompCert over GCC due to missing features) and are going to have to perform on a par with the best non-verified implementations.

What I suspect will happen is that we’ll start to see little islands of verified programs floating in the larger ocean of non-verified code. For example, I’d be happy to drop a proved-correct gzip into my Linux machine as long as it was respectably fast. Ideally, these islands will spread out to encompass more and more security-critical code. I think we are on the cusp of this starting to happen, but progress is going to be slow. I worry that creating these fully-featured, high-performance verified programs is going to require a huge amount of work, but that it will not be particularly novel (so the research community won’t care about it) or commercially viable. These problems go away only if we can dramatically reduce the cost of implementing verified programs, or (less likely!) of verifying implemented programs.

Software Testing Using Function/Inverse Pairs

One of the hardest problems in software testing is finding convenient, strong oracles—programs that can tell us whether the software under test worked correctly. Weak oracles, such as checking to make sure a program doesn’t crash, are always available, but not always of much use. An example of a strong oracle would be a reference implementation.

Sometimes software comes in function/inverse pairs, meaning that we get a program implementing some function f and another that implements f-1. When this happens we can (ideally) make up an input x and then check to make sure that f-1(f(x))=x. Often this works only in one direction. For example, let’s consider an assembler/disassembler pair. If I take some arbitrary assembly language, assemble it, and then disassemble the resulting object code, I’m unlikely to get exactly what I started with. On the other hand, if I take some arbitrary object code, disassemble it, and then assemble the resulting assembly code, I should get back the same object code. The question is whether or not f or f-1 is prepared to accept inputs in non-canonical forms. This problem can usually be solved by running a second cycle through the system. For example, Jesse Ruderman of jsfunfuzz fame reports taking JavaScript code in non-canonical form, compiling and decompiling it, and then doing the same thing again. If the second and third versions of the program differ, a bug has been found. I don’t know how many of the impressive 1500+ bugs found by jsfunfuzz come from this method.

Here’s a list of common function/inverse pairs that I came up with:

  • pickle/unpickle, or save/load, of in-memory data
  • checkpoint/restore of process or virtual machine state
  • assemble/disassemble and compile/decompile
  • encode/decode of data for transmission
  • encrypt/decrypt
  • compress/decompress, either lossless or lossy

The thing I would like to learn from you readers is what other function/inverse pairs you can think of. I feel like I must be missing many.

This topic was suggested by Alastair Reid who additionally suggested that an assembler/disassembler pair for an architecture such as ARM could be exhaustively tested in this fashion since there are only 232 possible instruction words. Very cool!