Avoidable Failures of Peer Review


This piece is about a specific kind of peer review failure where a paper is rejected despite there being sufficient evidence to warrant acceptance. In other words, all the facts are available but the wrong decision gets made anyway. In my experience this is extremely common at selective computer science conferences. The idea here is to identify and discuss some of the causes for this kind of peer review failure, in order to make these easier to identify and fight.

Not in the Current Problem Set

In most academic communities, most of the researchers are attacking a fairly small set of problems at any given time. Moving as a herd is good because it increases the problem-solving power of a research community as people compete and build upon each others’ solutions. Drawbacks of this style of research are that the research topic may remain alive well beyond the point of diminishing returns and also the area can suffer mission drift into irrelevant subtopics.

Hot topics can also crowd out other work. At best, papers not on a hot topic are implicitly held to a higher standard because the reviewers must perform a more significant context switch in order to understand the nuances of the problem being solved. Or, to flip that around, the authors are unable to exploit the fact that everyone already considers the problem to be worth attacking and understands not only its structure but also the state of its major solutions.

A pernicious variant of “not in the current problem set” occurs when the paper being reviewed addresses a problem that the community views as solved. It should not come as a shock if I say that we see some non-standard definitions of “solved” in academia. One of my most jaw-dropping moments in the research world (and this is saying a lot) happened at an NSF workshop some years ago where a well-known formal verification researcher got up in front of a group of industrial people, government people, and academics and said more or less: “We have solved all of the key problems. Why aren’t you people adopting our work?” As another example, a researcher working on verified compilation has told me horror stories about reviews that attempt to invalidate his work by denying the existence of compiler bugs.

Solutions:

  • Conferences should actively pursue eclectic work. This is easier said than done, but I often feel that ASPLOS is reasonably eclectic and USENIX ATC used to be.
  • Reviewers need to be sensitive to the “we solved that problem” phenomenon. Of course, in some fraction of cases this reaction is justified — but not always.

Vocal Detractor

Sometimes the reviewers pretty much like a paper — perhaps it even has one or two strong supporters — but one person just wants to kill it. If this person is loud, determined, and forceful this tactic will usually succeed. A frequent vocal detractor will subvert the peer review process. Furthermore, one has to suspect that these people are sometimes politically or ideologically motivated.

Solutions:

  • The committee as a whole (and particularly the chair) needs to create an environment in which abusive behavior is not tolerated. Frequent vocal detractors should not be invited to be on conference committees.
  • Electronic meetings probably make it more difficult to be a vocal detractor since force of personality is blunted by the medium. On the other hand, electronic meetings tend to fragment the discussion, making systematic use of this tactic harder to detect.

Conservatism

Ground-breaking research requires boldness and risk-taking, at least in some cases. The NSF explicitly wants research to be “transformative” as opposed to incremental. It follows that just a bit of boldness and risk-taking is also required on the part of the program committees. But too often, we see exactly the opposite: professors display a huge amount of conservatism in the peer review process. Nobody wants to accept a paper that might be embarrassing or wrong, nor do they want to accept a paper where the authors have displayed a less-than-masterful command of the related work, nor do they want to accept a paper where the evaluation section uses a geometric mean instead of arithmetic, or fails to explain Figure 17 in sufficient detail.

A common rhetorical tactic seen at program committee meetings is for someone to say: “Of course we can’t accept this paper, since two reviewers gave negative recommendations.” This happens so frequently that I have a few stock responses:

  • On the contrary — we can do whatever we like with this paper.
  • Can you tell me again why we all flew out here? To not discuss this paper?
  • Might we consider the possibility that some of these reviewers are inexpert, biased, or wrong?

Although conservatism can come from an individual, just as often it emerges out of group dynamics. Someone makes an offhand remark during the discussion of a paper and suddenly the conversation enters a death spiral from which it is impossible to reach a positive consensus.

Conservatism is rooted in the fact that so much of academia is about status. When playing status games, we fear above all losing face. Conservative decisions minimize the risk of looking like idiots, but of course they also minimize the risk of learning something new.

Would it really be that bad to risk publishing some wrong results, if this kind of risk-taking had other benefits such as accepting a larger amount of interesting new work? I don’t think so. Informal post-publication peer review should be able to deal with these problems — and it’s not like conservative acceptance can catch all instances of bad research.

A sub-problem of conservatism is that papers that fail to conform get rejected. In computer systems, paper submissions from non-PhDs in industry are not uncommon. Usually these papers doesn’t quite look right and they often get rejected. In a number of cases I’ve seen reviews so blisteringly ugly that nobody in their right mind who received them would ever submit another paper to that community.

Solutions:

  • Accept more papers. At a 10% acceptance rate, conservatism dominates all other considerations since only papers that look perfect will be accepted. At a 40% acceptance rate this is much less likely. Given the trend away from paper conference proceedings, there is little downside.
  • Give each program committee member a limited number of whiteballs: unilateral acceptances for papers that have merit, but that are dying in committee. The OOPSLA 2010 frontmatter has interesting thoughts on this.

Conclusion

Avoidable rejections have two main root causes: human nature and the artificial scarcity of accepted papers at prestigious conferences and journals. Human nature can’t be fixed but it can be worked around, for example using the “unilateral accept” mechanism. Artificial scarcity can be fixed.

I wrote up this piece for two reasons. First, in the last couple of years I have found myself on the losing side of many arguments that a paper should be accepted to a good conference; I’m trying to better understand why this happens. Second, I’m co-chairing an embedded software conference this year and need to figure out how I can keep my program committee from rejecting too many papers that could otherwise be accepted.

Tricking a Whitebox Testcase Generator

The awesome but apparently discontinued Underhanded C Contest invited us to write C code that looks innocent but acts malicious. Today’s post is an alternate take on the same idea: we don’t really care what the malicious code looks like, but it needs to escape detection by an automated whitebox testcase generator. In some respects this is much easier than tricking a person, since whitebox tools are (at best) idiot savants. On the other hand, a number of tried-and-true methods for tricking humans will be stopped cold by this kind of tool.

As a simple example, we’ll start with two solutions to my bit puzzle from last month: the slow reference implementation and a much faster optimized version. We’ll also include a little test harness.

uint64_t high_common_bits_reference(uint64_t a, uint64_t b)
{
  uint64_t mask = 0x8000000000000000LLU;
  uint64_t output = 0;
  int i;
  for (i = 63; i >= 0; i--) {
    if ((a & mask) == (b & mask)) {
      output |= (a & mask);
    } else {
      output |= mask;
      goto out;
    }
    mask >>= 1;
  }
 out:
  return output;
}

uint64_t high_common_bits_portable(uint64_t a, uint64_t b)
{
  uint64_t x = a ^ b;
  x |= x >> 1;
  x |= x >> 2;
  x |= x >> 4;
  x |= x >> 8;
  x |= x >> 16;
  x |= x >> 32;
  return (a & ~x) | (x & ~(x >> 1));
}

int main(void)
{
  srand(time(NULL));
  int i;
  for (i=0; i < 100000000; i++) {
    uint64_t a = rand64();
    uint64_t b = mutate(a);
    assert (high_common_bits_reference (a,b) ==
            high_common_bits_portable (a,b));
  }
  return 0;
}

The code in main() is just a trivial fuzz tester: rand64() gives uniformly distributed 64-bit random integers and mutate() randomly flips 0..63 random bits in its argument (making both arguments independently random does not do a very good job testing this code).

Can we add malicious code that this fuzzer will not detect? Of course this is easy, for example we can add this code to the first line of either function:

if (a==0x12345678 && b==0x9abcdef0) return -77;

Why return -77? Let’s assume that the return value from this function (which is guaranteed to be in the range 0..63) is used as an array index, and offset -77 lets us turn an array write into a stack smash.

First, will our fuzz tester discover this modification? Nope… even with 100 million random trials, the probability of detection is on the order of 10-30.

Next, let’s try to discover the modification using Klee, an excellent open source whitebox test case generator. We’ll change the test harness to this:

int main(void)
{
  uint64_t a,b;
  klee_make_symbolic (&a, sizeof (a), "a");
  klee_make_symbolic (&b, sizeof (b), "b");
  assert (high_common_bits_reference (a,b) == high_common_bits_portable (a,b));
  return 0;
}

As expected, Klee discovers the problem almost immediately:

KLEE: ERROR: /mnt/z/svn/code/bitdiff-klee.c:129: ASSERTION FAIL: high_common_bits_reference (a,b) == high_common_bits_portable (a,b)

How did it find these inputs? The point of a whitebox testcase generator is to take advantage of the structure of the code while generating test cases. So here, it simply asked “What values of the arguments lead to the assertion-failed branch being taken?” The answer, in this case, is trivial.

Do tools like Klee pose a serious problem for our effort to insert a backdoor? Let’s try to stop Klee from seeing our bug. The canonical way to fool a whitebox testcase generator is to exploit the fact that these tools solve hard inverse problems — so we insert a computation that is hard to invert. A checksum computation should do nicely. Our bug code now reads:

if (obfuscate_crc(a)==0xdf590898 && obfuscate_crc(b)==0x970b5f84) return -77;

where:

uint64_t obfuscate_crc (uint64_t x)
{
  return crc32 ((unsigned char *)&x, sizeof (x));
}

crc32() is simply a standard 32-bit checksum function that my stupid WordPress plugin for rendering C code cannot render, but you can find it in the full source code for this example.

Now we run Klee again and — damn! — after about 30 minutes (on my slowish home machine) it finds an input that triggers our buggy code. This is impressive; I did not expect Klee to be able to invert a couple of CRC32 operations. While it’s clear that we could push this line of attack further, and eventually confuse Klee to the point where it gives up, let’s try a different tactic. Our new bug code is:

 if (obfuscate_pipe(a)==0x12345678 && obfuscate_pipe(b)==0x9abcdef0) return -77;

where:

uint64_t obfuscate_pipe (uint64_t x)
{
  int pipefd[2];
  int res = pipe (pipefd);
  // assert (res==0);
  res = write (pipefd[1], &x, sizeof(x));
  // assert (res==sizeof(x));
  uint64_t ret;
  res = read (pipefd[0], &ret, sizeof(x));
  // assert (res==sizeof(x));
  close (pipefd[0]);
  close (pipefd[1]);
  return ret;
}

This time, rather than trying to overcome Klee using brute force, we’re counting on its inability to follow data through certain system calls. This time we get the desired result: not only does Klee not find anything amiss with our code, but it reports the same number of total paths (65) as it reports for the unmodified code.

A third way to trick Klee is to exploit a weakness in its goal of achieving full path coverage of the system under test. It is not hard to create a code path that is malicious for particular values of variables, but not for other choices. For example, Klee does not see the problem in this code:

int indices[] = { 0, 2, 1, 5, 3, 0, -77, 0, 3, 6 };

value_t storage[7];

void stash (int i, value_t v)
{
  if (i<0 || i>=10) {
     error();
  } else {
    storage[indices[i]] = v;
  }
}

When Klee tests the path containing the array store, it won’t notice the malicious index unless it happens to generate inputs that result in stash() being invoked with i==6 (it doesn’t, at least for me).

I feel certain there are additional ways to fool this kind of tool and will be interested to hear people’s thoughts.

Note that Microsoft does massive whitebox testing of its products at the binary level (Klee, in contrast, operates on LLVM bitcodes). Anyone who wants to put a backdoor into these products had better be covering their tracks carefully using techniques like the ones in this post (of course other strategies, such as ensuring that the affected modules are never tested by SAGE, would seem simpler and more reliable).

Finally, I might be able to come up with a new underhanded C contest where the goal is to fool both humans and a whitebox tester; let me know if that sounds interesting.

[Notes: Source code is available here. While preparing this piece I used this self-contained Klee distribution which is super easy to use and is described in a bit more detail on the Klee “getting started” page.]

Update from Jan 25:

As arrowdodger points out below, Klee should find the problem with the third method in this post. The fact that I was unable to get it to do so stemmed from a silly mistake I made: giving Klee the “–optimize” flag, which runs LLVM’s optimizers, which deleted the array write as dead code. Thanks to the people on the Klee mailing list for setting me straight!

Discovering New Instructions

Sometimes I wonder what instruction sets are supposed to look like. That is, what instructions would there be if computers were redesigned by smart people who understood our fabrication capabilities and who knew what we wanted to accomplish using computers, but who didn’t care about backwards compatibility and who haven’t seen our architectures? We can get little glimpses into that world by looking at network processors, DSPs, GPUs, ISA extensions like SSE4 and NEON, extensible processors like Tensilica’s, and others. But still, these are too rooted in the past.

Although the machine code emitted by our compilers is inextricably tied to our processors, perhaps this code can still be leveraged to discover new instructions. As a thought experiment, let’s start with a collection of executables whose performance we care about. Preferably, some of these will have been compiled from programs in Haskell, OCaml, and other languages that are not well-represented in today’s benchmark suites. We’ll run these programs in a heavily instrumented execution environment that creates a dynamic dataflow graph for the computation; the excellent Redux paper shows how this can be done. Next, we’ll need to clean up the dataflow graphs. First, we rewrite processor-specific operations (condition code dependencies, CISC instructions, etc.) into a simpler, portable form. Next, we optimize away as much dead, redundant, and vacuous code as possible; including, hopefully, all irrelevancies such as stack frame manipulation, dynamic linking, and garbage collection. The result — perhaps — will be something beautiful: the essence of the original computation, stripped of all sequential constraints, processor-specific operations, and bookkeeping. Of course this dataflow graph has some limitations. First, it only encodes the meaning of a single execution of the program. Second, it encodes a lot of incidental facts about the computation such as the bitwidths of all integer operations, the specific hashing methods used, etc. We’ll just have to live with these problems. The Redux paper contains a great example where factorial codes written in C, in Haskell, and in a stack machine are shown to all produce basically the same dataflow graph.

So now we have a collection of giant dataflow graphs: one for each execution of each program that we’re interested in. Our goal is to design an instruction set that can compute these dataflow graphs. Trivially, this can be done by partitioning the graphs into very small units of computation corresponding to a RISC instruction set. But that ISA is boring and won’t show any performance wins. To do better we’ll use a search-based strategy to find subgraphs that:

  • occur a large number of times across the collection of dataflow graphs — these are operations that are executed frequently by our workloads
  • contain a lot of parallelism — making them good candidates for hardware acceleration
  • contain a lot of internal symmetry — supporting SIMD-style execution
  • have a small number of dynamic inputs and outputs
  • rely on a small number of constants
  • do not contain dependency chains that are too long — we don’t want to create instructions that are too slow

I think this can be done; none of these properties seems particularly difficult to test for. The major problem necessitating cleverness will be the huge size of the dataflow graphs. We’ll end up with a list of candidate instructions ranked by some fitness function, such as performance or code density. We can build as many of these into our new ISA as we have hardware budget for.

Would this method discover saturating arithmetic instructions when applied to signal processing codes? Would it find clever ways to optimize bounds checks and exception handling in high-level programming programming languages? It’s possible (though I’d be super disappointed) that the new machines are just incidental variations on existing RISC and CISC designs. If this happened, I would suspect that we had failed to abstract away a sufficient number of processor artifacts. Or, perhaps it was a mistake to compile our computations to an existing machine architecture before building the dataflow graphs. Rather, perhaps we should start with a source-level program and its operational semantics, unrolling it into a dataflow graph without any compilation to machine code. This avoids ties to our existing processors, but also risks coming up with graphs that are very hard to map back onto actual hardware. Of course, many languages don’t even have a real semantics, but researchers are diligently working on that sort of thing. An even more difficult option would build up the functional representation of a source program (or executable) without running it, but this has the disadvantage of losing the “profile data” that is built into a dynamic dataflow graph — we’d need to add that in separately.

An aspect of this exercise that I find interesting is that gives insight into what our processors really do. Many years ago (I don’t have a reference handy, unfortunately) a study showed that computers spend most of their time in sorting algorithms. That cannot be true anymore — but what does the average mobile phone do? What about the average supercomputer chip? The average machine in Google or Amazon’s cloud? Of course we know the answers at a high level, but are there surprises lurking inside the computations? I would expect so — it’s tough to take a close look at a complicated system and not learn something new. Are there new instructions out there, waiting to be discovered, that can help these workloads execute more efficiently? I have to think so, at least for for workloads that are not well-represented in Intel’s, AMD’s, and ARM’s benchmark suites.

Wanted: Epitaphs for Hot Topics

Any given research community always has a few hot topics that attract an inordinate number of paper submissions. Sometimes these are flashes in the pan, other times they mature into full-fledged areas having their own workshops and such — but most often they endure for a few years, result in a pile of PhDs, and then slowly die off. After this happens I often find myself with unanswered questions:

  • Were the underlying research problems solved or were they too hard?
  • Did the solutions migrate into practice or are they withering on the academic vine?
  • What lessons were learned?

For example, consider software-based distributed shared memory, where MMU tricks are used to provide a shared address space for a cluster of machines. My former colleague John Carter wrote some of the basic papers in this area and they are hugely cited: more than 2000 citations for just the top four papers (including the patent). Clearly, a lot of followup papers were written. But then, as far as I know, software DSM is not widely used today and certainly it’s not an area where papers are being written. My understanding (which could easily be flawed) is that DSM just abstracted away a few too many things, such as performance and node failures, and this made DSM hard to use compared to explicit messaging. On the other hand, the DSM community created a detailed understanding of memory coherence models and (I believe) this directly fed into the formalization of the Java and C++ memory models. Thus, the practical impact of this research area is large, even if a bit oblique.

Couple more examples:

My experience is that the people who were there — writing papers about DSM or DHTs or P2P or whatever — always know the answers to my questions above. The problem is, by the time the answers become apparent everyone is working on the next hot topic and going up for tenure or whatever, and has zero time to write a proper retrospective. Furthermore, in many cases an honest retrospective would need to be a bit brutal, for example to indicate which papers really just were not good ideas (of course some of these will have won “best paper” awards). In the old days, these retrospectives would have required a venue willing to publish them, perhaps ACM Computing Surveys or similar, but today they could be uploaded to arXiv. I would totally read and cite these papers if they existed, and would make my students do the same.

Updates:

If you know of a good research epitaph, please send a pointer.

  • Michael Hind’s paper Pointer Analysis: Haven’t We Solved This Problem Yet? isn’t quite what I’m looking for, but it’s not too far off either.
  • David Andersen sent a link to this excellent epitaph for QoS.
  • Bhaskar Krishnamachari sent a pointer to this paper about the state of simulation studies in wireless ad hoc networks; it’s not really an epitaph but serves a similar role — capturing lessons that need to be learned.

It’s All About Interfaces

The Frank system — see also this recent post — is intended to reduce the amount of code needed to create a usable desktop software stack by about 1000x. I’m pretty sure that this goal is sufficient but not necessary. In other words, if we can reduce software size to 0.1% of its current size then that’s great and a lot of current software problems will be far less severe. However, the 1000x reduction is probably unnecessarily difficult — there may be easier ways to achieve similar benefits.

To support my claim that “1000x reduction in software size” is too strong of a goal, consider this question: Would a reasonable person object to running Frank on a modern Intel processor containing more than two billion transistors? In other words, does the gigantic complexity of one of these chips somehow detract from Frank’s simplicity? Of course not. The x86-64 ISA forms a fairly leak-proof abstraction boundary between Frank and all those transistors.[1. Is the ISA exported by a modern Intel chip really leak-proof? Pretty much so as far as the functional semantics goes. However, the abstraction is incredibly leaky with respect to performance — but this only matters for real-time systems and performance-sensitive codes — neither of which is in Frank’s target domain.] I believe the same argument can be made about other interfaces. Let’s say that we have a compiler or an operating system that we trust, perhaps because it has been proved correct. In this case, does it really matter how much code implements the proved-correct interface? I don’t think so.

What we need, then, are more interfaces that are:

  • durable
  • non-leaky
  • beautiful (or, at least, extremely workable)

A durable interface lasts. For example, the UNIX system call interface is reasonably durable — it is often the case that a statically linked UNIX executable will run on many years’ worth of systems, whereas a dynamically linked program rots fairly quickly. x86 has proved to be highly durable.

A non-leaky interface reveals little or nothing about the interface’s implementation. Trivially, a non-leaky interface requires some sort of language-based or OS-based memory isolation — otherwise peeks and pokes (and their unintentional equivalents) cause leaks. Non-leaky interfaces require a lot of error-checking code to ensure that each side adheres to its part of the contract. Performance leaks, alas, are nearly impossible to avoid, unless we give up on implementing optimizations. Bugs are perhaps the ultimate abstraction leak, forcing evil workarounds in code using the interface. Formal verification and automatically-inserted contract checks constitute partial solutions.

Beautiful interfaces do not expose too much or too little functionality, nor are they more complex than they need to be. I’m not sure that I’ve seen a lot of specific examples of beautiful interfaces lately, but at a slightly more abstract level I would consider regular expressions, database transactions, stream sockets, SMT queries, RISC instruction sets, functional programming languages, and hierarchical filesystems to all have a certain amount of beauty. A lot of work and a lot of iteration is required to create a beautiful interface — so this conflicts somewhat with durability.

The worthwhile research problem, then, is to create interfaces having these properties in order that the code living in between interface layers can be developed, understood, debugged, and maintained in a truly modular fashion. To some extent this is a pipe dream since the “non-leaky” requirement requires both correctness and performance opacity: both extremely hard problems. Another problem with this idea — from the research point of view — is that a grant proposal “to create durable, non-leaky, beautiful interfaces” is completely untenable, nor is it even clear that most of this kind of work belongs in academia. On the other hand, it seems clear that we don’t want to just admit defeat either. If we disregard the people who are purely chasing performance and those who are purely chasing correctness, a substantial sub-theme in computer systems research can be found where people are chasing beautiful interfaces. There are probably a lot of good examples I could give here, but Matthew Flatt (example) and Eddie Kohler (example) come to mind.

In summary, I’ve tried to argue that creating a really small software stack, as in Frank, is a good goal, but not necessarily the very best one.

Fixing Computers

Pretty often, a friend or relative asks if I can fix something that’s wrong with their computer. Because I’m very irritable in some respects, my first impulse is always to say something like “Sure! And I bet a brain surgeon would love to put a band-aid on that ouchie, too.” Usually, I manage to resist. Over the years I’ve found that fixing random PC problems is a surprisingly entertaining form of puzzle solving. Also, when successful, it makes people really happy. This post contains a few things I’ve learned.

Most often the problem is easy to diagnose and also easy to fix, like a flaky fan, a full disk, not enough RAM, flaky RAM, a cable or I/O card is loose, or a peripheral has died.

A noisy fan should always be fixed. Even if the noise isn’t especially irritating, it probably stems from a fan that’s about to fail because it’s rubbing against something or its bearings are wearing out. Failed fans lead to nasty secondary consequences, so it’s best to just nip problems in the bud. Identifying the noisy fan is super easy: use a pencil or similar to stop each externally-reachable fan, one at a time. If this fails, crack open the case and repeat for each internal fan.

Some problems are easiest to diagnose by swapping out parts. Unfortunately, most people don’t keep a reserve of spare parts, but even so, common stuff like keyboard and mouse, monitor and video card, DVD drive, etc. are not hard to deal with. When in doubt, throwing a bit of money at the problem often works.

Some problems are very hard to diagnose. One of the worst I’ve seen stemmed from a slightly flaky DIMM and also a slightly flaky CDROM drive. Swapping out either part made the problem better, but did not make it go away.

Some fraction of the things that go wrong with Windows machines — malware, DLL hell, disk corruption, and similar — simply cannot be fixed. Luckily, these problems often only become serious fairly late in a computer’s useful operating life. Thus, people don’t mind too much being told they’re stuck with the problem unless they feel like re-installing the OS. Other Windows problems can be fixed by a mixture of Googling, registry editing, judicious uninstallation of crapware, etc.

It’s best to err on the side of caution, especially for old machines that are held together using duct tape and baling wire. Not too long ago I permanently killed my wife’s primary work machine either by defragmenting the hard disk or by uninstalling a few programs that I suspected were slowing down the boot process. Neither of these actions should have stopped the machine from booting — but one of them did, possibly by exposing some latent HDD corruption. Of course the dead machine was my fault and I had to dig myself out by building her a new, fast machine.

Finally, people are generally miserably poor about backups and computer problems present an excellent excuse to lecture them on backup discipline. These days it couldn’t be easier to perform proper backups, even if it’s just copying the “my documents” folder to a USB dongle or Dropbox disk every day or two.

Can Simplicity Scale?

Software has gotten really big, with many systems — even, apparently, cars — running into the hundreds of millions of lines of code. The drawbacks of code bases this large are numerous: they are hard to understand, hard to modify, hard to test, and virtually guaranteed to contain huge numbers of bugs. My understanding is that up to this point, we have survived principally by burying mountains of code under abstraction layers, hiding much of the complexity from the next layer in the software stack. This is fine but it only gets us so far: large codes tend to export quirky, unclean abstractions and of course bugs cause even more abstraction leaks. We should be exploring alternative approaches that might be able to radically reduce the size of our code bases.

This annual report describes an ongoing effort to implement Frank, a full software stack for an interactive desktop machine, using less than 20,000 lines of code. This is a great project and I love the fact that they’ve made an annual report — typically an internal-only document required by a funding agency — publicly available. Another nice aspect of this project is that creating an innovative GUI is explicitly not a goal; they simply want to mimic existing systems using 1000 times fewer lines of code.

The technical approach, in a nutshell, is to design a number of domain-specific programming languages, each suited to concisely expressing some part of the system. For example, consider this text from the report which describes the core of Frank’s windowing system, which is written in the Nile language:

The several dozen standard compositing rules, shading, stroking, gradients, sampling, shadowing, etc.—457 lines in total—were written in Nile and debugged to make a working graphics system, strong enough to be used for Frank, and to do all the graphics required for personal computing (and hence this report).

The Nile code can be found here. What is really being talked about — thought the report doesn’t use the term — is executable specifications. Specifications describe how a system should operate; they are written in English, mathematics, pseudocode, etc. and in general they can’t be executed efficiently (or at all). On the other hand, if the domain and the specification language are constrained, it should be possible to create executable specifications. Even in cases where the simple version of the specification cannot be executed efficiently, this approach supports the separation of optimizations from the core specification, making it easier to reason about their correctness.

Although I love this research project, at some point we have to ask ourselves if the basic goal — a 1000x reduction in lines of code, or a full system in 20 KLOC — will survive contact with the enemy. To play devil’s advocate for a moment, I suspect that a minimal OS + GUI + applications could be written in 20,000 lines of code even in C++ if the design was careful, the feature set limited, and the implementers talented. To continue with that line of thought: How much of Frank’s elegance will survive performance tuning, porting to many platforms, feature creep, internationalization and related hacks, mediocre developers, security hardening, and all of the other things that happen to software as it is pushed into production and then used by real people to accomplish complex tasks?

My guess is that a significant fraction of Frank’s gains go away in the face of real-world engineering concerns. However, even if we are left with a 10x or 100x reduction in code size, instead of the original 1000x, the exercise is worthwhile. The thing that worries me most about pushing a system like Frank into the real world is that the hidden costs are larger than we might hope. Let’s make an analogy with microkernels. Speaking abstractly, there is much to love about, for example, a 10 KLOC kernel that has been proved correct. However, people whose judgement I trust have said that (1) debugging code running on a microkernel can be extremely hard and (2) creating high performance code requires very talented developers. The first problem stems mainly from concurrency, the second from the fact that in a microkernel-based system, an arbitrary piece of data is no longer a single pointer indirection away. It seems that a Frank-like system is likely to have similar problems. Creating a clean, coherent, and efficient system using a big pile of little languages, some of them highly mathematical, probably requires serious talent. In contrast, if I’m a medium-grade developer and you give me a big, crappy mountain of C++, I can just start hacking and probably I’ll eventually get the desired effect — even if it is fragile and impossible to maintain. The is a classic worse-is-better situation — which has nothing to do with “worse” or “better”, but rather is about the conflict between two value systems: one which prizes cleanliness and clarity, the other which values pragmatic code that works efficiently in practice and can be easily evolved to solve new problems.

In summary, I desperately hope that we won’t be seeing, in a few years, 50 billion lines of code on the desktop and in the automobile. This future seems all too plausible and we should be actively working to avoid it. Approaches such as Frank are promising. The Racket project has a significant amount of overlap with Frank (though its top-level goals are different, or at least differently stated); see for example the recent CACM article by my colleague Matthew Flatt.

NSF Data Management Plans

As of a year ago, all grant proposals submitted to NSF must be accompanied by a data management plan. Basically, the PIs must explain:

  • how sensitive data (for example, data that contains personal information about experimental subjects) will be managed
  • how data resulting from the research will be archived and how access to it will be preserved

The requirements seem reasonable and certainly people asking for public funding should have thought about these issues. The problem is that little guidance is given about what constitutes an appropriate plan, or how the quality of the plan will play into the evaluation of the proposal. Since I (happily!) do not work with sensitive data, this post will deal primarily with the second item — preserving public access to data.

The NSF policy on Dissemination and Sharing of Research Results makes it clear that researchers are expected to share the results of their research in three ways. First, research results must be published. Second, peripheral data must be shared “at no more than incremental cost and within a reasonable time.” Third, NSF grantees are permitted to “retain principal legal rights to intellectual property developed under NSF grants to provide incentives for development and dissemination of inventions, software and publications that can enhance their usefulness, accessibility and upkeep.” In other words, it is permissible to create proprietary products based on technologies developed using NSF money.

It is the second item in the previous paragraph that concerns us here, and which seems to be a main focus of the data management plan. The problems that PIs need to solve when writing this plan include determining:

  • How much data will need to be archived? A few megabytes of code or a few petabytes of imaging data?
  • Over what time period must data be stored? The duration of the grant or 75 years?
  • What kind of failures must be tolerated? The failure of a single disk or tape? A major, region-wide disaster (like the one Salt Lake City is certain to get, eventually)?
  • In the absence of disasters, how many bit flips in archived data are permissible per year?

Of course the cost of implementing such a plan might vary over many orders of magnitude depending on the answers. Furthermore, there is the difficult question of how to pay for archiving services after the grant expires. I’m personally a bit cynical about the ability of an educational organization to make a funding commitment that will weather unforeseeable financial problems and changes in administration. Also, I’m guessing that data management plans along the lines of “we’ll mail a couple of TB disks to our buddies across the country” aren’t going to fly anymore.

For large-scale data, a serious implementation of the NSF’s requirements seems to be beyond the capabilities of most individual PIs and probably also out of reach of smallish organizations like academic departments. Rather, it will make sense to pool resources at least across an entire institution and probably across several of them. Here’s an example where this is happening for political and social research. Here’s some guidance provided by my institution, which includes the Uspace repository. I don’t believe they are (yet) equipped to handle very large data sets, and I’d guess that similar repositories at other institutions are in the same boat.

This post was motivated by the fact that researchers like me, who haven’t ever had to worry about data management — because we don’t produce much data — now need to care about it, at least to the level of producing a plausible plan. This has lead to a fair amount of confusion among people I’ve talked to. I’d be interested to hear what’s going on at other US CS departments to address this.