How to Evaluate a Computer Systems Research Paper

Some excellent resources exist about how to write a good systems paper. This post is about a slightly different topic.

In a typical recent year I review about 100 papers, mostly conference papers 8-14 pages long in 9 or 10 point font. People in similar positions — mid-career computer systems professors — are generally in the same situation and some have it worse. Since a good review takes three or four hours to write, it’s important to develop some shortcuts. Many of mine take the form of “sniff tests”: ways to rapidly discover that a paper contains bogus or useless results. Perhaps one third of papers that I review fall into this category. If I can save time by writing relatively brief reviews of them, then I can spend more time reviewing the marginal-to-good papers: these are the ones that stand to benefit most from detailed feedback. The best papers, like the worst, require relatively little effort to review.

I almost always skip the abstract. For one thing, I’ve probably already read it when deciding how to rank the paper in my reviewing preferences. Additionally, the introduction to the paper almost always contains the same information but with more motivation and background.

The first thing I do when evaluating a paper is to read its conclusion. Authors are almost invariably more truthful about their contributions in the conclusion than in the abstract and introduction. Why? First, people often write their papers front-to-back, and usually the introduction gets written before the final results have arrived. The authors are still optimistic. Second, it is simply psychologically harder to oversell the results of paper in the conclusion, where the authors are aware that the reader has just finished reading a perhaps not totally conclusive evaluation section.

The most important questions to ask when reading a paper are: Does it contain a new idea? A useful idea? Both can be hard to answer.

One complication in evaluating novelty is that the actual contribution of a paper is often different than the one(s) stated in the paper. Most of the time this is not due to any deliberate misrepresentation by the authors. Rather, people usually emphasize what they thought was hard or fun about the work, neglecting the fact that many hard and fun — and often substantially similar — research projects have been done in the past. Also, the authors’ perception of their contributions tend to be heavily colored by their previous work and their backgrounds. Furthermore, sometimes the actual contributions of a piece of work become clear only years after its initial publication.

Another problem is that evaluating the novelty of an idea requires a huge breadth of knowledge covering many thousands of papers in related, and not-obviously-related, research areas. It is not uncommon for zero or one of the people evaluating a paper submitted to a conference to have a really solid idea about where the paper fits into the literature. Complicating matters, many people submit papers to a venue that they know and like, where it will be evaluated by people they know and like, even if these are not the most appropriate people to be evaluating the paper. Changing research sub-areas seems to be much easier and more appealing than changing communities.

To show that an idea is useful, it is customary to evaluate it analytically and/or experimentally. Experimental results come in many sub-flavors, including those based on simulation and implementation. Regardless of the actual technique, there are many sniff-tests that should be applied to any computer systems paper.

For analytical results: Does the result make sense? Is it grounded in reality? Does it tell us anything new? Are the theorems and lemmas actually formal or are they “pseudo-formal”: written in the formal style but lacking key definitions and steps in reasoning? Would a theoretician with the appropriate background find the results useful or interesting?

For simulation results: Did the authors have to use simulation or were they simply too lazy to find a better evaluation method? Are the simulation parameters realistic? Does the simulation tell us anything new? I once reviewed a paper that presented a collection of analytical results and then a collection of simulation results that were based on an exact implementation of the analytical model. Of course, the “experimental” results matched the predicted results nearly perfectly (and trivially). Another paper that I once reviewed performed its evaluation on a simulation of a large multiprocessor computer where the parameters and workload were chosen in such a way that the aggregate throughput of the multiprocessor was around one instruction per cycle. Of course any conclusions reached from this kind of simulation are useless.

For experimental results: Did the authors measure the right quantities? Were appropriate tests of statistical significance used? (In computer systems, confidence intervals are quite avant-garde.) Is the measured effect robust? Is the baseline a sensible one? A commonly used trick is to compare the new work against an obsolete or otherwise obviously defective baseline, rather than the state of the art. Another common trick is to report on the degree of improvement offered by a new technique, but to omit the absolute numbers.

One sniff test I like to apply to a piece of research is to ask the following questions. First, what percent of actual systems cannot benefit from the proposed technique no matter what? Second, what percent of systems get the benefit of the proposed technique without any special effort? Third, what percent of systems are left? Perhaps surprisingly, a lot of research fails this trivial test. As an example, let’s suppose I’m proposing a new CPU scheduling technique for desktop Linux/Windows/MacOS boxes (this is not a random example: my PhD thesis was basically about this). First we ask: what class of systems cannot benefit from the proposed technique? There are several answers, but probably the most obvious one is overloaded machines: those that cannot finish their workloads (displaying video frames, decoding audio chunks, etc.) on time regardless of scheduling discipline. Second, we ask: what class of systems gets the benefits of good scheduling without a good scheduler? The answer is: those with low CPU loads. If the average length of the run queue is not greater than one, then all work-conserving scheduling algorithms are equivalent. Finally, we ask: which systems are left? That is, who actually benefits from the smart new scheduler? The answer is: systems whose load is in a fairly narrow range between too low and too high. The narrowness of this band, and the presence of techniques for getting into the underload region (for example manually reducing the degree of multiprogramming), were the ultimate reasons why I stopped working on scheduling problems.

Other common problems include: A solution to a problem that comes with significant, unavoidable costs that are not discussed. Solutions that cannot scale to the situations they are meant to address. Improvements that are far too small to be significant. Techniques that exploit the same source of benefit as an existing technique, but that are more complicated or are otherwise inferior.

Looking back over this post I see that it could be read as being very cynical: just looking for ways to reject papers. But the fact is, if the research community is unwilling to self-censor, then the pushback has to come from somewhere else. It’s also worth noting that the interaction between the relentless pressure to publish and program committees full of people like me has resulted in an incredible proliferation of conferences and workshops. To a limited extent, this kind of community diversification and evolution is good, it helps the field adapt to changes. On the other hand, most people would agree that things have gone too far. But this is a subject for another post.

There is plenty of excellent systems research being done. Fantastic new ideas are changing how systems are built and making it possible to build new kinds of computer systems. The problem, on the other hand, is that the research community produces a lot of mediocre and useless work (I have produced some myself). Lacking a marketplace, we rely on humans to evaluate both the good work and the bad, and those of us who have to evaluate a broad cross-section of research results need to adapt our strategies in order to make effective use of time. In summary, Sturgeon’s Law applies.