Some Good and Bad Things About Academia

Partially in response to Matt and Daniel‘s posts, I wanted to list a few things I like about being a professor:

  1. It’s a good match for my short attention span. If I get interested in something new, I can drop everything and work on it for a while. When I get tired of a project, I can just drop it (subject to constraints imposed by students and ongoing grants).
  2. As long as I basically do my job, nobody can tell me what to do, or would even think of trying to.
  3. There are few secrets. Within certain broad limits, I can tell anyone about anything I’m working on, release anything I want as open source code, and blog about whatever I like.
  4. I can be annoying, for example by airing dissenting opinions or pointing out flaws in popular systems. These are harder to do when my company has to maintain good relations with customers and otherwise keep making money.
  5. Writing papers is at the core of the job, and I enjoy doing this.
  6. Basically I’m being paid to be smart and to know a lot of stuff, how great is that?
  7. Dealing with students, both in courses and in the advisor/advisee relationship, is fun.
  8. I can work on things that nobody would pay me to do, if I think they advance the field or serve the greater good.

Of course it’s not all roses. Here are some things I dislike:

  1. The capriciousness of the peer review system is tiresome.
  2. I’m constantly spread too thin, and do a bad job at things because of this.
  3. Dealing with students is not fun when they are apathetic, entitled, dishonest, or inept.
  4. The incentives in the research funding system are screwed up.
  5. Sometimes I have to go to meetings that bore me.
  6. I end up spending quite a bit of time being an administrator and manager, and I’m not good at these things.
  7. Pre-tenure, the incentives are setup to strongly reward work that has a reliable, short-term payoff, which is precisely not what professors should be working on.

Reviewers Get Worse Over Time

This is a depressing paper. The study shows that:

92% of peer reviewers deteriorated during 14 years of study in the quality and usefulness of their reviews (as judged by editors at the time of decision)

The quality of reviews that I write has definitely decreased over the 12 or so years that I’ve been doing it. The main factor is time: I’m far busier than I used to be and I’m asked to review many more papers, grant proposals, etc. There is just not enough time anymore to write a pages-long discussion of the pros and cons of a paper. Ennui is also a factor: there are certain classes of bad paper that I’ve seen so many times that I can’t be bothered to write a detailed description of why it’s bad. Additionally, senior people often write bad reviews because they farm them out to very inexperienced students, postdocs, etc. and then fail to rewrite the review themselves (for some reason, students often write the harshest reviews). Another factor is that as people become more senior, they can start to lose touch with the technical details.

A reasonable response to this trend is to ensure that younger researchers make up a significant fraction of review committees. I deliberately did this the last time I formed a program committee. This has the additional benefit of giving younger researchers experience and visibility that they require at early career stages.

Computer Systems Evaluation Mistake #1

Tonight I wrote this in a paper review:

The results section of this paper contains what is probably the most elementary and most annoying flaw that could possibly plague a computer systems paper.  All of the numbers are relative and we’re never told the absolute values.  It’s great that you can achieve a 30% reduction in overhead, but it matters whether this is 30% of 0.1% or 30% of 10000%.  I have to recommend against acceptance of a paper that does this.

Enough said.

Independence in N-Version Programming

N-version programming is a way to reduce the impact of software bugs by independently implementing the same specification N times, running the implementations in parallel, and using voting to heuristically choose the right answer. A key weakness of N-version programming is that the increased reliability is predicated on faults occurring independently. As Knight and Leveson famously showed in 1986, the assumption of independence can easily be unjustified.

Aside from the Knight-Leveson problem, N-version programming is also expensive since the entire development process needs to be replicated. But what if the replicas already exist? Take the example of C compilers: dozens of independent implementations have been created. If I were developing a safety-critical or mission-critical embedded system where compiler bugs were a major source of concern, I could compile my source code using N different compilers (targeting N different architectures if desired) and thus gain some margin with respect to compiler bugs. Of course, my system would incur added hardware costs, but triple modular redundancy has long been an accepted way to gain reliability in avionics and space systems.

We’re left with the question: Are compiler flaws independent? My group has spent the better part of three years finding and reporting 300 bugs to various C compiler development teams. We do this using randomized differential testing: create a random C program, compile it using N compilers, run their outputs, and compare the results. So far, we have never seen a case where even two independent compilers (let alone a majority) have produced the same wrong result for a test case.  The intuitive reason for this is that the kind of compiler bugs we find are typically deep inside the optimizer. At this level, independent implementations really are independent: GCC is operating on Gimple or RTL, LLVM is operating on LLVM IR, etc. The problems being solved here don’t relate so much to the C specification, but instead to more fundamental issues like preserving the meaning of a piece of code. My guess is that for experiments like Knight and Leveson’s, more of the complexity of implementations was at the surface level, having direct relevance to the specification, and therefore errors in implementing the specification were more likely to go wrong in similar ways.

It is entirely possible that there exist correlated failures in implementations of tricky and lightly-used parts of the C standard like those relating to variadic functions or FP rounding modes. We don’t test these. In earlier work we tested implementations of the volatile qualifier and found that all compilers get this wrong sometimes; unfortunately we didn’t spend any time looking for correlated failures when doing that work.

When we started testing compilers, I half-expected that we would find correlated failures among compilers because some authoritative source such as the Dragon Book or the CCP paper contained a subtly flawed description of a key algorithm, and everyone just implemented it as described. But so far, no dice. In C++ compilers, the EDG frontend is so widely used that it would seem to create an important opportunity for non-independent failures. On the other hand, it is reputed to be high quality and also there are enough non-EDG frontends out there that sufficient cross-checking power is probably available.

Compiler Bug-Finding Project Milestones

Despite the Fall semester being crappy for me (double course load, and I keep getting sick) this project has made good progress:

  • Xuejun gave a talk at the LLVM Developer’s Meeting
  • We submitted our 300th compiler bug report (up from 200 in March 2010)
  • We submitted a paper to PLDI 2011 summarizing our methods and results; I’m keeping fingers crossed that I won’t have to burn an effigy of the dreaded reviewer 3

The next milestone will be to release Csmith, our test-case generator, as open source, hopefully before Christmas.

Computer Science Research Proposal

Last night I sent a link to this article to my colleagues. It’s a really entertaining essay, though I agree with several of the commenters that some of the details do not ring true. Someone responded with a link to this page, which says:

Our writers staff on, that specialize at computer science can provide you with custom written computer science research proposal, that exactly match your expectations.

However sweet the promise of ending the proposal-writing problems that plague faculty and students alike, this sentence is too riddled with errors to give us much confidence in the proposal-writing service. But then again a lot of funded CS research proposals also display signs of extremely hasty proofreading.

Counting Compiler Crashes

This is a bit of material from my GCC Summit talk that I thought was worth bringing out in a separate post.

As an experiment, we compiled 1,000,000 randomly generated C programs using GCC 3.[0-4].0, GCC 4.[0-5].0, LLVM-GCC 1.9, LLVM-GCC 2.[0-8], and Clang 2.[6-8]. All compilers targeted x86 and were given the -O0, -O1, -O2, -Os, and -O3 options. We then asked the question: How many ways is each compiler crashed? We count crash bugs by looking for unique “assertion failure” strings in LLVM output and “internal compiler error” strings in GCC output. This is conservative because typically a compiler will also have a number of crashes due to null pointer dereferences and other memory safety violations, and we don’t try to tell these apart. Here are the three crash messages we got from Clang 2.8:

Statement.cpp:944: void Statement::post_creation_analysis(std::vector<const Fact*, std::allocator<const Fact*> >&, CGContext&) const: Assertion `0′ failed.

StatementIf.cpp:81: static StatementIf* StatementIf::make_random(CGContext&): Assertion `ok’ failed.

Block.cpp:512: bool Block::find_fixed_point(std::vector<const Fact*, std::allocator<const Fact*> >, std::vector<const Fact*, std::allocator<const Fact*> >&, CGContext&, int&, bool) const: Assertion `0′ failed.

Here are the GCC results:

And here are the LLVM-GCC results:

The thing I really like about these results is the way it shows that the latest versions of GCC and LLVM are really solid. Both teams have put a tremendous amount of work into their tools.

The “bugs fixed” annotations in the graphs refer to fixes to bugs that we found (using random testing) and reported. We had hoped to establish some sort of nice causal link between fixing these bugs and improving the resistance of compilers to crashing, but these graphs are a long way from showing anything like that. There is just a lot going on in each of these projects — the experiment is too uncontrolled to give us any kind of solid evidence.

Are compiler crashes good or bad? Certainly they’re a pain when you’re just trying to get work done, but overall they’re good. First, they can almost always be worked around by changing the optimization level. Second, a crash is the compiler’s way of failing fast, which is good. If that assertion hadn’t been there, it’s entirely possible the compiler would have run to completion, generating incorrect code.

Update from 11/6/2010: Here are the graphs showing rate of crashing, as opposed to number of distinct crash bugs, that Chris asked about in a comment.

Although it’s not totally clear what to read into these numbers, they do seem to tell a good story. GCC steadily improves during the 3.x series, regresses following the major changes that went into 4.0.0, then (more or less) steadily improves again. Similarly, LLVM starts off with a rather high rate of crashes and improves over time. Both compilers are exceptionally solid in their latest releases (note the log scale: both compilers have reduced their crash rates by at least 3 orders of magnitude).