How Good Does The Writing Need To Be?

One of the hard parts about reviewing for a conference is trying to rank imperfect papers. Does the novel paper without an evaluation get ranked higher than the incremental work with beautiful experiments? Does the interesting paper in a near-dead area get ranked above the borderline paper that is attacking an important problem?

Over the weekend I reviewed a dozen conference submissions that ended up being quite a bit more interesting than I had feared, but a good 5 or 6 of them had presentation problems. One paper described strong work and was well written in all respects except that it contained literally 50 or 60 errors of this general form:

One of the major reason for…

Obviously this hurts the paper, but how much? The larger context is that although English is the dominant language for scientific publishing, papers are increasingly coming from all over the world. More and more, none of the authors has English as a first language and it’s easy to imagine circumstances where it’s hard for the authors to find someone to do a solid proofreading pass. Having lived overseas for several years when I was younger, I can sympathize with how difficult it is to operate in a foreign language.

The position I’ve arrived at is to try to get a paper rejected when the level of presentation is so poor that it seriously interferes with the technical message, but to let errors slide otherwise. Ideally, I’d also include a detailed list of grammatical mistakes in my review, but in practice this is prohibitively time-consuming when papers contain many dozens of errors.

The 5+5 Commandments of a Ph.D.

[This post is co-authored and co-posted with my colleagues Matt and Suresh. Comments should all go to Suresh’s version.]

There have been a lot of Ph.D.-bashing articles lately. There have been some spirited defenses of a Ph.D. too. Most of these articles make good observations, but they’re often about the larger Ph.D. ecosystem and therefore fail to provide actionable advice to (potential) Ph.D. students.

We observe that most failures of the Ph.D. system — including both failure to get the degree and failure to see a good return on time and money invested in obtaining the degree — boil down to a small set of root causes. These causes are on both sides of the implicit contract between advisor and advisee. Here’s our pragmatic view of the conditions that need to be met for a Ph.D. to make sense. (Please keep in mind that we’re all computer science professors, though we’ve made an effort to avoid field-specificity.)

The advisor shall…

  1. Advise the student: help find a thesis topic, teach how to do research, write papers, give talks, etc.
  2. Provide protection from and information about funding concerns (to the level of expectations of the field, which vary widely).
  3. Proactively provide realistic, honest advice about post-Ph.D. career prospects.
  4. Provide early and clear guidance about the time frames and conditions for graduation.
  5. Introduce the student to the academic community, through conference talks, invited talks, letters of recommendation, etc.

The student shall…

  1. As early as possible, do due diligence in researching career prospects. It’s not hard to get people to talk about this and there’s also plenty of written advice, in books and on the web. Carefully filter what you read since the situations may be very different between engineering fields, science fields, and the humanities. There may also be significant differences between sub-fields such as theoretical computer science vs. operating systems. A new student should glance at job postings and NSF statistics to determine the ratio of new Ph.D.s to open tenure-track slots.
  2. As early as possible, determine if the actual career prospects are a reasonable match for their needs/expectations. Until the student makes her expectations clear, the advisor has no clue if she simply must have an academic job or whether he’ll be perfectly happy getting a Ph.D. and then going to law school or being a stay-at-home parent.
  3. Not be deluded or blinded by catchphrases like “life of the mind.” Indeed, this life does exist, but probably only during the ABD portion of a Ph.D. A professor would be extremely lucky to live the life of the mind 15 hours a week, leaving 60 hours of advising, teaching, reviewing, writing grant proposals, traveling, and sitting in meetings.
  4. Be a good investment in terms of time and money. In other words, work hard. Students who periodically disappear for long bouts of skiing, soul searching, or contract work tend to be put on the back burner by their advisor, making it much more difficult to get re-engaged later on. An easy litmus test: if acting a certain way would get you fired from a real job, then it’s probably a bad idea to try that in a Ph.D. program too.
  5. Jump through the administrative hoops appropriately. The hurdles are important and generally not too burdensome: take some classes, do a qualifying exam, write a proposal, and so on. These are easy to ignore until they become a problem. Your advisor is not likely to remind you, or even remember that you need to do them.

Since nothing is obvious on the Internet, a disclaimer: These edicts might come across as cold and overly pragmatic, and might suggest that we are ignoring the joy of discovery, the thrill of learning and the excitement of doing cutting-edge research that goes along with doing a Ph.D. Far from it: we’ve chosen this life because we experience all of this and enjoy it. But the easiest way to crash and burn in what is a long, multi-year haul is to forget about the brass tacks and float in the clouds.

Gear for Getting Outside in Winter

Since I’m not a big skier, I didn’t get out a lot in winter during my first few years in Utah. When springtime came, I would just tough it out and get into shape the hard way — go on a few long hikes and suffer appropriately. This worked because I was in my 20s. Now that I’m close to 40, that strategy works very poorly so I try to stay in shape all year. This means having some reasonable winter clothes, which is what this post is about.

The basic ideas behind putting together a good collection of clothing on a modest budget are:

  • Waterproof items are expensive and almost never necessary. The main exceptions are hard rain (rare in Utah) or spending a lot of time wallowing in wet snow (rare outside of self-arrest practice sessions).
  • Items should be multi-purpose. Many pieces of outdoor gear are things a person would want to own anyway, just to shovel the driveway or whatever.
  • Everything can be found for at least 30% off retail price (and often quite a bit more) if you are patient.

I’m posting this since it took me a while to come up with a system that’s warm, comfortable, and durable.

Basics

These are all you need down to about 5°F, as long as you keep moving and there’s little wind. More layers are needed (see below) to provide margin against getting lost or hurt, to stay comfortable over a lunch break, or when there’s weather.

Pants

Soft shell pants are awesome: stretchy, warm, and breathable. The Mammut Champ pants I’ve used for about three years are not showing much wear and are comfortable in a wide variety of conditions. They seem to be a great choice.

Top

Layering is not needed: a single breathable, windproof, stretchy top is what you want. My top is a Sporthill “zone 3” model and it is excellent, though getting threadbare after 4 years of use. Many other companies make similar gear. The idea is to keep it simple: the shirt is a pullover with no pockets, hood, or any of that.

Socks and Boots

Smartwool socks are my favorite, and not too hard to find on sale. Otherwise anything that isn’t cotton should be fine. Boots don’t seem to matter that much, you just want decent treads. I’m not partial to Goretex boots but they’re hard to avoid unless you go to heavy leather boots.

Gaiters

Everyone in the Midwest and Northeast of the US has a hat and gloves, but most people who visit us from those regions have never heard of gaiters. I don’t understand this because gaiters are the best thing ever: they keep snow (and sticks and leaves) out of your boots and they keep your legs warm. I’d rather go snowshoeing in tennis shoes and gaiters than in a good pair of boots without gaiters.

Gloves and Hat

The “windstopper” kind of fleece seems to work well for a light hat and gloves. Gloves last a few years and hats last until you lose them.

Extras

For trips longer than an hour or two, or for non-benign winter weather some extra gear is nice.

Balaclava

Even with a good hat and temperatures not much below freezing, windy weather can easily lead to numb cheeks, neck, and chin. A balaclava solves this problem; in winter I almost always hike/run with a Smartwool balaclava in my pocket, it weights almost nothing and is really warm and comfortable. I also have a heavy storm balaclava but don’t use it that often.

Long Underwear

Backpacking in crappy weather taught me to love wool long underwear: it stays comfortable and doesn’t get (extremely) smelly after wearing it continuously for a week. Mine is Smartwool but there are several other brands like Ibex and Icebreaker that are probably just as good. This stuff is so comfortable I wear it around the house in winter.

Headlamp

This is a good idea for winter day hikes, and mandatory for evening hiking. I used to use a small headlamp with 3 AAA batteries inside the unit; these are awesome for emergency use and for backpacking trips, but don’t provide enough light to illuminate a trail if you’re moving fast. My current headlamp has an external battery pack, which is annoying, but it’s far brighter and can be used for trail running at night.

Poles

I usually hike with poles in winter, they provide extra points of balance on icy hillsides and make it much easier to self-extricate from a snowbank.

Shell Pants

As I said, these are seldom necessary unless it’s very windy or rainy. But still, it’s nice to have a pair. I bought cheapo goretex pants about 10 years ago and they still work.

Puffy Jacket

A puffy jacket is great to pull on as soon as you stop moving. The one shown here is a Montbell UL thermawrap, which I love because it’s light and small and fairly close-fitting, so it layers well under other stuff. Since it’s thin, it’s not very warm by itself.

Shell or Soft-Shell Jacket

One of these is nice for the crappiest conditions.

Serious Hand Protection

You definitely want some heavier gloves than the ones I showed above. I also have a pair of pullover mittens, mainly for insurance. You can’t actually do anything with your hands while wearing the damn things but I’m pretty sure that (pulled over warm gloves) they’d stave off frostbite in most conditions found in the lower 48.

The Future of Software System Correctness

A few weeks ago I re-read Tanenbaum et al.’s 2006 article Can We Make Operating Systems Reliable and Secure. They begin by observing that it would be nice if our general-purpose operating systems were as reliable as our cars and televisions. Unfortunately, Tanenbaum’s vision is being realized in the worst way: as the amount of software in cars and televisions increases, these products are becoming far more prone to malfunctions resulting from software errors. This begs the question: can we create a future where large, important, software-intensive systems work?

Let’s briefly look at some non-solutions to the software problem. First, we can’t just prove everything to be correct. This is way too expensive and most real systems lack formal specifications. At present, it is not even clear that the correct behavior of large systems can be formalized at all, though hopefully this will be possible someday (exercise for the reader: formalize Asimov’s Three Laws in HOL, Coq, or a similar language). Another non-route to safe software is pervasive use of hardware interlocks to prevent decisions made by software from doing damage. This is not going to happen because interlocks are too expensive and in many cases humans are an inappropriate fallback, for example because they are too slow or not in the right place at the right time. There’s no silver-bullet programming language, certainly; nor does traditional software engineering have the answers. A final non-solution would be to reduce our reliance on software systems. For reasons I think we do not fully understand, the universe wants lots of computation to be embedded in it, and we are going to put it there.

Given that all non-dystopian futures for the human race (and many dystopian ones too) include massive reliance on software-intensive systems, and taking into account the obvious impossibility of eliminating all bugs in these systems, the best we can reasonably hope for is a future where bugs do not have serious real-world impact. In other words, they do not (often) make mistakes that cost a lot of money, damage the environment, or kill a lot of people. Realizing this future is going to require many technological innovations and also we’ll need to significantly rethink the process of creating software systems.

The rest of this piece discusses some of the technical ideas that I believe will be most important during the next 25 years or so. Wherever possible, I’ll provide links to examples of the kinds of thinking that I’m talking about. Obviously the software safety problem has many human aspects; I’m not writing about those.

Automated Generation of Test Inputs

Whitebox testcase generators that take the structure of the system under test into account have improved greatly in the last few years. One of the best examples is Klee, which is not only open source but (atypically for research products) is engineered well enough that it works for a broad class of inputs other than the ones tested by the original developers. Drawbacks of whitebox testing tools include:

  • the path explosion problem prevents them from testing medium to large systems (the current limit seems to be around 10 kloc)
  • they do not support complex validity constraints on inputs
  • code living behind hash functions and other hard-to-reverse computations is inherently resistant to whitebox techniques

A different approach is to randomly generate well-formed testcases; the recently announced cross_fuzz tool and my group’s compiler testing tool are good examples, having found ~100 and ~300 bugs respectively in deployed systems. Random testing of this kind has significant, well-known drawbacks, but it can be extremely effective if the generator is carefully designed and tuned.

Future tools for automatically generating test cases will combine the best features of whitebox and constrained random testing. In fact, an example combining these features already exists; it is really cool work but it’s not clear (to me at least) how to reuse its ideas in more complicated situations. I think it’s fair to say that the problems encountered when mixing whitebox-type constraints and sophisticated validity constraints (e.g. “input is a valid collection of sensor inputs to the aircraft” or “input is a valid C program”) are extremely difficult and it will take a while to work out good solutions. Note that you can’t get around the validity problem by saying something like “well, the system should operate properly for all inputs.” Even if this is true, we still want to spend much of our time testing valid inputs and these typically make up an infinitesimal part of the total input space.

Scalability problems in software testing are inherent; the only possible solutions I can see are those that exploit modularity. We need to create smallish pieces of software, test them thoroughly, and then come up with ways to make the testing results say something about compositions of these smaller units (and this reasoning needs to apply at all levels of software construction). While “unit testing is good” may seem too stupidly obvious to be worth saying, there are many complex software systems with substantial internal modularity (GCC and Linux, for example) whose component parts get little or no individual testing.

Large-Scale Virtualized Environments

Executable models of complex systems — the Internet, the financial system, a fleet of ships, a city-wide automated driving system — are needed. The National Cyber Range is an example of an effort in this direction. These environments will never capture all the nuances of the emulated system (especially if it includes humans) but they’re a lot better than the alternative which is to bring up fragile systems that have never been tested against meaningful large-scale events, such as an earthquake or a concerted DoS attack in the case of a city-scale automatic driving system.

Increasingly, software verification tools and testcase generators will be applied at the level of these large-scale emulated environments, instead of being applied to individual machines or small networks. What kind of rogue trading program will maximally stress the stock market systems? What kind of compromised vehicles or broken network links will maximally wig out the automated driving system? We’d sure like to know this before something bad happens.

Years ago I did a bunch of reading about federated simulation environments. At that point, the status seemed to be that building the things was a huge amount of work but that most of the technical problems were fairly straightforward. What I would hope is that as software systems increase in number and complexity, the number of kinds of physical objects with which they interact will grow only slowly. In that case. the economics of reuse will lead us to create more interoperable physics engines, astro- and hydrodynamics packages, SoC simulators, etc., greatly reducing the cost of creating future large-scale simulation environments.

Self-Checking Systems

What would happen if we created a large system under the restriction that 98% of all system resources had to be used for redundancy, error detection, checkpointing and rollback, health monitoring, and related non-functional activities? Would we think of interesting ways to use all those cycles and bytes? Probably so. Large-scale future systems are going to have to put a much larger fraction of their resources into this kind of activity. This will not be a problem because resources are still rapidly getting cheaper. As a random example, the simplex architecture is pretty awesome.

Independent Failure

Independent failures are at the core of all reliability arguments. A major problem with computer systems is that it is very difficult to show that failures will be independent. Correlated failures occur at different levels:

  • all Windows machines may be affected by the same zero-day exploit
  • hard drives from the same batch may all fail at around 15,000 hours
  • a VMM bug or a bad RAM cell can take out multiple VMs on the same physical platform
  • n-version programming is known to not be a panacea

Simply put, we need to develop better ways to create defensible arguments about independent failure. This is a tough nut to crack and I don’t have good ideas. I did, however, recently see an interesting talk with some ideas along these lines for the case of timing faults.

Margin for Software Systems

Margin is at the center of all reliable engineered systems, and yet the concept of margin is almost entirely absent from software. I’ve written about this before, so won’t repeat it here.

Modularity

Modularity is the only reason we can create large software systems at all. Even so, I’m convinced that modularity is far from being played out as a source of benefit and ideas. Here are some areas where work is needed.

The benefits that we get from modularity are fewer if the interfaces between modules are the wrong ones. Our interfaces tend to be very long lived (consider TCP/IP, the UNIX system call interface, and x86) and the suboptimality of these is a constant low-level drag on system construction. Furthermore, the increasing size of software systems has made it quite difficult to start from scratch, meaning that even academic researchers are quite often constrained by 30 year-old interfaces. Eddie Kohler’s work represents an excellent recent body of work on interface re-thinking at the operating system level.

Big system problems often happen due to unintended interactions between components, which are basically failures of modularity. These failures occur because we pretend to be using abstractions, but we’re actually using pieces of code. Real software tends to not be very modular at all with respect to bugs, timing behavior, or memory allocation behavior. Coming to grips with the impact of leaky abstractions on system construction is a critical problem. Plugging the leaks is hard and in most cases it has to be done one at a time. For example, switching to a type-safe language partially plugs the class of abstraction leaks caused by memory safety violations; synchronous languages like Esterel render systems independent of certain kinds of timing behavior; etc. An alternative to plugging leaks is to roll the leaking behaviors into the abstraction. For example, one of my favorite programming books is Leventhal and Saville’s 6502 Assembly Language Subroutines. Each routine contains documentation like this:

Registers Used: All

Execution time: NUMBER OF SHIFTS * (16 + 20 * LENGTH OF OPERAND IN BYTES) + 73 cycles

Data Memory Required: Two bytes anywhere in RAM plus two bytes on page 0.

Obviously it is challenging to scale this up, but how else are we going to reason about worst-case behavior of timing-sensitive codes?

Increasingly, some of the modules in a software system, like compilers or microkernels, will have been proved correct. In the not-too-distant future we’ll also be seeing verified device drivers, VMMs, and libraries. The impact of verified modules on software testing is not, as far as I know, well-understood. What I mean is: you sure as hell don’t stop testing just because something got proved correct — but how should you test differently?

Summary

We have a long way to go. The answers are pretty much all about modularity, testing, and the interaction between modularity and testing.

Almost Everything in Subversion

Every file I use on a day to day basis — excluding only data shared with other people, email folders, and bulk media — is kept in a big subversion repository. For the five years that I’ve been doing this, I’ve averaged 3.5 commits per day. Overall it works really well. Advantages of this scheme include:

  • Seamless operation across Mac, UNIX, and Windows.
  • I’m always using a local copy of data, so access is fast and there’s no problem if I lose the network.
  • Synchronization is fast and low-bandwidth, so it works fine using airport and coffee shop connections.
  • My machines become stateless; reinstalling an OS is hassle-free and I never backup hard disks anymore.
  • Explicit add of files means that large temporary objects are never dragged across the network.

On the other hand, a few aspects of the plan are less than ideal:

  • Explicit push, pull, and add means I have to always remember to do these, though I seldom forget anymore.
  • Every now and then I have a file that seems a bit too large to checkin to svn, but that doesn’t really fit into my bulk media backup plan.
  • Subversion’s diffing and patching is geared towards text files, and works less than ideally for binary objects such as the .docx and .pptx files I sometimes need to deal with.

But overall I’m totally happy with working this way and probably won’t change for years to come.

A Critical Look at the SCADE Compiler Verification Kit

While searching for related work on compiler testing and certification, I ran across the SCADE Compiler Verification Kit: a collection of SCADE-generated C code along with some test vectors. The idea is to spot compiler problems early and in a controlled fashion by testing the compiler using the kinds of C code that SCADE generates.

SCADE is a suite of tools for building safety critical embedded systems; it generates C code that is then cross-compiled for the target platform using a convenient compiler. Using C as a portable assembly language is a fairly common strategy due to the large variety of embedded platforms out there.

Given that there is a lot of variation in the quality of embedded C compilers (in other words, there are some really buggy compilers out there), something like the Compiler Verification Kit (CVK) probably had to be created. It’s a great idea and I’m sure it serves its purpose. Unfortunately, the developers over-state their claims. The CVK web page says:

Once this verification of the compiler is complete, no further low-level testing is needed on the SCADE-generated object code. [emphasis is theirs]

This is saying that since the CVK tests all language features and combinations of language features used by SCADE, the CVK will reveal all compiler bugs that matter. This claim is fundamentally wrong. Just as a random example, we’ve found compiler bugs that depend on the choice of a constant used in code. Is the CVK testing all possible choices of constants? Of course not.

The web page also states that that CVK has:

…the test vectors needed to achieve 100% MC/DC of structural code coverage.

No doubt this is true, but it is misleading. 100% coverage of the generated code is not what is needed here. 100% coverage of the compiler under test would be better, but even that would be insufficient to guarantee the absence of translation bugs. Creating safety-critical systems that work is a serious business, and it would be nice if the tool vendors in this sector took a little more trouble to provide accurate technical claims.