Who Verifies the Verifiers?

Verification tools, like all complex software systems, contain design flaws and implementation errors. Moreover, they are not necessarily very easy to test. The other day I started to address the question: If someone claims that a piece of software has been formally verified, what should you believe about the software? The previous piece was non-technical; this one adds a bit of detail.

It often comes as a surprise to non-specialists that a formally verified program or system can still be wrong, but this is the case. To see how it happens, let’s break down a typical formal verification argument into the following pieces:

  1. A computer program or system that we want to reason about
  2. A model that assigns a mathematical meaning to the program
  3. An intuitive property that we wish to show the program has
  4. A mathematical formalization of the property
  5. A proof showing the model has the property

Even if the system to be verified contains no bugs and we have a useful property in mind, there is significant room for error in elements 2, 4, and 5 of this list. First I’ll discuss the difficulties, then give a few examples, and finally discuss solutions.


A lot of verification efforts go wrong when ascribing a mathematical meaning to the verified program. Although finding a faithful formalism for a small piece of code (merge sort, for example) is pretty easy, doing so for medium-sized or large codes in real programming languages remains a major challenge. These problems are exacerbated by the fact that most verification research focuses on reasoning about models of programs, as opposed to deriving good models in the first place.

Formalizing a system at the right level of abstraction is important. If the level of abstraction is too high, important details will be missing from the formalism. Some years ago I went to a talk by a well-known verification researcher who said he was providing a sound argument that a piece of C code had certain properties. I spoke up, pointing out that nothing in his analysis ruled out problems such as stack smashing attacks; he responded that his result was sound with respect to a model of C where memory safety errors were assumed not to occur. I wondered (but didn’t say aloud) why someone working for Microsoft would ignore the precise class of C program errors that leads to most of the problems with Microsoft’s software. Since that time, the state of the art has improved greatly, but we still see sloppy models of C with respect to integer behaviors, order of evaluation of arguments to functions, inline assembly code, and other details that can easily render a verification result meaningless.

There are plenty of other problems in formalizing a real language like C. The mathematical meaning of some parts of the standard are unclear. There are many dialects: Microsoft C, GNU C, ANSI C, C99, etc. Many multithreaded C and C++ programs depend on processor-level weak memory models, which are horrible. The header files and library functions pull all sorts of hard-for-formalize code into a program. A rigorous formalization of C++0x will be a major challenge because the language is massive.

If a system’s behavior is formalized at a level of detail that is too low, the complexity of the formalism may be so large that it cannot be dealt with. It is only fairly recently that proofs about non-trivial executable programs have become possible. Regardless of the level of abstraction that is chosen, the fundamental leakiness of all real abstractions can be a problem. C code dives down into assembly sometimes; the correctness of assembly code may depend on the timing properties of the chip being used; etc.

A pernicious problem is that models of programs may include subtle assumptions. For example, a CIL-based tool for analyzing C code (many such tools exist) implicitly assumes that the C compiler used to build the system evaluates function arguments in the same order that CIL does. Ferreting out all of these assumptions requires wizard-level expertise, and bogus assumptions may go undetected for years.


There are plenty of famous examples of faulty specifications such as requiring that the output of a sort routine is sorted, but failing to specify that the output is a permutation of the input. Similarly, one traditional definition of correctness is “returns the correct result or else fails to terminate” — not too useful when verifying embedded software that actually needs the answer. These are silly examples but checking the specification for a real piece of software is not easy and a serious problem in the specification may invalidate the whole verification result.

Even relatively simple properties such as memory safety contain plenty of room for error: Do they account for threads, interrupts, signal handlers, and DMA? Are holes poked in the safety system for accessing device registers and invoking inline assembly?

It is, as far as I know, not yet clear how one might formalize a property such as “never enters a no-fly zone” for a UAV control system in terms of a relatively faithful system model that includes aerodynamics, jet engine behavior, lots of sensors and actuators, a network of computers, and similar important details.


At some computer science conferences, a common kind of paper proposes an improved way to perform formal verification. The results section of this kind of paper contains some tables showing how long it took to verify some different pieces of software. Typically verification is faster now, or perhaps verification becomes possible where it wasn’t, before. The thing that is missing from many of these papers is a convincing argument that verification happened as opposed to, for example, a logic error causing the verifier to spin for a while and then return the wrong result. A nice recent paper showed that SAT and QBF solvers — the reasoning engines behind many program verification tools — are not as correct as we might have hoped.

Proofs are supposed to be formalized arguments that break intuitive arguments down into steps small enough that they are individually believable. The problem is that constructing such an argument about a real piece of software is tedious at about the same level as computing the average size of a grain of sand on the beach by measuring all of them. Thus, people have created tools such as HOL and Coq that can eliminate some of the tedium. However, the vast majority of program verification tools fail to emit a proof that could be checked using a mechanized prover. Rather, the proof is implicit in the execution of the tool. If we are to trust a result produced by this kind of tool, the trust has to have a different source.


One of my favorite papers that I’ve written described a tool that took a piece of executable software for AVR microcontrollers and returned an upper bound on the amount of memory that would be consumed by the call stack, including stack memory used by one or more interrupt handlers. Let’s look at the components of the argument made by this tool:

  • The system model is derived from the AVR documentation and is embedded in the C++ code comprising my analyzer.
  • The specification is of the form “under a set of assumptions, the stack memory consumed by this executable does not exceed N.” This was never formalized in math, but rather in English in the paper and also embedded in the code for the analyzer.
  • The proof is based on an abstract interpretation of each sequential flow of control (one per interrupt handler and one for the non-interrupt context) plus a customized algorithm for composing these partial results into an overall stack depth. The algorithms are described informally in the paper. No proof — formal or otherwise — was given.

At first glance it would seem that I’ve done everything possible to create an unconvincing argument. Why would anyone believe a result produced by this tool? The first major piece of evidence is based on the fact that the tool gives a numerical result — the worst-case stack depth — as opposed to the more typical yes/no answer. By comparing this depth against the worst observed stack depth executions of real AVR codes, it was shown to be conservative in complicated cases and exact in simple cases. Second, in the paper I tried to be very explicit about the assumptions under which the tool would produce valid results. The intent was for readers to think about these assumptions and the system model, and conclude that their combination is plausibly correct. Third, the AVR semantics and the abstract interpretation derived from it were extensively tested, for example by turning low-level analysis results into assertions and stress-testing systems compiled with these assertions. Finally, the tool is open source. None of these arguments creates compelling evidence that my tool is bug-free, of course, but they do serve to give us a bit of confidence in its results.

The best way to argue against subtle bugs would be to formalize the AVR instruction set’s semantics and the worst-case stack depth property in, for example, HOL4. Then, my tool would have to be altered to emit a proof witness to back up its stack depth claims. I thought about this back in 2003 when I was doing the work, and decided the 2-3 person years it probably would have required weren’t worth it. So what did my tool really accomplish? The best we can say is that it emits a claim that could, in principle, be formalized, barring design and implementation errors.

Let’s look at another example. A few years ago Patrick Cousot gave a talk called “Proving the Absence of Run-Time Errors in Safety-Critical Avionics Code.” The talk abstract states that

…abstract-interpretation-based static analyzers have no false negative hence are sound by design. As opposed to bug-finders, static analyzers will never omit to signal an error that can appear in some execution environment.

“Proving” is a pretty generous description of an argument that is buried in a large, complicated tool. Moreover, Astree is closed-source, preventing independent validation of this proof. If Astree has not been designed and implemented correctly, or if the semantics of C have not been formalized correctly, the proof holds no water. Was Astree validated, for example by turning its intermediate results into assertions? I assume so, since you’d have to be crazy not to use these kinds of methods to debug a large analyzer. But this is not described in any of the papers (that I’ve read, at least). Rather, the argument for correctness is basically “we’re smart, trust us.” But of course the avionics software developers could have saved everyone some time by making the same claim about their own system.

Does Astree’s result for the Airbus software pass the piano test? Speaking for myself: given the available published evidence, absolutely not. I”m not really criticizing Astree — which is an amazing piece of work — but rather taking a critical look at exactly what kind of argument it makes about the analyzed system.

Could Astree’s argument for formalized as a result that could be verified using a mechanical theorem prover? Again, yes — barring design and implementation errors. But it would probably take in excess of 10 person-years of effort. The Astree team is unlikely to invest that effort because they have already made public claims equivalent to stating that the tool is infallible. (This reminds me of a story I heard from a formal verification researcher who was told by a large safety-critical system vendor that they were allowed to prove the company’s software correct, but by definition they weren’t going to find any bugs in it.)

A rather different kind of example comes from Column 4 of Programming Pearls, where Bentley proves that a pseudocode binary search implementation is correct, and then in the next column, derives a C implementation. In 2006 Joshua Bloch noticed that the proof is faulty for any target programming language that has fixed-width integers, including C, when an array at least half the size of the maximum representable integer value is sorted. Bentley’s verification effort failed because his model assumed that integers in C behave mathematically, whereas integer overflows in C either have 2’s complement behavior or else completely destroy the meaning of the program, depending on whether the ints are unsigned or signed and depending on what version of C is being used. Unfaithful models of integer behavior are a common source of false verification results. Verification engineers and researchers have taken note of this and many modern tools have accurate models of integers.

The most convincing verification arguments are those where the proof is machine-checked. Recent examples include seL4, CompCert, and Myreen’s Lisp interpreter. All of these are great results. The one that I have the most experience with is CompCert, which my group has tested extensively. We found a few bugs in its unproved frontend and a few minor holes in its formalization of the PowerPC assembly language, but we did not find any of the incorrect safety checks in optimization passes that plague every other compiler we have studied. Furthermore, in recent versions CompCert provably (and in practice) respects the volatile invariant, whereas every other compiler we have tested gets it wrong.


The closest thing we have to a gold standard for mathematical proof is to make the proof checkable by machine. Could a proof checker be wrong? Sure, but people have made some very small, simple checkers, and different implementations of them can be run on totally different hardware and software systems, greatly reducing the scope for undetected proof errors. The main problem with machine-checked proofs is that creating them requires a pedantic level of commitment to detail that most people do not have. Proof-producing verifiers are the solution, but currently these are so hard to create that very few researchers do so. In the future we can look forward to program analysis infrastructure that handles most of the grunt-work of producing proofs, giving researchers (and system certifiers) the benefit without most of the pain.

Is there a gold standard for assigning meanings to realistic software artifacts? Generally not; but perhaps so, in a few very limited domains. A formalization of the ARM instruction set architecture by some folks at Cambridge is one example. There are three arguments in favor of trusting this particular formalization. First, Fox proved that it was implemented by a particular version of the ARM hardware. Second, the semantics is executable: it can be used to interpret ARM programs, and the results checked against the results on real hardware. Third, the semantics has been around for a while and has been used by many different research groups. It is surprising to some people that a formal model — the basis for all formal reasoning about programs — should itself require extensive verification and validation, but this is absolutely the case. In the long run I think we must hope that relatively few kinds of artifacts will require formal models, and that well-vetted ways of creating these models will emerge. The current situation, where for example everyone who verifies C code comes up with their own semantics, is pretty terrible.

What about program properties? There don’t seem to be widely accepted specifications for memory safety for C, exception safety for C++, and related reusable properties. There exist plenty of nice specification languages such as SPARK ADA’s and ACSL. Even so, creating nontrivial specifications is really hard.

In my “piano test” post I said that NASA’s “test what you fly, fly what you test” adage needed to be adapted for software. Here’s how to do it. The unverifiable elements of program verification — the models and specifications — need to be thoroughly validated through use, through many-eyes techniques, and through independent mathematical checks. They need to lead to executable system models that can be stress-tested. Whenever possible, intermediate results computed by analyzers must be turned into executable checks and the resulting checked codes must also be stress-tested.

In the long run, are formal verification tools that don’t produce proofs any good? Sure. First, we can gain trust in these tools just like we have gained trust in the software that today controls our avionics systems, antilock brakes, and pacemakers. Second, they can find bugs: a bug is a bug, regardless of the quality of the tool that found it. Third, these tools may be useful simply by reporting false positives that make us look closely at our software to make sure they’re false. There’s a great paper showing that only one static analysis tool refutes the null hypothesis that “static analyzer performance is equivalent to diagonal random guessing.”


A formal verification result is based on a mathematical property that attempts to formalize something we care about, a mathematical model that attempts to formalize the behavior of a program or system of interest, and a proof that attempts to argue that the model has the property. Practically speaking, the model, property, and proof may be one of:

  • informal and implicit (embedded in the source code of a checking tool, for example)
  • documented in technical English (or another human language)
  • written in informal math
  • fully formalized
  • formalized and also machine checked

At present, taking the “fully formalized and machine checked” option adds massive overhead to a verification effort. This is a problem because the systems that we want to verify are getting larger and more complicated each year. Advances in modularity and proof techniques are needed, and appear to be coming.

The Piano Test for Program Verification

[Update from Feb 1 2011: I’ve written a new post that adds some technical details.]

Here’s a little thought experiment:

You’re spending the day visiting the Utah computer science department. We’re chatting and I describe a software verification project where my group has proved that some particular version of OpenSSH is memory safe: it will never execute an array bounds violation, null pointer dereference, or similar. Since you are perhaps slightly gullible, you believe me. Now I ask you to look up and for the first time, you notice that a large piano is hanging from my office ceiling, directly above your head. The rope from which the piano is hanging is securely tied off to an I-beam, so there’s no reason to worry about it falling on you accidentally. However, the rope passes through a small box that is connected to my desktop computer using a USB cable. I explain to you that this box contains a very sharp, remotely-activated knife that can instantly sever the rope when commanded to by the computer (don’t worry about this software being wrong, I verified it too).

Now, since you believe my verification result, you should be perfectly willing to keep sitting in your chair while I attack OpenSSH using some sort of highly advanced fuzz tool that tries to crash it by generating evil inputs. Of course, I have rigged OpenSSH’s SIGSEGV handler so that if it ever fires, it will command the USB knife to sever the rope — and you will be squished. Do you keep sitting in the chair?

My opinion is that a person would have to be a bit foolish to sit under the piano. But forget about my work: how many verification researchers believe in their own work to this level? Probably not too many. The first problem is one of semantics: the word “verified” is used extremely loosely by members of the verification community (I recently reviewed a paper that kept saying things like “we rigorously prove” and “we formally verify” — despite the fact that the paper contained nothing but proof sketches). The second problem is deeper: it is extremely difficult to argue convincingly that a verification result is correct. By “correct” I mean not only “mathematically sound” but also “the result is faithful to the intent of the verification effort.”

Although the picture I’ve painted about dropping a piano is silly, the situation is realistic. For example, every time you use the transportation system in a developed country, you are gambling that the benefit of using the system outweighs the risk that some embedded computer will malfunction, harming you. Of course, we willingly make this bet every day, but not because the stoplights and stability control systems are verified (they aren’t) but because we assume that if these systems worked well yesterday, they’ll work well today. The adage (supposedly from NASA) is “test what you fly, fly what you test” and it captures a fundamental fact about how we create reliable engineered systems: systems are tested in realistic environments and failures are rigorously studied and the results are fed back into the design of the next iteration of the system. This is not incompatible with formal verification, but it’s also not a point of view that comes out very often in the verification literature.

Good Book: Idaho Falls

This book, like the one I wrote about yesterday, is a horror story for engineers. Idaho Falls is about the SL-1, a prototype nuclear reactor in the desert in Idaho. Although it had been designed for a 3 MW thermal capacity, in early 1961 its output briefly reached something like 18 GW when the single main control rod was pulled too far out of the core. This power surge caused the reactor to explode, killing its three operators and necessitating an elaborate cleanup effort and a painful investigation.

Idaho Falls paints a terrifying picture of the SL-1. Apparently it was poorly designed (single primary control rod, no containment vessel) and also had deteriorated significantly over time: not only would the control rods stick sometimes, but also the proper placement for the rods kept changing as neutron-absorbing boron flaked off of the reactor. Furthermore, oversight was limited and operating procedures were poorly documented.

A large part of Idaho Falls focuses on the people involved; this adds human interest, but in the end there is no convincing evidence that personality issues were a direct contributor to the reactor explosion. There is, however, an excellent anecdote describing what may be the least funny practical joke of all time: an operator (one of the three who was later killed in the explosion) turned off a fan that cooled part of the reactor and watched his colleagues panic for a while before reaching behind a control panel and switching it back on.

I came across this book in an indirect way: last summer, while driving across the Lost River Desert on the way to Borah Peak, I saw an interesting collection of buildings. A sign pointed to the Idaho National Laboratory, which I’d barely even heard of. After getting home, I read up a bit on this lab and eventually found a reference to Idaho Falls. It’s a quick read that people interested in the very early nuclear age should enjoy.

Good Book: Space Systems Failures

Space Systems Failures is like a horror novel for engineers: years of people’s lives and hundreds of millions of dollars are wasted because somebody crossed a wire or skipped a test. The real reasons for failures of launch vehicles and their payloads, however, are more interesting:

  • Margins are slim because adding margin is expensive
  • System state may evolve rapidly with respect to our ability to react, especially during launch or when a spacecraft is far from Earth
  • It’s hard to make things work in a vacuum, in high-radiation environments, and under extreme temperatures
  • Each launch is expensive and often involves untested elements

The result is that the space industry has not, and may never, reach the level of reliability seen for example in aviation. Space System Failures is a detailed, engineer-oriented account of a large number of failed space missions. The first part of the book covers launch vehicles and it is organized by rocket family. This part is interesting but perhaps too long: while the failures are surprisingly diverse at first, they begin to blur together after a while. The readability of this section is saved by the (sometimes dryly humorous) discussions of the historical and economic contexts for the various missions.

The second part of the book is about payload failures. It is more interesting for several reasons. First, the functionality of the payloads is diverse, compared to launch vehicles. Second, since failures happen on longer time scales, humans are in the loop. Surprisingly often, an initial apparent failure could be turned into partial or total success by talented ground personnel. Another surprise was how many failures stemmed from error-checking logic, such as shutting down a thruster that was judged to have been firing for too long, preventing a satellite from being stabilized. Unsurprisingly, many serious failures come from interesting feature interactions unforeseen by designers.

I learned a lot from this book, particularly about the environment in which spacecraft live and about the sensors, actuators, and other subsystems that spacecraft use to fulfill their functions. It was interesting to learn that something goes wrong on almost every mission, but often the fault is properly contained instead of propagating into total system failure. The authors conclude that far too many failures result from lessons being improperly learned or forgotten as the relentless drive to lower costs forces tests to be skipped and personnel to be cut.

The Synergy Between Delta Debugging and Compiler Optimization

Before reporting a compiler bug, it’s best to reduce the size of the failure-inducing input. For example, this morning I reported an LLVM bug where the compiler enters an infinite loop when compiling this C code:

static int foo (int si1, int si2) {
  return si1 - si2;
void bar (void) {
  unsigned char l_197;
  for (;; l_197 = foo (l_197, 1))

My bug report was a good one because the test program looks pretty close to minimal; it was reduced manually and automatically from a code that was originally 25 KB. Testcase reduction is so important that the GCC folks have an entire document devoted to the topic, including instructions for how to use the Berkeley Delta implementation.

The Berkeley Delta is pretty awesome except when it is not. It often gets stuck at local minima that cannot be avoided except by making coordinated changes to multiple parts of a program. For example, it will never remove an argument from a function definition and also from all calls to the function. Therefore, a person works with delta like so:

  1. Run delta
  2. Do more reduction by hand
  3. Goto step 1

After spending more hours on this kind of activity than I’d care to admit, I realized that most of the stuff I was doing resembled compiler optimizations:

  • inlining a function
  • deleting a dead function argument
  • propagating a constant
  • reducing the dimension of an array
  • removing a level of indirection from a pointer

I think the similarity is a useful observation, but there are important differences between an optimizer and a testcase reducer:

  1. An optimizer must be semantics-preserving, whereas a testcase reducer doesn’t need to be
  2. An optimization is desirable when it speeds up the code, whereas a reduction step is desirable when it both simplifies the test case and also doesn’t make the bug disappear

Thus, to reuse an optimizer as a reducer, we would simply need to drive it in a typical Delta loop, being ready to back out any transformation which makes the bug go away or fails to make the test case simpler. I haven’t had time to do this; hopefully someone else will.

What Peers?

[This post was motivated by, and includes things I learned from, a discussion with some of my blogging and non-blogging colleagues.]

Ideally the peer review process benefits not only the conference or journal that does the reviewing, but also the authors who receive valuable feedback on their submission. And usually it does work like that: I’ve received a large amount of great feedback on my work through paper reviews. However, I feel like some part of the process isn’t working very well. On the reviewer side, I keep getting papers to review that are far enough from my areas of expertise that the reviews I write are superficial (over the weekend I reviewed a paper on mapping temperatures in small bodies of water using sonar — what am I supposed to say?). On the author side, at least 50% of reviews that I get are similarly flawed: they don’t tell me anything new or useful. This is confusing to me because I would have expected that the massive proliferation of conferences and tracks at conferences that has taken place in computer science would facilitate good matching of reviewers to papers. Somehow this doesn’t seem to be happening. In contrast, when I need feedback on a paper I typically send it off to a few acquaintances and friends, and usually get tremendously useful feedback. Also, when I send people feedback on their work I often get a reply along the lines of “thanks — that was way more helpful than the reviews we got from conference X.”

A few random guesses about what’s happening:

  • The peer review system is groaning under the overload of papers produced by all of us who are trying to get a higher h-index, trying to get a faculty position, trying to get tenure, etc.
  • Overworked program chairs are resorting to algorithmic methods for assigning papers to reviewers. This never, ever works as well as manual assignment (though it may be useful as a starting point).
  • There may be an increasing amount of work that crosses community and sub-community boundaries, making it harder to effectively pigeonhole a paper. Conferences and journals do not keep up well as boundaries and interests shift.

The interesting question is: how can we use blogs, online journals, and other modern mechanisms to better match up the people producing research to the people who are best able to evaluate it? It seems like some of the obvious answers, such as adding a reviewing facility to arXiv, suffer from problems where I submit a paper and then get my ten best friends to submit reviews saying how great it is.

Update from Jan 18, evening: Because it’s so easy to throw stones, I wanted to add a couple of positive anecdotes:

  • A student and I submitted a paper to POPL 2011. Although it was rejected, the reviews were super useful and totaled to almost 25 KB of text (a bit over 4000 words) after deleting all boilerplate. This is especially amazing considering that POPL got a lot of submissions, or so I heard.
  • I was recently invited to be on the “external review committee” for ISMM, the International Symposium on Memory Management. The invitation said “ISMM reviews are often on the order of one typewritten page” which is again pretty amazing. I’ve submitted only one paper to ISMM and the reviews (again after removing all boilerplate) came out to nearly 21 KB (not quite 3400 words).

ISMM is an interesting conference because it is of unusually high quality for the relatively small community it serves. I’m not sure how they have accomplished that, but my guess is that it’s a fairly tight-knit community where there are good incentives to provide people with useful feedback.

How Good Does The Writing Need To Be?

One of the hard parts about reviewing for a conference is trying to rank imperfect papers. Does the novel paper without an evaluation get ranked higher than the incremental work with beautiful experiments? Does the interesting paper in a near-dead area get ranked above the borderline paper that is attacking an important problem?

Over the weekend I reviewed a dozen conference submissions that ended up being quite a bit more interesting than I had feared, but a good 5 or 6 of them had presentation problems. One paper described strong work and was well written in all respects except that it contained literally 50 or 60 errors of this general form:

One of the major reason for…

Obviously this hurts the paper, but how much? The larger context is that although English is the dominant language for scientific publishing, papers are increasingly coming from all over the world. More and more, none of the authors has English as a first language and it’s easy to imagine circumstances where it’s hard for the authors to find someone to do a solid proofreading pass. Having lived overseas for several years when I was younger, I can sympathize with how difficult it is to operate in a foreign language.

The position I’ve arrived at is to try to get a paper rejected when the level of presentation is so poor that it seriously interferes with the technical message, but to let errors slide otherwise. Ideally, I’d also include a detailed list of grammatical mistakes in my review, but in practice this is prohibitively time-consuming when papers contain many dozens of errors.

The 5+5 Commandments of a Ph.D.

[This post is co-authored and co-posted with my colleagues Matt and Suresh. Comments should all go to Suresh’s version.]

There have been a lot of Ph.D.-bashing articles lately. There have been some spirited defenses of a Ph.D. too. Most of these articles make good observations, but they’re often about the larger Ph.D. ecosystem and therefore fail to provide actionable advice to (potential) Ph.D. students.

We observe that most failures of the Ph.D. system — including both failure to get the degree and failure to see a good return on time and money invested in obtaining the degree — boil down to a small set of root causes. These causes are on both sides of the implicit contract between advisor and advisee. Here’s our pragmatic view of the conditions that need to be met for a Ph.D. to make sense. (Please keep in mind that we’re all computer science professors, though we’ve made an effort to avoid field-specificity.)

The advisor shall…

  1. Advise the student: help find a thesis topic, teach how to do research, write papers, give talks, etc.
  2. Provide protection from and information about funding concerns (to the level of expectations of the field, which vary widely).
  3. Proactively provide realistic, honest advice about post-Ph.D. career prospects.
  4. Provide early and clear guidance about the time frames and conditions for graduation.
  5. Introduce the student to the academic community, through conference talks, invited talks, letters of recommendation, etc.

The student shall…

  1. As early as possible, do due diligence in researching career prospects. It’s not hard to get people to talk about this and there’s also plenty of written advice, in books and on the web. Carefully filter what you read since the situations may be very different between engineering fields, science fields, and the humanities. There may also be significant differences between sub-fields such as theoretical computer science vs. operating systems. A new student should glance at job postings and NSF statistics to determine the ratio of new Ph.D.s to open tenure-track slots.
  2. As early as possible, determine if the actual career prospects are a reasonable match for their needs/expectations. Until the student makes her expectations clear, the advisor has no clue if she simply must have an academic job or whether he’ll be perfectly happy getting a Ph.D. and then going to law school or being a stay-at-home parent.
  3. Not be deluded or blinded by catchphrases like “life of the mind.” Indeed, this life does exist, but probably only during the ABD portion of a Ph.D. A professor would be extremely lucky to live the life of the mind 15 hours a week, leaving 60 hours of advising, teaching, reviewing, writing grant proposals, traveling, and sitting in meetings.
  4. Be a good investment in terms of time and money. In other words, work hard. Students who periodically disappear for long bouts of skiing, soul searching, or contract work tend to be put on the back burner by their advisor, making it much more difficult to get re-engaged later on. An easy litmus test: if acting a certain way would get you fired from a real job, then it’s probably a bad idea to try that in a Ph.D. program too.
  5. Jump through the administrative hoops appropriately. The hurdles are important and generally not too burdensome: take some classes, do a qualifying exam, write a proposal, and so on. These are easy to ignore until they become a problem. Your advisor is not likely to remind you, or even remember that you need to do them.

Since nothing is obvious on the Internet, a disclaimer: These edicts might come across as cold and overly pragmatic, and might suggest that we are ignoring the joy of discovery, the thrill of learning and the excitement of doing cutting-edge research that goes along with doing a Ph.D. Far from it: we’ve chosen this life because we experience all of this and enjoy it. But the easiest way to crash and burn in what is a long, multi-year haul is to forget about the brass tacks and float in the clouds.

Gear for Getting Outside in Winter

Since I’m not a big skier, I didn’t get out a lot in winter during my first few years in Utah. When springtime came, I would just tough it out and get into shape the hard way — go on a few long hikes and suffer appropriately. This worked because I was in my 20s. Now that I’m close to 40, that strategy works very poorly so I try to stay in shape all year. This means having some reasonable winter clothes, which is what this post is about.

The basic ideas behind putting together a good collection of clothing on a modest budget are:

  • Waterproof items are expensive and almost never necessary. The main exceptions are hard rain (rare in Utah) or spending a lot of time wallowing in wet snow (rare outside of self-arrest practice sessions).
  • Items should be multi-purpose. Many pieces of outdoor gear are things a person would want to own anyway, just to shovel the driveway or whatever.
  • Everything can be found for at least 30% off retail price (and often quite a bit more) if you are patient.

I’m posting this since it took me a while to come up with a system that’s warm, comfortable, and durable.


These are all you need down to about 5°F, as long as you keep moving and there’s little wind. More layers are needed (see below) to provide margin against getting lost or hurt, to stay comfortable over a lunch break, or when there’s weather.


Soft shell pants are awesome: stretchy, warm, and breathable. The Mammut Champ pants I’ve used for about three years are not showing much wear and are comfortable in a wide variety of conditions. They seem to be a great choice.


Layering is not needed: a single breathable, windproof, stretchy top is what you want. My top is a Sporthill “zone 3” model and it is excellent, though getting threadbare after 4 years of use. Many other companies make similar gear. The idea is to keep it simple: the shirt is a pullover with no pockets, hood, or any of that.

Socks and Boots

Smartwool socks are my favorite, and not too hard to find on sale. Otherwise anything that isn’t cotton should be fine. Boots don’t seem to matter that much, you just want decent treads. I’m not partial to Goretex boots but they’re hard to avoid unless you go to heavy leather boots.


Everyone in the Midwest and Northeast of the US has a hat and gloves, but most people who visit us from those regions have never heard of gaiters. I don’t understand this because gaiters are the best thing ever: they keep snow (and sticks and leaves) out of your boots and they keep your legs warm. I’d rather go snowshoeing in tennis shoes and gaiters than in a good pair of boots without gaiters.

Gloves and Hat

The “windstopper” kind of fleece seems to work well for a light hat and gloves. Gloves last a few years and hats last until you lose them.


For trips longer than an hour or two, or for non-benign winter weather some extra gear is nice.


Even with a good hat and temperatures not much below freezing, windy weather can easily lead to numb cheeks, neck, and chin. A balaclava solves this problem; in winter I almost always hike/run with a Smartwool balaclava in my pocket, it weights almost nothing and is really warm and comfortable. I also have a heavy storm balaclava but don’t use it that often.

Long Underwear

Backpacking in crappy weather taught me to love wool long underwear: it stays comfortable and doesn’t get (extremely) smelly after wearing it continuously for a week. Mine is Smartwool but there are several other brands like Ibex and Icebreaker that are probably just as good. This stuff is so comfortable I wear it around the house in winter.


This is a good idea for winter day hikes, and mandatory for evening hiking. I used to use a small headlamp with 3 AAA batteries inside the unit; these are awesome for emergency use and for backpacking trips, but don’t provide enough light to illuminate a trail if you’re moving fast. My current headlamp has an external battery pack, which is annoying, but it’s far brighter and can be used for trail running at night.


I usually hike with poles in winter, they provide extra points of balance on icy hillsides and make it much easier to self-extricate from a snowbank.

Shell Pants

As I said, these are seldom necessary unless it’s very windy or rainy. But still, it’s nice to have a pair. I bought cheapo goretex pants about 10 years ago and they still work.

Puffy Jacket

A puffy jacket is great to pull on as soon as you stop moving. The one shown here is a Montbell UL thermawrap, which I love because it’s light and small and fairly close-fitting, so it layers well under other stuff. Since it’s thin, it’s not very warm by itself.

Shell or Soft-Shell Jacket

One of these is nice for the crappiest conditions.

Serious Hand Protection

You definitely want some heavier gloves than the ones I showed above. I also have a pair of pullover mittens, mainly for insurance. You can’t actually do anything with your hands while wearing the damn things but I’m pretty sure that (pulled over warm gloves) they’d stave off frostbite in most conditions found in the lower 48.