Skip to content

Atomic Accidents

Although I was six years old when the Three Mile Island accident happened, I clearly remember grownups talking about it and being worried: the house my family lived in was only about 60 miles away from the meltdown. In those days there was also plenty of free-floating nuclear angst due to the cold war; this would occasionally condense into something like The Day After or Edge of Darkness. The latter remains one of the best things ever to be shown on television, I re-watch it every couple of years (the 1985 one, not the 2010 one).

James Mahaffey’s Atomic Accidents covers not only Three Mile Island, Chernobyl, and Fukushima, but also pretty much everything else that has gone wrong when humans tried to exploit nuclear fission or fusion. It’s a fascinating book as well as being — perhaps oddly — quite funny, and I had trouble putting it down.

I was surprised to learn how many nuclear reactors have been destroyed on purpose, and I was also surprised to learn how many nuclear weapons were temporarily lost by the US military: something like 60 in total. That’s really scary. But perhaps the most chilling image painted in Atomic Accidents is the criticality accident where a small nuclear reactor is accidentally created, usually by someone working in a fuel processing facility. Imagine doing something innocuous like turning on a stirrer or pouring a liquid into a different container, seeing a bright blue flash, and realizing that you’re dead on your feet. This fascinating report contains a lot of details.

The accidents in large reactor facilities have some depressing common elements. First, the situation is inherently dangerous due to this large system that, under certain conditions, will get into a runaway positive feedback loop. Second, the thing can’t just be shut down to zero power: residual radioactive decay generates heat that has to be gotten rid of, necessitating extraordinarily complex cooling systems and backup power systems behind those. Third, visibility into the operating reactor is often poor: in one early accident, a reactor core had been on fire for several days before this was realized. Finally, humans, caught in between all of these factors, don’t seem to reliably do the right thing at the right instant.

A lot of pop science is written by people whose understanding of the issues seems to be shallow, but that is not the case here: Mahaffey is clearly a real expert on the subject matter. On the other hand, he is not unbiased. For example, on page XIX:

To keep the industry alive, thriving, and growing, it is imperative that the general population not feel threatened by it.

On page XXI:

The purpose of this book is not to convince you that nuclear power is unsafe beyond reason, or that it will lead to the destruction of civilization. On the contrary, I hope to demonstrate that nuclear power is even safer than transportation by steam and may be one of the key things that will allow life on Earth to keep progressing…

The best we can say is that it’s nice that he is up-front about this. Mahaffey’s slanted point of view caused me real stomach trouble only once: by page 33 he has twice asked the question: “Could we eventually evolve into a race that can withstand high levels of radiation?” What? For the human race to evolve in such a fashion, those who cannot withstand high levels of radiation must die — or be sterilized — before they can reproduce, repeatedly, over a period of hundreds or thousands of years. This is what might happen if the entire surface of the earth became dangerously radioactive. What was going on in Mahaffey’s mind that made this disturbing idea seem so appealing that he had to mention it more than once before the end of the first chapter?

Non-Transparent Memory Safety

[This paper contains more detail about the work described in this post.]

Instrumenting C/C++ programs to trap memory safety bugs is a popular and important research topic. In general, a memory safety solution has three goals:

  • efficiency,
  • transparency, and
  • compatibility.

Efficiency is obvious. Transparency means that we can turn on memory safety with a switch, we don’t have to do anything at the program level. Compatibility means that safe and unsafe code can freely interact, especially when linking against libraries. Compatibility is tricky because it severely limits the ways in which we can change the layout of memory objects, as we might hope to do in order to store the length of an array along with its data.

One of my favorite memory safety solutions for C — the Deputy project from Berkeley — is distinct from most other work on this space because it does not have transparency as a goal. While this initially seems like a bad idea, and it will obviously limit the amount of legacy code that we can run under Deputy, I eventually came to realize that non-transparency can be a good thing. The goal of this piece is to explain why.

When you write a C or C++ program, you usually intend it to be memory safe. And in fact, a large proportion of C/C++ code in the wild is memory safe, meaning that for all valid inputs it fails to access out-of-bounds or unallocated storage (or it might mean something else, but let’s not worry about that). The problem, of course, is that a small fraction of C/C++ code is not memory safe and some of these errors have serious consequences.

For sake of argument, let’s say that you have written a piece of C code that is memory safe. With some effort you can do this for a small and perhaps for a medium-sized program. Now we might ask: Why is the program memory safe? Where does the memory safety live? Well, the memory safety resides in the logic of the program and perhaps also in the input domain. Unless we’ve used some sort of formal methods tool, the reasoning behind memory safety isn’t written down anywhere, so it’s impossible to verify.

Let’s take your memory safe C program and run it under a transparent memory safety solution like perhaps SoftBound + CETS. What we have now are two totally separate implementations of memory safety: one of them implicit and hard to get right, the other explicitly enforced by the compiler and runtime system.

Deputy is based on the premise that we don’t need two separate implementations of memory safety. Rather, Deputy is designed in such a way that the C programmer can tell the system just enough about her memory safety implementation that it can be checked. Let’s look at an example:

int lookup (int *array, int index) {
  return array[index];

If we don’t trust the developer to get memory safety right, we need to change the code to something like this:

int lookup (int *array, int index) {
  assert (index >= 0 && index < array.length);
  return array[index];

In the C programmer’s implementation of memory safety, the assertion is guaranteed not to fire by the surrounding program logic and by restrictions on the input domain. In a compatible memory safe C, the assertion must be statically or dynamically checked, meaning that we need to know how many int-typed variables are stored in the memory region starting at array. This is not so easy because C has no runtime representation for array lengths. The typical solution is to maintain some sort of fast lookup structure that maps pointers to lengths. A significant complication is that array might point into the middle of some other array. The code that actually executes would look something like this:

int lookup (int *array, int index) {
  check_read_ok (array + index, sizeof (int));
  return array[index];

Getting back to Deputy, the question is: How can the programmer communicate her memory safety argument to the system? It is done like this:

int lookup (int *COUNT(array.length) array, int index) {
  return array[index];

COUNT() is an annotation that tells Deputy what it needs to know in order to do a fast bounds check — no global lookup structure is necessary.

When I first saw the example above, I was not very impressed: it looks like Deputy is just being lazy and punting the problem back to me. But after using Deputy for a while, its genius became apparent. First, whenever I needed to tell Deputy something, the information was always available either in my head or in a convenient program variable. This is not a coincidence: if the information that Deputy requires is not available, then the code is probably not memory safe. Second, the annotations become incredibly useful documentation: they take memory safety information that is normally implicit and put it out in the open in a nice readable format. In contrast, a transparent memory safety solution is highly valuable at runtime but does not contribute to the understandability and maintainability of our code.

There are a number of other Deputy annotations, most notably NTS which is used to tell the system about a null-terminated string and NONNULL which of course indicates a non-null pointer. The Deputy Quick Reference shows the complete set of annotations and the Deputy Manual explains everything in more detail and has code examples. The Deputy paper focuses on more academic concerns and unfortunately contains only a single short example of Deputized C code.

Although the preceding example didn’t make this clear, applying Deputy to C code is pretty easy because the Deputy compiler uses type inference to figure out annotations within each function. Thus, many simple functions can be annotated at the prototype and the compiler takes care of the rest. In more involved situations, annotations are also necessary inside functions. The process for applying Deputy to legacy C code is to compile the code at which point Deputy says where annotations are missing. So you add them and repeat. It’s a nice process where you end up learning a lot about the code that you are annotating. In general, an incorrect annotation cannot lead to memory-unsafe behavior, but it can cause a memory safety violation to be incorrectly reported. (You can write truly unsafe code in Deputy using its UNSAFE annotation, but at least the unsafe code is obvious, as it is in Rust.) My guess is that people who enjoy using assertions would also enjoy Deputy; people who hate assertions may well have a different opinion.

Is Deputy perfect? Certainly not. Most seriously, it is only a partial memory safety solution and does not address use-after-free errors. Its memory safety guarantee does not hold if there are data races. One time I ran into a case where Deputy wouldn’t let me tell it the information that it needed to know, I believe it was when the size of an array was in a struct field. Finally, since it is based on CIL, Deputy supports C but not C++.

My group used Deputy as the basis for our Safe TinyOS project. TinyOS was a nice match for Deputy: the extremely lightweight runtime was suitable for embedded chips with 4 KB of RAM and the lack of use-after-free checking wasn’t a problem since TinyOS doesn’t have malloc/free. We found that in many cases it was sufficient to annotate the TinyOS interface files — which serve much the same role as C header files — and then Deputy didn’t need additional annotations. Here’s an example of an annotated interface:

   * @param  'message_t* ONE msg'        the received packet
   * @param  'void* COUNT(len) payload'  a pointer to the packet's payload
   * @param  len                         the length of the data region pointed to by payload
   * @return 'message_t* ONE'            a packet buffer for the stack to use for the next
   *                                     received packet.
  event message_t* receive(message_t* msg, void* payload, uint8_t len);

There are minor differences from standard Deputy, such as ONE pointers (they “point to one object”) instead of SAFE NONNULL, and we put the annotations into the comments, so they automatically get added to the interface documentation, instead of putting them directly into the function prototypes. There were also some changes under the hood. We found that Deputy was generally a pleasure to use and it caught some nasty bugs in various TinyOS programs.

The current status is that Deputy has not been supported for some time, so it would not be a good choice for a new project. The Deputy ESOP paper has been well cited (114 times according to Google Scholar) but the basic idea of memory safe C/C++ via annotations and type inference has not caught on, which is kind of a shame since I thought it was a nice design point. On the other hand, even if an updated implementation was available, in 2014 I would perhaps not use Deputy for a new safe low-level project, but would give Rust a try instead, since it has a good story not only for out-of-bounds pointers but also use-after-free errors.

Reviewing Research Papers Efficiently

The conference system that we use in computer science guarantees that several times a year, each of us will need to review a lot of papers, sometimes more than 20, in a fairly short amount of time. In order to focus reviewing energy where it matters most, it helps to review efficiently. Here are some ideas on how to do that.

Significant efficiency can come from recognizing papers that are not deserving of a full review. A paper might fall into this category if it is:

  • way too long
  • obviously outside the scope of the conference
  • significantly incomplete, such as an experimental paper that lacks results
  • a duplicate of a paper submitted or published by the same or different authors
  • an aged resubmission of a many-times-rejected paper that, for example, has not been updated to reference any work done in the last 15 years

These papers can be marked as “reject” and the review then contains a brief, friendly explanation of the problem. If there is controversy about the paper it will be discussed, but the most common outcome is for each reviewer to independently reach the same conclusion, causing the paper to be dropped from consideration early. Certain program committee members actively bid on these papers in order to minimize their amount of reviewing work.

Every paper that passes the quick smoke test has to be read in its entirety. Or perhaps not… I usually skip the abstract of a paper while reviewing it (you would read the abstract when deciding whether or not to read the paper — but here that decision has already been made). Rather, I start out by reading the conclusion. This is helpful for a couple of reasons. First, the conclusion generally lacks the motivational part of the paper which can be superfluous when one is closely familiar with the research area. Second — and there’s no nice way to say this — I’ve found that authors are more truthful when writing conclusions than they are when writing introductions. Perhaps the problem is that the introduction is often written early on, in the hopeful phase of a research project. The conclusion, on the other hand, is generally written during the grim final days — or hours — of paper preparation when the machines have wound down to an idle and the graphs are all plotted. Also, I appreciate the tone of a conclusion, which usually includes some text like: “it has been shown that 41% of hoovulators can be subsumed by frambulators.” This gives us something specific to look for while reading the rest of the paper: evidence supporting that claim. In contrast, the introduction probably spends about a page waxing eloquent on the number of lives that are put at risk every day by the ad hoc and perhaps unsound nature of the hoovulator.

Alas, other than the abstract trick, there aren’t really any good shortcuts during the “reading the paper” phase of reviewing a paper. The next place to save time is on writing the review. The first way to do this is to keep good notes while reading, either in ink on the paper or in a text file. Generally, each such comment will turn into a sentence or two in the final review. Therefore, once you finish reading the paper, your main jobs are (1) to make up your mind about the recommendation and (2) to massage the notes into a legible and useful form. The second way to save time is to decide what kind of review you are writing. If the paper is strong then your review is a persuasive essay with the goal of getting the rest of the committee to accept it. In this case it is also useful to give detailed comments on the presentation: which graphs need to be tweaked, which sentences are awkward, etc. If the paper needs to be rejected, then the purpose of the review is to convince the committee of this and also to help the authors understand where they took a wrong turn. In this case, detailed feedback about the presentation is probably not that useful. Alternatively, many papers at top conferences seem to be a bit borderline, and in this case the job of the reviewer is to provide as much actionable advice as possible to the authors about how to improve the work — this will be useful regardless of whether the paper is accepted or rejected.

I hope it is clear that I am not trying to help reviewers spend less total time reviewing. Rather, by adopting efficient reviewing practices, we can spend our time where it matters most. My observation is that the amount of time that computer scientists spend writing paper reviews varies tremendously. Some people spend almost no time at all whereas others produce reviews that resemble novellas. The amazing people who produce these reviews should embarrass all of us into doing a better job.

Update: Also see Shriram’s excellent notes about reviewing papers.

ALIVe: Automatic LLVM InstCombine Verifier

[This post was jointly written by Nuno Lopes, David Menendez, Santosh Nagarakatte, and John Regehr.]

A modern compiler is a big, complex machine that contains a lot of moving parts, including many different kinds of optimizations. One important class of optimization is peephole optimizations, each of which translates a short sequence of instructions into a more desirable sequence of instructions. For example, consider this LLVM code that first shifts an unsigned 32-bit value 29 bits to the left, then 29 bits to the right:

%1 = shl i32 %x, 29 
%2 = lshr i32 %1, 29

As long as %1 is not used anywhere else, this computation would be better written as:

%2 = and i32 %x, 7

Unsurprisingly, LLVM already knows how to perform this peephole optimization; the code implementing it can be found here.

Although most peephole transformations are pretty straightforward, the problem is that there are a lot of them, creating a lot of opportunities to make mistakes. Many of LLVM’s peephole optimizations can be found in its instruction combiner. According to Csmith, the instruction combiner was the single buggiest file in LLVM (it used to all be a single file) with 21 bugs found using random testing.

Wouldn’t it be nice if, instead of embedding peephole optimizations inside C++ code, they could be specified in a clearer fashion, and if bugs in them could be found automatically? These are some of the goals of a new project that we have been working on. So far, we have produced an early prototype of a tool called ALIVe that reads in the specification for one or more optimizations and then, for each one, either proves that it is correct or else provides a counterexample illustrating why it is wrong. For example, the optimization above can be written in ALIVe like this:

%1 = shl i32 %x, 29
%2 = lshr i32 %1, 29 
%2 = and i32 %x, 7

Each optimization that is fed to ALIVe has an input or left-hand side (LHS), before the =>, that specifies a pattern to look for in LLVM code. Each optimization also has an output or right-hand side (RHS) after the => that specifies some new LLVM code that has to refine the original code. Refinement happens when the new code produces the same effect as the old code for all inputs that do not trigger undefined behavior.

Here is what happens when the code above is provided to ALIVe:

$ < example1.opt  
Optimization: 1  
Precondition: true  
%1 = shl i32 %x, 29  
%2 = lshr i32 %1, 29 
%2 = and i32 %x, 7

Done: 1 
Optimization is correct!

(All of the example ALIVe files from this post can be found here.)

The proof is accomplished by encoding the meaning of the LHS and RHS in an SMT query and then asking a solver whether the resulting formula is satisfiable. The “Done: 1″ in the output means that ALIVe’s case splitter, which deals with instantiations of type variables, only had one case to deal with.

Of course, the optimization that we just specified is not a very good one: it handles only a single register width (32 bits) and a single shift amount (29 bits). The general form is a bit more interesting since the optimized code contains a constant not found on the LHS:

%1 = shl %x, C 
%2 = lshr %1, C 
%2 = and %x, (1<<(width(C)-C))-1

This also verifies as correct. To finish specifying this optimization, we also would want to support the case where the right shift comes before the left shift. At present, to do this in ALIVe you need to specify a second optimization rule, but in the future we will support transformations that are parameterized by lists of instructions.

Undefined Behavior

Undefined behavior makes optimizations much harder to think about. For example, let's look at PR20186, an LLVM wrong-code bug that we found while translating optimizations from the instruction combiner into ALIVe, where the optimization looks like this:

%a = sdiv %X, C 
%r = sub 0, %a 
%r = sdiv %X, -C

In other words, dividing an integer by a constant, and then negating the result, can be optimized into dividing the integer by the negated constant. The optimization is attractive since it reduces the number of instructions. It also looks reasonable at first glance since we all learned in school that -(X/C) is equal to X/(-C).

However, ALIVe is not happy with this optimization:

$ < example2.opt  
Optimization: 1  
Precondition: true  
%a = sdiv %X, C  
%r = sub 0, %a  
%r = sdiv %X, -C

Done: 1 
ERROR: Domain of definedness of Target is smaller than Source's for i2 %r

%X i2 = 2 (0x2) 
C i2 = 1 (0x1) 
%a i2 = 2 (0x2) 
Source value: 2 (0x2) 
Target value: undef

What's the problem? If C=1, then -C=-1. When %X=-2, %X/(-C) overflows, since 2 is not representable in a 2-bit signed representation (ALIVe gives counterexamples at the smallest bitwidth that is required to trigger an optimization bug). LLVM's language reference states that overflow in sdiv leads to undefined behavior. In summary, in the original code there was no undefined behavior when C=1, but the optimized version may be undefined in that case.

ALIVe is able to prove that a fixed version of this optimization is correct (but note that we've restricted the bitwidth since division is difficult for SMT solvers):

Optimization: 1
Precondition: ((C != 1) && !isSignBit(C))
%a = sdiv i16 %X, C
%r = sub 0, %a
%r = sdiv %X, -C

Done: 1
Optimization is correct!

And, in fact, the fix that was committed to LLVM's instruction combiner is equivalent to the one we've shown here.

A precondition specifies additional conditions beyond the occurrence of a pattern of instructions that must hold before the optimization is allowed to fire. The isSignBit() predicate is an LLVM function that tests whether a value is INT_MIN, where INT_MIN is the minimum integer value for a given bit-width. In general, it is not possible to call arbitrary LLVM code from ALIVe code, but we have reimplemented some commonly used functions, and will continue to add more of these as needed.

Another kind of undefined behavior bug that ALIVe can help find is incorrect propagation of the nsw/nuw flags which inform LLVM that an arithmetic operation is guaranteed to not overflow (nsw stands for "no signed wrap" and nuw is "no unsigned wrap"). For example, while translating instruction combiner patterns into ALIVe, we ran across PR20189 (which was independently discovered and then reported by another developer). This transformation wanted to convert x -nsw (-C) into x +nsw C, which is invalid; the optimization should only fire when both subtractions on the LHS are nsw.

So far, we have translated some of LLVM's instruction combiner (88 transformations, in total) into ALIVe; you can look at the results of that effort here. Each file in this directory corresponds to a file in LLVM's instruction combiner.

Design Goals

Tools based on formal methods are not always easy to use. One of our major design goals is that regular LLVM developers, and other interested people, can use ALIVe to explore new transformations and to evaluate the correctness of existing transformations. To meet this goal, we have designed ALIVe to look and act like the textual representation of LLVM IR. The major departure from LLVM is support for abstraction. As we saw above, optimizations must contain elements of abstraction in order to be effective in as wide a range of situations as possible. Syntactically, ALIVe supports abstraction via omission and via variables. For example, if you fail to specify the bit-width of a register, then ALIVe will assume that the optimization is intended to apply to all bit-widths. If an optimizations doesn't care about bitwidth but requires that two registers have the same width, then they can be given symbolic widths. Another aspect of usability is that ALIVe should clearly explain its reasoning not only in the case where something is wrong with an optimization, but also when an optimization is correct. We are still working out the details of this, but probably ALIVe will need to share some details about the results of its type inference. The reason that ALIVe must explain itself in the "correct optimization" case is to help avoid vacuously correct specifications, which are undesirable although not nearly as harmful as incorrect specifications.

The second major goal of ALIVe is to automate as many hard and tedious tasks as possible. Obviously, reasoning about the correctness of optimizations is one of these tasks. Another is inferring the placement of LLVM's undefined behavior qualifiers: nsw, nuw, and exact. Above, we saw an example where LLVM was erroneously optimistic about where such a flag needed to be placed. We believe that the instruction combiner contains many instances of the opposite error: undefined behavior flags are left out in cases where they could have been added. ALIVe will help find these, leading to more effective optimizations. Going forward, we also want to compile ALIVe specifications into efficient C++ code that could be included in LLVM. This has multiple benefits for LLVM developers: eliminating bugs in the instruction combiner and reducing the time to develop new instruction combiner patterns. Code review can then focus on whether the pattern is profitable, rather than worrying that it may be wrong.

Our third goal is to separate correctness from profitability. These are orthogonal concerns and neither is easy to deal with. Currently, this separation is accomplished by not reasoning about profitability at all. In some cases, the profitability of peephole optimizations is straightforward: if instructions can be removed, they should be. But in many other cases, transformations are desirable for a more subtle reason: they create conditions that are conducive to further optimization. Some of this is captured by canonicalization rules.

The Language

ALIVe abstracts away types. For example, the icmp instruction can either take a pair of integers of arbitrary bit-width or a pair of pointers. If you write an optimization that matches an icmp instruction, then ALIVe will try to prove the optimization correct for all possible integer bit-widths and pointer types, since an optimization may be correct for, say, i1 but not i8. Currently, however, ALIVe limits its proofs to bit-widths from 1 to 64 bits, and pointer sizes of 32 and 64 bits. Types may be named to express the constraint that a type is the same in different parts of an optimization.

ALIVe supports the specification of optimizations over several instruction types. Generally, anything that is left unspecified means that the optimization is intended to apply to all possible instantiations. For example, the following optimization encodes all possible choices of icmp comparisons and zero extensions:

%1 = icmp %n, %a 
%2 = zext %1 
%3 = icmp %n, %b 
%4 = zext %3 
%5 = or %2, %4 
%6 = icmp eq %5, 0 
%1 = icmp %n, %a 
%3 = icmp %n, %b 
%. = or %1, %3 
%6 = icmp eq %., 0

ALIVe assumes that identifiers matching the regex 'C\d*' are constants. This is not important for correctness, but it will be important once we start generating C++ code from the specifications.

Like LLVM, identifiers starting with % are unaliased temporaries that are likely to be mapped to registers by a backend. They hold arbitrary values (subject to preconditions) on the input side. If a temporary appears on both the RHS and the LHS, then the RHS must end up with the same value for the temporary as the LHS, for all valid inputs. A valid input is one that does not trigger undefined behavior in any instruction on the LHS. If a register appears only on the LHS or only on the RHS and it is not an input, then it is an intermediate value that imposes no correctness requirement. In practice, that value will be removed from the output if the number of users reaches zero.

What Next?

There's much left to do! We are still working on improving and extending the ALIVe language, as well as improving the performance of the tool and the quality of the diagnostics.

We welcome users and contributions from the LLVM community. We are looking for feedback regarding the specification language, the tool itself, how to best integrate ALIVe within the LLVM development process, and so on.

If you find a bug in ALIVe or find an LLVM bug using ALIVe or if you manage to verify an awesome optimization, please get in touch.


Although ALIVe is open source (Apache 2 license), it relies on the Z3 SMT solver. Z3 is licensed under the Microsoft Research License Agreement (MSR-LA), which forbids commercial usage. However -- after discussions with people at Microsoft -- it is our and Microsoft's understanding that using Z3 with ALIVe for the development of LLVM does not constitute commercial usage for the following reasons:

  1. LLVM is not a commercial product of any particular company.
  2. ALIVe is free.

Questions regarding Z3's license should be directed to Behrooz Chitsaz (, the Director of IP at Microsoft Research, who kindly offered to answer any question regarding the usage of Z3 within ALIVe. This statement does not constitute legal advice and it is not legally binding. Interested parties should seek professional consultation with an attorney.

Broads Fork

After moving to Utah I decided that regularly spending time in the mountains was one of the best ways to stay sane and healthy. Since I usually can’t make time for an all-day hike, I developed a habit getting up around 5, hiking hard for a couple of hours, and then getting into the office by 8:30 or 9. This was nice while it lasted but had to stop once I had kids. However, now that they’re a bit older, I hope to start doing early hikes again, at least occasionally.

One of my favorite trails for a quick hike is Broads Fork, which gains about 2000 feet over 2 miles, ending up at a pretty meadow with a small beaver pond. There never seem to be too many people here; the nearby Lake Blanche trail gets most of the traffic. The Broads Fork trailhead is about a 20 minute drive from the University of Utah or about 25 minutes from downtown SLC.

Finding Compiler Bugs by Removing Dead Code

I was pretty bummed to miss PLDI this year, it has been my favorite conference recently. One of the talks I was most interested in seeing was Compiler Validation via Equivalence Modulo Inputs by some folks at UC Davis. Although I had been aware of this paper (which I’ll call “the EMI paper” from now on) for a while, I was hesitant to write this post — the work is so close to my work that I can’t avoid having a biased view. So anyway, keep that in mind.

One of the things that makes testing hard is the necessity of oracles. An oracle tells us if a run of the software being tested was buggy or not. A cute idea that no doubt has been around for a long time, but which has more recently been called metamorphic testing, is to work around the oracle problem by taking an existing test case and changing it into a new test case whose output can be predicted. For example, if I have a test case (and expected answer) for a program that classifies a triangle as isosceles/scalene/right/etc., I can scale, translate, and rotate the triangle in my test case in order to create many new test cases that all have the same answer as the original one.

So how should one apply metamorphic testing to compilers? It’s not very hard to come up with bad ideas such as adding layers of parens, rewriting (x+y) to be (y+x), rewriting x to be (x+0), etc. The reason that these are bad ideas is that the changes will be trivially undone by the optimizer, resulting in poor testing of the optimizer logic. Some better ideas can be found in this paper on metamorphic compiler testing (IEEE paywall, sorry) which found a few GCC bugs.

The EMI paper is based on a particularly clever application of metamorphic testing where the program transformation is removal of dead code. Of course compilers know how to remove dead code, so how is this transformation any better than the bad ones I already mentioned? The idea is to remove dynamically dead code: code that is dead in some particular run. This kind of code is easy to detect using a code coverage tool. Of course this code may not be dead in other executions of the program, but this is fine: we’ll just need to be careful not to test the compiled program on anything other than the input that was originally used to do dead code discovery. So basically EMI works like this:

  1. Run a C program on whatever input we have sitting around, using compiler instrumentation to check which lines are executed.
  2. Create a new C program lacking randomly chosen pieces of code that did not execute in Step 1.
  3. Run the new program on the same input. Report a compiler bug if its output has changed.

Notice that the original C program had better not execute undefined behavior or rely on unspecified behavior or else we’ll get false positives.

The cleverness of EMI is not abstract or conceptual. Rather, EMI is clever because it works: at the time the paper was finalized the authors had reported 147 compiler bugs that were confirmed by developers, 110 of which have been fixed. This last number — fixed bugs — is the impressive one, since finding bugs that people care about enough to fix is generally a lot harder than just finding bugs.

The great thing about EMI is that it is a simple and extensible process. For example, it would not be hard to adapt the idea to C++. In contrast, random generation of a meaningful subset of C++11 is a project that we have been reluctant to start because we don’t yet know how to build this generator at an acceptable level of effort. Another easy extension to EMI would be adding or modifying dead code rather than simply deleting it. More generally, metamorphic compiler testing is probably an underused idea.

I was interested to read that the vast majority (all but four, it looks like) of the bugs discovered by EMI were triggered by mutated versions of Csmith programs. One reason that this is interesting is that since Csmith programs are “closed” — they take no inputs — the statically and dynamically dead code in such a program is precisely the same code. Therefore, an apparent advantage of using dynamic information — that it can remove code that is not dead in all executions — turns out to be a bit of a red herring. EMI works in this situation because the dead code elimination passes in compilers are not very precise.

An interesting question is: Why is Csmith+EMI so effective? One reason is that Csmith programs tend to contain a large amount of dead code, giving EMI a lot of room to play. It is just hard to generate expressive random code (containing lots of conditionals, etc.) that isn’t mostly dead, as far as we can tell. We’ve known this for a while and basically we don’t care — but we never guessed that it would turn out to be a hidden advantage.

Another problem with using EMI to mutate non-Csmith programs is that many real C programs execute undefined behaviors and even when they do not, it is generally difficult to verify that fact. Csmith, in contrast, has been somewhat co-designed with Frama-C such that the two tools work together with no additional effort. Automated undefined behavior detection is a crucial part of doing automated test-case reduction using C-Reduce.

One might ask: How useful is EMI when applied to C++ given that there is no C++smith? I look forward to learning the answer. The lack of a robust undefined behavior checker for C++ is another problem, although projects like LLVM’s UBsan are slowly chipping away at this.

The EMI authors say “… the majority of [Csmith’s] reported bugs were compiler crashes as it is difficult to steer its random program generation to specifically exercise a compiler’s most critical components—its optimization phases.” This doesn’t follow. The actual situation is subtle, but keep in mind that the entire purpose of Csmith is to exercise the compiler’s optimization phases. We spent years working on making Csmith good at this exact thing. We did in fact report more crash bugs than wrong code bugs but the real reasons are (1) we aggressively avoided duplicate bug reports by reporting only one wrong code bug at a time, and (2) wrong code bugs tend to be fixed much more slowly than crash bugs. In essence, the reasons that we reported fewer wrong code bugs than crash bugs are complex ones having more to do with social factors (and perhaps our own laziness) than to do with weaknesses of Csmith. Of course it might still be the case that EMI is better than Csmith at discovering middle-end optimizer bugs, but the EMI authors have not yet shown any evidence backing up that sort of claim. Finally, it is not necessarily useful to think of compiler crash bugs and wrong code bugs as being different things. The underlying bugs look much the same, the difference is often that in one case someone put the right assertion into the compiler (causing the inconsistency to be detected, leading to crash via assertion violation) and in the other case the inconsistency was undetected.

On a closely related note, after finishing this paper I was left asking: Given a limited testing budget, would it be better to run Csmith or to run Csmith+EMI? In other words, which method discovers either more bugs or higher-quality bugs? This would be a fun experiment to run, although there are some subtleties such as the fact that GCC and LLVM have (effectively) evolved to be somewhat resistant to Csmith, giving any new testing method an implicit advantage.

One thing about the EMI work that makes me selfishly happy is that over the last couple of years I’ve slacked off on reporting compiler bugs. This makes me feel guilty since Csmith continues to be capable of finding bugs in any given version of GCC or LLVM. Anyway, I feel like I’ve done my time here, so have fun with the bug reporting, guys!

In summary, EMI is very cool and anyone interested in compiler correctness should read the paper. It’s also worth thinking about how to apply its ideas (and metamorphic testing in general) to other application domains.

We Need Hardware Traps for Integer Overflow

Processors should support integer math instructions that optionally trap on overflow. Because popular architectures lack this feature, otherwise excellent modern systems programming languages, such as Rust, Go, and D, have default integer types that wrap. This is bad because unexpected wrapping causes programs to produce incorrect results, although of course integer overflow in a safe language is not the utter disaster that it is in C/C++, where integer overflow and type unsafety cooperate to create the worst kinds of bugs. The reason that Rust provides wrapping integers is that so far, the costs of a better semantics — both in terms of runtime overhead and in terms of implementation effort for the Rust team — exceed the perceived benefits. My belief is that hardware traps could change this inequation in a favorable way.

The lack of trapping math instructions doesn’t only hurt low-level languages. In JavaScript, where all numbers are semantically double-precision floats, replacing doubles with integer types in the runtime requires expensive software overflow checks. I’ve heard that this results in a 5-10% slowdown for basically all JavaScript code. In languages such as Python and Racket the default integer type overflows into bignums instead of doubles; Matthew Flatt tells me that Racket’s performance penalty due to software overflow checks probably exceeds 100% for jitted tight loops that do lots of integer math.

You might be saying to yourself: But I require wrapping integers to implement crypto codes and PRNGs and hash functions and stuff like that. The answer is easy: we provide wrapping operators that can be used in these specialized situations. One choice would be +., -., etc. In a unicode language we might use ⊞, ⊟, etc. (Jesse Ruderman suggested this, and I like the idea).

Architectures such as MIPS and Alpha support integer overflow traps. However, to a good approximation, the only architectures that matter right now are ARM’s and Intel’s. There are two issues in adding integer overflow traps to these ISAs. First, where do we get opcode space for the new trapping instructions? For x86 and x86-64, which support an elaborate system of instruction prefixes, it may make sense to use that mechanism, although this comes with a code size penalty. ARM has fixed-size instructions so finding space may be trickier there. A mail to a friend at ARM on this topic has so far gone unanswered, but I am guessing that this could be shoehorned into ARMv9 somehow. The second issue is the effect of integer overflow traps on chip area and critical path length. An experienced architect who I talked to doesn’t think these are serious problems, and in any case the complexity is dramatically lower than the complexity of implementing something like hardware support for array bounds checking.

This post isn’t as much of an opinion piece as a plea to the folks at ARM, Intel, and AMD: Please provide this feature. It is needed in order to make high-level languages faster and low-level languages saner.

UPDATE: Some data about the overhead of software integer overflow checking in C/C++ codes can be found in section IIID of this paper. Please keep in mind, however, that this implementation is tuned for debugging not performance. A highly tuned software integer undefined behavior checker for C/C++ could probably have overhead in the 5% range.

A Guide to Better Scripty Code for Academics

[Suresh suggested that I write a piece about unit testing for scripty academic software, but the focus changed somewhat while I was writing it.]

Several kinds of software are produced at universities. At one extreme we have systems like Racket and ACL2 and HotCRP that are higher quality than most commercial software. Also see the ACM Software System Award winners (though not all of them came from academia). I wrote an earlier post about how hard it is to produce that kind of code.

This piece is about a different kind of code: the scripty stuff that supports research projects by running experiments, computing statistics, drawing graphs, and that sort of thing. Here are some common characteristics of this kind of code:

  • It is often written in several different programming languages; for example R for statistics, Matplotlib for pretty pictures, Perl for file processing, C/C++ for high performance, and plenty of shell scripts and makefiles to tie it all together. Code in different languages may interact through the filesystem and also it may interact directly.
  • It seldom has users outside of the research group that produced it, and consequently it usually embeds assumptions about its operating environment: OS and OS version, installed packages, directory structure, GPU model, cluster machine names, etc.
  • It is not usually explicitly tested, but rather it is tested through use.

The problem is that when there aren’t any obvious errors in the output, we tend to believe that this kind of code is correct. This isn’t good, and it causes many of us to have some legitimate anxiety about publishing incorrect results. In fact, I believe that incorrect results are published frequently (though many of the errors are harmless). So what can we do? Here’s a non-orthogonal list.

Never Ignore Problems

Few things in research are worse than discovering a major error way too late and then finding out that someone else had noticed the problem months earlier but didn’t say anything. For example we’ll be tracking down an issue and will find a comment in the code like this:

  # dude why do I have to mask off the high bits or else this segfaults???

Or, worse, there’s no comment and we have to discover the offending commit the hard way — by understanding it. In any case, at this point we pull out our hair and grind our teeth because if the bug had been tracked down instead of hacked around, there would have been huge savings in terms of time, energy, and maybe face. As a result of this kind of problem, most of us have trained ourselves to be hyper-sensitive to little signs that the code is horked. But this only works if all members of the group are onboard.

Go Out of Your Way to Find Problems

Failing to ignore problems is a very low bar. We also have to actively look for bugs in the code. The problem is that because human beings don’t like being bothered with little details such as code that does not work, our computing environments tend to hide problems by default. It is not uncommon for dynamically and weakly typed programming languages to (effectively) just make up crap when you do something wrong, and of course these languages are the glue that makes everything work. To some extent this can be worked around by turning on flags such as -Wall in gcc and use warningsuse strict; in Perl. Bugs that occur when crossing layers of the system, such as calling into a different language or invoking a subprocess, can be particularly tricky. My bash scripts became a lot less buggy once I discovered the -e option. Many languages have a lint-like tool and C/C++ have Valgrind and UBSan.

One really nice thing about scripty research code is that there’s usually no reason to recover from errors. Rather, all dials can be set to “fail early, fail fast” and then we interactively fix any problems that pop up.

The basic rule is that if your programming environment supports optional warnings and errors, turn them all on (and then maybe turn off the most annoying ones). This tends to have a gigantic payoff in terms of code quality relative to effort. Also, internal sanity checks and assertions are worth their weight in gold.

Fight Confirmation Bias

When doing science, we formulate and test hypotheses. Although we are supposed to be objective, objectivity is difficult, and there’s even a term for this. According to Wikipedia:

Confirmation bias is the tendency of people to favor information that confirms their beliefs or hypotheses.

Why is this such a serious problem? For one thing, academia attracts very smart people who are accustomed to being correct. Academia also attracts people who prefer to work in an environment where bad ideas do not lead to negative economic consequences, if you see what I mean. Also, our careers depend on having good ideas that get good results. So we need our ideas to be good ones — the incentives point to confirmation bias.

How can we fight confirmation bias? Well, most of us who have been working in the field for more than a few years can easily bring to mind a few examples where we felt like fools after making a basic mistake. This is helpful in maintaining a sense of humility and mild research paranoia. Another useful technique is to assume that the people doing previous work were intelligent, reasonable people: if implementing their ideas does not give good results, then maybe we should figure out what we did wrong. In contrast, it is easy to get into the mindset that the previous work wasn’t very good. Evidence of this kind of thinking can be seen in the dismissive related work sections that one often sees.

Write Unit Tests

Modern programming languages come with good unit testing frameworks and I’ve noticed that the better students tend to instinctively write unit tests when they can. In contrast, us old fogies grew up as programmers long before the current testing culture developed and we have a harder time getting ourselves to do this.

But does unit testing even make sense for scripty code? In many cases it clearly doesn’t. On the other hand, Suresh gives the example where they are comparing various versions of an algorithm; in such a situation we might be able to run various data sets through all versions of the algorithm and make sure their results are consistent with each other. In other situations we’re forced to re-implement a statistical test or some other piece of fairly standard code; these can often be unit tested using easy cases. Mathematical functions often have properties that support straightforward smoke tests. For example, a function that computes the mean or median of a list should compute the same value when fed the same list twice.

Write Random Testers

It is often the case that an API that can be unit tested can also be fuzzed. Two things are required: a test-case generator and an oracle. The test-case generator can do something easy like randomly shuffling or subsetting existing data sets or it can make up new data sets from scratch. The oracle decides whether the code being tested is behaving correctly. Oracles can be weak (looking for crashes) or strong (looking for correct behavior). Many modern programming languages have a QuickCheck-like tool which can make it easier to create a fuzzer. This blog post and this one talk about random testing (as do plenty of others, this being one of my favorite subjects).

Clean Up and Document After the Deadline

As the deadline gets closer, the code gets crappier, including the 12 special cases that are necessary to produce those weird graphs that reviewer 2 wants. Cleaning this up and also documenting how the graphs for the paper were produced is surely one of the best investments we could make with our time.

Better Tooling

Let’s take it as a given that we’re doing code reviews, using modern revision control, unit testing frameworks, static and dynamic analysis tools, etc. What other tool improvements do we want to see? Phil Guo’s thesis has several examples showing how research programming could be improved by tools support. There’s a lot of potential for additional good work here.


There are plenty of easy ways to make scripty research code better. The important thing is that the people who are building the code — usually students — are actually doing this stuff and that they are receiving proper supervision and encouragement from their supervisors.

Early Superoptimizer Results

[Here’s a shortcut to the results. But it would be best to read the post first.]

Following my previous superoptimizer post, my student Jubi and I were getting up to speed on the prerequisites — SMT solvers, LLVM internals, etc. — when Googler Peter Collingbourne contacted me saying that he had recently gotten a superoptimizer up and running and might I be interested in working with him on it? I read his code and found it to be charmingly clear and simple. Also, one of my basic principles in doing research is to avoid competing, since competition wastes resources and burns students because the resulting race to publication effectively has an arbitrary winner. So I immediately started feeding bug reports to Peter.

The new superoptimizer, Souper, makes a few simplifying assumptions:

  • The only optimization candidates that it considers are the true and false values. Therefore, at present Souper only harvests expressions that compute an i1: a one-bit integer, which is how Booleans are represented in LLVM. Thus, the result of a Souper run is a collection of expressions that LLVM could have — but did not — evaluate to either true or false.
  • It doesn’t yet have models for all instructions or for all undefined behaviors for the instructions it does support.

These assumptions need to be relaxed. One generalization that should be pretty easy is to harvest expressions that end up as integers of arbitrary width. The interesting thing about this is that we cannot take time to check if every harvested expression evaluates to, for example, every possible value that an i32 can take. What we will do instead is to ask the SMT solver to synthesize the equivalent constant. The problem is that by default, when we make an equivalence query to an SMT solver, it is an unsat result that signals equivalence, and unsat doesn’t come with a model — it indicates failure to find a model. It turns out there’s a cute trick (which I learned from Nuno Lopes) involving a quantifier which flips a query around such that an equivalence results in sat, and therefore a model, from which we can pluck the synthesized constant. Consider this Z3/Python code where we’re asking, for a variety of constants c, how to express i*c (where i is an integer variable) in the form i<<x + i<<y + i<<z:

from z3 import *

s = Solver()

def checkit (c):
    i, x, y, z = BitVecs('i x y z',32)
    q = ForAll (i, i*c == ((i<<x) + (i<<y) + (i<<z)))
    s.add(x>=0, x<32)
    s.add(y>=0, y<32)
    s.add(z>=0, z<32)
    if s.check() == sat:
        m = s.model()
        print ("i * " + str(c) + 
               " == i<<" + str(m.evaluate(x)) + 
                " + i<<" + str(m.evaluate(y)) +
                " + i<<" + str(m.evaluate(z)))
        print "i * " + str(c) + " has no model"

for m in range(100):

This is just an example but it’s the kind of thing that might make sense on a small embedded processor where the integer multiply instruction is expensive or doesn’t exist. The results include:

i * 28 == i<<4 + i<<3 + i<<2
i * 29 has no model
i * 30 has no model
i * 31 has no model
i * 32 == i<<4 + i<<3 + i<<3
i * 33 == i<<4 + i<<4 + i<<0

The full set of results is here. I particularly enjoyed the solver's solutions for the first three cases. So we know that the synthesis part of a superoptimizer is possible and in fact probably not all that difficult. But that's a digression that we'll return to in a later post; let's get back to the main topic.

Now I'll show you how to read Souper's output. You may find it useful to keep the LLVM instruction set reference handy. Here's an optimization report:

%0:i32 = var 
%1:i32 = mul 4294967294:i32, %0
%2:i1 = eq 1:i32, %1
cand %2 0:i1

The first line tells us that %0 has type i32 -- a 32-bit integer -- corresponding to a signed or unsigned int in C/C++, and that it is a variable: an input to the superoptimized code that may hold any value. Reasoning about any-valued variables is hard but solvers are good at it and that is the entire point of the superoptimizer.

The second line tells us that %1 is a new i32 computed by multiplying %0 by -2. The third line tells us that %2 is a new i1 -- a Boolean or 1-bit integer -- computed by seeing if %1 is equal to 1. The last line, starting with "cand", is Souper telling us that it believes %2 will always take the value 0. If Souper tells us this when running on optimized code, it has found a missed optimization. In this case LLVM has missed the fact that multiplying an arbitrary value by an even number can never result in an odd number. Is this a useful optimization to implement in LLVM? I don't know, but GCC does it, see the bottom of this page.

Souper finds many missed optimizations that fit this general pattern:

%0:i32 = var 
%1:i64 = sext %0
%2:i64 = sdiv 2036854775807:i64, %1
%3:i1 = ne 0:i64, %2
cand %3 1:i1

Here the observation is that if we divide a large constant by an arbitrary 32-bit value, the result cannot be zero. GCC does not find this one.

Some Souper output contains path constraints:

%0:i32 = var 
%1:i1 = eq 0:i32, %0
pc %1 1:i1
%2:i32 = addnsw 1:i32, %0
%3:i1 = slt %2, 2:i32
cand %3 1:i1

Here, at line 3, we learn that %1 must take the value 1 in the remaining code due to a path constraint. In the original LLVM code there was a conditional branch exiting if %1 had the value 0. Since %1 has the value 1, we can infer, in the remaining code, that %0 contains 0. Thus, %2 contains 1 and the expression %2 < 2 must evaluate to true.

Finally, this charming example exploits the fact that if the product of two numbers is not zero, then neither of the numbers could have been zero:

%0:i32 = var 
%1:i32 = var 
%2:i32 = mul %0, %1
%3:i1 = eq 0:i32, %2
pc %3 0:i1
%4:i1 = eq 0:i32, %0
cand %4 0:i1

One more thing that you might see in the full set of results is an entry like this: %0 = block. This means (more or less) that %0 is a value that is going to pass through the code to a phi node without being otherwise used, this is useful for increasing Souper's precision.

I think that's about all you need to know in order to read the full set of results from a couple days of running Csmith, Souper, and C-Reduce in a loop. First, we wait for Souper to find a missed optimization and then second, we find a minimal C program that exhibits the missed optimization. The results have been ranked in a way that attempts to push more similar results (that are more likely to be duplicates) lower in the list.

So far, the most common pattern that comes out of Souper's findings is that LLVM needs an integer range analysis. Such an analysis would also help eliminate integer overflow checks, one of my hobby horses. LLVM also doesn't always propagate information that would be best represented at the bit level, such as the even/odd distinction required for the first optimization that I discussed. Finally, LLVM does not always learn from branches. My not-necessarily-educated guess is that all of this is a symptom of LLVM's heavy reliance on the instruction combiner, which is not so much an optimization pass as a loose federation of a few hundred peephole passes.

Some of the missing LLVM optimizations won't be hard to implement for people have passable C++ and who have spent some time becoming familiar with the instruction combiner. But here are a few things we need to keep in mind:

  • One might ask: Does it make sense to harvest missed optimizations from randomly generated code? My initial idea was that since Csmith's programs are free from undefined behaviors, the resulting optimizations would be less likely to be evil exploitation of undefined behaviors. But also I did it because it was easy and I was curious what the results would look like. My judgement is the the results are interesting enough to deserve a blog post. Perhaps an easier way to avoid exploiting undefined behavior would be to add a command line option telling Souper to avoid exploiting undefined behaviors.

  • For each missed optimization we should do a cost/benefit analysis. The cost of implementing a new optimization is making LLVM a bit bigger and a bit more likely to contain a bug. The benefit is potential speedup of code that contains the idioms.

  • Although the reduced C programs can be useful, you should look at Souper output first and the C code second. For one thing, the Boolean that Souper finds is sometimes a bit obscured in the C code. For another, the C-Reduce output is somewhat under-parenthesized -- it will test your knowledge of C's operator precedence rules. Finally C-Reduce has missed some opportunities to fold constants, so for example we see ~1 instead of -2 in the 2nd example from the top.

  • Each missed optimization found by Souper should be seen as a member of a class of missed optimizations. So the goal is obviously not to teach LLVM to recognize the specific cases found by Souper, but rather to teach it to be smart about some entire class of optimizations. My belief is that this generalization step can be somewhat automated, but that is a research problem.

  • Although all of the optimizations that I've looked at are correct, there's always the possibility that some of them are wrong, for example due to a bug in Souper or STP.

This article presents some very early results. I hope that it is the beginning of a virtuous circle where Souper and LLVM can both be strengthened over time. It will be particularly interesting to see what kinds of optimizations are missing in LLVM code emitted by rustc, GHC, or llgo.

UPDATE: Here are some bugs that people have filed or fixed in response to these results:

It's very cool that people are acting on this! Please let me know if you know of more results than are listed here.

This Is Not a Defect

In several previous blog entries I’ve mentioned that in some recent versions of C and C++, left-shifting a 1 bit into the high-order bit of a signed integer is an undefined behavior. In other words, if you have code that computes INT_MIN by evaluating 1<<31 (or 1<<(sizeof(int)*CHAR_BIT-1) if you want to be that way) or code that does a byte swap using a subexpression like c<<24, then your program most likely has no meaning according to the standards. And in fact, Clang's integer sanitizer confirms that most non-trivial codes, including several crypto libraries, are undefined according to this rule.

An obvious fix is to tighten up the standard a bit and specify that shifting a 1 into the sign bit has the expected effect, which is what all compilers that I am aware of already do. This is what the C++ standards committee is doing, though as far as I can tell the fix doesn't officially take effect until a TC, a "technical corrigendum," is issued -- and even that doesn't finalize the thing, but it seems near enough in practice.

Anyhow, today Nevin Liber pointed out that there's a bit of news here, which is that last month the C standards committee decided that this same issue is not a defect in C and that they'll reconsider it later on, which I guess is fine since compiler implementers are ignoring this particular undefined behavior, but it seems like a bit of a missed opportunity to (1) make the language slightly saner and (2) bring it into line with the existing practice. Also you might consider perusing the full set of defect reports if you want to be thankful that you did something other than attend a standards meeting last month.