Synthesizing Constants

(See this blog post for a short introduction to synthesis, or this paper for a long one.)

In this piece I want to discuss an aspect of program synthesis that sounds like it should be easy, but isn’t: synthesizing constant values. For example, consider trying to synthesize an optimized x86-64 implementation of this code:

long foo(long x) {
    return x / 52223;
}

The desired output is something like:

foo(long):
        movabs  rdx, -6872094784941870951
        mov     rax, rdi
        imul    rdx
        lea     rax, [rdx+rdi]
        sar     rdi, 63
        sar     rax, 15
        sub     rax, rdi
        ret

Assume for a moment that we know that this sequence of instructions will work, and we only lack the specific constants:

foo(long):
        movabs  rdx, C1
        mov     rax, rdi
        imul    rdx
        lea     rax, [rdx+rdi]
        sar     rdi, C2
        sar     rax, C3
        sub     rax, rdi
        ret

We need to find a 64-bit constant and two 6-bit constants such that the assembly code returns the same result as the C code for every value of the function parameter. Of course there’s a well-known algorithm that GCC and LLVM use to compute these constants, but the synthesizer doesn’t have it: it must operate from first principles.

An even more difficult example is a 64-bit Hamming weight computation, where (assuming we don’t want to use vector instructions) we might hope to synthesize something like this:

popcount64(unsigned long):
        movabs  rdx, 6148914691236517205
        mov     rax, rdi
        shr     rax
        and     rax, rdx
        movabs  rdx, 3689348814741910323
        sub     rdi, rax
        mov     rax, rdi
        shr     rdi, 2
        and     rax, rdx
        and     rdi, rdx
        add     rdi, rax
        mov     rax, rdi
        shr     rax, 4
        add     rax, rdi
        movabs  rdi, 1085102592571150095
        and     rax, rdi
        movabs  rdi, 72340172838076673
        imul    rax, rdi
        shr     rax, 56
        ret

Again assuming that we know that this sequence of instructions will work, we still need to come up with 280 bits worth of constant values. One can imagine even more challenging problems where we want to synthesize an entire lookup table.

The running example I’ll use in this post is much easier:

uint32_t bar1(uint32_t x) {
  return ((x << 8) >> 16) << 8;
}

The code we want to synthesize is:

uint32_t bar2(uint32_t x) {
  return x & 0xffff00;
}

That is, we want to find C in this skeleton:

uint32_t bar3(uint32_t x) {
  return x & C;
}

Basic Constant Synthesis

We need to solve an “exists-forall” problem: does there exist a C such that the LHS (left-hand side, the original code) is equivalent to the RHS (right-hand side, the optimized code) for all values of x? In other words:

(Here we’ll play fast and loose with the fact that in the real world we check refinement instead of equivalence — this doesn’t matter for purposes of this post.)

The specific formula that we want an answer for is:

A SAT solver can natively solve either an exists query (this is the definition of SAT) or else a forall query (by seeing if there exists a solution to the negated proposition). But, by itself, a SAT solver cannot solve an exists-forall query in one go. An SMT solver, on the other hand, can natively attack an exists-forall query, but in practice we tend to get better results by doing our own quantifier elimination based only on SAT calls. First, we ask the solver if there exists a C and x that make the LHS and RHS equivalent. If not, synthesis fails. If so, we issue a second query to see if C works for all values of x. If so, synthesis succeeds. If not, we add a constraint that this value of C doesn’t work, and we start the process again. The problem is that each pair of queries only rules out a single choice for C, making this process equivalent, in the worst case, to exhaustive guessing, which we cannot do when C is wide. We need to do better.

Reducing the Number of Counterexamples using Specialization

Souper uses a technique that appears to be common knowledge among people who do this kind of work; unfortunately I don’t know where it was first published (Section 4.2 of this paper is one place it has appeared). The trick is simple: we choose a few values of x and use it to specialize both the LHS and RHS of the equation; you can think of this as borrowing some constraints from the forall phase and moving them into the exists phase (which becomes an “exists-forsome” query, if you will). In many cases this adds enough extra constraints that the solver can arrive at a workable constant within a few iterations. If we specialize x with 0 and -1 then in the general case we get:

After constant folding, our running example becomes:

The first choice, 0, turns out to be an unlucky one: after further simplification it comes out to “0 = 0”, giving the solver no extra information. The second choice, on the other hand, is extremely lucky: it can be rewritten as “0x00FFFF00 = C” which simply hands us the answer. In most cases things won’t work out so nicely, but specializing the LHS and RHS with fixed inputs helps enormously in practice.

Some open questions remain: How many specific values should we try? How should these values be chosen? An obvious constraint is that the values chosen should be as different from each other as possible, though it isn’t clear what distance metric should be used. A related but less obvious criterion (this is Nuno Lopes’s idea, and we haven’t tried it out yet in Souper) is that the values should exercise as many different behaviors of each instruction as possible. For example, an addition instruction should have example inputs that overflow and also inputs that don’t. This requirement can be satisfied syntactically for instructions that are close to the inputs, but solver calls (or other symbolic methods) will be required to reach instructions that consume the outputs of other instructions. (In Souper’s case, solver calls are required anyway if there are any path conditions, which place arbitrary restrictions on the input space.) It also seems possible that we could tailor the input values to maximize how much of the RHS folds away (the LHS always folds completely down to a constant, of course).

Synthesis at Reduced Bitwidth

Instead of solving a difficult synthesis problem, we can instead solve a problem that is structurally identical, but uses narrower bitvectors (even two or three bits are often enough to capture the essence of a computation). This results in far better solver performance and, typically, a narrow synthesis result that does not contain constants will also work at the original, wider bitwidth (though obviously this needs to be verified by the solver). When the narrow result has constants, there’s a problem: we lack a principled method for deriving the wider constants from the narrow ones. One approach is to use heuristics. This paper suggests the following methods:

Here you should read BV(x,y) as “a bitvector of length y holding value x.” So, for example, rule 1 would extend an 8-bit variable containing 8 to a 16-bit variable containing 16. Rule 4 would extend an 8-bit variable containing 8 to a 16-bit variable containing 8. These seem reasonable, but notice that none of them would help with our running example.

In Souper we have the additional problem that the codes we’re trying to optimize usually contain constants, presenting us with a constant-narrowing problem that is analogous to the constant-widening problem that we just discussed, but worse because now any dodgy heuristics that we come up with are located in front of the expensive synthesis process instead of in back of it. It isn’t clear to us that there’s any satisfactory solution to this problem.

Getting More out of Counterexamples

In standard constant synthesis, each failed guess only adds the constraint that that particular constant doesn’t work — this fails to make much of a dent in an exponential search space. This paper and this one propose extracting additional information from each counterexample, cutting out a larger part of the search space. This also seems like an excellent direction.

Symbolic Constants

An alternate approach that avoids both narrowing and widening problems would be to treat constants symbolically. This, however, creates two new problems: deriving preconditions for optimizations and deriving functions to compute RHS constants in terms of elements of the LHS (constants, widths, etc.). It seems clear that in some cases this will be very difficult, for example imagine trying to derive, from first principles, the algorithm for computing the output constants for an arbitrary value of the LHS constant C:

long foo(long x) {
    return x / C;
}

Alive-Infer helps derive preconditions but it does not yet try to come up with the functions for computing constants in the synthesized output.

Since in some use cases (including my “destroy InstSimplify and InstCombine” dream) we want the generalized optimization anyway, this seems like a great direction for future work.

Conclusion

All of the techniques described in this post could be implemented inside the solver or outside it. Solvers like CVC4 already have sophisticated internal support for synthesis. The advantage of pushing more work into the solver is that it can exploit internal levers to guide the search, reuse data structures across synthesis iterations, and do other clever things that aren’t exposed by the solver’s API. The disadvantage is that we might have high-level information about the problem domain that is difficult to communicate effectively to the solver, that would help it do its job better.

In summary, constant synthesis is a hard problem that hasn’t received a lot of attention yet. My group is actively working on this and we have some ideas that aren’t mature enough to share yet. Please drop me a line if you know of any good techniques that I’ve missed here. Also, I’d be interested to hear from anyone working on benchmarks for constant synthesis.

Finally, tools such as Rosette and Sketch know how to synthesize constants.

Learning When Values are Changed by Implicit Integer Casts

C and C++ perform implicit casts when, for example, you pass an integer-typed variable to a function that expects a different type. When the target type is wider, there’s no problem, but when the target type is narrower or when it is the same size and the other signedness, integer values may silently change when the type changes. For example, this program:

#include <iostream>

int main() {
  int x = -3;
  unsigned y = x;
  std::cout << y << "\n";
}

prints 4294967293. Like unsigned integer wraparound (and unlike signed integer overflow) these changes of value are not undefined behavior, but they may be unintentional and may also be bugs. As of recently, Clang contains support for dynamically detecting value changes and either providing a diagnostic or else terminating the program. Some fine-grained flags are available for controlling these diagnostics, but you can enable all of them (plus some others, such as signed overflow and unsigned wraparound checks) using -fsanitize=integer.

Now we get:

$ clang++ implicit2.cpp -fsanitize=integer
$ ./a.out 
implicit2.cpp:5:16: runtime error: implicit conversion from type 'int' of value -3 (32-bit, signed) to type 'unsigned int' changed the value to 4294967293 (32-bit, unsigned)
4294967293
$ 

Here’s a truncation example:

int main() {
  int x = 300;
  unsigned char c = x;
}

It gives:

$ clang++ implicit3.cpp -fsanitize=integer
$ ./a.out 
implicit3.cpp:3:21: runtime error: implicit conversion from type 'int' of value 300 (32-bit, signed) to type 'unsigned char' changed the value to 44 (8-bit, unsigned)
$

To suppress the diagnostic, we can make the conversion explicit using a cast:

int main() {
  int x = 300;
  unsigned char c = (unsigned char)x;
}

Different parts of this functionality landed in Clang before and after the 7.0.0 release — to get everything, you will want to build a Clang+LLVM from later than 1 Nov 2018 (or else wait for the 8.0.0 release in a few months).

How would you use these checks? Generally, they should be part of a testing campaign, perhaps in support of a code audit. If you enable them on a non-trivial code base, you will run into diagnostics that do not correspond to bugs, just because real C and C++ programs mix integer types so freely. I suggest starting with -fsanitize=implicit-integer-truncation. Let’s look at doing this to Clang+LLVM itself and then compiling a hello world program: here’s the output.

I’d be happy to hear about any interesting bugs located using these checks, if anyone wants to share.

Finally, see this tweet by Roman Lebedev (the author of these checks) showing that the runtime impact of -fsanitize-trap=implicit-conversion is very low.

What’s the difference between an integer and a pointer?

(This piece is an alternate introduction and advertisement for a soon-to-be-published research paper.)

In an assembly language we typically don’t have to worry very much about the distinction between pointers and integers. Some instructions happen to generate addresses whereas others behave arithmetically, but underneath there’s a single data type: bitvectors. At the opposite end of the PL spectrum, a high-level language won’t offer opportunities for pointer/integer confusion because the abstractions are completely firewalled off from each other. Also, of course, a high-level language may choose not to expose anything that resembles a pointer.

The problematic case, then, is the performance-oriented systems programming language: C, C++, Rust, and a few others. The issue is that:

  1. When possible, they pretend to be high-level, in order to justify optimizations that would be unsound at the assembly level. For example, they assume that a pointer derived from one heap allocation cannot alias a pointer derived from a different heap allocation.
  2. When necessary, they allow the programmer to manipulate pointers in low-level ways, such as inspecting the representation of a pointer, creating a new pointer at some offset from an existing one, storing unrelated data in unused parts of a pointer, and even fabricating a pointer out of whole cloth.

These goals conflict: the better a language is for justifying high-level optimizations, the worse it seems to be for implementing an OS kernel, and vice versa. There are a few ways we might deal with this:

  • We could require the high-level language to call out to low-level code written in a different language, as Java does with its JNI.
  • We could create a special low-level mode for a safe language where it can break the normal rules, as Rust does with its unsafe code.
  • We could go ahead and freely mix high-level and low-level code, hoping that the compiler can sort it all out, as C and C++ do.

Let’s look at an example:

#include <stdio.h>
#include <stdlib.h>

int main(void) {
  int *x = malloc(4);
  *x = 3;
  int *y = malloc(4);
  *y = 5;
  uintptr_t xi = (uintptr_t)x;
  uintptr_t yi = (uintptr_t)y;
  uintptr_t diff = xi - yi;
  printf("diff = %ld\n", (long)diff);
  int *p = (int *)(yi + diff);
  *p = 7;
  printf("x = %p, *x = %d\n", x, *x);
  printf("y = %p, *y = %d\n", y, *y);
  printf("p = %p, *p = %d\n", p, *p);
}

When run (on my Mac) it gives:

$ clang-6.0 -O3 mem1d.c ; ./a.out
diff = -96
x = 0x7fcb0ec00300, *x = 7
y = 0x7fcb0ec00360, *y = 5
p = 0x7fcb0ec00300, *p = 7
$ gcc-8 -O3 mem1d.c ; ./a.out
diff = -96
x = 0x7f93b6400300, *x = 7
y = 0x7f93b6400360, *y = 5
p = 0x7f93b6400300, *p = 7

What happened here? First, we grabbed a couple of heap cells and initialized their contents. Second, we used integer math to fabricate a pointer p that has the same value as x, and stored through that pointer. Third, we asked the compiler what contents it thinks are in the locations pointed to by x, y, and p. Both GCC and Clang/LLVM are of the opinion that the store through p changed the value referred to by x. This is probably the result that we had hoped for, but let’s look into it a bit more deeply.

First, does this C program execute undefined behavior? It does not! Computing out-of-bounds pointers is UB but we are free to do whatever we like with integer-typed values that were derived from pointers. Second, does the C standard mandate the behavior that we saw here? It does not! In fact it gives only very sparse guidance about the behavior of integers derived from pointers and vice versa. You can read about that here.

Now let’s modify the program a tiny bit:

#include <stdio.h>
#include <stdlib.h>

int main(void) {
  int *x = malloc(4);
  *x = 3;
  int *y = malloc(4);
  *y = 5;
  uintptr_t xi = (uintptr_t)x;
  uintptr_t yi = (uintptr_t)y;
  uintptr_t diff = xi - yi;
  printf("diff = %ld\n", (long)diff);
  int *p = (int *)(yi - 96); // <--- CHANGED!
  *p = 7;
  printf("x = %p, *x = %d\n", x, *x);
  printf("y = %p, *y = %d\n", y, *y);
  printf("p = %p, *p = %d\n", p, *p);
}

The only difference is at line 13: instead of adding diff to yi to compute p, we added the observed value of the difference: -96. Now we’ll run it:

$ clang-6.0 -O3 mem1d2.c ; ./a.out
diff = -96
x = 0x7ff21ac00300, *x = 7
y = 0x7ff21ac00360, *y = 5
p = 0x7ff21ac00300, *p = 7
$ gcc-8 -O3 mem1d2.c ; ./a.out
diff = -96
x = 0x7fce5e400300, *x = 3
y = 0x7fce5e400360, *y = 5
p = 0x7fce5e400300, *p = 7

Clang behaves the same as before, but GCC no longer believes that storing through p results in x being overwritten, despite the fact that x and p have the same value. What is going on is that as far as GCC is concerned, since we computed the address of x through guesswork instead of through an actual observation, our guess is effectively illegitimate. (If this guessing game is too simple for you, there are entertaining variations that avoid the need for information from a previous run of the program, such as guessing addresses using an exhaustive loop or making huge allocations so that the resulting addresses are constrained by the pigeonhole principle.) Here again we are operating well outside of the C standard and are rather in a zone where compiler writers need to make decisions that balance optimization power against developers’ expectations. I like this example because it is weird and perhaps surprising.

What should we take away from this? First, it is a big mistake to try to understand pointers in a programming language as if they follow the same rules as pointer-sized integer values. Even people who have come to terms with UB and its consequences are often surprised by this. Second, these issues are not specific to C and C++, but rather are going to create problems in any language that wants highly optimizing compilers but also permits low-level pointer manipulation.

So what are the actual rules that Clang and GCC are following when compiling these examples? The short answer is that it’s complicated and not well documented. Some useful discussion can be found in this piece about pointer provenance. For a longer answer that is specific to LLVM, see this paper that will get presented at OOPSLA in November. Also, one of my coauthors wrote a blog post about this material.

Future Directions for Optimizing Compilers

I wanted to write a manifesto-ish sort of piece about what compilers are supposed to look like in the future. Nuno was the obvious coauthor since I’ve talked to him about this topic so much that I’m overall not really sure which parts started out as his ideas and which were mine. The article didn’t end up feeling like a good fit for a series of blog posts, so we’ve placed it on arXiv: Future Directions for Optimizing Compilers. The first paragraph of the paper gives the main idea:

As software becomes larger, programming languages become higher-level, and processors continue to fail to be clocked faster, we’ll increasingly require compilers to reduce code bloat, eliminate abstraction penalties, and exploit interesting instruction sets. At the same time, compiler execution time must not increase too much and also compilers should never produce the wrong output. This paper examines the problem of making optimizing compilers faster, less buggy, and more capable of generating high-quality output.

We plan to keep revising the paper and would be happy to receive feedback on it.

How LLVM Optimizes a Function

An optimizing, ahead-of-time compiler is usually structured as:

  1. A frontend that converts source code into an intermediate representation (IR).
  2. A target-independent optimization pipeline: a sequence of passes that successively rewrite the IR to eliminate inefficiencies and forms that cannot be readily translated into machine code. Sometimes called the “middle end.”
  3. A target-dependent backend that generates assembly code or machine code.

In some compilers the IR format remains fixed throughout the optimization pipeline, in others the format or semantics change. In LLVM the format and semantics are fixed, and consequently it should be possible to run any sequences of passes you want without introducing miscompilations or crashing the compiler.

The sequence of passes in the optimization pipeline is engineered by compiler developers; its goal is to do a pretty good job in a reasonable amount of time. It gets tweaked from time to time, and of course there’s a different set of passes that gets run at each optimization level. A longish-standing topic in compiler research is to use machine learning or something to come up with a better optimization pipeline either in general or else for a specific application domain that is not well-served by the default pipelines.

Some principles for pass design are minimality and orthogonality: each pass should do one thing well, and there should not be much overlap in functionality. In practice, compromises are sometimes made. For example when two passes tend to repeatedly generate work for each other, they may be integrated into a single, larger pass. Also, some IR-level functionality such as constant folding is so ubiquitously useful that it might not make sense as a pass; LLVM, for example, implicitly folds away constant operations as instructions are created.

In this post we’ll look at how some of LLVM’s optimization passes work. I’ll assume you’ve read this piece about how Clang compiles a function or alternatively that you more or less understand how LLVM IR works. It’s particularly useful to understand the SSA (static single assignment) form: Wikipedia will get you started and this book contains more information than you’re likely to want to know. Also see the LLVM Language Reference and the list of optimization passes.

We’re looking at how Clang/LLVM 6.0.1 optimizes this C++:

bool is_sorted(int *a, int n) {
  for (int i = 0; i < n - 1; i++)
    if (a[i] > a[i + 1])
      return false;
  return true;
}

Keep in mind that the optimization pipeline is a busy place and we’re going to miss out on a lot of fun stuff such as:

  • inlining, an easy but super-important optimization, which can’t happen here since we’re looking at just one function
  • basically everything that is specific to C++ vs C
  • autovectorization, which is defeated by the early loop exit

In the text below I’m going to skip over every pass that doesn’t make any changes to the code. Also, we’re not even looking at the backend and there’s a lot going on there as well. Even so, this is going to be a bit of a slog! (Sorry to use images below, but it seemed like the best way to avoid formatting difficulties, I hate fighting with WP themes. Click on the image to get a larger version. I used icdiff.)

Here’s the IR file emitted by Clang (I manually removed the “optnone” attribute that Clang put in) and this is the command line that we can use to see the effect of each optimization pass:

opt -O2 -print-before-all -print-after-all is_sorted2.ll

The first pass is “simplify the CFG” (control flow graph). Because Clang does not optimize, IR that it emits often contains easy opportunities for cleanup:

Here, basic block 26 simply jumps to block 27. This kind of block can be eliminated, with jumps to it being forwarded to the destination block. The diff is a bit more confusing that it would have to be due to the implicit block renumbering performed by LLVM. The full set of transformations performed by SimplifyCFG is listed in a comment at the top of the pass:

This file implements dead code elimination and basic block merging, along with a collection of other peephole control flow optimizations. For example:

  • Removes basic blocks with no predecessors.
  • Merges a basic block into its predecessor if there is only one and the predecessor only has one successor.
  • Eliminates PHI nodes for basic blocks with a single predecessor.
  • Eliminates a basic block that only contains an unconditional branch.
  • Changes invoke instructions to nounwind functions to be calls.
  • Change things like “if (x) if (y)” into “if (x&y)”.

Most opportunities for CFG cleanup are a result of other LLVM passes. For example, dead code elimination and loop invariant code motion can easily end up creating empty basic blocks.

The next pass to run, SROA (scalar replacement of aggregates), is one of our heavy hitters. The name ends up being a bit misleading since SROA is only one of its functions. This pass examines each alloca instruction (function-scoped memory allocation) and attempts to promote it into SSA registers. A single alloca will turn into multiple registers when it is statically assigned multiple times and also when the alloca is a class or struct that can be split into its components (this splitting is the “scalar replacement” referred to in the name of the pass). A simple version of SROA would give up on stack variables whose addresses are taken, but LLVM’s version interacts with alias analysis and ends up being fairly smart (though the smarts are not needed in our example).

After SROA, all alloca instructions (and their corresponding loads and stores) are gone, and the code ends up being much cleaner and more amenable to subsequent optimization (of course, SROA cannot eliminate all allocas in general — this only works when the pointer analysis can completely eliminate aliasing ambiguity). As part of this process, SROA had to insert some phi instructions. Phis are at the core of the SSA representation, and the lack of phis in the code emitted by Clang tells us that Clang emits a trivial kind of SSA where communication between basic blocks is through memory, instead of being through SSA registers.

Next is “early common subexpression elimination” (CSE). CSE tries to eliminate the kind of redundant subcomputations that appear both in code written by people and in partially-optimized code. “Early CSE” is a fast, simple kind of CSE that looks for trivially redundant computations.

Here both %10 and %17 do the same thing, so uses of one of the values can be rewritten as uses of the other, and then the redundant instruction eliminated. This gives a glimpse of the advantages of SSA: since each register is assigned only once, there’s no such thing as multiple versions of a register. Thus, redundant computations can be detected using syntactic equivalence, with no reliance on a deeper program analysis (the same is not true for memory locations, which live outside of the SSA universe).

Next, a few passes that have no effect run, and then “global variable optimizer” which self-describes as:

This pass transforms simple global variables that never have their address taken. If obviously true, it marks read/write globals as constant, deletes variables only stored to, etc.

It makes this change:

The thing that has been added is a function attribute: metadata used by one part of the compiler to store a fact that might be useful to a different part of the compiler. You can read about the rationale for this attribute here.

Unlike other optimizations we’ve looked at, the global variable optimizer is interprocedural, it looks at an entire LLVM module. A module is more or less equivalent to a compilation unit in C or C++. In contrast, intraprocedural optimizations look at only one function at a time.

The next pass is the “instruction combiner“: InstCombine. It is a large, diverse collection of “peephole optimizations” that (typically) rewrite a handful of instructions connected by data flow into a more efficient form. InstCombine won’t change the control flow of a function. In this example it doesn’t have a lot to do:

Here instead of subtracting 1 from %1 to compute %4, we’ve decided to add -1. This is a canonicalization rather than an optimization. When there are multiple ways of expressing a computation, LLVM attempts to canonicalize to a (often arbitrarily chosen) form, which is then what LLVM passes and backends can expect to see. The second change made by InstCombine is canonicalizing two sign-extend operations (sext instructions) that compute %7 and %11 to zero-extends (zext). This is a safe transformation when the compiler can prove that the operand of the sext is non-negative. That is the case here because the loop induction variable starts at zero and stops before it reaches n (if n is negative, the loop never executes at all). The final change is adding the “nuw” (no unsigned wrap) flag to the instruction that produces %10. We can see that this is safe by observing that (1) the induction variable is always increasing and (2) that if a variable starts at zero and increases, it would become undefined by passing the sign wraparound boundary next to INT_MAX before it reaches the unsigned wraparound boundary next to UINT_MAX. This flag could be used to justify subsequent optimizations.

Next, SimplifyCFG runs for a second time, removing two empty basic blocks:

Deduce function attributes” annotates the function further:

“Norecurse” means that the function is not involved in any recursive loop and a “readonly” function does not mutate global state. A “nocapture” parameter is not saved anywhere after its function returns and a “readonly” parameter refers to storage not modified by the function. Also see the full set of function attributes and the full set of parameter attributes.

Next, “rotate loops” moves code around in an attempt to improve conditions for subsequent optimizations:

Although this diff looks a bit alarming, there’s not all that much happening. We can see what happened more readily by asking LLVM to draw the control flow graph before and after loop rotation. Here are the before (left) and after (right) views:

The original code still matches the loop structure emitted by Clang:

  initializer
  goto COND
COND:
  if (condition)
    goto BODY
  else
    goto EXIT
BODY:
  body
  modifier
  goto COND
EXIT:

Whereas the rotated loop looks like this:

  initializer
  if (condition)
    goto BODY
  else
    goto EXIT
BODY:
  body
  modifier
  if (condition)
    goto BODY
  else
    goto EXIT
EXIT:

(Corrected as suggested by Johannes Doerfert below– thanks!)

The point of loop rotation is to remove one branch and to enable subsequent optimizations. I did not find a better description of this transformation on the web (the paper that appears to have introduced the term doesn’t seem all that relevant).

CFG simplification folds away the two basic blocks that contain only degenerate (single-input) phi nodes:

The instruction combiner rewrites “%4 = 0 s< (%1 - 1)" as "%4 = %1 s> 1″, this kind of thing is useful because it reduces the length of a dependency chain and may also create dead instructions (see the patch making this happen). This pass also eliminates another trivial phi node that was added during loop rotation:

Next, “canonicalize natural loops” runs, it self-describes as:

This pass performs several transformations to transform natural loops into a simpler form, which makes subsequent analyses and transformations simpler and more effective.

Loop pre-header insertion guarantees that there is a single, non-critical entry edge from outside of the loop to the loop header. This simplifies a number of analyses and transformations, such as LICM.

Loop exit-block insertion guarantees that all exit blocks from the loop (blocks which are outside of the loop that have predecessors inside of the loop) only have predecessors from inside of the loop (and are thus dominated by the loop header). This simplifies transformations such as store-sinking that are built into LICM.

This pass also guarantees that loops will have exactly one backedge.

Indirectbr instructions introduce several complications. If the loop contains or is entered by an indirectbr instruction, it may not be possible to transform the loop and make these guarantees. Client code should check that these conditions are true before relying on them.

Note that the simplifycfg pass will clean up blocks which are split out but end up being unnecessary, so usage of this pass should not pessimize generated code.

This pass obviously modifies the CFG, but updates loop information and dominator information.

Here we can see loop exit-block insertion happen:

Next up is “induction variable simplification“:

This transformation analyzes and transforms the induction variables (and computations derived from them) into simpler forms suitable for subsequent analysis and transformation.

If the trip count of a loop is computable, this pass also makes the following changes:

  1. The exit condition for the loop is canonicalized to compare the induction value against the exit value. This turns loops like: ‘for (i = 7; i*i < 1000; ++i)' into 'for (i = 0; i != 25; ++i)'
  2. Any use outside of the loop of an expression derived from the indvar is changed to compute the derived value outside of the loop, eliminating the dependence on the exit value of the induction variable. If the only purpose of the loop is to compute the exit value of some derived expression, this transformation will make the loop dead.

The effect of this pass here is simply to rewrite the 32-bit induction variable as 64 bits:

I don’t know why the zext — previously canonicalized from a sext — is turned back into a sext.

Now “global value numbering” performs a very clever optimization. Showing this one off is basically the entire reason that I decided to write this blog post. See if you can figure it out just by reading the diff:

Got it? Ok, the two loads in the loop on the left correspond to a[i] and a[i + 1]. Here GVN has figured out that loading a[i] is unnecessary because a[i + 1] from one loop iteration can be forwarded to the next iteration as a[i]. This one simple trick halves the number of loads issued by this function. Both LLVM and GCC only got this transformation recently.

You might be asking yourself if this trick still works if we compare a[i] against a[i + 2]. It turns out that LLVM is unwilling or unable to do this, but GCC is willing to throw up to four registers at this problem.

Now “bit-tracking dead code elimination” runs:

This file implements the Bit-Tracking Dead Code Elimination pass. Some instructions (shifts, some ands, ors, etc.) kill some of their input bits. We track these dead bits and remove instructions that compute only these dead bits.

But it turns out the extra cleverness isn’t needed since the only dead code is a GEP, and it is trivially dead (GVN removed the load that previously used the address that it computes):

Now the instruction combiner sinks an add into a different basic block. The rationale behind putting this transformation into InstCombine is not clear to me; perhaps there just wasn’t a more obvious place to do this one:

Now things get a tiny bit weird, “jump threading” undoes what “canonicalize natural loops” did earlier:

Then we canonicalize it back:

And CFG simplification flips it the other way:

And back:

And forth:

And back:

And forth:

Whew, we’re finally done with the middle end! The code at the right is what gets passed to (in this case) the x86-64 backend.

You might be wondering if the oscillating behavior towards the end of the pass pipeline is the result of a compiler bug, but keep in mind that this function is really, really simple and there were a whole bunch of passes mixed in with the flipping and flopping, but I didn’t mention them because they didn’t make any changes to the code. As far as the second half of the middle end optimization pipeline goes, we’re basically looking at a degenerate execution here.

Acknowledgments: Some students in my Advanced Compilers class this fall provided feedback on a draft of this post (I also used this material as an assignment). I ran across the function used here in this excellent set of lecture notes about loop optimizations.

How Clang Compiles a Function

I’ve been planning on writing a post about how LLVM optimizes a function, but it seems necessary to first write about how Clang translates C or C++ into LLVM. This is going to be fairly high-level:

  • rather than looking at Clang’s internals, I’m going to focus on how Clang’s output relates to its input
  • we’re not going to look at any non-trivial C++ features

We’ll use this small function that I borrowed from these excellent lecture notes on loop optimizations:

bool is_sorted(int *a, int n) {
  for (int i = 0; i < n - 1; i++)
    if (a[i] > a[i + 1])
      return false;
  return true;
}

Since Clang doesn’t do any optimization, and since LLVM IR was initially designed as a target for C and C++, the translation is going to be relatively easy. I’ll use Clang 6.0.1 (or something near it, since it hasn’t quite been released yet) on x86-64.

The command line is:

clang++ is_sorted.cpp -O0 -S -emit-llvm

In other words: compile the file is_sorted.cpp as C++ and then tell the rest of the LLVM toolchain: don’t optimize, emit assembly, but no actually emit textual LLVM IR instead. Textual IR is bulky and not particularly fast to print or parse; the binary “bitcode” format is always preferred when a human isn’t in the loop. The full IR file is here, we’ll be looking at it in parts.

Starting at the top of the file we have:

; ModuleID = 'is_sorted.cpp'
source_filename = "is_sorted.cpp"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

Any text between a semicolon and the end of a line is a comment, so the first line does nothing, but if you care, an LLVM “module” is basically a compilation unit: a container for code and data. The second line doesn’t concern us. The third line describes some choices and assumptions made by the compiler; they don’t matter much for this post but you can read more here. The target triple is a gcc-ism that doesn’t concern us here either.

An LLVM function optionally has attributes:

; Function Attrs: noinline nounwind optnone uwtable

Some of them (like these) are supplied by the front end, others get added later by optimization passes. This is just a comment, the actual attributes are specified towards the end of the file. These attributes don’t have anything to do with the meaning of the code, so I’m not going to discuss them, but you can read read about them here if you’re curious.

Ok, finally on to the function itself:

define zeroext i1 @_Z9is_sortedPii(i32* %a, i32 %n) #0 {

“zeroext” means that the return value of the function (i1, a one-bit integer) should be zero-extended, by the backend, to whatever width the ABI requires. Next comes the mangled name of the function, then its parameter list, which is basically the same as in the C++ code, except that the “int” type has been refined to 32 bits. The #0 connects this function to an attribute group near the bottom of the file.

Now on to the first basic block:

entry:
  %retval = alloca i1, align 1
  %a.addr = alloca i32*, align 8
  %n.addr = alloca i32, align 4
  %i = alloca i32, align 4
  store i32* %a, i32** %a.addr, align 8
  store i32 %n, i32* %n.addr, align 4
  store i32 0, i32* %i, align 4
  br label %for.cond

Every LLVM instruction has to live inside a basic block: a collection of instructions that is only entered at the top and only exited at the bottom. The last instruction of a basic block must be a terminator instruction: fallthrough is not allowed. Every function must have an entry block that has no predecessors (places to come from) other than the function entry itself. These properties and others are checked when an IR file is parsed, and can optionally be checked more times during compilation by the “module verifier.” The verifier is useful for debugging situations where a pass emits illegal IR.

The first four instructions in the entry basic block are “alloca”s: stack memory allocations. The first three are for implicit variables created during the compilation, the fourth is for the loop induction variable. Storage allocated like this can only be accessed using load and store instructions. The next three instructions initialize three of the stack slots: a.addr and n.addr are initialized using the values passed into the function as parameters and i is initialized to zero. The return value does not need to be initialized: this will be taken care of later by any code that wasn’t undefined at the C or C++ level. The last instruction is an unconditional branch to the subsequent basic block (we’re not going to worry about it here, but most of these unnecessary jumps can be elided by an LLVM backend).

You might ask: why does Clang allocate stack slots for a and n? Why not just use those values directly? In this function, since a and n are never modified, this strategy would work, but noticing this fact is considered an optimization and thus outside of the job Clang was designed to do. In the general case where a and n might be modified, they must live in memory as opposed to being SSA values, which are — by definition — given a value at only one point in a program. Memory cells live outside of the SSA world and can be modified freely. This may all seem confusing and arbitrary but these design decisions have been found to allow many parts of a compiler to be expressed in natural and efficient ways.

I think of Clang as emitting degenerate SSA code: it meets all the requirements for SSA, but only because basic blocks communicate through memory. Emitting non-degenerate SSA requires some care and some analysis and Clang’s refusal to do these leads to a pleasing separation of concerns. I haven’t seen the measurements but my understanding is that generating a lot of memory operations and then almost immediately optimizing many of them away isn’t a major source of compile-time overhead.

Next let’s look at how to translate a for loop. The general form is:

for (initializer; condition; modifier) {
  body
}

This is translated roughly as:

  initializer
  goto COND
COND:
  if (condition)
    goto BODY
  else
    goto EXIT
BODY:
  body
  modifier
  goto COND
EXIT:

Of course this translation isn’t specific to Clang: any C or C++ compiler would do the same.

In our example, the loop initializer has been folded into the entry basic block. The next basic block is the loop condition test:

for.cond:                                         ; preds = %for.inc, %entry
  %0 = load i32, i32* %i, align 4
  %1 = load i32, i32* %n.addr, align 4
  %sub = sub nsw i32 %1, 1
  %cmp = icmp slt i32 %0, %sub
  br i1 %cmp, label %for.body, label %for.end

As a helpful comment, Clang is telling us that this basic block can be reached either from the for.inc or the entry basic block. This block loads i and n from memory, decrements n (the “nsw” flag preserves the C++-level fact that signed overflow is undefined; without this flag an LLVM subtract would have two’s complement semantics), compares the decremented value against i using “signed less than”, and then finally branches either to the for.body basic block or the for.end basic block.

The body of the for loop can only be entered from the for.cond block:

for.body:
  %2 = load i32*, i32** %a.addr, align 8
  %3 = load i32, i32* %i, align 4
  %idxprom = sext i32 %3 to i64
  %arrayidx = getelementptr inbounds i32, i32* %2, i64 %idxprom
  %4 = load i32, i32* %arrayidx, align 4
  %5 = load i32*, i32** %a.addr, align 8
  %6 = load i32, i32* %i, align 4
  %add = add nsw i32 %6, 1
  %idxprom1 = sext i32 %add to i64
  %arrayidx2 = getelementptr inbounds i32, i32* %5, i64 %idxprom1
  %7 = load i32, i32* %arrayidx2, align 4
  %cmp3 = icmp sgt i32 %4, %7
  br i1 %cmp3, label %if.then, label %if.end

The first two lines load a and i into SSA registers; i is then widened to 64 bits so it can participate in an address computation. getelementptr (gep for short) is LLVM’s famously baroque pointer computation instruction, it even has its own faq. Unlike a machine language, LLVM does not treat pointers and integers the same way. This facilitates alias analysis and other memory optimizations. The code then goes on to load a[i] and a[i + 1], compare them, and branch based on the result.

The if.then block stores 0 into the stack slot for the function return value and branches unconditionally to the function exit block:

if.then:
  store i1 false, i1* %retval, align 1
  br label %return

The else block is trivial:

if.end:
  br label %for.inc

And the block for adding one to the loop induction variable is easy as well:

for.inc:
  %8 = load i32, i32* %i, align 4
  %inc = add nsw i32 %8, 1
  store i32 %inc, i32* %i, align 4
  br label %for.cond

This code branches back up to the loop condition test.

If the loop terminates normally, we want to return true:

for.end:
  store i1 true, i1* %retval, align 1
  br label %return

And finally whatever got stored into the return value stack slot is loaded and returned:

return:
  %9 = load i1, i1* %retval, align 1
  ret i1 %9

There’s a bit more at the bottom of the function but nothing consequential. Ok, this post became longer than I’d intended, the next one will look at what the IR-level optimizations do to this function.

(Thanks to Xi Wang and Alex Rosenberg for sending corrections.)

Software Engineering Takeaways

I had a great time this spring teaching a software engineering course for a new professional masters degree program created by my department. Since I didn’t use slides or hand out lecture notes, some students were asking if maybe I could write up a summary of what I wanted them to learn in the course. This sounded like a good idea and I figured I’d make a blog post out of it.

This course didn’t have a textbook, but we ended up going through much of Writing Solid Code and also several chapters of Code Complete. I particularly like Appendix D of the 2nd edition of Writing Solid Code, where Maguire describes how he conducts an interview for a programming position. The students also completed about a dozen (mostly) programming assignments, individually and in groups, that I’m not going to talk about here.

Testing

Being good at testing is sort of a superpower and computer science programs usually don’t do a great job teaching it. And testing truly is hard to teach since there’s a lot to it and so much of testing is either highly situational or else is about engaging the problem with the right mindset and plenty of energy: you can’t test well unless you truly want to make things fail. My approach to teaching this material is to focus on what I think of as testing’s three pillars:

  1. Test cases. These come from regressions, from systematic or randomized test-case generators, from requirements and specifications, and from corner cases in the implementation. In a few special situations, exhaustive testing works. Test cases should rarely be discarded but they can be segregated, for example into fast and slow ones, which run at different times. Test cases should exercise all relevant sources of input: not just file inputs, but also environment variables, user inputs, and things like the Windows registry, if the system under test depends on those things. The time at which parts of the test case arrive may be important. Though I forgot to give it to the class, Cem Kaner’s What Is a Good Test Case? is excellent.
  2. Oracles. Test cases are only useful insofar as we can tell if the system under test processes them correctly. Some oracles are easy and broadly applicable, such as detecting crashes, excessive resource use, and violations of programming language rules (e.g. using ASan and UBSan). The most useful oracles, however, are specific to the system under test. The one we see most often is to specify, by hand, the expected behavior of the system, based on our understanding of the requirements or specification. Differential testing — checking one implementation of a spec against another — is highly powerful, and more broadly applicable than it first appears: the reference implementation doesn’t need to implement the full functionality of the system, and sometimes it is simply a different mode of the same system. For example, an optimizing compiler can be tested against a non-optimizing version of itself. Function-inverse pairs make an excellent oracle, when applicable. In metamorphic testing we change the test case (add irrelevant levels of nesting to a C program, or remove dead code from it) in a way that shouldn’t change how it works. Assertions and checkReps are extremely valuable oracles. Finding good oracles requires creativity and attention to detail, but the potential rewards are high.
  3. Coverage. Every serious programming language has one or more tools for measuring code coverage, with the goal of finding code not exercised by a test suite. There are a lot of variants on coverage, but in practice we seldom see anything beyond line or branch coverage. If the coverage induced by a test suite is bad, the test suite itself is bad. If the coverage is good, then the test suite might be good, but this is not a sure thing.

Testing should be as frictionless as possible: investments in automation and parallelization often pay off. In class we saw how incredibly easy it is to run the tests for a project in Github on every commit using Travis.

One of my favorite kinds of testing is burning in an ADT implementation — this exercise brings the three pillars of testing together in a very satisfying way.

How SQLite is Tested is great supplemental reading.

Tool and Library Ecosystems

One of the defining characteristics of modern software engineering is its reliance on tooling to help us create large projects rapidly, with lots of high-quality libraries to build on. Every programming language has one or more collections of tools, each of which forms a more or less coherent ecosystem for getting programming tasks done: editing, refactoring, linting, profiling, building, packaging, measuring coverage, debugging, etc. The tricky bit for newcomers is to rapidly learn the shape of an ecosystem in order to avoid wasting time doing jobs that could be better accomplished with tool support.

Modularity

If testing is the reason that large software sometimes works, then modularity is the reason it can be constructed in the first place. Modularity is kind of a tough topic to cover in a course but we did spend time on API design, where I mainly wanted to convey that it’s hard to get right and there are plenty of pitfalls to avoid: excess mutability, unclear ordering constraints among API calls, long lists of parameters, poor choices for default values, hard-to-follow memory ownership protocols, cryptic names, and unwise exposure of implementation details. Something I tried to show the class is how a small group of people working closely together on a well-defined task can almost always succeed, but that — without some care and experience — multiple small groups who have accomplished tasks separately will invariably have problems integrating things they have developed separately.

Code Reviews

Testing only gets us so far and a huge number of problems can be headed off by putting two or more sets of eyes on code before it lands, whether this is done in Github or in a meeting room. We did a bit of code reviewing in class, though I feel like we should have done this more. Some aspects I find important are trying to stay constructive and objective, having clear goals for the code review, and making sure that outcomes get written down and acted upon.

Coding Styles

My view is that the details aren’t nearly as important as just having and following a coding style. This include source code formatting details, identifier naming, language features to avoid, etc. This can result in a code base that is more approachable to newcomers and where some kinds of bugs are easier to spot. The various Google documents seem as good examples as any. Coding styles are particularly important for languages that are older, cruftier, and less safe. For example, LLVM lives in a C++ style that I generally find to be comfortable and tasteful.

Source Control Systems

Git is more or less the only game in town so there’s not much to talk about there. The thing I tried to convince the students here is that we don’t just “use git.” Rather, it’s crucial to define a sensible git-based workflow and get everyone onboard with it. The main thing is that for a variety of common use cases, we have a fairly mechanical set of steps to follow; git is confusing and badly designed enough that workflow innovation is best left to people with a lot of experience.

Critical Systems and Responsibility

We are far past the point of pretending that software is either ethically neutral or uniformly positive, and the 90s-era Internet optimism sounds pretty naive at this point. A significant fraction of today’s CS students are going to work on, at some point, software that is very much ethically non-neutral. This is a good post on that topic. We read various documents related to the Toyota unintended acceleration case and discussed other software safety problems like the Therac-25.

Backwards Compatibility

People tend to underestimate its importance and there are great stories on both sides. Microsoft’s commitment is incredible: a long time ago I had access to the Windows 2000 sources and ran across crazy things, like a copy of Windows 3.1 (in x86 assembly!) and routines for detecting legacy applications and implementing various buggy behaviors that they depended on. On the other side we have, for example, the Python 2/3 debacle.

Static Analysis

I tried to convince students that static analysis is useful, starting with compiler warnings. We read A Few Billion Lines of Code Later and they spent some time looking at issues pointed out by the Clang static analyzer.

Engineering for Performance

The message here is a bit subtle: we want to discourage premature optimization while still encouraging people to learn to build software that isn’t fundamentally too slow to meet its goals. So on one hand we don’t want to write everything in C “for speed” but on the other hand we need to avoid showstoppers such as Python, a bottleneck data structure stored on disk, a quadratic algorithm, etc.

For performance tuning, of course measurement is everything, so we need profilers and maybe also hardware-specific utilities like perf. It doesn’t hurt to know the strengths and weaknesses of both the compiler and the hardware platform, and we should always be ready to read some assembly. Chapters 25 and 26 of Code Complete 2e are a good resource for all of this, though they do not present a very nuanced view of what one can expect the compiler to accomplish.

Reporting Bugs

Getting a software developer to stop doing what they wanted to do, and to spend the day fixing a defect instead, can be difficult, but there are a few simple ingredients that can make bug reports vastly more effective. The report should be framed politely and matter-of-factly; implying that the developers are incompetent is rarely helpful. The issue must be reproducible and the bug report must describe all of the circumstances necessary to make the bug show itself. However, all nonessential circumstances should be left out, including any part of the test case that is not necessary for triggering the bug.

Software Process

There are plenty of software development process models, but I guess I don’t find any of these to be particularly compelling, so we didn’t spend much time on them. On the other hand, the elements of software process — estimation, requirements, modeling, architecture, implementation, quality assurance, maintenance, etc. — are hard to argue with. I spend a bit of effort trying to prepare students for the huge diversity in software development organizations they might see out in the world, from startups to banks to Googles to IBMs. I tried to get them used to the fact that in some cases management may have wildly unrealistic ideas about software development and that requirements can change at the drop of a hat.

The Human Element

Effectively reviewing someone’s code requires a light touch; you have to critique the code rather than judging its author. On the other side, it can be very difficult to gracefully process criticism of a piece of code that you’ve bled and sweated over for months. A bug report is really just an argument that a person should make a change to a piece of software. It is hard to implement a piece of software that solves someone’s problem without putting yourself in their place. Similarly, it is very difficult to solve a problem for someone you don’t like or respect or at least empathize with to some degree. UI design requires understanding the needs of people who don’t share your own abilities and cultural background. Ethical concerns are common, and probably should be thought about a lot more often than they are. Team dynamics are complex and managing people is really not easy. In summary, software development is fundamentally a human endeavor and we’d all do well to keep that in mind.

High Desert Camping

My brother Eric does a great job in the cool uncle role, whereas I’ve been typecast as the nerdy dad — so my kids were super excited when he decided to join us on a spring break camping trip. As I usually do, I obsessively searched Google Earth for a perfect campsite and ended up doing pretty well. The goal was to explore some remote canyons from a base camp on BLM land near the Maze District of Canyonlands National Park. This is a big, wild part of Utah: the official map shows several locations in the park that are a 5-6 hour drive from the ranger station at Hans Flat, which is itself at least 1.5 hours from the nearest town. A lot of these places can’t be visited without carrying extra gas, and even when you get to where you’re going there are very few foot trails, so most hikes end up being cross-country routes on difficult terrain.

During our first night it stormed, ending up with more than a third of an inch of rain, which is quite a bit considering the average annual rainfall is around 9 inches.

A wet morning:

Our camp was at 6500′ and it looks like the snow line was around 9000′:

I was worried the rain had made some roads impassable, so we spent the day hiking the rim of nearby Happy Canyon. Since the wingate sandstone layer tends to form long, impassable cliff systems, we didn’t have any way to get into the canyon itself.

A rough tool of some sort, perhaps a scraper, that Eric found:

Home sweet home. I don’t enjoy driving while pulling a trailer, but it’s really nice to be up off the ground and to be able to stand up in the tent.

Views from camp:

The next morning, on the way to our hike, we stopped to look into gigantic Millard Canyon. There are a couple routes down to the bottom that the cowboys put in in the early 20th century, that probably get very little use, but we opted for a different hike. The mountains on the horizon are the La Sals, on the other side of the Green River and Moab.

Wild burros! Inside the park there’s no grazing and both plant and animal life are a lot more abundant than outside.

Doing some more wildlife spotting:

A slab-sided granary that we found in an alcove:

Fingerprints in the mud. These could be a few thousand years old, or could be as recent as 700 years old — people were forced the leave the region around 1300 AD due to climate change. My understanding is that humans arrived in this part of the world around 9000 years ago.

Half of a metate:

A very cool rock art panel:

A pair of granaries; it’s rare to find them with the lids intact:

This is either a huge arrowhead or a small spear point, broken in a couple of places. I’ve come up with a good tactic for making the kids feel better about not being able to carry away artifacts like this that they find: we take a GPS point so we can visit it in the future. This was found in a wash, far from any possible archaeological context.

The next day we visited High Spur slot canyon, an easy but extremely bumpy 13 mile drive past the ranger station. Colorful scenery near the trailhead:

Dropping into the drainage, which was a bit wet from the rain a couple days earlier but happily was free of deep pools, which would have been really cold:

Plenty of low-key obstacles like this one, but no larger ones until further down-canyon than we had time to go.

Yin and yang:

It turned out Eric had never been in a slot canyon! Here he’s having fun:

And here, perhaps, not quite as much fun. We ended up taking off our packs for this section. I’m fairly tolerant of enclosed spaces but would not be interested in exploring a canyon much narrower than this one. In a really tight canyon you are forced up off the ground level, sometimes many feet, and a fall would be disastrous.

Sometime you go under:

Sometimes you go over. Getting wet when this is avoidable is considered bad form. Of course it was me who slipped into a pool first.

Kids looking over an obstacle of some sort:

Long road to nowhere:

Back at camp just in time to eat dinner in the light:

The stupid wax log I bought at the supermarket was very reluctant to burn, the kids spent a long time trying to get a fire started as it got colder and darker and windier:

Eric and I sipped adult beverages and provided helpful suggestions.

Temperature dropping fast, it got down to 24F on our last night. In a high-altitude desert it can easily be 50 F warmer during the day than overnight.

On the last day we got up early and packed quickly since we had to get Eric to the SLC airport for an afternoon flight. Here the first sun is hitting the Henry Mountains:

This is the area I visited on my very first trip to southern Utah about 18 years ago. I love it!

Paths to External Engagement in Computer Science Research

The other day I wrote a post imploring academic computer scientists to at least occasionally break out of their research bubbles and engage with real engineering problems where their research matters. That post was, admittedly, a bit facile about how one might make this engagement happen. This piece suggests some ways. I don’t claim any particular authority or expertise, though I have — except where noted — tried out all of the suggestions below. Perhaps others will chime in with what has worked for them.

Talk to a lot of people and create a network. Networking is a really important skill. Back in the day Phil Agre’s Networking on the Network was what students would read. It predates modern social networks but contains a lot of timeless advice and in any case I don’t know of a better choice.

Attend conferences outside the academic mainstream. In some cases, such as SIGGRAPH and Supercomputing, there’s plenty of non-academic content at the conference you’re attending anyhow, but most of us aren’t so lucky. There’s a really wide range of industrial conferences; some of them definitely won’t be very interesting for academics so you need to do some homework ahead of time to find the one where the right people will be. For my current work, the LLVM Developers Meeting is the main event, almost all of the community’s heavy hitters are there and the technical sessions are amazing. The security community has plenty of great non-academic events. I’ve heard good things about !!Con and Strange Loop. In any case, the point of attending these conferences (besides the technical content) is to meet people who aren’t professors or students.

Spend a day visiting a company or a government agency. You need an invitation, but you can invite yourself if you can make a contact for example via your advisor, a mutual friend, or by meeting people at conferences. Talk to people there, get a sense of what they’re doing with their days, what they’re worried about, what they’re frustrated with. Give a talk about your work if possible, but this isn’t necessary. It often makes sense to do these visits when you’re already traveling.

Spend longer visiting a company or government agency. Depending on your career stage this could be an internship, a postdoc, a sabbatical, a summer, or a leave of absence. This is a chance to work closely with people for an extended period of time. A lot of people do this at some point in their careers and I think it’s a really good idea.

Engage on twitter. It’s a weird, crowded, noisy place and it can take a while to find the right people to follow and longer to get them to follow you. The advantage of Twitter is that there’s a huge number of smart people are out there and communicating with them is almost frictionless.

Blog. People are far more likely to read a blog entry than a paper, in my experience. Also, the readership is different, because non-academics are even less likely to read a paper than academics are. Realistically, starting a blog only makes sense if you have a fairly consistent stream of things to say that don’t fit into tweets and don’t really belong in academic papers. Building an audience takes time and requires a certain amount of regularity in writing; these don’t necessarily fit in very well with the academic binge model of working that many of us subscribe to. Another issue is that blogging doesn’t pay the bills, academically speaking — you should only do it because you want to, not because you expect any direct benefit to your career. I waited until getting tenure to start a blog for this reason, and also to make sure that I had at least a few years’ worth of ideas lined up.

Find people who are great at external engagement. Emulate them, collaborate with them, or join them. The Racket folks are amazing at this.

Release software. Put your stuff on Github, polish it up, and then tell people about it. Get users, accept pull requests, respond to feedback, fix bugs, add features, cut releases, and repeat. Either your code will provide people with a good value proposition or it won’t — either way you learn something. The caveats are that building a user base takes time, creating realistically usable software is like 25 times as much work as creating research-grade crapware, and only a small subset of computer science professors will value your contributions in this area. But it is enormously fun and anyway you don’t want to make the mistake of caring too much what professors think.

Engage with existing open source software. For many of us, there’s an obvious open source project that we could be contributing to or otherwise helping out. Find that project and read their mailing lists, look into the issue tracker, build and use the code, read the code, and maybe submit a patch or two. Beyond these things, consider attending their meetings or BoF sessions, if these exist. A reasonable long-term goal would be to use your work to make a significant improvement to the open source program.

Start a company. This one I haven’t done, though I know many people who have. It is a somewhat extreme option, as much a lifestyle choice as research engagement strategy. Details are out of scope of this post and anyway I don’t know anything about doing this.

Ok, with all that said, I’ll make a prediction or two about what will happen if you follow these suggestions. First, you’ll often find it frustrating and some of the time you invest will be wasted. I’ve burned months on things that never bore the tiniest fruit; if I knew how to tell you to avoid this, I certainly would. Second, you’ll discover that the problems that people are having out there aren’t the ones that you would have predicted, nor are they the ones that your CS friends and colleagues predicted. You need to learn to listen to people, but often even the people having problems aren’t actually having the problems that they think they’re having (anyone who has worked tech support will tell you this is the case more often than not). You need to learn to observe carefully and read between the lines to figure out what’s really going on. Third, at some point you will run into the distinction between problem-driven research and solution-driven research. The former is like trying to cure cancer or put a person on Mars: the problem is everything and you’ll consider any solution that might work. The latter is where your research hammer is a damn good one and you’re never going to put it down: if it can’t solve someone’s problem, you’ll move on and find a different problem. Obviously there’s a spectrum — but you’ll need to decide where on it you sit.

Closing the Loop: The Importance of External Engagement in Computer Science Research

Computer scientists tend to work by separating the essence of a problem from its environment, solving it in an abstract form, and then figuring out how to make the abstract solution work in the real world. For example, there is an enormous body of work on solving searching and sorting problems and in general it applies to finding and rearranging things regardless of whether they live in memory, on disk, or in a filing cabinet.

To paint a bit of a caricature, we have the real world where:

  • all problems are engineering problems, and every engineering problem is distinct from the rest in small or large ways
  • abstractions are leaky: machines fail, bits flip, packets drop, and bugs exist
  • requirements are many-dimensional, diverse, and poorly specified
  • systems often involve human beings who are notoriously hard to reason about

Overall, engineering problems are messy and we usually can’t prove anything about anything, any more than we could prove the correctness and optimality of a bridge or lawnmower.

On the other hand, in the abstract problem space, we’ve dropped every detail that we don’t wish to reason about but retained (ideally) all of the important characteristics of the problem. We’re now dealing with a model — sometimes a formal mathematical model, other times an executable model such as a kernel or simulation — of part or all of the problem. Models can be rigorously analyzed for efficiency, optimality, correctness, and whatever else. Furthermore, connections between problems that initially seem very different may become more apparent in the abstract form.

This process of lifting engineering challenges into abstract problems, solving them, and applying the results — I’ll call it the computer science research loop — is so integral to the DNA of computer science research that it is simply assumed; people have a hard time imagining any other way to work. Also, it has been incredibly successful, which is unsurprising since we inherited it from mathematics where it had been giving good results ever since some prehistoric person observed that you could talk about the number three without actually having three things in front of you.

Here’s the loop in its simplest form:

Here are some ways the research loop shows up in areas that I’m familiar with:

  • In compilers we can develop a toy language that is much easier to compile but that (we argue) retains the essential features of real programming languages.
  • In compilers we can compile a benchmark suite instead of real applications, arguing that our results will translate over into practice.
  • In resource scheduling research it is typical to abstract away all details of the jobs being scheduled.
  • In databases or operating systems we can create a transaction engine or OS kernel that supports only a tiny subset of the features provided by SQL Server or Linux, arguing that the advantages displayed by our simplified model would not disappear if we took the trouble to implement all the boring stuff.

In all cases the goal is to elide details that make our work harder, but without oversimplifying. This piece is about an avoidable but undesirable second-order effect: it is common for both edges of the computer science research loop to be weaker than they could be.

The concrete-to-abstract edge suffers when people working on the abstract side don’t have deep expertise in the concrete problems they’re trying to solve, and it also tends to weaken over time as the character of the problem drifts, causing assumptions on the abstract side to be invalidated. The abstract side has a kind of inertia: infrastructure builds up in code, formalisms, books and papers, and mental models. It requires significant time, energy, and risk-taking to throw away parts of the abstract infrastructure and create fresh pieces. Abstractions that are out of touch with real-world problems can linger, producing papers and PhD theses, for decades.

The abstract-to-concrete edge of the research loop is also problematic: solving real engineering problems, or interacting with the people whose jobs are to solve those problems, can be difficult and time-consuming. It is generally much easier to work purely on the abstract side, and in fact our field’s mathematical roots encourage this behavior. Abstract work is, of course, fine as long as someone else is doing the grungy part, but in many cases that never happens because the abstract side has drifted off to the side of the real problems, becoming more elaborate and complex over time as the easy problems get mined out, and in the end there’s no realistic prospect of applying it.

I believe these issues cause computer science research, overall, to be less interesting and impactful than it could be. I also believe that mitigating the problem isn’t that difficult and that doing so tends to make researchers’ careers a lot more fun and rewarding.

The solution is for researchers to engage with the world outside of their research bubble. Working on real-time scheduling? Then go find some drone software and help its authors or users avoid deadline misses, or else contribute scheduling code to Linux. Working on concurrency control in databases? Figure out a way to integrate the new scheme into MySQL or something, instead of waiting for them to read your paper. Working on finding bugs in software? Report the bugs to the people who maintain the code and watch their reactions. It is particularly important that students do these things, first because their intuitions often aren’t as well-developed and second because I’ve noticed that quite a few CS graduate students are quietly and privately wondering if their work is good for anything in the end. It turns out there’s a way to answer this question: engage with the people whose problems you are solving. As a bonus you’ll publish fewer papers.

It is not the case that that every piece of research should be applied research. Rather, good pure research usually stems from direct personal experience with problems on the engineering side of our world. It’s a bit of a subtle point: doing the grungy boots-on-ground work is how we build our intuitions about what kinds of solutions actually work vs. sounding good on paper. It is hard — though not impossible — to skip this step and still do great work.

Going a bit further, my experience is that much of the interesting action in research happens on the abstract-to-concrete edge of the CS research loop, even though this work is not glamorous or well-rewarded by program committees or grant panels. Even the old workhorses like sorting an array or implementing a key-value map became dramatically more interesting and complex in the context of a real machine, operating system, compiler, and workload.

Concretely, here are some things to look for that might indicate that a research community needs to tighten up its loop:

  • few members of the community are plugged into the concrete problem domain, and are providing fresh insights from developments there
  • few members of the community are moving abstract results into practice
  • members of the community are mainly interested in impressing each other (or, equivalently, papers that demonstrate external impact are not highly valued)
  • the community rewards complex solutions because they are innately interesting, as opposed to rewarding simple solutions because they have engineering merit
  • years of results cluster around the same baseline or benchmark set, instead of continually moving the bar higher

In conclusion, the tension between mathematics and engineering is alive and well in our field. My belief is that more of us, perhaps even most of us, should be skilled in, and actively working in, both modes of thinking.

Also see this followup post.