What’s the difference between an integer and a pointer?

(This piece is an alternate introduction and advertisement for a soon-to-be-published research paper.)

In an assembly language we typically don’t have to worry very much about the distinction between pointers and integers. Some instructions happen to generate addresses whereas others behave arithmetically, but underneath there’s a single data type: bitvectors. At the opposite end of the PL spectrum, a high-level language won’t offer opportunities for pointer/integer confusion because the abstractions are completely firewalled off from each other. Also, of course, a high-level language may choose not to expose anything that resembles a pointer.

The problematic case, then, is the performance-oriented systems programming language: C, C++, Rust, and a few others. The issue is that:

  1. When possible, they pretend to be high-level, in order to justify optimizations that would be unsound at the assembly level. For example, they assume that a pointer derived from one heap allocation cannot alias a pointer derived from a different heap allocation.
  2. When necessary, they allow the programmer to manipulate pointers in low-level ways, such as inspecting the representation of a pointer, creating a new pointer at some offset from an existing one, storing unrelated data in unused parts of a pointer, and even fabricating a pointer out of whole cloth.

These goals conflict: the better a language is for justifying high-level optimizations, the worse it seems to be for implementing an OS kernel, and vice versa. There are a few ways we might deal with this:

  • We could require the high-level language to call out to low-level code written in a different language, as Java does with its JNI.
  • We could create a special low-level mode for a safe language where it can break the normal rules, as Rust does with its unsafe code.
  • We could go ahead and freely mix high-level and low-level code, hoping that the compiler can sort it all out, as C and C++ do.

Let’s look at an example:

#include <stdio.h>
#include <stdlib.h>

int main(void) {
  int *x = malloc(4);
  *x = 3;
  int *y = malloc(4);
  *y = 5;
  uintptr_t xi = (uintptr_t)x;
  uintptr_t yi = (uintptr_t)y;
  uintptr_t diff = xi - yi;
  printf("diff = %ld\n", (long)diff);
  int *p = (int *)(yi + diff);
  *p = 7;
  printf("x = %p, *x = %d\n", x, *x);
  printf("y = %p, *y = %d\n", y, *y);
  printf("p = %p, *p = %d\n", p, *p);
}

When run (on my Mac) it gives:

$ clang-6.0 -O3 mem1d.c ; ./a.out
diff = -96
x = 0x7fcb0ec00300, *x = 7
y = 0x7fcb0ec00360, *y = 5
p = 0x7fcb0ec00300, *p = 7
$ gcc-8 -O3 mem1d.c ; ./a.out
diff = -96
x = 0x7f93b6400300, *x = 7
y = 0x7f93b6400360, *y = 5
p = 0x7f93b6400300, *p = 7

What happened here? First, we grabbed a couple of heap cells and initialized their contents. Second, we used integer math to fabricate a pointer p that has the same value as x, and stored through that pointer. Third, we asked the compiler what contents it thinks are in the locations pointed to by x, y, and p. Both GCC and Clang/LLVM are of the opinion that the store through p changed the value referred to by x. This is probably the result that we had hoped for, but let’s look into it a bit more deeply.

First, does this C program execute undefined behavior? It does not! Computing out-of-bounds pointers is UB but we are free to do whatever we like with integer-typed values that were derived from pointers. Second, does the C standard mandate the behavior that we saw here? It does not! In fact it gives only very sparse guidance about the behavior of integers derived from pointers and vice versa. You can read about that here.

Now let’s modify the program a tiny bit:

#include <stdio.h>
#include <stdlib.h>

int main(void) {
  int *x = malloc(4);
  *x = 3;
  int *y = malloc(4);
  *y = 5;
  uintptr_t xi = (uintptr_t)x;
  uintptr_t yi = (uintptr_t)y;
  uintptr_t diff = xi - yi;
  printf("diff = %ld\n", (long)diff);
  int *p = (int *)(yi - 96); // <--- CHANGED!
  *p = 7;
  printf("x = %p, *x = %d\n", x, *x);
  printf("y = %p, *y = %d\n", y, *y);
  printf("p = %p, *p = %d\n", p, *p);
}

The only difference is at line 13: instead of adding diff to yi to compute p, we added the observed value of the difference: -96. Now we’ll run it:

$ clang-6.0 -O3 mem1d2.c ; ./a.out
diff = -96
x = 0x7ff21ac00300, *x = 7
y = 0x7ff21ac00360, *y = 5
p = 0x7ff21ac00300, *p = 7
$ gcc-8 -O3 mem1d2.c ; ./a.out
diff = -96
x = 0x7fce5e400300, *x = 3
y = 0x7fce5e400360, *y = 5
p = 0x7fce5e400300, *p = 7

Clang behaves the same as before, but GCC no longer believes that storing through p results in x being overwritten, despite the fact that x and p have the same value. What is going on is that as far as GCC is concerned, since we computed the address of x through guesswork instead of through an actual observation, our guess is effectively illegitimate. (If this guessing game is too simple for you, there are entertaining variations that avoid the need for information from a previous run of the program, such as guessing addresses using an exhaustive loop or making huge allocations so that the resulting addresses are constrained by the pigeonhole principle.) Here again we are operating well outside of the C standard and are rather in a zone where compiler writers need to make decisions that balance optimization power against developers’ expectations. I like this example because it is weird and perhaps surprising.

What should we take away from this? First, it is a big mistake to try to understand pointers in a programming language as if they follow the same rules as pointer-sized integer values. Even people who have come to terms with UB and its consequences are often surprised by this. Second, these issues are not specific to C and C++, but rather are going to create problems in any language that wants highly optimizing compilers but also permits low-level pointer manipulation.

So what are the actual rules that Clang and GCC are following when compiling these examples? The short answer is that it’s complicated and not well documented. Some useful discussion can be found in this piece about pointer provenance. For a longer answer that is specific to LLVM, see this paper that will get presented at OOPSLA in November. Also, one of my coauthors wrote a blog post about this material.

Future Directions for Optimizing Compilers

I wanted to write a manifesto-ish sort of piece about what compilers are supposed to look like in the future. Nuno was the obvious coauthor since I’ve talked to him about this topic so much that I’m overall not really sure which parts started out as his ideas and which were mine. The article didn’t end up feeling like a good fit for a series of blog posts, so we’ve placed it on arXiv: Future Directions for Optimizing Compilers. The first paragraph of the paper gives the main idea:

As software becomes larger, programming languages become higher-level, and processors continue to fail to be clocked faster, we’ll increasingly require compilers to reduce code bloat, eliminate abstraction penalties, and exploit interesting instruction sets. At the same time, compiler execution time must not increase too much and also compilers should never produce the wrong output. This paper examines the problem of making optimizing compilers faster, less buggy, and more capable of generating high-quality output.

We plan to keep revising the paper and would be happy to receive feedback on it.

How LLVM Optimizes a Function

An optimizing, ahead-of-time compiler is usually structured as:

  1. A frontend that converts source code into an intermediate representation (IR).
  2. A target-independent optimization pipeline: a sequence of passes that successively rewrite the IR to eliminate inefficiencies and forms that cannot be readily translated into machine code. Sometimes called the “middle end.”
  3. A target-dependent backend that generates assembly code or machine code.

In some compilers the IR format remains fixed throughout the optimization pipeline, in others the format or semantics change. In LLVM the format and semantics are fixed, and consequently it should be possible to run any sequences of passes you want without introducing miscompilations or crashing the compiler.

The sequence of passes in the optimization pipeline is engineered by compiler developers; its goal is to do a pretty good job in a reasonable amount of time. It gets tweaked from time to time, and of course there’s a different set of passes that gets run at each optimization level. A longish-standing topic in compiler research is to use machine learning or something to come up with a better optimization pipeline either in general or else for a specific application domain that is not well-served by the default pipelines.

Some principles for pass design are minimality and orthogonality: each pass should do one thing well, and there should not be much overlap in functionality. In practice, compromises are sometimes made. For example when two passes tend to repeatedly generate work for each other, they may be integrated into a single, larger pass. Also, some IR-level functionality such as constant folding is so ubiquitously useful that it might not make sense as a pass; LLVM, for example, implicitly folds away constant operations as instructions are created.

In this post we’ll look at how some of LLVM’s optimization passes work. I’ll assume you’ve read this piece about how Clang compiles a function or alternatively that you more or less understand how LLVM IR works. It’s particularly useful to understand the SSA (static single assignment) form: Wikipedia will get you started and this book contains more information than you’re likely to want to know. Also see the LLVM Language Reference and the list of optimization passes.

We’re looking at how Clang/LLVM 6.0.1 optimizes this C++:

bool is_sorted(int *a, int n) {
  for (int i = 0; i < n - 1; i++)
    if (a[i] > a[i + 1])
      return false;
  return true;
}

Keep in mind that the optimization pipeline is a busy place and we’re going to miss out on a lot of fun stuff such as:

  • inlining, an easy but super-important optimization, which can’t happen here since we’re looking at just one function
  • basically everything that is specific to C++ vs C
  • autovectorization, which is defeated by the early loop exit

In the text below I’m going to skip over every pass that doesn’t make any changes to the code. Also, we’re not even looking at the backend and there’s a lot going on there as well. Even so, this is going to be a bit of a slog! (Sorry to use images below, but it seemed like the best way to avoid formatting difficulties, I hate fighting with WP themes. Click on the image to get a larger version. I used icdiff.)

Here’s the IR file emitted by Clang (I manually removed the “optnone” attribute that Clang put in) and this is the command line that we can use to see the effect of each optimization pass:

opt -O2 -print-before-all -print-after-all is_sorted2.ll

The first pass is “simplify the CFG” (control flow graph). Because Clang does not optimize, IR that it emits often contains easy opportunities for cleanup:

Here, basic block 26 simply jumps to block 27. This kind of block can be eliminated, with jumps to it being forwarded to the destination block. The diff is a bit more confusing that it would have to be due to the implicit block renumbering performed by LLVM. The full set of transformations performed by SimplifyCFG is listed in a comment at the top of the pass:

This file implements dead code elimination and basic block merging, along with a collection of other peephole control flow optimizations. For example:

  • Removes basic blocks with no predecessors.
  • Merges a basic block into its predecessor if there is only one and the predecessor only has one successor.
  • Eliminates PHI nodes for basic blocks with a single predecessor.
  • Eliminates a basic block that only contains an unconditional branch.
  • Changes invoke instructions to nounwind functions to be calls.
  • Change things like “if (x) if (y)” into “if (x&y)”.

Most opportunities for CFG cleanup are a result of other LLVM passes. For example, dead code elimination and loop invariant code motion can easily end up creating empty basic blocks.

The next pass to run, SROA (scalar replacement of aggregates), is one of our heavy hitters. The name ends up being a bit misleading since SROA is only one of its functions. This pass examines each alloca instruction (function-scoped memory allocation) and attempts to promote it into SSA registers. A single alloca will turn into multiple registers when it is statically assigned multiple times and also when the alloca is a class or struct that can be split into its components (this splitting is the “scalar replacement” referred to in the name of the pass). A simple version of SROA would give up on stack variables whose addresses are taken, but LLVM’s version interacts with alias analysis and ends up being fairly smart (though the smarts are not needed in our example).

After SROA, all alloca instructions (and their corresponding loads and stores) are gone, and the code ends up being much cleaner and more amenable to subsequent optimization (of course, SROA cannot eliminate all allocas in general — this only works when the pointer analysis can completely eliminate aliasing ambiguity). As part of this process, SROA had to insert some phi instructions. Phis are at the core of the SSA representation, and the lack of phis in the code emitted by Clang tells us that Clang emits a trivial kind of SSA where communication between basic blocks is through memory, instead of being through SSA registers.

Next is “early common subexpression elimination” (CSE). CSE tries to eliminate the kind of redundant subcomputations that appear both in code written by people and in partially-optimized code. “Early CSE” is a fast, simple kind of CSE that looks for trivially redundant computations.

Here both %10 and %17 do the same thing, so uses of one of the values can be rewritten as uses of the other, and then the redundant instruction eliminated. This gives a glimpse of the advantages of SSA: since each register is assigned only once, there’s no such thing as multiple versions of a register. Thus, redundant computations can be detected using syntactic equivalence, with no reliance on a deeper program analysis (the same is not true for memory locations, which live outside of the SSA universe).

Next, a few passes that have no effect run, and then “global variable optimizer” which self-describes as:

This pass transforms simple global variables that never have their address taken. If obviously true, it marks read/write globals as constant, deletes variables only stored to, etc.

It makes this change:

The thing that has been added is a function attribute: metadata used by one part of the compiler to store a fact that might be useful to a different part of the compiler. You can read about the rationale for this attribute here.

Unlike other optimizations we’ve looked at, the global variable optimizer is interprocedural, it looks at an entire LLVM module. A module is more or less equivalent to a compilation unit in C or C++. In contrast, intraprocedural optimizations look at only one function at a time.

The next pass is the “instruction combiner“: InstCombine. It is a large, diverse collection of “peephole optimizations” that (typically) rewrite a handful of instructions connected by data flow into a more efficient form. InstCombine won’t change the control flow of a function. In this example it doesn’t have a lot to do:

Here instead of subtracting 1 from %1 to compute %4, we’ve decided to add -1. This is a canonicalization rather than an optimization. When there are multiple ways of expressing a computation, LLVM attempts to canonicalize to a (often arbitrarily chosen) form, which is then what LLVM passes and backends can expect to see. The second change made by InstCombine is canonicalizing two sign-extend operations (sext instructions) that compute %7 and %11 to zero-extends (zext). This is a safe transformation when the compiler can prove that the operand of the sext is non-negative. That is the case here because the loop induction variable starts at zero and stops before it reaches n (if n is negative, the loop never executes at all). The final change is adding the “nuw” (no unsigned wrap) flag to the instruction that produces %10. We can see that this is safe by observing that (1) the induction variable is always increasing and (2) that if a variable starts at zero and increases, it would become undefined by passing the sign wraparound boundary next to INT_MAX before it reaches the unsigned wraparound boundary next to UINT_MAX. This flag could be used to justify subsequent optimizations.

Next, SimplifyCFG runs for a second time, removing two empty basic blocks:

Deduce function attributes” annotates the function further:

“Norecurse” means that the function is not involved in any recursive loop and a “readonly” function does not mutate global state. A “nocapture” parameter is not saved anywhere after its function returns and a “readonly” parameter refers to storage not modified by the function. Also see the full set of function attributes and the full set of parameter attributes.

Next, “rotate loops” moves code around in an attempt to improve conditions for subsequent optimizations:

Although this diff looks a bit alarming, there’s not all that much happening. We can see what happened more readily by asking LLVM to draw the control flow graph before and after loop rotation. Here are the before (left) and after (right) views:

The original code still matches the loop structure emitted by Clang:

  initializer
  goto COND
COND:
  if (condition)
    goto BODY
  else
    goto EXIT
BODY:
  body
  modifier
  goto COND
EXIT:

Whereas the rotated loop looks like this:

  initializer
  if (condition)
    goto BODY
  else
    goto EXIT
BODY:
  body
  modifier
  if (condition)
    goto BODY
  else
    goto EXIT
EXIT:

(Corrected as suggested by Johannes Doerfert below– thanks!)

The point of loop rotation is to remove one branch and to enable subsequent optimizations. I did not find a better description of this transformation on the web (the paper that appears to have introduced the term doesn’t seem all that relevant).

CFG simplification folds away the two basic blocks that contain only degenerate (single-input) phi nodes:

The instruction combiner rewrites “%4 = 0 s< (%1 - 1)" as "%4 = %1 s> 1″, this kind of thing is useful because it reduces the length of a dependency chain and may also create dead instructions (see the patch making this happen). This pass also eliminates another trivial phi node that was added during loop rotation:

Next, “canonicalize natural loops” runs, it self-describes as:

This pass performs several transformations to transform natural loops into a simpler form, which makes subsequent analyses and transformations simpler and more effective.

Loop pre-header insertion guarantees that there is a single, non-critical entry edge from outside of the loop to the loop header. This simplifies a number of analyses and transformations, such as LICM.

Loop exit-block insertion guarantees that all exit blocks from the loop (blocks which are outside of the loop that have predecessors inside of the loop) only have predecessors from inside of the loop (and are thus dominated by the loop header). This simplifies transformations such as store-sinking that are built into LICM.

This pass also guarantees that loops will have exactly one backedge.

Indirectbr instructions introduce several complications. If the loop contains or is entered by an indirectbr instruction, it may not be possible to transform the loop and make these guarantees. Client code should check that these conditions are true before relying on them.

Note that the simplifycfg pass will clean up blocks which are split out but end up being unnecessary, so usage of this pass should not pessimize generated code.

This pass obviously modifies the CFG, but updates loop information and dominator information.

Here we can see loop exit-block insertion happen:

Next up is “induction variable simplification“:

This transformation analyzes and transforms the induction variables (and computations derived from them) into simpler forms suitable for subsequent analysis and transformation.

If the trip count of a loop is computable, this pass also makes the following changes:

  1. The exit condition for the loop is canonicalized to compare the induction value against the exit value. This turns loops like: ‘for (i = 7; i*i < 1000; ++i)' into 'for (i = 0; i != 25; ++i)'
  2. Any use outside of the loop of an expression derived from the indvar is changed to compute the derived value outside of the loop, eliminating the dependence on the exit value of the induction variable. If the only purpose of the loop is to compute the exit value of some derived expression, this transformation will make the loop dead.

The effect of this pass here is simply to rewrite the 32-bit induction variable as 64 bits:

I don’t know why the zext — previously canonicalized from a sext — is turned back into a sext.

Now “global value numbering” performs a very clever optimization. Showing this one off is basically the entire reason that I decided to write this blog post. See if you can figure it out just by reading the diff:

Got it? Ok, the two loads in the loop on the left correspond to a[i] and a[i + 1]. Here GVN has figured out that loading a[i] is unnecessary because a[i + 1] from one loop iteration can be forwarded to the next iteration as a[i]. This one simple trick halves the number of loads issued by this function. Both LLVM and GCC only got this transformation recently.

You might be asking yourself if this trick still works if we compare a[i] against a[i + 2]. It turns out that LLVM is unwilling or unable to do this, but GCC is willing to throw up to four registers at this problem.

Now “bit-tracking dead code elimination” runs:

This file implements the Bit-Tracking Dead Code Elimination pass. Some instructions (shifts, some ands, ors, etc.) kill some of their input bits. We track these dead bits and remove instructions that compute only these dead bits.

But it turns out the extra cleverness isn’t needed since the only dead code is a GEP, and it is trivially dead (GVN removed the load that previously used the address that it computes):

Now the instruction combiner sinks an add into a different basic block. The rationale behind putting this transformation into InstCombine is not clear to me; perhaps there just wasn’t a more obvious place to do this one:

Now things get a tiny bit weird, “jump threading” undoes what “canonicalize natural loops” did earlier:

Then we canonicalize it back:

And CFG simplification flips it the other way:

And back:

And forth:

And back:

And forth:

Whew, we’re finally done with the middle end! The code at the right is what gets passed to (in this case) the x86-64 backend.

You might be wondering if the oscillating behavior towards the end of the pass pipeline is the result of a compiler bug, but keep in mind that this function is really, really simple and there were a whole bunch of passes mixed in with the flipping and flopping, but I didn’t mention them because they didn’t make any changes to the code. As far as the second half of the middle end optimization pipeline goes, we’re basically looking at a degenerate execution here.

Acknowledgments: Some students in my Advanced Compilers class this fall provided feedback on a draft of this post (I also used this material as an assignment). I ran across the function used here in this excellent set of lecture notes about loop optimizations.

How Clang Compiles a Function

I’ve been planning on writing a post about how LLVM optimizes a function, but it seems necessary to first write about how Clang translates C or C++ into LLVM. This is going to be fairly high-level:

  • rather than looking at Clang’s internals, I’m going to focus on how Clang’s output relates to its input
  • we’re not going to look at any non-trivial C++ features

We’ll use this small function that I borrowed from these excellent lecture notes on loop optimizations:

bool is_sorted(int *a, int n) {
  for (int i = 0; i < n - 1; i++)
    if (a[i] > a[i + 1])
      return false;
  return true;
}

Since Clang doesn’t do any optimization, and since LLVM IR was initially designed as a target for C and C++, the translation is going to be relatively easy. I’ll use Clang 6.0.1 (or something near it, since it hasn’t quite been released yet) on x86-64.

The command line is:

clang++ is_sorted.cpp -O0 -S -emit-llvm

In other words: compile the file is_sorted.cpp as C++ and then tell the rest of the LLVM toolchain: don’t optimize, emit assembly, but no actually emit textual LLVM IR instead. Textual IR is bulky and not particularly fast to print or parse; the binary “bitcode” format is always preferred when a human isn’t in the loop. The full IR file is here, we’ll be looking at it in parts.

Starting at the top of the file we have:

; ModuleID = 'is_sorted.cpp'
source_filename = "is_sorted.cpp"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

Any text between a semicolon and the end of a line is a comment, so the first line does nothing, but if you care, an LLVM “module” is basically a compilation unit: a container for code and data. The second line doesn’t concern us. The third line describes some choices and assumptions made by the compiler; they don’t matter much for this post but you can read more here. The target triple is a gcc-ism that doesn’t concern us here either.

An LLVM function optionally has attributes:

; Function Attrs: noinline nounwind optnone uwtable

Some of them (like these) are supplied by the front end, others get added later by optimization passes. This is just a comment, the actual attributes are specified towards the end of the file. These attributes don’t have anything to do with the meaning of the code, so I’m not going to discuss them, but you can read read about them here if you’re curious.

Ok, finally on to the function itself:

define zeroext i1 @_Z9is_sortedPii(i32* %a, i32 %n) #0 {

“zeroext” means that the return value of the function (i1, a one-bit integer) should be zero-extended, by the backend, to whatever width the ABI requires. Next comes the mangled name of the function, then its parameter list, which is basically the same as in the C++ code, except that the “int” type has been refined to 32 bits. The #0 connects this function to an attribute group near the bottom of the file.

Now on to the first basic block:

entry:
  %retval = alloca i1, align 1
  %a.addr = alloca i32*, align 8
  %n.addr = alloca i32, align 4
  %i = alloca i32, align 4
  store i32* %a, i32** %a.addr, align 8
  store i32 %n, i32* %n.addr, align 4
  store i32 0, i32* %i, align 4
  br label %for.cond

Every LLVM instruction has to live inside a basic block: a collection of instructions that is only entered at the top and only exited at the bottom. The last instruction of a basic block must be a terminator instruction: fallthrough is not allowed. Every function must have an entry block that has no predecessors (places to come from) other than the function entry itself. These properties and others are checked when an IR file is parsed, and can optionally be checked more times during compilation by the “module verifier.” The verifier is useful for debugging situations where a pass emits illegal IR.

The first four instructions in the entry basic block are “alloca”s: stack memory allocations. The first three are for implicit variables created during the compilation, the fourth is for the loop induction variable. Storage allocated like this can only be accessed using load and store instructions. The next three instructions initialize three of the stack slots: a.addr and n.addr are initialized using the values passed into the function as parameters and i is initialized to zero. The return value does not need to be initialized: this will be taken care of later by any code that wasn’t undefined at the C or C++ level. The last instruction is an unconditional branch to the subsequent basic block (we’re not going to worry about it here, but most of these unnecessary jumps can be elided by an LLVM backend).

You might ask: why does Clang allocate stack slots for a and n? Why not just use those values directly? In this function, since a and n are never modified, this strategy would work, but noticing this fact is considered an optimization and thus outside of the job Clang was designed to do. In the general case where a and n might be modified, they must live in memory as opposed to being SSA values, which are — by definition — given a value at only one point in a program. Memory cells live outside of the SSA world and can be modified freely. This may all seem confusing and arbitrary but these design decisions have been found to allow many parts of a compiler to be expressed in natural and efficient ways.

I think of Clang as emitting degenerate SSA code: it meets all the requirements for SSA, but only because basic blocks communicate through memory. Emitting non-degenerate SSA requires some care and some analysis and Clang’s refusal to do these leads to a pleasing separation of concerns. I haven’t seen the measurements but my understanding is that generating a lot of memory operations and then almost immediately optimizing many of them away isn’t a major source of compile-time overhead.

Next let’s look at how to translate a for loop. The general form is:

for (initializer; condition; modifier) {
  body
}

This is translated roughly as:

  initializer
  goto COND
COND:
  if (condition)
    goto BODY
  else
    goto EXIT
BODY:
  body
  modifier
  goto COND
EXIT:

Of course this translation isn’t specific to Clang: any C or C++ compiler would do the same.

In our example, the loop initializer has been folded into the entry basic block. The next basic block is the loop condition test:

for.cond:                                         ; preds = %for.inc, %entry
  %0 = load i32, i32* %i, align 4
  %1 = load i32, i32* %n.addr, align 4
  %sub = sub nsw i32 %1, 1
  %cmp = icmp slt i32 %0, %sub
  br i1 %cmp, label %for.body, label %for.end

As a helpful comment, Clang is telling us that this basic block can be reached either from the for.inc or the entry basic block. This block loads i and n from memory, decrements n (the “nsw” flag preserves the C++-level fact that signed overflow is undefined; without this flag an LLVM subtract would have two’s complement semantics), compares the decremented value against i using “signed less than”, and then finally branches either to the for.body basic block or the for.end basic block.

The body of the for loop can only be entered from the for.cond block:

for.body:
  %2 = load i32*, i32** %a.addr, align 8
  %3 = load i32, i32* %i, align 4
  %idxprom = sext i32 %3 to i64
  %arrayidx = getelementptr inbounds i32, i32* %2, i64 %idxprom
  %4 = load i32, i32* %arrayidx, align 4
  %5 = load i32*, i32** %a.addr, align 8
  %6 = load i32, i32* %i, align 4
  %add = add nsw i32 %6, 1
  %idxprom1 = sext i32 %add to i64
  %arrayidx2 = getelementptr inbounds i32, i32* %5, i64 %idxprom1
  %7 = load i32, i32* %arrayidx2, align 4
  %cmp3 = icmp sgt i32 %4, %7
  br i1 %cmp3, label %if.then, label %if.end

The first two lines load a and i into SSA registers; i is then widened to 64 bits so it can participate in an address computation. getelementptr (gep for short) is LLVM’s famously baroque pointer computation instruction, it even has its own faq. Unlike a machine language, LLVM does not treat pointers and integers the same way. This facilitates alias analysis and other memory optimizations. The code then goes on to load a[i] and a[i + 1], compare them, and branch based on the result.

The if.then block stores 0 into the stack slot for the function return value and branches unconditionally to the function exit block:

if.then:
  store i1 false, i1* %retval, align 1
  br label %return

The else block is trivial:

if.end:
  br label %for.inc

And the block for adding one to the loop induction variable is easy as well:

for.inc:
  %8 = load i32, i32* %i, align 4
  %inc = add nsw i32 %8, 1
  store i32 %inc, i32* %i, align 4
  br label %for.cond

This code branches back up to the loop condition test.

If the loop terminates normally, we want to return true:

for.end:
  store i1 true, i1* %retval, align 1
  br label %return

And finally whatever got stored into the return value stack slot is loaded and returned:

return:
  %9 = load i1, i1* %retval, align 1
  ret i1 %9

There’s a bit more at the bottom of the function but nothing consequential. Ok, this post became longer than I’d intended, the next one will look at what the IR-level optimizations do to this function.

(Thanks to Xi Wang and Alex Rosenberg for sending corrections.)

Software Engineering Takeaways

I had a great time this spring teaching a software engineering course for a new professional masters degree program created by my department. Since I didn’t use slides or hand out lecture notes, some students were asking if maybe I could write up a summary of what I wanted them to learn in the course. This sounded like a good idea and I figured I’d make a blog post out of it.

This course didn’t have a textbook, but we ended up going through much of Writing Solid Code and also several chapters of Code Complete. I particularly like Appendix D of the 2nd edition of Writing Solid Code, where Maguire describes how he conducts an interview for a programming position. The students also completed about a dozen (mostly) programming assignments, individually and in groups, that I’m not going to talk about here.

Testing

Being good at testing is sort of a superpower and computer science programs usually don’t do a great job teaching it. And testing truly is hard to teach since there’s a lot to it and so much of testing is either highly situational or else is about engaging the problem with the right mindset and plenty of energy: you can’t test well unless you truly want to make things fail. My approach to teaching this material is to focus on what I think of as testing’s three pillars:

  1. Test cases. These come from regressions, from systematic or randomized test-case generators, from requirements and specifications, and from corner cases in the implementation. In a few special situations, exhaustive testing works. Test cases should rarely be discarded but they can be segregated, for example into fast and slow ones, which run at different times. Test cases should exercise all relevant sources of input: not just file inputs, but also environment variables, user inputs, and things like the Windows registry, if the system under test depends on those things. The time at which parts of the test case arrive may be important. Though I forgot to give it to the class, Cem Kaner’s What Is a Good Test Case? is excellent.
  2. Oracles. Test cases are only useful insofar as we can tell if the system under test processes them correctly. Some oracles are easy and broadly applicable, such as detecting crashes, excessive resource use, and violations of programming language rules (e.g. using ASan and UBSan). The most useful oracles, however, are specific to the system under test. The one we see most often is to specify, by hand, the expected behavior of the system, based on our understanding of the requirements or specification. Differential testing — checking one implementation of a spec against another — is highly powerful, and more broadly applicable than it first appears: the reference implementation doesn’t need to implement the full functionality of the system, and sometimes it is simply a different mode of the same system. For example, an optimizing compiler can be tested against a non-optimizing version of itself. Function-inverse pairs make an excellent oracle, when applicable. In metamorphic testing we change the test case (add irrelevant levels of nesting to a C program, or remove dead code from it) in a way that shouldn’t change how it works. Assertions and checkReps are extremely valuable oracles. Finding good oracles requires creativity and attention to detail, but the potential rewards are high.
  3. Coverage. Every serious programming language has one or more tools for measuring code coverage, with the goal of finding code not exercised by a test suite. There are a lot of variants on coverage, but in practice we seldom see anything beyond line or branch coverage. If the coverage induced by a test suite is bad, the test suite itself is bad. If the coverage is good, then the test suite might be good, but this is not a sure thing.

Testing should be as frictionless as possible: investments in automation and parallelization often pay off. In class we saw how incredibly easy it is to run the tests for a project in Github on every commit using Travis.

One of my favorite kinds of testing is burning in an ADT implementation — this exercise brings the three pillars of testing together in a very satisfying way.

How SQLite is Tested is great supplemental reading.

Tool and Library Ecosystems

One of the defining characteristics of modern software engineering is its reliance on tooling to help us create large projects rapidly, with lots of high-quality libraries to build on. Every programming language has one or more collections of tools, each of which forms a more or less coherent ecosystem for getting programming tasks done: editing, refactoring, linting, profiling, building, packaging, measuring coverage, debugging, etc. The tricky bit for newcomers is to rapidly learn the shape of an ecosystem in order to avoid wasting time doing jobs that could be better accomplished with tool support.

Modularity

If testing is the reason that large software sometimes works, then modularity is the reason it can be constructed in the first place. Modularity is kind of a tough topic to cover in a course but we did spend time on API design, where I mainly wanted to convey that it’s hard to get right and there are plenty of pitfalls to avoid: excess mutability, unclear ordering constraints among API calls, long lists of parameters, poor choices for default values, hard-to-follow memory ownership protocols, cryptic names, and unwise exposure of implementation details. Something I tried to show the class is how a small group of people working closely together on a well-defined task can almost always succeed, but that — without some care and experience — multiple small groups who have accomplished tasks separately will invariably have problems integrating things they have developed separately.

Code Reviews

Testing only gets us so far and a huge number of problems can be headed off by putting two or more sets of eyes on code before it lands, whether this is done in Github or in a meeting room. We did a bit of code reviewing in class, though I feel like we should have done this more. Some aspects I find important are trying to stay constructive and objective, having clear goals for the code review, and making sure that outcomes get written down and acted upon.

Coding Styles

My view is that the details aren’t nearly as important as just having and following a coding style. This include source code formatting details, identifier naming, language features to avoid, etc. This can result in a code base that is more approachable to newcomers and where some kinds of bugs are easier to spot. The various Google documents seem as good examples as any. Coding styles are particularly important for languages that are older, cruftier, and less safe. For example, LLVM lives in a C++ style that I generally find to be comfortable and tasteful.

Source Control Systems

Git is more or less the only game in town so there’s not much to talk about there. The thing I tried to convince the students here is that we don’t just “use git.” Rather, it’s crucial to define a sensible git-based workflow and get everyone onboard with it. The main thing is that for a variety of common use cases, we have a fairly mechanical set of steps to follow; git is confusing and badly designed enough that workflow innovation is best left to people with a lot of experience.

Critical Systems and Responsibility

We are far past the point of pretending that software is either ethically neutral or uniformly positive, and the 90s-era Internet optimism sounds pretty naive at this point. A significant fraction of today’s CS students are going to work on, at some point, software that is very much ethically non-neutral. This is a good post on that topic. We read various documents related to the Toyota unintended acceleration case and discussed other software safety problems like the Therac-25.

Backwards Compatibility

People tend to underestimate its importance and there are great stories on both sides. Microsoft’s commitment is incredible: a long time ago I had access to the Windows 2000 sources and ran across crazy things, like a copy of Windows 3.1 (in x86 assembly!) and routines for detecting legacy applications and implementing various buggy behaviors that they depended on. On the other side we have, for example, the Python 2/3 debacle.

Static Analysis

I tried to convince students that static analysis is useful, starting with compiler warnings. We read A Few Billion Lines of Code Later and they spent some time looking at issues pointed out by the Clang static analyzer.

Engineering for Performance

The message here is a bit subtle: we want to discourage premature optimization while still encouraging people to learn to build software that isn’t fundamentally too slow to meet its goals. So on one hand we don’t want to write everything in C “for speed” but on the other hand we need to avoid showstoppers such as Python, a bottleneck data structure stored on disk, a quadratic algorithm, etc.

For performance tuning, of course measurement is everything, so we need profilers and maybe also hardware-specific utilities like perf. It doesn’t hurt to know the strengths and weaknesses of both the compiler and the hardware platform, and we should always be ready to read some assembly. Chapters 25 and 26 of Code Complete 2e are a good resource for all of this, though they do not present a very nuanced view of what one can expect the compiler to accomplish.

Reporting Bugs

Getting a software developer to stop doing what they wanted to do, and to spend the day fixing a defect instead, can be difficult, but there are a few simple ingredients that can make bug reports vastly more effective. The report should be framed politely and matter-of-factly; implying that the developers are incompetent is rarely helpful. The issue must be reproducible and the bug report must describe all of the circumstances necessary to make the bug show itself. However, all nonessential circumstances should be left out, including any part of the test case that is not necessary for triggering the bug.

Software Process

There are plenty of software development process models, but I guess I don’t find any of these to be particularly compelling, so we didn’t spend much time on them. On the other hand, the elements of software process — estimation, requirements, modeling, architecture, implementation, quality assurance, maintenance, etc. — are hard to argue with. I spend a bit of effort trying to prepare students for the huge diversity in software development organizations they might see out in the world, from startups to banks to Googles to IBMs. I tried to get them used to the fact that in some cases management may have wildly unrealistic ideas about software development and that requirements can change at the drop of a hat.

The Human Element

Effectively reviewing someone’s code requires a light touch; you have to critique the code rather than judging its author. On the other side, it can be very difficult to gracefully process criticism of a piece of code that you’ve bled and sweated over for months. A bug report is really just an argument that a person should make a change to a piece of software. It is hard to implement a piece of software that solves someone’s problem without putting yourself in their place. Similarly, it is very difficult to solve a problem for someone you don’t like or respect or at least empathize with to some degree. UI design requires understanding the needs of people who don’t share your own abilities and cultural background. Ethical concerns are common, and probably should be thought about a lot more often than they are. Team dynamics are complex and managing people is really not easy. In summary, software development is fundamentally a human endeavor and we’d all do well to keep that in mind.

Paths to External Engagement in Computer Science Research

The other day I wrote a post imploring academic computer scientists to at least occasionally break out of their research bubbles and engage with real engineering problems where their research matters. That post was, admittedly, a bit facile about how one might make this engagement happen. This piece suggests some ways. I don’t claim any particular authority or expertise, though I have — except where noted — tried out all of the suggestions below. Perhaps others will chime in with what has worked for them.

Talk to a lot of people and create a network. Networking is a really important skill. Back in the day Phil Agre’s Networking on the Network was what students would read. It predates modern social networks but contains a lot of timeless advice and in any case I don’t know of a better choice.

Attend conferences outside the academic mainstream. In some cases, such as SIGGRAPH and Supercomputing, there’s plenty of non-academic content at the conference you’re attending anyhow, but most of us aren’t so lucky. There’s a really wide range of industrial conferences; some of them definitely won’t be very interesting for academics so you need to do some homework ahead of time to find the one where the right people will be. For my current work, the LLVM Developers Meeting is the main event, almost all of the community’s heavy hitters are there and the technical sessions are amazing. The security community has plenty of great non-academic events. I’ve heard good things about !!Con and Strange Loop. In any case, the point of attending these conferences (besides the technical content) is to meet people who aren’t professors or students.

Spend a day visiting a company or a government agency. You need an invitation, but you can invite yourself if you can make a contact for example via your advisor, a mutual friend, or by meeting people at conferences. Talk to people there, get a sense of what they’re doing with their days, what they’re worried about, what they’re frustrated with. Give a talk about your work if possible, but this isn’t necessary. It often makes sense to do these visits when you’re already traveling.

Spend longer visiting a company or government agency. Depending on your career stage this could be an internship, a postdoc, a sabbatical, a summer, or a leave of absence. This is a chance to work closely with people for an extended period of time. A lot of people do this at some point in their careers and I think it’s a really good idea.

Engage on twitter. It’s a weird, crowded, noisy place and it can take a while to find the right people to follow and longer to get them to follow you. The advantage of Twitter is that there’s a huge number of smart people are out there and communicating with them is almost frictionless.

Blog. People are far more likely to read a blog entry than a paper, in my experience. Also, the readership is different, because non-academics are even less likely to read a paper than academics are. Realistically, starting a blog only makes sense if you have a fairly consistent stream of things to say that don’t fit into tweets and don’t really belong in academic papers. Building an audience takes time and requires a certain amount of regularity in writing; these don’t necessarily fit in very well with the academic binge model of working that many of us subscribe to. Another issue is that blogging doesn’t pay the bills, academically speaking — you should only do it because you want to, not because you expect any direct benefit to your career. I waited until getting tenure to start a blog for this reason, and also to make sure that I had at least a few years’ worth of ideas lined up.

Find people who are great at external engagement. Emulate them, collaborate with them, or join them. The Racket folks are amazing at this.

Release software. Put your stuff on Github, polish it up, and then tell people about it. Get users, accept pull requests, respond to feedback, fix bugs, add features, cut releases, and repeat. Either your code will provide people with a good value proposition or it won’t — either way you learn something. The caveats are that building a user base takes time, creating realistically usable software is like 25 times as much work as creating research-grade crapware, and only a small subset of computer science professors will value your contributions in this area. But it is enormously fun and anyway you don’t want to make the mistake of caring too much what professors think.

Engage with existing open source software. For many of us, there’s an obvious open source project that we could be contributing to or otherwise helping out. Find that project and read their mailing lists, look into the issue tracker, build and use the code, read the code, and maybe submit a patch or two. Beyond these things, consider attending their meetings or BoF sessions, if these exist. A reasonable long-term goal would be to use your work to make a significant improvement to the open source program.

Start a company. This one I haven’t done, though I know many people who have. It is a somewhat extreme option, as much a lifestyle choice as research engagement strategy. Details are out of scope of this post and anyway I don’t know anything about doing this.

Ok, with all that said, I’ll make a prediction or two about what will happen if you follow these suggestions. First, you’ll often find it frustrating and some of the time you invest will be wasted. I’ve burned months on things that never bore the tiniest fruit; if I knew how to tell you to avoid this, I certainly would. Second, you’ll discover that the problems that people are having out there aren’t the ones that you would have predicted, nor are they the ones that your CS friends and colleagues predicted. You need to learn to listen to people, but often even the people having problems aren’t actually having the problems that they think they’re having (anyone who has worked tech support will tell you this is the case more often than not). You need to learn to observe carefully and read between the lines to figure out what’s really going on. Third, at some point you will run into the distinction between problem-driven research and solution-driven research. The former is like trying to cure cancer or put a person on Mars: the problem is everything and you’ll consider any solution that might work. The latter is where your research hammer is a damn good one and you’re never going to put it down: if it can’t solve someone’s problem, you’ll move on and find a different problem. Obviously there’s a spectrum — but you’ll need to decide where on it you sit.

Closing the Loop: The Importance of External Engagement in Computer Science Research

Computer scientists tend to work by separating the essence of a problem from its environment, solving it in an abstract form, and then figuring out how to make the abstract solution work in the real world. For example, there is an enormous body of work on solving searching and sorting problems and in general it applies to finding and rearranging things regardless of whether they live in memory, on disk, or in a filing cabinet.

To paint a bit of a caricature, we have the real world where:

  • all problems are engineering problems, and every engineering problem is distinct from the rest in small or large ways
  • abstractions are leaky: machines fail, bits flip, packets drop, and bugs exist
  • requirements are many-dimensional, diverse, and poorly specified
  • systems often involve human beings who are notoriously hard to reason about

Overall, engineering problems are messy and we usually can’t prove anything about anything, any more than we could prove the correctness and optimality of a bridge or lawnmower.

On the other hand, in the abstract problem space, we’ve dropped every detail that we don’t wish to reason about but retained (ideally) all of the important characteristics of the problem. We’re now dealing with a model — sometimes a formal mathematical model, other times an executable model such as a kernel or simulation — of part or all of the problem. Models can be rigorously analyzed for efficiency, optimality, correctness, and whatever else. Furthermore, connections between problems that initially seem very different may become more apparent in the abstract form.

This process of lifting engineering challenges into abstract problems, solving them, and applying the results — I’ll call it the computer science research loop — is so integral to the DNA of computer science research that it is simply assumed; people have a hard time imagining any other way to work. Also, it has been incredibly successful, which is unsurprising since we inherited it from mathematics where it had been giving good results ever since some prehistoric person observed that you could talk about the number three without actually having three things in front of you.

Here’s the loop in its simplest form:

Here are some ways the research loop shows up in areas that I’m familiar with:

  • In compilers we can develop a toy language that is much easier to compile but that (we argue) retains the essential features of real programming languages.
  • In compilers we can compile a benchmark suite instead of real applications, arguing that our results will translate over into practice.
  • In resource scheduling research it is typical to abstract away all details of the jobs being scheduled.
  • In databases or operating systems we can create a transaction engine or OS kernel that supports only a tiny subset of the features provided by SQL Server or Linux, arguing that the advantages displayed by our simplified model would not disappear if we took the trouble to implement all the boring stuff.

In all cases the goal is to elide details that make our work harder, but without oversimplifying. This piece is about an avoidable but undesirable second-order effect: it is common for both edges of the computer science research loop to be weaker than they could be.

The concrete-to-abstract edge suffers when people working on the abstract side don’t have deep expertise in the concrete problems they’re trying to solve, and it also tends to weaken over time as the character of the problem drifts, causing assumptions on the abstract side to be invalidated. The abstract side has a kind of inertia: infrastructure builds up in code, formalisms, books and papers, and mental models. It requires significant time, energy, and risk-taking to throw away parts of the abstract infrastructure and create fresh pieces. Abstractions that are out of touch with real-world problems can linger, producing papers and PhD theses, for decades.

The abstract-to-concrete edge of the research loop is also problematic: solving real engineering problems, or interacting with the people whose jobs are to solve those problems, can be difficult and time-consuming. It is generally much easier to work purely on the abstract side, and in fact our field’s mathematical roots encourage this behavior. Abstract work is, of course, fine as long as someone else is doing the grungy part, but in many cases that never happens because the abstract side has drifted off to the side of the real problems, becoming more elaborate and complex over time as the easy problems get mined out, and in the end there’s no realistic prospect of applying it.

I believe these issues cause computer science research, overall, to be less interesting and impactful than it could be. I also believe that mitigating the problem isn’t that difficult and that doing so tends to make researchers’ careers a lot more fun and rewarding.

The solution is for researchers to engage with the world outside of their research bubble. Working on real-time scheduling? Then go find some drone software and help its authors or users avoid deadline misses, or else contribute scheduling code to Linux. Working on concurrency control in databases? Figure out a way to integrate the new scheme into MySQL or something, instead of waiting for them to read your paper. Working on finding bugs in software? Report the bugs to the people who maintain the code and watch their reactions. It is particularly important that students do these things, first because their intuitions often aren’t as well-developed and second because I’ve noticed that quite a few CS graduate students are quietly and privately wondering if their work is good for anything in the end. It turns out there’s a way to answer this question: engage with the people whose problems you are solving. As a bonus you’ll publish fewer papers.

It is not the case that that every piece of research should be applied research. Rather, good pure research usually stems from direct personal experience with problems on the engineering side of our world. It’s a bit of a subtle point: doing the grungy boots-on-ground work is how we build our intuitions about what kinds of solutions actually work vs. sounding good on paper. It is hard — though not impossible — to skip this step and still do great work.

Going a bit further, my experience is that much of the interesting action in research happens on the abstract-to-concrete edge of the CS research loop, even though this work is not glamorous or well-rewarded by program committees or grant panels. Even the old workhorses like sorting an array or implementing a key-value map became dramatically more interesting and complex in the context of a real machine, operating system, compiler, and workload.

Concretely, here are some things to look for that might indicate that a research community needs to tighten up its loop:

  • few members of the community are plugged into the concrete problem domain, and are providing fresh insights from developments there
  • few members of the community are moving abstract results into practice
  • members of the community are mainly interested in impressing each other (or, equivalently, papers that demonstrate external impact are not highly valued)
  • the community rewards complex solutions because they are innately interesting, as opposed to rewarding simple solutions because they have engineering merit
  • years of results cluster around the same baseline or benchmark set, instead of continually moving the bar higher

In conclusion, the tension between mathematics and engineering is alive and well in our field. My belief is that more of us, perhaps even most of us, should be skilled in, and actively working in, both modes of thinking.

Also see this followup post.

The Basic Toolbox

This post is aimed at computer science students.

In the software engineering course I’m teaching this spring, I often find myself saying things like “you need to know a scripting language” or “everyone should be able to run a code coverage tool.” Finally, the other day, a student stopped me and asked for the whole list. In other words, what — in my opinion — is the collection of tools that someone graduating with a CS degree should know how to use. Of course I couldn’t answer this on the spot but I’ve been thinking about it since then. The basic idea is that for most any common situation, you should have a decent tool at hand and be able to start solving problems with it without too much fumbling around. (Keep in mind that this is a wish list for self-study: I doubt that any CS program teaches all of these. Also, I didn’t have all of these tool skills when I got my undergraduate CS degree, though I did by the time I got a PhD.)

A version control system: Git is the obvious choice; the main thing you should have is a basic Github-centric workflow including pull requests, remotes, dealing with merge conflicts, etc.

A text editor: We all end up using different editors from time to time, but we should each have a solid default choice that does a good job with most editing tasks. It should highlight and indent any common programming language, integrate with a spellchecker, easily load gigantic files, have nice regex-based search and replace, etc. There are plenty of choices, many CS people migrate to vim or emacs.

A graphing program: I routinely use gnuplot, graphviz, and Powerpoint to make figures. Lots of people like matplotlib.

A presentation tool: Powerpoint, Keynote, Google Slides, something LaTeX based, etc.

An interactive debugger for native executables: LLDB, GDB, something IDE-based.

A generic build system: Make, CMake, etc.

A scripting language: This is for low-grade automation, quick and dirty data analysis tasks, etc. Python and JavaScript would seem like natural choices. Around 20 years ago I was an intern at a networking company and my supervisor popped out of a meeting with some data concerning switch errors, and asked me to do some analysis to locate the underlying pattern. I wasn’t sure how; he handed me a Perl book and I was able to get the job done before the meeting ended.

A shell language: This is probably bash or PowerShell, but there are plenty of other choices. There’s some overlap with scripting languages but I think there are two distinct niches here: a shell language is for scripting a smallish number of commands, doing a bit of error checking, and perhaps looping or interacting with the user slightly This sort of job is a bit too cumbersome in Python, Perl, or JavaScript.

A systems language: This is for creating servers, daemons, and other code that wants to go fast, use little memory, have few dependencies, and interact tightly with the OS. C or C++ would be the obvious choices, but Rust and Go may be fine too.

A workhorse language: This is your default programming language for most tasks, it should have a huge collection of high-quality libraries, be pretty fast, run on all common platforms, have a great tool ecosystem, etc. Racket, Java, Scala, OCaml, C#, Swift, or Haskell would be great — even C++ would work.

A pocket calculator: This is your go-to REPL for basic arithmetic and conversions between number representations, it should be near-instantaneous to get answers. For reasons I no longer remember, I use gdb for this — typically multiple times in any work day. Old standbys like bc and dc also seem like bad choices. I’m curious what other people do here.

Tools for Programming Languages

There’s no reason these days to use a language that doesn’t have a good tool ecosystem. For any given language you should know how to use its interactive debugger, static and dynamic bug-finding tools, a profiler, a code coverage tool, a build system, a package manager, and perhaps a random test-case generator.

Secondary Tools

There are a lot of other tools that could have gone into my basic toolbox, such as a data analysis tool, a browser language, a cloud-based testing service, a statistics language, a typesetting system, a spreadsheet, a database, and a GUI builder/toolkit. I don’t consider these as fundamental; of course, your mileage may vary.

Trust Boundaries in Software Systems

One of the big things that has changed in computer science education over the last 20 years is that it is now mandatory to prepare students for writing software that lives in a hostile environment. This content can’t be limited to a computer security course, it has to be spread throughout the curriculum. My experience, based on talking to people, looking through textbooks, and looking at lecture material on the web, is that overall we’re not yet doing a great job at this.

When teaching this subject, I’ve started using trust boundaries as an organizing principle. A trust boundary exists any time we (the system designers or system owners) trust code, data, or human actors on one side of an interface more than we trust the other side of the interface. Students need to be able to recognize, understand, fortify, and stress-test the trust boundaries in any system they have a stake in.

Trust boundaries aren’t hard to find: We just need to ask questions like “What are the consequences if this code/data became horribly malicious? Is that likely? Can we defend against it? Do we want to defend against it?” It is easy to conclude, for example, that a demonic garbage collector or OS kernel might not be something that we wish to defend against, but that we had better fortify our systems against toxic PNG files that we load from random web sites.

Some basic observations about trust boundaries:

  1. They’re everywhere, even inside code written by a single person. Anytime I put an assertion into my code, it’s a tacit acknowledgment that I don’t have complete trust that the property being asserted actually holds.
  2. The seriousness of trust boundaries varies greatly, from mild mistrust within a software library all the way to major safety issues where a power plant connects to the internet.
  3. They change over time: a lot of our security woes stem from trust boundaries becoming more serious than they had been in the past. Email was not designed for security. The NSA wasn’t ready for Snowden. Embedded control systems weren’t intended to be networked. Libraries for decoding images, movies, and other compressed file formats that were developed in the 90s were not ready for the kinds of creative exploits that they faced later on.
  4. If you fail to recognize and properly fortify an important trust boundary, it is very likely that someone else will recognize it and then exploit it.

To deal with trust boundaries, we have all the usual techniques and organizing principles: input sanitization, defense in depth, sandboxing, secure authentication, least privilege, etc. The issue that I’m trying to respond to with this post is that, in my experience, it doesn’t really work to hand students these tools without some sort of framework they can use to help figure out where and when to deploy the different defenses. I’d be interested to hear how other CS instructors are dealing with these issues.

Stories Behind Papers: Integer Overflow

A couple months ago Jean Yang and Vijay Chidambaram had a Twitter discussion about the stories behind research efforts that you might hear over coffee, but that usually don’t get written up. Vijay started a series of posts containing these. I thought I’d write up a couple of them myself. Alas, none will be particularly dramatic. This seems like a good one to start with.

Around the mid/late 2000s — perhaps starting with Nearly All Binary Searches and Mergesorts are Broken — I got interested in integer overflow bugs. At this point the security aspect of integer bugs in C and C++ was receiving plenty of attention, but I didn’t feel like very many people were looking at the broader issue of logic errors stemming from integer overflows. Even in functional languages with super-serious type systems and a focus on correctness, integer overflow was (and is) an often-neglected issue. These problems are fundamentally difficult to deal with at compile time.

Anyhow, the thing that really got me motivated was the very limited understanding of undefined behavior that seemed to be par for the course in those days. Additionally, most of the existing research tools for detecting or mitigating integer overflows were operating on compiled code, while I believed this problem needed to be attacked near the source level.

By summer 2010 my student Peng Li (now at Baidu USA) had a modified version of Clang that emitted dynamic checks for integer overflow, divide by zero, value-losing typecasts, shifts past bitwidth, and that kind of thing into a compiled C or C++ program. We used this to test open source software and it turned out that basically all programs were executing a constant stream of undefined behavior. I fired off a few dozen bug reports. Since UB wasn’t widely understood at that time, many developers still had the attitude “that is OK since we did it intentionally” or “I am allowed to expect signed overflow in C/C++ to have two’s complement behavior because that is what the hardware does.” See for example the discussions that happened at PostgreSQL and PHP.

In early 2011 I visited Grigore Rosu’s group at UIUC to learn about their awesome new KCC tool. We needed something like this to filter out undefined programs that C-Reduce was creating while making minimal versions of bug-triggering programs from Csmith. During this visit I happened to be able to grab a few minutes with Vikram Adve and learned that he and his student Will Dietz were also working on LLVM-based integer overflow detection, and they also had a working prototype. Yikes! This wasn’t even close to the worst case scenario — which would have been learning about their work once they had a paper accepted somewhere — but it’s never fun to be scooped. Competition in research may occasionally be useful as a forcing function, but I am personally uninterested in competing. If smart people are working on a problem, I would like to leave them to it, they’ll most likely come up with a good solution. There are way too many other fun and important problems to work on for competing on any single problem to be attractive. Anyhow, luckily, Vikram and Will were happy to collaborate, so we started working together. I’m still happy with the resulting paper and I’m confident that it is better than what either of the groups would have produced working on its own.

One of our goals all along was to get integer overflow checks into Clang. This took a while longer and Will did most of the legwork. The job was made easier by the fact that by this time there was plenty of momentum towards dynamic undefined behavior detection in LLVM. For example, ASan was already part of the tree. There was an existing -fcatch-undefined-behavior flag that we fit into, but this pretty rapidly (in time for LLVM 3.3, I believe) got phased out in favor of the -fsanitize=undefined usage that Clang still uses.

Overall, dynamic detection of integer-related undefined behaviors in C/C++ is not difficult, but convincing people to take these bugs seriously was a long struggle and the broader issue of how integer overflows relate to program bugs is interesting and deep. Fundamentally, people usually do not want, and are not good at reasoning about, fixed-width integers, but on the other hand we don’t want to just put bignums everywhere in our low-level programming languages. The other thing I take away from this effort is how a lucky break and a willingness to collaborate were really important in making the work successful, and in fact my group continues to collaborate with Vikram’s.