Floating the Dirty Devil River

Packrafts are tough, light, individual-sized inflatable boats that people use to put together really amazing wilderness trips by combining rafting, hiking, and sometimes even biking. I’ve had sort of a low-grade obsession with packrafting since 2009 when I ran into some people in Alaska who were on their way to hike up and over the Brooks Range and paddle down to the Arctic Ocean. (I’m 99.9% certain it was these folks.) But anyway, I never managed to actually do any packrafting until spring break this year when — along with my friends Brian and Tim — I floated a section of the Dirty Devil River in southern Utah.

The Dirty Devil is a little tricky: it’s a seasonal desert river that, on average, only has enough flow for floating during February and March — and February can be really cold. Luckily the university’s spring break is in March, usually a superb time to be in the desert: the days are getting longer and warmer, the nights are still cold, and there’s not a single bug to be found. Usually you get snowed on for a couple hours, but the snow doesn’t stick around.

None of us had so much as touched a packraft, but all of us had canoed or kayaked before, so we figured a class 1-2 river should be doable. The Dirty Devil flows through a canyon 1500 feet (450 meters) deep and, once we started, there was no realistic possibility of escape except by finishing the trip 36 miles later.

We started out on a spectacular morning:

To set up a shuttle we drove down the rough but super fun Poison Springs Road; here Brian’s cramming a few last things into my vehicle with the DD running narrow and fast in the background:

There was quite a bit more water in the channel than we’d expected; it turned out that Happy Canyon, about eight miles upstream, was in the middle of a minor flash flood.

Next we returned to the paved road, drove a few miles, then down to the top of the western end of the Angel Trail (one of a number of trails with that name). Nobody else was there, and we didn’t see anyone until the last day of our trip. At this point the weather was deteriorating:

And soon we here hiking in the snow:

The snow didn’t last long, but it remained windy and not far above freezing; we were happy to have wetsuits and puffy jackets. We put in across the river from Robbers Roost Canyon, where Butch Cassidy and his gang used to hide out:

We paddled to Angel Cove, which has good camping and a nice spring. With the braided river and the wind it took forever to go just a couple of miles. We were pretty happy to get there, get into dry clothes, make some dinner. Also, the rest of our trip lacked reliable water sources (water from the DD is nasty) so we purified enough water to last the rest of the trip.

The next morning it was still cold, but nice and sunny:

We saw both sides of a ridge called the Saw Tooth, which the river meanders around:

We also passed No Man’s Canyon and Larry Canyon, both of which would have been great to spend a few days hiking, but we hadn’t been able to budget time for that.

The opaque water was hard to read and we did an awful lot of pulling our boats around the river trying to find the channel, which moved from bank to bank every few hundred yards:

This was exhausting and there was plenty of quicksand too. Time flew by as we moved down the river at less than a walking pace and it was around 3pm before we noticed that we should stop for lunch.

The scenery continued to develop:

We pushed late into the day:

And finally stopped at a nice camp site on the river bank where we watched the last of the sun on the canyon walls:

There was copious driftwood so we had a fire:

Morning was good:

Since we hadn’t made good time the previous day I got us up early, but it was just too cold to get into the water and anyway our wetsuits were frozen stiff. We had another cup of coffee and waited for the sun:

It ended up being warmer than the previous day — just about perfect weather — and for whatever reason we made much better time than we had earlier. At this innocuous-looking rest stop I hit some ugly quicksand and told the others to land somewhere else:

Brian landed a little ways to one side and just walked up the bank but Tim got mired much worse than I had, it probably took him 15 minutes to get out. We couldn’t even help.

Over the course of the day we passed Twin Corral Box Canyon and Sam’s Mesa Box Canyon, both huge and seldom-visited places I’d have been happy to spend a couple days exploring, but again we wanted to keep on schedule. Travel was generally easy and also the river gained some fun riffles as it started to drop more rapidly through the sandstone layers. It wasn’t possible to avoid hitting rocks since we couldn’t see through the water, but packrafts seem to be very tough. They’re also quite stable, I only worried about flipping once or twice, and never did.

By late afternoon we reached Happy Canyon — named, apparently, by old-time cowboys who were happy to find a way out of it — which was still running from its flash three days earlier:

No shortage of quicksand here either, ugh!

We didn’t take much time to explore, but rather found a place to camp nearby. By this point we were only eight miles from our takeout and I finally stopped worrying about getting trapped in the canyon if one of the boats popped or got lost. For whatever reason this night was extra cold, my water bottle was more than half frozen in the morning, meaning it had probably gotten below 25 degrees.

Nice light on the walls:

Happy Canyon has some of the finest walk-through narrows anywhere on the Colorado Plateau:

After this excursion it was time to pack up camp and float some more. We immediately hit a series of rock gardens:

Progress was overall slower than the previous day, but we moved along and the scenery kept being good.

And suddenly we reached the takeout, where we had adult beverages and loaded a lot of extremely sandy gear into Brian’s truck:

And then back to pick up my truck, which we reached just in time for sunset:

Here in the middle-right you can see the sawtooth ridge that we had floated past a few days earlier:

And finally, as tradition dictates, we stopped for burgers at Ray’s in Green River. This was a super good trip, but fairly difficult physically and perhaps a bit much to bite off for a group of 40-something first-time packrafters.

Verifying Popcount

Popcount is the function that returns the number of set bits in its argument. Showing that a popcount implementation does what it claims to do has become one of my favorite examples to use when I need to quickly show students how we can reason about programs mathematically. Something like a selection sort is probably the more standard example here, but sorting gets into some subtleties that I’d rather reserve for a second example. Popcount is so easy that this whole example can be done in a half hour.

The programming language and the particular algorithm don’t matter too much, here we’ll use what is perhaps the simplest popcount implementation in C or C++:

int popcount(uint64_t x) {
  int c = 0;
  for (int i = 0; i < 64; i++) {
    c += x & 1;
    x >>= 1;
  return c;

This works, of course, because it scans the entire argument, incrementing the counter each time it finds a set bit. To show that it works in a more formal fashion, we can use methods more or less out of A Discipline of Programming, which allow us to combine obvious effects of individual statements into (hopefully) the desired high-level property, which we might as well annotate the program with right now:

int popcount(uint64_t x) {
  int c = 0;
  for (int i = 0; i < 64; i++) {
    c += x & 1;
    x >>= 1;
  // c = POPCOUNT(orig_x)
  return c;

Here we’ve used a couple of tricks that need to be fully explained when giving this example to students who aren’t used to this kind of reasoning. First, since x is modified as the program executes, we’ve used orig_x as a shorthand for the original value of x when it was passed as an argument. (This syntax is a trick I learned from ACSL.) Second, here we’ve used the POPCOUNT function, which is the mathematical specification of popcount, which is a completely different thing than the C code we are trying to verify. Consistently applying the correct rules to mathematical and programmatic elements isn’t as straightforward as we would hope and in fact quite a few results from the bad old days of non-mechanized program verification are wrong in subtle ways.

To make our proof go, we’ll need to find the right loop invariant: a statement that is true before the loop executes and also true after every loop iteration. The trick when explaining this task to students is helping them realize that (for a simple example like this) if they understand how the algorithm works, they effectively already know the loop invariant. They just need some help in teasing it out into a usable form. One of the stumbling blocks is that every loop has many invariants, but most of them (for example c ≥ 0) aren’t helpful. We can avoid a lot of silly loop invariants by observing that the right invariant will be strong enough to imply the correctness criterion for the program while also being weak enough to already be true before the first loop iteration, but I’ve found that negative constraints like these are not very helpful for students. The trick, of course, is to capture the abstract fact that is being held in balance as the loop modifies pieces of program state, and to make a lecture like this work, we need to avoid the appearance of pulling this fact out of thin air. It defeats the entire purpose of this discussion if students walk away feeling like there was a magical step in the middle.

So anyway, by hook or by crook we need to get the audience to realize that the key to this algorithm is the loop invariant c + POPCOUNT(x) == POPCOUNT(orig_x). At this point we’re over the hump and we should mark up the code some more:

int popcount(uint64_t x) {
  int c = 0;
  // c + POPCOUNT(x) == POPCOUNT(orig_x) ???
  for (int i = 0; i < 64; i++) {
    // c + POPCOUNT(x) == POPCOUNT(orig_x)
    c += x & 1;
    x >>= 1;
    // c + POPCOUNT(x) == POPCOUNT(orig_x) ???
  // c + POPCOUNT(x) == POPCOUNT(orig_x) => c = POPCOUNT(orig_x) ???
  return c;

Here I’ve used question marks to indicate a proposition that is still in doubt. Now we’ve broken up the problem into manageable pieces and can visit them in turn:

  • The loop invariant is true before the loop executes because c = 0 and x = orig_x.
  • To see that the loop invariant is preserved by the loop body, we analyze two cases: x & 1 is either zero or one. If it is zero, then c is unchanged and also, clearly, shifting x right by one place does not change POPCOUNT(x). If it is 1 then c is incremented by one and shifting x right by one place has decreased its popcount by one. So we can see that either way, the invariant is maintained.
  • To prove that the loop invariant implies the overall correctness condition, we must observe that x = 0 after being shifted right 64 times. This is the sort of situation where we better be clear about the rules for shifting signed vs. unsigned values in C or C++.

Of course we’ve skipped over a lot of smaller steps, but this argument captures the essence of why the algorithm works. When presenting this particular algorithm, we can safely skip the termination proof since i is not modified in the loop body.

An entertaining variant is Kernighan’s popcount:

int popcount(uint64_t x) {
  int c = 0;
  while (x) {
    x &= x - 1;
  return c;

I didn’t use it in this post because I feel like it’s useful to show the analysis by cases in the simpler algorithm. Wikipedia has some faster and much harsher algorithms, which are the ones used in practice on processors that do not have a native popcount instruction.

Anyway, I presented this material in a software engineering class yesterday, and realized that I’d used this example a couple of times before, so it seemed worth writing up here.

Explaining Code using ASCII Art

People tend to be visual: we use pictures to understand problems. Mainstream programming languages, on the other hand, operate in an almost completely different kind of abstract space, leaving a big gap between programs and pictures. This piece is about pictures drawn using a text character set and then embedded in source code. I love these! The other day I asked around on Twitter for more examples and the responses far exceeded expectations (thanks everyone!). There are a ton of great examples in the thread; here I’ve categorized a few of them. Click on images go to the repositories.

Data Structures

One of the most common kinds of ASCII art in code is illustrating the shape of a data structure.

The example I started with comes from LLVM:

The layout of a data structure in the Jikes RVM:

A tree rotate in Musl:

Double-ended queue from the Rust library:

Swift compiler internals:

Malloc header layout:

State Machines

JavaScript profiling:

RPCs in Cloud Spanner:

I/O stream states:

Logical Structure in the Problem Domain

Control flow in an NWScript program being decompiled:

ECC internals:

Formatting numbers:

A quantum circuit:

Balancing memory management objectives in an OS kernel:

Subtyping relations (this is a very cool special case where the ASCII art is also code):

The format of a DBF file:

A lookup table for image processing:

Shape of a color function:

Structure of a URI:

A very quick tutorial on undo systems from emacs:


Attitude control in the Apollo Guidance Computer (!!!):

Image tiling:

Boomerang trajectories in Nethack:

Rendering CSS borders:


Speed control in a milling machine:

Scrolling web pages:

I hope you enjoyed these as much as I did!

Walking or Biking to NSF

Since the National Science Foundation funds a large fraction of academic computer science research in the USA, we often end up traveling to Washington to visit the NSF. This post is just to say that if you are traveling light, if you need some exercise, if you have a bit of free time, and if the weather is good, it is feasible and pleasant to walk or bike from Reagan Washington National Airport (DCA) to the new NSF location in Alexandria.

The short version is:

  • Don’t follow Google Maps in pedestrian mode, which finds a not-great route
  • Do follow the Mount Vernon Trail, which connects DCA and Old Town Alexandria in a pretty direct fashion

I’ve walked this route in both directions; it’s roughly five miles and takes me 80 or 90 minutes (I’m a pretty fast walker so ymmv). I have not biked it, but Capital Bikeshare has several stations near DCA and quite a few stations scattered around Alexandria — so this seems like a great option, assuming that your luggage is in a backpack or something else hands-free.

Here’s an overview of the whole route:

To help orient yourself, the beltway runs across the bottom of the image, the Potomac River runs north to south, you can see DCA’s runways towards the top, and Old Town Alexandria is in the triangular area in between the Potomac, the beltway, and the red walking route.

The only potentially tricky bit is locating the Mt Vernon Trail. Starting anywhere in the airport, walk to Terminal A and go outside to the semicircle-shaped pickup/dropoff area, and then find a sidewalk going roughly southwest. The road you want to walk along is South Smith Boulevard. Don’t cross under the Metro tracks and also don’t go down a level to Thomas Avenue, which gives access to some of the buildings right next to the airport terminal. This may be helpful:

In less than half a mile, as you reach the end of the airport-related buildings, the Mount Vernon Trail, paved with a dotted yellow center line, is easy to see and you should move over to it here (open kmz files in Google Earth). This is all pretty obvious on the ground, and there are some helpful signs. Whatever you do, don’t cross the George Washington Parkway — the busy divided highway that runs nearby.

Follow the Mount Vernon Trail past Daingerfield Island to Alexandria, at which point the trail forks and the Mount Vernon Trail heads over towards the Alexandria waterfront. This is a nice walk if you have time, but it’s out of the way and you can get to NSF more quickly by moving to East Abingdon Drive here and then winding your way through residential Alexandria towards King Street and then NSF.

As long as you have some sort of map this whole route is very easy to follow, as is the reverse route back to the airport.

Synthesizing Constants

(See this blog post for a short introduction to synthesis, or this paper for a long one.)

In this piece I want to discuss an aspect of program synthesis that sounds like it should be easy, but isn’t: synthesizing constant values. For example, consider trying to synthesize an optimized x86-64 implementation of this code:

long foo(long x) {
    return x / 52223;

The desired output is something like:

        movabs  rdx, -6872094784941870951
        mov     rax, rdi
        imul    rdx
        lea     rax, [rdx+rdi]
        sar     rdi, 63
        sar     rax, 15
        sub     rax, rdi

Assume for a moment that we know that this sequence of instructions will work, and we only lack the specific constants:

        movabs  rdx, C1
        mov     rax, rdi
        imul    rdx
        lea     rax, [rdx+rdi]
        sar     rdi, C2
        sar     rax, C3
        sub     rax, rdi

We need to find a 64-bit constant and two 6-bit constants such that the assembly code returns the same result as the C code for every value of the function parameter. Of course there’s a well-known algorithm that GCC and LLVM use to compute these constants, but the synthesizer doesn’t have it: it must operate from first principles.

An even more difficult example is a 64-bit Hamming weight computation, where (assuming we don’t want to use vector instructions) we might hope to synthesize something like this:

popcount64(unsigned long):
        movabs  rdx, 6148914691236517205
        mov     rax, rdi
        shr     rax
        and     rax, rdx
        movabs  rdx, 3689348814741910323
        sub     rdi, rax
        mov     rax, rdi
        shr     rdi, 2
        and     rax, rdx
        and     rdi, rdx
        add     rdi, rax
        mov     rax, rdi
        shr     rax, 4
        add     rax, rdi
        movabs  rdi, 1085102592571150095
        and     rax, rdi
        movabs  rdi, 72340172838076673
        imul    rax, rdi
        shr     rax, 56

Again assuming that we know that this sequence of instructions will work, we still need to come up with 280 bits worth of constant values. One can imagine even more challenging problems where we want to synthesize an entire lookup table.

The running example I’ll use in this post is much easier:

uint32_t bar1(uint32_t x) {
  return ((x << 8) >> 16) << 8;

The code we want to synthesize is:

uint32_t bar2(uint32_t x) {
  return x & 0xffff00;

That is, we want to find C in this skeleton:

uint32_t bar3(uint32_t x) {
  return x & C;

Basic Constant Synthesis

We need to solve an “exists-forall” problem: does there exist a C such that the LHS (left-hand side, the original code) is equivalent to the RHS (right-hand side, the optimized code) for all values of x? In other words:

(Here we’ll play fast and loose with the fact that in the real world we check refinement instead of equivalence — this doesn’t matter for purposes of this post.)

The specific formula that we want an answer for is:

A SAT solver can natively solve either an exists query (this is the definition of SAT) or else a forall query (by seeing if there exists a solution to the negated proposition). But, by itself, a SAT solver cannot solve an exists-forall query in one go. An SMT solver, on the other hand, can natively attack an exists-forall query, but in practice we tend to get better results by doing our own quantifier elimination based only on SAT calls. First, we ask the solver if there exists a C and x that make the LHS and RHS equivalent. If not, synthesis fails. If so, we issue a second query to see if C works for all values of x. If so, synthesis succeeds. If not, we add a constraint that this value of C doesn’t work, and we start the process again. The problem is that each pair of queries only rules out a single choice for C, making this process equivalent, in the worst case, to exhaustive guessing, which we cannot do when C is wide. We need to do better.

Reducing the Number of Counterexamples using Specialization

Souper uses a technique that appears to be common knowledge among people who do this kind of work; unfortunately I don’t know where it was first published (Section 4.2 of this paper is one place it has appeared). The trick is simple: we choose a few values of x and use it to specialize both the LHS and RHS of the equation; you can think of this as borrowing some constraints from the forall phase and moving them into the exists phase (which becomes an “exists-forsome” query, if you will). In many cases this adds enough extra constraints that the solver can arrive at a workable constant within a few iterations. If we specialize x with 0 and -1 then in the general case we get:

After constant folding, our running example becomes:

The first choice, 0, turns out to be an unlucky one: after further simplification it comes out to “0 = 0”, giving the solver no extra information. The second choice, on the other hand, is extremely lucky: it can be rewritten as “0x00FFFF00 = C” which simply hands us the answer. In most cases things won’t work out so nicely, but specializing the LHS and RHS with fixed inputs helps enormously in practice.

Some open questions remain: How many specific values should we try? How should these values be chosen? An obvious constraint is that the values chosen should be as different from each other as possible, though it isn’t clear what distance metric should be used. A related but less obvious criterion (this is Nuno Lopes’s idea, and we haven’t tried it out yet in Souper) is that the values should exercise as many different behaviors of each instruction as possible. For example, an addition instruction should have example inputs that overflow and also inputs that don’t. This requirement can be satisfied syntactically for instructions that are close to the inputs, but solver calls (or other symbolic methods) will be required to reach instructions that consume the outputs of other instructions. (In Souper’s case, solver calls are required anyway if there are any path conditions, which place arbitrary restrictions on the input space.) It also seems possible that we could tailor the input values to maximize how much of the RHS folds away (the LHS always folds completely down to a constant, of course).

Synthesis at Reduced Bitwidth

Instead of solving a difficult synthesis problem, we can instead solve a problem that is structurally identical, but uses narrower bitvectors (even two or three bits are often enough to capture the essence of a computation). This results in far better solver performance and, typically, a narrow synthesis result that does not contain constants will also work at the original, wider bitwidth (though obviously this needs to be verified by the solver). When the narrow result has constants, there’s a problem: we lack a principled method for deriving the wider constants from the narrow ones. One approach is to use heuristics. This paper suggests the following methods:

Here you should read BV(x,y) as “a bitvector of length y holding value x.” So, for example, rule 1 would extend an 8-bit variable containing 8 to a 16-bit variable containing 16. Rule 4 would extend an 8-bit variable containing 8 to a 16-bit variable containing 8. These seem reasonable, but notice that none of them would help with our running example.

In Souper we have the additional problem that the codes we’re trying to optimize usually contain constants, presenting us with a constant-narrowing problem that is analogous to the constant-widening problem that we just discussed, but worse because now any dodgy heuristics that we come up with are located in front of the expensive synthesis process instead of in back of it. It isn’t clear to us that there’s any satisfactory solution to this problem.

Getting More out of Counterexamples

In standard constant synthesis, each failed guess only adds the constraint that that particular constant doesn’t work — this fails to make much of a dent in an exponential search space. This paper and this one propose extracting additional information from each counterexample, cutting out a larger part of the search space. This also seems like an excellent direction.

Symbolic Constants

An alternate approach that avoids both narrowing and widening problems would be to treat constants symbolically. This, however, creates two new problems: deriving preconditions for optimizations and deriving functions to compute RHS constants in terms of elements of the LHS (constants, widths, etc.). It seems clear that in some cases this will be very difficult, for example imagine trying to derive, from first principles, the algorithm for computing the output constants for an arbitrary value of the LHS constant C:

long foo(long x) {
    return x / C;

Alive-Infer helps derive preconditions but it does not yet try to come up with the functions for computing constants in the synthesized output.

Since in some use cases (including my “destroy InstSimplify and InstCombine” dream) we want the generalized optimization anyway, this seems like a great direction for future work.


All of the techniques described in this post could be implemented inside the solver or outside it. Solvers like CVC4 already have sophisticated internal support for synthesis. The advantage of pushing more work into the solver is that it can exploit internal levers to guide the search, reuse data structures across synthesis iterations, and do other clever things that aren’t exposed by the solver’s API. The disadvantage is that we might have high-level information about the problem domain that is difficult to communicate effectively to the solver, that would help it do its job better.

In summary, constant synthesis is a hard problem that hasn’t received a lot of attention yet. My group is actively working on this and we have some ideas that aren’t mature enough to share yet. Please drop me a line if you know of any good techniques that I’ve missed here. Also, I’d be interested to hear from anyone working on benchmarks for constant synthesis.

Finally, tools such as Rosette and Sketch know how to synthesize constants.

Learning When Values are Changed by Implicit Integer Casts

C and C++ perform implicit casts when, for example, you pass an integer-typed variable to a function that expects a different type. When the target type is wider, there’s no problem, but when the target type is narrower or when it is the same size and the other signedness, integer values may silently change when the type changes. For example, this program:

#include <iostream>

int main() {
  int x = -3;
  unsigned y = x;
  std::cout << y << "\n";

prints 4294967293. Like unsigned integer wraparound (and unlike signed integer overflow) these changes of value are not undefined behavior, but they may be unintentional and may also be bugs. As of recently, Clang contains support for dynamically detecting value changes and either providing a diagnostic or else terminating the program. Some fine-grained flags are available for controlling these diagnostics, but you can enable all of them (plus some others, such as signed overflow and unsigned wraparound checks) using -fsanitize=integer.

Now we get:

$ clang++ implicit2.cpp -fsanitize=integer
$ ./a.out 
implicit2.cpp:5:16: runtime error: implicit conversion from type 'int' of value -3 (32-bit, signed) to type 'unsigned int' changed the value to 4294967293 (32-bit, unsigned)

Here’s a truncation example:

int main() {
  int x = 300;
  unsigned char c = x;

It gives:

$ clang++ implicit3.cpp -fsanitize=integer
$ ./a.out 
implicit3.cpp:3:21: runtime error: implicit conversion from type 'int' of value 300 (32-bit, signed) to type 'unsigned char' changed the value to 44 (8-bit, unsigned)

To suppress the diagnostic, we can make the conversion explicit using a cast:

int main() {
  int x = 300;
  unsigned char c = (unsigned char)x;

Different parts of this functionality landed in Clang before and after the 7.0.0 release — to get everything, you will want to build a Clang+LLVM from later than 1 Nov 2018 (or else wait for the 8.0.0 release in a few months).

How would you use these checks? Generally, they should be part of a testing campaign, perhaps in support of a code audit. If you enable them on a non-trivial code base, you will run into diagnostics that do not correspond to bugs, just because real C and C++ programs mix integer types so freely. I suggest starting with -fsanitize=implicit-integer-truncation. Let’s look at doing this to Clang+LLVM itself and then compiling a hello world program: here’s the output.

I’d be happy to hear about any interesting bugs located using these checks, if anyone wants to share.

Finally, see this tweet by Roman Lebedev (the author of these checks) showing that the runtime impact of -fsanitize-trap=implicit-conversion is very low.

What’s the difference between an integer and a pointer?

(This piece is an alternate introduction and advertisement for a soon-to-be-published research paper.)

In an assembly language we typically don’t have to worry very much about the distinction between pointers and integers. Some instructions happen to generate addresses whereas others behave arithmetically, but underneath there’s a single data type: bitvectors. At the opposite end of the PL spectrum, a high-level language won’t offer opportunities for pointer/integer confusion because the abstractions are completely firewalled off from each other. Also, of course, a high-level language may choose not to expose anything that resembles a pointer.

The problematic case, then, is the performance-oriented systems programming language: C, C++, Rust, and a few others. The issue is that:

  1. When possible, they pretend to be high-level, in order to justify optimizations that would be unsound at the assembly level. For example, they assume that a pointer derived from one heap allocation cannot alias a pointer derived from a different heap allocation.
  2. When necessary, they allow the programmer to manipulate pointers in low-level ways, such as inspecting the representation of a pointer, creating a new pointer at some offset from an existing one, storing unrelated data in unused parts of a pointer, and even fabricating a pointer out of whole cloth.

These goals conflict: the better a language is for justifying high-level optimizations, the worse it seems to be for implementing an OS kernel, and vice versa. There are a few ways we might deal with this:

  • We could require the high-level language to call out to low-level code written in a different language, as Java does with its JNI.
  • We could create a special low-level mode for a safe language where it can break the normal rules, as Rust does with its unsafe code.
  • We could go ahead and freely mix high-level and low-level code, hoping that the compiler can sort it all out, as C and C++ do.

Let’s look at an example:

#include <stdio.h>
#include <stdlib.h>

int main(void) {
  int *x = malloc(4);
  *x = 3;
  int *y = malloc(4);
  *y = 5;
  uintptr_t xi = (uintptr_t)x;
  uintptr_t yi = (uintptr_t)y;
  uintptr_t diff = xi - yi;
  printf("diff = %ld\n", (long)diff);
  int *p = (int *)(yi + diff);
  *p = 7;
  printf("x = %p, *x = %d\n", x, *x);
  printf("y = %p, *y = %d\n", y, *y);
  printf("p = %p, *p = %d\n", p, *p);

When run (on my Mac) it gives:

$ clang-6.0 -O3 mem1d.c ; ./a.out
diff = -96
x = 0x7fcb0ec00300, *x = 7
y = 0x7fcb0ec00360, *y = 5
p = 0x7fcb0ec00300, *p = 7
$ gcc-8 -O3 mem1d.c ; ./a.out
diff = -96
x = 0x7f93b6400300, *x = 7
y = 0x7f93b6400360, *y = 5
p = 0x7f93b6400300, *p = 7

What happened here? First, we grabbed a couple of heap cells and initialized their contents. Second, we used integer math to fabricate a pointer p that has the same value as x, and stored through that pointer. Third, we asked the compiler what contents it thinks are in the locations pointed to by x, y, and p. Both GCC and Clang/LLVM are of the opinion that the store through p changed the value referred to by x. This is probably the result that we had hoped for, but let’s look into it a bit more deeply.

First, does this C program execute undefined behavior? It does not! Computing out-of-bounds pointers is UB but we are free to do whatever we like with integer-typed values that were derived from pointers. Second, does the C standard mandate the behavior that we saw here? It does not! In fact it gives only very sparse guidance about the behavior of integers derived from pointers and vice versa. You can read about that here.

Now let’s modify the program a tiny bit:

#include <stdio.h>
#include <stdlib.h>

int main(void) {
  int *x = malloc(4);
  *x = 3;
  int *y = malloc(4);
  *y = 5;
  uintptr_t xi = (uintptr_t)x;
  uintptr_t yi = (uintptr_t)y;
  uintptr_t diff = xi - yi;
  printf("diff = %ld\n", (long)diff);
  int *p = (int *)(yi - 96); // <--- CHANGED!
  *p = 7;
  printf("x = %p, *x = %d\n", x, *x);
  printf("y = %p, *y = %d\n", y, *y);
  printf("p = %p, *p = %d\n", p, *p);

The only difference is at line 13: instead of adding diff to yi to compute p, we added the observed value of the difference: -96. Now we’ll run it:

$ clang-6.0 -O3 mem1d2.c ; ./a.out
diff = -96
x = 0x7ff21ac00300, *x = 7
y = 0x7ff21ac00360, *y = 5
p = 0x7ff21ac00300, *p = 7
$ gcc-8 -O3 mem1d2.c ; ./a.out
diff = -96
x = 0x7fce5e400300, *x = 3
y = 0x7fce5e400360, *y = 5
p = 0x7fce5e400300, *p = 7

Clang behaves the same as before, but GCC no longer believes that storing through p results in x being overwritten, despite the fact that x and p have the same value. What is going on is that as far as GCC is concerned, since we computed the address of x through guesswork instead of through an actual observation, our guess is effectively illegitimate. (If this guessing game is too simple for you, there are entertaining variations that avoid the need for information from a previous run of the program, such as guessing addresses using an exhaustive loop or making huge allocations so that the resulting addresses are constrained by the pigeonhole principle.) Here again we are operating well outside of the C standard and are rather in a zone where compiler writers need to make decisions that balance optimization power against developers’ expectations. I like this example because it is weird and perhaps surprising.

What should we take away from this? First, it is a big mistake to try to understand pointers in a programming language as if they follow the same rules as pointer-sized integer values. Even people who have come to terms with UB and its consequences are often surprised by this. Second, these issues are not specific to C and C++, but rather are going to create problems in any language that wants highly optimizing compilers but also permits low-level pointer manipulation.

So what are the actual rules that Clang and GCC are following when compiling these examples? The short answer is that it’s complicated and not well documented. Some useful discussion can be found in this piece about pointer provenance. For a longer answer that is specific to LLVM, see this paper that will get presented at OOPSLA in November. Also, one of my coauthors wrote a blog post about this material.

Future Directions for Optimizing Compilers

I wanted to write a manifesto-ish sort of piece about what compilers are supposed to look like in the future. Nuno was the obvious coauthor since I’ve talked to him about this topic so much that I’m overall not really sure which parts started out as his ideas and which were mine. The article didn’t end up feeling like a good fit for a series of blog posts, so we’ve placed it on arXiv: Future Directions for Optimizing Compilers. The first paragraph of the paper gives the main idea:

As software becomes larger, programming languages become higher-level, and processors continue to fail to be clocked faster, we’ll increasingly require compilers to reduce code bloat, eliminate abstraction penalties, and exploit interesting instruction sets. At the same time, compiler execution time must not increase too much and also compilers should never produce the wrong output. This paper examines the problem of making optimizing compilers faster, less buggy, and more capable of generating high-quality output.

We plan to keep revising the paper and would be happy to receive feedback on it.

How LLVM Optimizes a Function

An optimizing, ahead-of-time compiler is usually structured as:

  1. A frontend that converts source code into an intermediate representation (IR).
  2. A target-independent optimization pipeline: a sequence of passes that successively rewrite the IR to eliminate inefficiencies and forms that cannot be readily translated into machine code. Sometimes called the “middle end.”
  3. A target-dependent backend that generates assembly code or machine code.

In some compilers the IR format remains fixed throughout the optimization pipeline, in others the format or semantics change. In LLVM the format and semantics are fixed, and consequently it should be possible to run any sequences of passes you want without introducing miscompilations or crashing the compiler.

The sequence of passes in the optimization pipeline is engineered by compiler developers; its goal is to do a pretty good job in a reasonable amount of time. It gets tweaked from time to time, and of course there’s a different set of passes that gets run at each optimization level. A longish-standing topic in compiler research is to use machine learning or something to come up with a better optimization pipeline either in general or else for a specific application domain that is not well-served by the default pipelines.

Some principles for pass design are minimality and orthogonality: each pass should do one thing well, and there should not be much overlap in functionality. In practice, compromises are sometimes made. For example when two passes tend to repeatedly generate work for each other, they may be integrated into a single, larger pass. Also, some IR-level functionality such as constant folding is so ubiquitously useful that it might not make sense as a pass; LLVM, for example, implicitly folds away constant operations as instructions are created.

In this post we’ll look at how some of LLVM’s optimization passes work. I’ll assume you’ve read this piece about how Clang compiles a function or alternatively that you more or less understand how LLVM IR works. It’s particularly useful to understand the SSA (static single assignment) form: Wikipedia will get you started and this book contains more information than you’re likely to want to know. Also see the LLVM Language Reference and the list of optimization passes.

We’re looking at how Clang/LLVM 6.0.1 optimizes this C++:

bool is_sorted(int *a, int n) {
  for (int i = 0; i < n - 1; i++)
    if (a[i] > a[i + 1])
      return false;
  return true;

Keep in mind that the optimization pipeline is a busy place and we’re going to miss out on a lot of fun stuff such as:

  • inlining, an easy but super-important optimization, which can’t happen here since we’re looking at just one function
  • basically everything that is specific to C++ vs C
  • autovectorization, which is defeated by the early loop exit

In the text below I’m going to skip over every pass that doesn’t make any changes to the code. Also, we’re not even looking at the backend and there’s a lot going on there as well. Even so, this is going to be a bit of a slog! (Sorry to use images below, but it seemed like the best way to avoid formatting difficulties, I hate fighting with WP themes. Click on the image to get a larger version. I used icdiff.)

Here’s the IR file emitted by Clang (I manually removed the “optnone” attribute that Clang put in) and this is the command line that we can use to see the effect of each optimization pass:

opt -O2 -print-before-all -print-after-all is_sorted2.ll

The first pass is “simplify the CFG” (control flow graph). Because Clang does not optimize, IR that it emits often contains easy opportunities for cleanup:

Here, basic block 26 simply jumps to block 27. This kind of block can be eliminated, with jumps to it being forwarded to the destination block. The diff is a bit more confusing that it would have to be due to the implicit block renumbering performed by LLVM. The full set of transformations performed by SimplifyCFG is listed in a comment at the top of the pass:

This file implements dead code elimination and basic block merging, along with a collection of other peephole control flow optimizations. For example:

  • Removes basic blocks with no predecessors.
  • Merges a basic block into its predecessor if there is only one and the predecessor only has one successor.
  • Eliminates PHI nodes for basic blocks with a single predecessor.
  • Eliminates a basic block that only contains an unconditional branch.
  • Changes invoke instructions to nounwind functions to be calls.
  • Change things like “if (x) if (y)” into “if (x&y)”.

Most opportunities for CFG cleanup are a result of other LLVM passes. For example, dead code elimination and loop invariant code motion can easily end up creating empty basic blocks.

The next pass to run, SROA (scalar replacement of aggregates), is one of our heavy hitters. The name ends up being a bit misleading since SROA is only one of its functions. This pass examines each alloca instruction (function-scoped memory allocation) and attempts to promote it into SSA registers. A single alloca will turn into multiple registers when it is statically assigned multiple times and also when the alloca is a class or struct that can be split into its components (this splitting is the “scalar replacement” referred to in the name of the pass). A simple version of SROA would give up on stack variables whose addresses are taken, but LLVM’s version interacts with alias analysis and ends up being fairly smart (though the smarts are not needed in our example).

After SROA, all alloca instructions (and their corresponding loads and stores) are gone, and the code ends up being much cleaner and more amenable to subsequent optimization (of course, SROA cannot eliminate all allocas in general — this only works when the pointer analysis can completely eliminate aliasing ambiguity). As part of this process, SROA had to insert some phi instructions. Phis are at the core of the SSA representation, and the lack of phis in the code emitted by Clang tells us that Clang emits a trivial kind of SSA where communication between basic blocks is through memory, instead of being through SSA registers.

Next is “early common subexpression elimination” (CSE). CSE tries to eliminate the kind of redundant subcomputations that appear both in code written by people and in partially-optimized code. “Early CSE” is a fast, simple kind of CSE that looks for trivially redundant computations.

Here both %10 and %17 do the same thing, so uses of one of the values can be rewritten as uses of the other, and then the redundant instruction eliminated. This gives a glimpse of the advantages of SSA: since each register is assigned only once, there’s no such thing as multiple versions of a register. Thus, redundant computations can be detected using syntactic equivalence, with no reliance on a deeper program analysis (the same is not true for memory locations, which live outside of the SSA universe).

Next, a few passes that have no effect run, and then “global variable optimizer” which self-describes as:

This pass transforms simple global variables that never have their address taken. If obviously true, it marks read/write globals as constant, deletes variables only stored to, etc.

It makes this change:

The thing that has been added is a function attribute: metadata used by one part of the compiler to store a fact that might be useful to a different part of the compiler. You can read about the rationale for this attribute here.

Unlike other optimizations we’ve looked at, the global variable optimizer is interprocedural, it looks at an entire LLVM module. A module is more or less equivalent to a compilation unit in C or C++. In contrast, intraprocedural optimizations look at only one function at a time.

The next pass is the “instruction combiner“: InstCombine. It is a large, diverse collection of “peephole optimizations” that (typically) rewrite a handful of instructions connected by data flow into a more efficient form. InstCombine won’t change the control flow of a function. In this example it doesn’t have a lot to do:

Here instead of subtracting 1 from %1 to compute %4, we’ve decided to add -1. This is a canonicalization rather than an optimization. When there are multiple ways of expressing a computation, LLVM attempts to canonicalize to a (often arbitrarily chosen) form, which is then what LLVM passes and backends can expect to see. The second change made by InstCombine is canonicalizing two sign-extend operations (sext instructions) that compute %7 and %11 to zero-extends (zext). This is a safe transformation when the compiler can prove that the operand of the sext is non-negative. That is the case here because the loop induction variable starts at zero and stops before it reaches n (if n is negative, the loop never executes at all). The final change is adding the “nuw” (no unsigned wrap) flag to the instruction that produces %10. We can see that this is safe by observing that (1) the induction variable is always increasing and (2) that if a variable starts at zero and increases, it would become undefined by passing the sign wraparound boundary next to INT_MAX before it reaches the unsigned wraparound boundary next to UINT_MAX. This flag could be used to justify subsequent optimizations.

Next, SimplifyCFG runs for a second time, removing two empty basic blocks:

Deduce function attributes” annotates the function further:

“Norecurse” means that the function is not involved in any recursive loop and a “readonly” function does not mutate global state. A “nocapture” parameter is not saved anywhere after its function returns and a “readonly” parameter refers to storage not modified by the function. Also see the full set of function attributes and the full set of parameter attributes.

Next, “rotate loops” moves code around in an attempt to improve conditions for subsequent optimizations:

Although this diff looks a bit alarming, there’s not all that much happening. We can see what happened more readily by asking LLVM to draw the control flow graph before and after loop rotation. Here are the before (left) and after (right) views:

The original code still matches the loop structure emitted by Clang:

  goto COND
  if (condition)
    goto BODY
    goto EXIT
  goto COND

Whereas the rotated loop looks like this:

  if (condition)
    goto BODY
    goto EXIT
  if (condition)
    goto BODY
    goto EXIT

(Corrected as suggested by Johannes Doerfert below– thanks!)

The point of loop rotation is to remove one branch and to enable subsequent optimizations. I did not find a better description of this transformation on the web (the paper that appears to have introduced the term doesn’t seem all that relevant).

CFG simplification folds away the two basic blocks that contain only degenerate (single-input) phi nodes:

The instruction combiner rewrites “%4 = 0 s< (%1 - 1)" as "%4 = %1 s> 1″, this kind of thing is useful because it reduces the length of a dependency chain and may also create dead instructions (see the patch making this happen). This pass also eliminates another trivial phi node that was added during loop rotation:

Next, “canonicalize natural loops” runs, it self-describes as:

This pass performs several transformations to transform natural loops into a simpler form, which makes subsequent analyses and transformations simpler and more effective.

Loop pre-header insertion guarantees that there is a single, non-critical entry edge from outside of the loop to the loop header. This simplifies a number of analyses and transformations, such as LICM.

Loop exit-block insertion guarantees that all exit blocks from the loop (blocks which are outside of the loop that have predecessors inside of the loop) only have predecessors from inside of the loop (and are thus dominated by the loop header). This simplifies transformations such as store-sinking that are built into LICM.

This pass also guarantees that loops will have exactly one backedge.

Indirectbr instructions introduce several complications. If the loop contains or is entered by an indirectbr instruction, it may not be possible to transform the loop and make these guarantees. Client code should check that these conditions are true before relying on them.

Note that the simplifycfg pass will clean up blocks which are split out but end up being unnecessary, so usage of this pass should not pessimize generated code.

This pass obviously modifies the CFG, but updates loop information and dominator information.

Here we can see loop exit-block insertion happen:

Next up is “induction variable simplification“:

This transformation analyzes and transforms the induction variables (and computations derived from them) into simpler forms suitable for subsequent analysis and transformation.

If the trip count of a loop is computable, this pass also makes the following changes:

  1. The exit condition for the loop is canonicalized to compare the induction value against the exit value. This turns loops like: ‘for (i = 7; i*i < 1000; ++i)' into 'for (i = 0; i != 25; ++i)'
  2. Any use outside of the loop of an expression derived from the indvar is changed to compute the derived value outside of the loop, eliminating the dependence on the exit value of the induction variable. If the only purpose of the loop is to compute the exit value of some derived expression, this transformation will make the loop dead.

The effect of this pass here is simply to rewrite the 32-bit induction variable as 64 bits:

I don’t know why the zext — previously canonicalized from a sext — is turned back into a sext.

Now “global value numbering” performs a very clever optimization. Showing this one off is basically the entire reason that I decided to write this blog post. See if you can figure it out just by reading the diff:

Got it? Ok, the two loads in the loop on the left correspond to a[i] and a[i + 1]. Here GVN has figured out that loading a[i] is unnecessary because a[i + 1] from one loop iteration can be forwarded to the next iteration as a[i]. This one simple trick halves the number of loads issued by this function. Both LLVM and GCC only got this transformation recently.

You might be asking yourself if this trick still works if we compare a[i] against a[i + 2]. It turns out that LLVM is unwilling or unable to do this, but GCC is willing to throw up to four registers at this problem.

Now “bit-tracking dead code elimination” runs:

This file implements the Bit-Tracking Dead Code Elimination pass. Some instructions (shifts, some ands, ors, etc.) kill some of their input bits. We track these dead bits and remove instructions that compute only these dead bits.

But it turns out the extra cleverness isn’t needed since the only dead code is a GEP, and it is trivially dead (GVN removed the load that previously used the address that it computes):

Now the instruction combiner sinks an add into a different basic block. The rationale behind putting this transformation into InstCombine is not clear to me; perhaps there just wasn’t a more obvious place to do this one:

Now things get a tiny bit weird, “jump threading” undoes what “canonicalize natural loops” did earlier:

Then we canonicalize it back:

And CFG simplification flips it the other way:

And back:

And forth:

And back:

And forth:

Whew, we’re finally done with the middle end! The code at the right is what gets passed to (in this case) the x86-64 backend.

You might be wondering if the oscillating behavior towards the end of the pass pipeline is the result of a compiler bug, but keep in mind that this function is really, really simple and there were a whole bunch of passes mixed in with the flipping and flopping, but I didn’t mention them because they didn’t make any changes to the code. As far as the second half of the middle end optimization pipeline goes, we’re basically looking at a degenerate execution here.

Acknowledgments: Some students in my Advanced Compilers class this fall provided feedback on a draft of this post (I also used this material as an assignment). I ran across the function used here in this excellent set of lecture notes about loop optimizations.

How Clang Compiles a Function

I’ve been planning on writing a post about how LLVM optimizes a function, but it seems necessary to first write about how Clang translates C or C++ into LLVM. This is going to be fairly high-level:

  • rather than looking at Clang’s internals, I’m going to focus on how Clang’s output relates to its input
  • we’re not going to look at any non-trivial C++ features

We’ll use this small function that I borrowed from these excellent lecture notes on loop optimizations:

bool is_sorted(int *a, int n) {
  for (int i = 0; i < n - 1; i++)
    if (a[i] > a[i + 1])
      return false;
  return true;

Since Clang doesn’t do any optimization, and since LLVM IR was initially designed as a target for C and C++, the translation is going to be relatively easy. I’ll use Clang 6.0.1 (or something near it, since it hasn’t quite been released yet) on x86-64.

The command line is:

clang++ is_sorted.cpp -O0 -S -emit-llvm

In other words: compile the file is_sorted.cpp as C++ and then tell the rest of the LLVM toolchain: don’t optimize, emit assembly, but no actually emit textual LLVM IR instead. Textual IR is bulky and not particularly fast to print or parse; the binary “bitcode” format is always preferred when a human isn’t in the loop. The full IR file is here, we’ll be looking at it in parts.

Starting at the top of the file we have:

; ModuleID = 'is_sorted.cpp'
source_filename = "is_sorted.cpp"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

Any text between a semicolon and the end of a line is a comment, so the first line does nothing, but if you care, an LLVM “module” is basically a compilation unit: a container for code and data. The second line doesn’t concern us. The third line describes some choices and assumptions made by the compiler; they don’t matter much for this post but you can read more here. The target triple is a gcc-ism that doesn’t concern us here either.

An LLVM function optionally has attributes:

; Function Attrs: noinline nounwind optnone uwtable

Some of them (like these) are supplied by the front end, others get added later by optimization passes. This is just a comment, the actual attributes are specified towards the end of the file. These attributes don’t have anything to do with the meaning of the code, so I’m not going to discuss them, but you can read read about them here if you’re curious.

Ok, finally on to the function itself:

define zeroext i1 @_Z9is_sortedPii(i32* %a, i32 %n) #0 {

“zeroext” means that the return value of the function (i1, a one-bit integer) should be zero-extended, by the backend, to whatever width the ABI requires. Next comes the mangled name of the function, then its parameter list, which is basically the same as in the C++ code, except that the “int” type has been refined to 32 bits. The #0 connects this function to an attribute group near the bottom of the file.

Now on to the first basic block:

  %retval = alloca i1, align 1
  %a.addr = alloca i32*, align 8
  %n.addr = alloca i32, align 4
  %i = alloca i32, align 4
  store i32* %a, i32** %a.addr, align 8
  store i32 %n, i32* %n.addr, align 4
  store i32 0, i32* %i, align 4
  br label %for.cond

Every LLVM instruction has to live inside a basic block: a collection of instructions that is only entered at the top and only exited at the bottom. The last instruction of a basic block must be a terminator instruction: fallthrough is not allowed. Every function must have an entry block that has no predecessors (places to come from) other than the function entry itself. These properties and others are checked when an IR file is parsed, and can optionally be checked more times during compilation by the “module verifier.” The verifier is useful for debugging situations where a pass emits illegal IR.

The first four instructions in the entry basic block are “alloca”s: stack memory allocations. The first three are for implicit variables created during the compilation, the fourth is for the loop induction variable. Storage allocated like this can only be accessed using load and store instructions. The next three instructions initialize three of the stack slots: a.addr and n.addr are initialized using the values passed into the function as parameters and i is initialized to zero. The return value does not need to be initialized: this will be taken care of later by any code that wasn’t undefined at the C or C++ level. The last instruction is an unconditional branch to the subsequent basic block (we’re not going to worry about it here, but most of these unnecessary jumps can be elided by an LLVM backend).

You might ask: why does Clang allocate stack slots for a and n? Why not just use those values directly? In this function, since a and n are never modified, this strategy would work, but noticing this fact is considered an optimization and thus outside of the job Clang was designed to do. In the general case where a and n might be modified, they must live in memory as opposed to being SSA values, which are — by definition — given a value at only one point in a program. Memory cells live outside of the SSA world and can be modified freely. This may all seem confusing and arbitrary but these design decisions have been found to allow many parts of a compiler to be expressed in natural and efficient ways.

I think of Clang as emitting degenerate SSA code: it meets all the requirements for SSA, but only because basic blocks communicate through memory. Emitting non-degenerate SSA requires some care and some analysis and Clang’s refusal to do these leads to a pleasing separation of concerns. I haven’t seen the measurements but my understanding is that generating a lot of memory operations and then almost immediately optimizing many of them away isn’t a major source of compile-time overhead.

Next let’s look at how to translate a for loop. The general form is:

for (initializer; condition; modifier) {

This is translated roughly as:

  goto COND
  if (condition)
    goto BODY
    goto EXIT
  goto COND

Of course this translation isn’t specific to Clang: any C or C++ compiler would do the same.

In our example, the loop initializer has been folded into the entry basic block. The next basic block is the loop condition test:

for.cond:                                         ; preds = %for.inc, %entry
  %0 = load i32, i32* %i, align 4
  %1 = load i32, i32* %n.addr, align 4
  %sub = sub nsw i32 %1, 1
  %cmp = icmp slt i32 %0, %sub
  br i1 %cmp, label %for.body, label %for.end

As a helpful comment, Clang is telling us that this basic block can be reached either from the for.inc or the entry basic block. This block loads i and n from memory, decrements n (the “nsw” flag preserves the C++-level fact that signed overflow is undefined; without this flag an LLVM subtract would have two’s complement semantics), compares the decremented value against i using “signed less than”, and then finally branches either to the for.body basic block or the for.end basic block.

The body of the for loop can only be entered from the for.cond block:

  %2 = load i32*, i32** %a.addr, align 8
  %3 = load i32, i32* %i, align 4
  %idxprom = sext i32 %3 to i64
  %arrayidx = getelementptr inbounds i32, i32* %2, i64 %idxprom
  %4 = load i32, i32* %arrayidx, align 4
  %5 = load i32*, i32** %a.addr, align 8
  %6 = load i32, i32* %i, align 4
  %add = add nsw i32 %6, 1
  %idxprom1 = sext i32 %add to i64
  %arrayidx2 = getelementptr inbounds i32, i32* %5, i64 %idxprom1
  %7 = load i32, i32* %arrayidx2, align 4
  %cmp3 = icmp sgt i32 %4, %7
  br i1 %cmp3, label %if.then, label %if.end

The first two lines load a and i into SSA registers; i is then widened to 64 bits so it can participate in an address computation. getelementptr (gep for short) is LLVM’s famously baroque pointer computation instruction, it even has its own faq. Unlike a machine language, LLVM does not treat pointers and integers the same way. This facilitates alias analysis and other memory optimizations. The code then goes on to load a[i] and a[i + 1], compare them, and branch based on the result.

The if.then block stores 0 into the stack slot for the function return value and branches unconditionally to the function exit block:

  store i1 false, i1* %retval, align 1
  br label %return

The else block is trivial:

  br label %for.inc

And the block for adding one to the loop induction variable is easy as well:

  %8 = load i32, i32* %i, align 4
  %inc = add nsw i32 %8, 1
  store i32 %inc, i32* %i, align 4
  br label %for.cond

This code branches back up to the loop condition test.

If the loop terminates normally, we want to return true:

  store i1 true, i1* %retval, align 1
  br label %return

And finally whatever got stored into the return value stack slot is loaded and returned:

  %9 = load i1, i1* %retval, align 1
  ret i1 %9

There’s a bit more at the bottom of the function but nothing consequential. Ok, this post became longer than I’d intended, the next one will look at what the IR-level optimizations do to this function.

(Thanks to Xi Wang and Alex Rosenberg for sending corrections.)