Checking Up on Dataflow Analyses

An important tool for reasoning about programs without running them is dataflow analysis, which can be used to compute facts such as:

  • an integer-valued variable is non-negative at a particular program point
  • a conditional branch always goes one way
  • an indirect write cannot modify a particular array

Dataflow facts drive optimizations such as constant propagation and dead code elimination. Outside of compilers, dataflow analyses are important in tools that prove safety properties, such as absence of undefined behaviors in C or C++ programs. Dataflow analysis is backed by a rich academic literature and tool writers have decades of experience implementing it. This set of lecture notes is my favorite introduction to dataflow analysis.

Despite all of the nice theorems, dataflow analyses are hard to get right, in part because people are just not very good at reasoning about properties that hold over sets of program executions. As a simple exercise in this kind of thinking, try this: you have a variable A whose value is known to be in the range [0..1000] and a variable B that is known to be in the range [10000..10050]. The program computes the bitwise xor of A and B and assigns the result into C. Provide tight bounds on the values that C can take. If you find this to be easy, that’s great — but consider that you then have to make this work for all intervals and after that there are about 9,999 similarly fiddly things left to implement and any mistakes you make are likely to result in miscompilations.

So now let’s finally get to the point of this post. The point is that since dataflow is hard, if you implement a dataflow analysis, you should also implement a dynamic checker that looks for cases where the analysis has come to the wrong conclusion. To keep going with the running example, the program being analyzed contains this line of code:

// A in [0..1000] and B in [10000..10050]
C = A ^ B;

If our interval analysis says that C has to be in the range [9225..12255], we would rewrite the code to add an assertion:

// A in [0..1000] and B in [10000..10050]
C = A ^ B;
assert(C >= 9225 && C <= 12255);

Now we can run the instrumented program and, if we’re very lucky, it will execute this code with values such as A = 830 and B = 10041, making C = 9223, triggering an assertion violation telling us that our analysis is unsound and giving us something to debug. The other way to debug this sort of problem is to backtrack from a miscompilation — an experience enjoyed by basically nobody.

Every dataflow analysis that I’ve implemented or supervised the implementation of has gotten a dynamic checker. This is a great testing technique that has numerous advantages. It finds bugs in practice. It can be automated, either using a program generator like Csmith or else using a regular test suite. The annotated programs serve as a useful way to look at the precision of an analysis: if not very many assertions show up, or if the bounds are very loose, then we probably need to do some tuning for precision. Finally, if we don’t find any bugs, then the dataflow analysis is probably (please excuse me, formal methods people) correct enough for practical purposes.

Of course, not everything that can be computed by a dataflow analysis can be expressed in code. For example, in C and C++ there’s no lightweight way to assert that a pointer refers to valid storage or that a variable is uninitialized. E-ACSL is a neat piece of related work in this general area.

Just recently I’ve been looking at the efficiency of integer overflow checking using LLVM and since then I wrote some code that uses integer range dataflow facts to remove overflow checks that can be proved to not fire. Here’s the under-review patch. The ranges are computed by an LLVM pass called LazyValueInfo (LVI), which better be correct if we’re going to rely on it, so I wrote a little LLVM pass that does the necessary checking. It processes each function in between the optimization and code generation phases by:

(If you build this pass, you’ll need a slightly hacked LLVM, see the README.) Although it might seem tempting to run the optimizers again on the instrumented code in order to clean it up a bit, this would be a very bad idea: the very dataflow analyses that we’re trying to check up on would be used to drive optimizations, which could easily end up incorrectly deleting our checks.

So far, some testing using Csmith hasn’t turned up any bugs in LVI, which is great. Less great is the fact that it drops precision all over the place: a lot of tuning up is needed before it is the basis for a really strong redundant overflow check remover.

The technique I’m describing is not as widely known or as widely used as it should be, considering how easy and effective it is. The best answer is that C = [9216..10239].

UPDATE: The latest version of my LLVM dataflow checker also validates the results of computeKnownBits(), which tries to figure out which bits of a value must be either zero or one.

Here’s a bit of further reading about how these techniques (and others) were applied to the Frama-C static analyzer.

Efficient Integer Overflow Checking in LLVM

(Here’s some optional background reading material.)

We want fast integer overflow checking. Why? First, if the undefined behavior sanitizers go faster then testing goes faster. Second, when overhead drops below a certain point people will become willing to use UBSan to harden production code against integer overflows. This is already being done in parts of Android. It isn’t the kind of thing to do lightly: some of LLVM’s sanitizers, such as ASan, will increase an application’s attack surface. Even UBSan in trapping mode — which does not increase attack surface that I know of — could easily enable remote DoS attacks. But this is beside the point.

Let’s start with this function:

int foo(int x, int y) {
  return x + y; 
}

When compiled with trapping integer overflow checks (as opposed to checks that provide diagnostics and optionally continue executing) Clang 3.8 at -O2 gives:

foo:
        addl    %esi, %edi
        jo      .LBB0_1
        movl    %edi, %eax
        retq
.LBB0_1:
        ud2

This code is efficient; here Chandler Carruth explains why. On the other hand, signed integer overflow checking slows down SPEC CINT 2006 by 11.8% overall, with slowdown ranging from negligible (GCC, Perl, OMNeT++) to about 20% (Sjeng, H264Ref) to about 40% (HMMER).

Why do fast overflow checks add overhead? The increase in object code size due to overflow checking is less than 1% so there’s not going to be much trouble in the icache. Looking at HMMER, for example, we see that it spends >95% of its execution time in a function called P7Viterbi(). This function can be partially vectorized, but the version with integer overflow checks doesn’t get vectorized at all. In other words, most of the slowdown comes from integer overflow checks interfering with loop optimizations. In contrast, GCC and Perl probably don’t get much benefit from advanced loop optimizations in the first place, hence the lack of slowdown there.

Here I’ll take a second to mention that I had to hack SPEC slightly so that signed integer overflows wouldn’t derail my experiments. The changes appear to be performance-neutral. Only GCC, Perl, and H254Ref have signed overflows. Here’s the patch for SPEC CPU 2006 V1.2. All performance numbers in this post were taken on an i7-5820K (6-core Haswell-E at 3.3 GHz), running Ubuntu 14.04 in 64-bit mode, with frequency scaling disabled.

Now for the fun part: making code faster by removing overflow checks that provably don’t fire. At -O0 the SPEC INT 2006 benchmarks have 67,678 integer overflow checks, whereas at -O3 there are 38,527. So that’s nice: LLVM 3.8 can already get rid of 43% of naively inserted checks.

Let’s look at some details. First, the good news: all of the overflow checks in these functions are optimized away by LLVM:

int foo2(int x, int y) {
  return (short)x + (short)y;
}

int foo3(int x, int y) {
  return (long)x + (long)y;
}

int foo4(int x, int y) {
  return (x >> 1) + (y >> 1);
}

int32_t foo5(int32_t x, int32_t y) {
  const int32_t mask = ~(3U << 30);
  return (x & mask) + (y & mask);
}

int32_t foo6(int32_t x, int32_t y) {
  const int32_t mask = 3U << 30;
  return (x | mask) + (y | mask);
}

int32_t foo7(int32_t x, int32_t y) {
  const int32_t mask = 1U << 31;
  return (x | mask) + (y & ~mask);
}

See for yourself here. The common theme across these functions is that each overflow check can be seen to be unnecessary by looking at the high-order bits of its inputs. The code for these optimizations is here.

Now on the other hand, LLVM is unable to see that these functions don’t require overflow checks:

int foo8(int x) {
  return x < INT_MAX ? x + 1 : INT_MAX;
}

int foo9(int x, int y) {
  if (x < 10 && x > -10 && y < 10 && y > -10)
    return x + y;
  return 0;
}

Here’s another one where the check doesn’t get eliminated:

void foo10(char *a) {
  for (int i = 0; i < 10; i++)
    a[i] = 0;
}

There’s good news: Sanjoy Das has a couple of patches that, together, solve this problem. Their overall effect on SPEC is to reduce the overhead of signed integer overflow checking to 8.7%.

Finally, this function should get one overflow check rather than three:

unsigned foo11(unsigned *a, int i) {
  return a[i + 3] + a[i + 5] + a[i + 2];
}

This example (or something like it) is from Nadav Rotem on twitter; his redundant overflow check removal pass for Swift eliminates this sort of thing at the SIL level. I’ve done a bit of work on bringing these ideas to LLVM, and will hopefully have more to write about that later on.

In summary, signed integer overflow checks in LLVM are fast but they get in the way of the optimizers, which haven’t yet been systematically taught to see through them or eliminate them. There’s plenty of low-hanging fruit, and hopefully we can get the overhead down to the point where people can turn on overflow checking in production codes without thinking too hard about the tradeoffs.

Addenda:

I haven’t forgotten about Souper! We’ve taught it to eliminate integer overflow checks. I’ll write about that later too.

See Dan Luu’s article on this topic.

SQLite with a Fine-Toothed Comb

One of the things I’ve been doing at Trust-in-Soft is looking for defects in open-source software. The program I’ve spent the most time with is SQLite: an extremely widely deployed lightweight database. At ~113 KSLOC of pointer-intensive code, SQLite is too big for easy static verification. On the other hand, it has an incredibly impressive test suite that has already been used to drive dynamic tools such as Valgrind, ASan, and UBSan. I’ve been using these tests with tis-interpreter.

Here I try to depict the various categories of undefined behaviors (UBs) where this post is mainly about the blue-shaded area:

This figure leaves out many UBs (there are a few hundred in all) and is not to scale.

For honesty’s sake I should add that using tis-interpreter isn’t painless (it wasn’t designed as a from-scratch C interpreter, but rather is an adaptation of a sound formal methods tool). It is slower even than Valgrind. It has trouble with separately compiled libraries and with code that interacts with the system, such as mmap() calls. As I tried to show in the figure above, when run on code that is clean with respect to ASan and UBSan, it tends to find fiddly problems that a lot of people aren’t going to want to hear about. In any case, one of the reasons that I’m pushing SQLite through tis-interpreter is to help us find and fix the pain points.

Next let’s look at some bugs. I’ve been reporting issues as I go, and a number of things that I’ve reported have already been fixed in SQLite. I’ll also discuss how these bugs relate to the idea of a Friendly C dialect.

Values of Dangling Pointers

SQLite likes to use — but not dereference — pointers to heap blocks that have been freed. It did this at quite a few locations. For example, at a time when zHdr is dangling:

if( (zHdr>=zEndHdr && (zHdr>zEndHdr 
  || offset64!=pC->payloadSize))
 || (offset64 > pC->payloadSize)
){
  rc = SQLITE_CORRUPT_BKPT;
  goto abort_due_to_error;
}

These uses are undefined behavior, but are they compiled harmfully by current C compilers? I have no evidence that they are, but in other situations the compiler will take advantage of the fact that a pointer is dangling; see this post and also Winner #2 here. You play with dangling pointers at your own risk. Valgrind and ASan make no attempt to catch these uses, as far as I know.

Using the value of a dangling pointer is a nettlesome UB, causing inconvenience while — as far as I can tell — giving the optimizer almost no added power in realistic situations. Eliminating it is a no-brainer for Friendly C.

Uses of Uninitialized Storage

I found several reads of uninitialized storage. This is somewhere between unspecified and undefined behavior.

One idiom was something like this:

int dummy;
some sort of loop {
  ...
  // we don't care about function()'s return value
  // (but its other callers might)
  dummy += function();
  ...
}
// dummy is not used again

Here the intent is to avoid a compiler warning about an ignored return value. Of course a better alternative is to initialize dummy; the compiler can still optimize away the unwanted bookkeeping if function() is inlined or otherwise specialized.

At least one uninitialized read that we found was potentially harmful, though we couldn’t make it behave unpredictably. Also, it had not been found by Valgrind. Both of these facts — the predictability and the lack of an alarm — might be explained by the compiler reusing a stack slot that had previously contained a different local variable. Of course we must not count on such coincidences working out well.

A Friendly C could ignore reads of uninitialized storage based on the idea that tool support for detecting this class of error is good enough. This is the solution I would advocate. A heavier-handed alternative would be compiler-enforced zeroing of heap blocks and automatic variables.

Out-of-Bounds Pointers

In C you are not allowed to compute — much less use or dereference — a pointer that isn’t inside an object or one element past its end.

SQLite’s vdbe struct has a member called aMem that uses 1-based array indexing. To avoid wasting an element, this array is initialized like this:

p->aMem = allocSpace(...);
p->aMem--;

I’ve elided a bunch of code, the full version is in sqlite3VdbeMakeReady() in this file. The real situation is more complicated since allocSpace() isn’t just a wrapper for malloc(): UB only happens when aMem points to the beginning of a block returned by malloc(). This could be fixed by avoiding the 1-based addressing, by allocating a zero element that would never be used, or by reordering the struct fields.

Other out-of-bounds pointers in SQLite were computed when pointers into character arrays went past the end. This particular class of UB is commonly seen even in hardened C code. It is probably only a problem when either the pointer goes far out of bounds or else when the object in question is allocated near the end of the address space. In these cases, the OOB pointer can wrap, causing a bounds check to fail to trigger, potentially causing a security problem. The full situation involves an undesirable UB-based optimization and is pretty interesting. A good solution, as the LWN article suggests, is to move the bounds checks into the integer domain. The Friendly C solution would be to legitimize creation of, and comparison between, OOB pointers. Alternately, we could beef up the sanitizers to complain about these things.

Illegal Arguments

It is UB to call memset(), memcpy(), and other library functions with invalid or null pointer arguments. GCC actively exploits this UB to optimize code. SQLite had a place where it called memset() with an invalid pointer and another calling memcpy() with a null pointer. In both cases the length argument was zero, so the calls were otherwise harmless. A Friendly C dialect would only require each pointer to refer to as much storage as the length argument implies is necessary, including none at all.

Comparisons of Pointers to Unrelated Objects

When the relational operators >, >=, <, <= are used to compare two pointers, the program has undefined behavior unless both pointers refer to the same object, or to the location one past its end. SQLite, like many other programs, wants to do these kinds of comparisons. The solution is to reduce undefined behavior to implementation defined behavior by casting to uintptr_t and comparing the resulting integers. For example, SQLite now has this method for checking if a pointer (that might be to another object) lies within a memory range:

# define SQLITE_WITHIN(P,S,E) \
    ((uintptr_t)(P)>=(uintptr_t)(S) && \
     (uintptr_t)(P)<(uintptr_t)(E))

Uses of this macro can be found in btree.c.

With respect to pointer comparisons, tis-interpreter’s (and Frama-C’s) intentions are stronger than just detecting undefined behavior: we want to ensure that execution is deterministic. Comparisons between unrelated objects destroy determinism because the allocator makes no guarantees about their relative locations. On the other hand, if the pair of comparisons in a SQLITE_WITHIN call is treated atomically, and if S and E point into the same object, there is no determinism violation. We added a “within” builtin to Frama-C that can be used without violating determinism and also a bare pointer comparison that — if used — breaks the guarantee that Frama-C’s results hold for all possible allocation orders. Sound analysis of programs that depend on the relative locations of different allocated objects is a research problem and something like a model checker would be required.

In Friendly C, comparisons of pointers to unrelated objects would act as if they had already been cast to uintptr_t. I don’t think this ties the hands of the optimizer at all.

Summary

SQLite is a carefully engineered and thoroughly tested piece of software. Even so, it contains undefined behaviors because, until recently, no good checker for these behaviors existed. If anything is going to save us from UB hell, it’s tools combined with developers who care to listen to them. Richard Hipp is a programming hero who always responds quickly and has been receptive to the fiddly kinds of problems I’m talking about here.

What to Do Now

Although the situation is much, much better than it was five years ago, C and C++ developers will not be on solid ground until every undefined behavior falls into one of these two categories:

  • Erroneous UBs for which reliable and ubiquitous (but perhaps optional and inefficient) detectors exist. These can work either at compile time or run time.
  • Benign behaviors for which compiler developers have provided a documented semantics.

The Strict Aliasing Situation is Pretty Bad

I’ll start with a quick review of the strict aliasing rules in C and C++ and then present some less well-known material.

Strict Aliasing

Compiler optimizations are often shot down by the potential for pointers to be aliases. For example, although we might naively expect a compiler to optimize this function to return zero, that cannot happen because x and y might refer to the same location:

int foo(int *x, int *y) {
  *x = 0;
  *y = 1;
  return *x;
}

Generated code typically looks like this:

foo:    movl    $0, (%rdi)
        movl    $1, (%rsi)
        movl    (%rdi), %eax
        ret

Failure to optimize is particularly frustrating when we know for sure that x and y are not aliases. The “restrict” keyword in C was introduced to help solve this sort of problem but we’re not going to talk about that today. Rather, we’re going to talk about an orthogonal solution, the aliasing rules in C and C++ that permit the compiler to assume that an object will not be aliased by a pointer to a different type. Often this is called “strict aliasing” although that term does not appear in the standards. Consider, for example, this variant of the program above:

int foo(int *x, long *y) {
  *x = 0;
  *y = 1;
  return *x;
}

Since a pointer-to-int and a pointer-to-long may be assumed to not alias each other, the function can be compiled to return zero:

foo2:   movl    $0, (%rdi)
        xorl    %eax, %eax
        movq    $1, (%rsi)
        ret

As we see here, the aliasing rules give C and C++ compilers leverage that they can use to generate better code. On the other hand, since C is a low-level language and C++ can be used as a low-level language, both permit casting between pointer types, which can end up creating aliases that violate the compiler’s assumptions. For example, we might naively write code like this to access the representation of a floating point value:

unsigned long bogus_conversion(double d) {
  unsigned long *lp = (unsigned long *)&d;
  return *lp;
}

This function is undefined under the aliasing rules and while it happens to be compiled into the same code that would be emitted without the strict aliasing rules, it is easy to write incorrect code that looks like it is getting broken by the optimizer:

#include <stdio.h>

long foo(int *x, long *y) {
  *x = 0;
  *y = 1;
  return *x;
}

int main(void) {
  long l;
  printf("%ld\n", foo((int *)&l, &l));
}

$ gcc-5 strict.c ; ./a.out
1
$ gcc-5 -O2 strict.c ; ./a.out
0
$ clang strict.c ; ./a.out
1
$ clang -O2 strict.c ; ./a.out
0

An exception to the strict aliasing rules is made for pointers to character types, so it is always OK to inspect an object’s representation via an array of chars. This is necessary to make memcpy-like functions work properly.

So far, this is very well known. Now let’s look at a few consequences of strict aliasing that are perhaps not as widely known.

Physical Subtyping is Broken

An old paper that I like uses the term “physical subtyping” to refer to the struct-based implementation of inheritance in C. Searching for “object oriented C” returns quite a few links. Additionally, many large C systems (the Linux kernel for example) implement OO-like idioms. Any time this kind of code casts between pointer types and dereferences the resulting pointers, it violates the aliasing rules. Many aliasing rule violations can be found in this book about object oriented C. Some build systems, such as Linux’s, invoke GCC with its -fno-strict-aliasing flag to avoid problems.

Update based on some comments from Josh Haberman and Sam Tobin-Hochstadt: It looks like the specific case where the struct representing the derived type includes its parent as its first member should not trigger UB. The language in this part of the standard is very hard to parse out.

This program from the Cerberus project illustrates the problem with changing the type of a pointer to struct:

#include <stdio.h>

typedef struct { int i1; } s1;
typedef struct { int i2; } s2;

void f(s1 *s1p, s2 *s2p) {
  s1p->i1 = 2;
  s2p->i2 = 3;
  printf("%i\n", s1p->i1);
}

int main() {
  s1 s = {.i1 = 1};
  f(&s, (s2 *)&s);
}

$ gcc-5 effective.c ; ./a.out
3
$ gcc-5 -O2 effective.c ; ./a.out
2
$ clang-3.8 effective.c ; ./a.out
3
$ clang-3.8 -O2 effective.c ; ./a.out
3

Chunking Optimizations Are Broken

Code that processes bytes one at a time tends to be slow. While an optimizing compiler can sometimes make a naive character-processing loop much faster, in practice we often need to help the compiler out by explicitly processing word-sized chunks of bytes at a time. Since the data reinterpretation is generally done by casting to a non-character-typed pointer, the resulting accesses are undefined. Search the web for “fast memcpy” or “fast memset”: many of the hits will return erroneous code. Example 1, example 2, example 3. Although I have no evidence that it is being miscompiled, OpenSSL’s AES implementation uses chunking and is undefined.

One way to get chunking optimizations without UB is to use GCC’s may_alias attribute, as seen here in Musl. This isn’t supported even by Clang, as far as I know.

Offset Overlap is Bad

Here is a devilish little program by Richard Biener and Robbert Krebbers that I found via the Cerberus report:

#include <stdio.h>
#include <stdlib.h>

struct X {
  int i;
  int j;
};

int foo(struct X *p, struct X *q) {
  q->j = 1;
  p->i = 0;
  return q->j;
}

int main() {
  unsigned char *p = malloc(3 * sizeof(int));
  printf("%i\n", foo((struct X *)(p + sizeof(int)),
                     (struct X *)p));
}

It is ill-formed according to LLVM and GCC:

$ clang-3.8 krebbers.c ; ./a.out
0
$ clang-3.8 -O2 krebbers.c ; ./a.out
1
$ gcc-5 krebbers.c ; ./a.out
0
$ gcc-5 -O2 krebbers.c ; ./a.out
1

int8_t and uint8_t Are Not Necessarily Character Types

This bug (and some linked discussions) indicate that compiler developers don’t necessarily consider int8_t and uint8_t to be character types for aliasing purposes. Wholesale replacement of character types with standard integer types — as advocated here, for example — would almost certainly lead to interesting strict aliasing violations when the resulting code was run through a compiler that doesn’t think int8_t and uint8_t are character types. Happily, no compiler has done this yet (that I know of).

Summary

A lot of C code is broken under strict aliasing. Separate compilation is probably what protects us from broader compiler exploitation of the brokenness, but it is a very poor kind of protection. Static and dynamic checking tools are needed. If I were writing correctness-oriented C that relied on these casts I wouldn’t even consider building it without -fno-strict-aliasing.

Pascal Cuoq provided feedback on a draft of this piece.

A Variable Argument Hazard

Variable argument functions in C and C++ can be tricky to use correctly, and they typically only get compiler-based type checking for special cases such as printf(). Recently I ran across an entertaining variable argument bug using tis-interpreter.

Although I will use code from SQLite to illustrate this bug, this bug has not been found in SQLite itself. Rather, it was in some code used to test that database.

SQLite’s API has a configuration function that takes variable arguments:

int sqlite3_config(int op, ...);

Several of its configuration commands, such as this one, expect pointer-typed arguments:

case SQLITE_CONFIG_LOG: {
  typedef void(*LOGFUNC_t)(void*,int,const char*);
  sqlite3GlobalConfig.xLog = va_arg(ap, LOGFUNC_t);
  sqlite3GlobalConfig.pLogArg = va_arg(ap, void*);
  break;
}

We’re only allowed to execute this code when the first two variable arguments are respectively compatible with LOGFUNC_t and void *, otherwise the behavior is undefined.

The test code contains calls that look like this:

sqlite3_config(SQLITE_CONFIG_LOG, 0, pLog); 

At first glance there’s no problem — normally it is fine to pass a 0 to a function that expects a pointer argument: the 0 is turned into an appropriately-typed null pointer. On the other hand, when a function takes variable arguments (or when it lacks a prototype) the compiler cannot do this kind of caller-side type conversion and instead a different set of rules, the “default argument promotions” are applied, resulting in a zero-valued int being passed to sqlite3_config() in the first variable argument position. Another fun aspect of the default argument promotions is that they force all float arguments to be promoted to double. These rules seem to be the same in C++ though they would fire less often due to C++’s stricter rules about function prototypes.

This problem was not noticed because the code happens to be compiled benignly on both x86 and x64 platforms. On x64 Linux, pointers and ints are not the same size, but the calling convention says that the first six integer and pointer arguments are passed in 64-bit registers — so the problem gets covered up by an implicit promotion to 64 bits. The ABI does not guarantee that the higher bits are zeroed in this case, but in this example they happen to be cleared by GCC and LLVM.

On x86 Linux, all arguments are passed in memory but since pointers and integers are the same size, the problem is again covered up. An ABI that represents a null pointer differently than a zero-valued long integer would expose this bug right away, as would an x64 program that passes at least six variable arguments. In any case, the fix is easy:

sqlite3_config(SQLITE_CONFIG_LOG, (LOGFUNC_t)0, pLog); 

In summary, be careful with the default argument promotions that are done when calling variable argument functions.

Thanks to Pascal Cuoq who provided comments on this piece.

Also see this article by Rich Felker and this one by Jens Gustedt.

The Problem with Friendly C

I’ll assume you’re familiar with the Proposal for Friendly C and perhaps also Dan Bernstein’s recent call for a Boring C compiler. Both proposals are reactions to creeping exploitation of undefined behaviors as C/C++ compilers get better optimizers. In contrast, we want old code to just keep working, with latent bugs remaining latent.

After publishing the Friendly C Proposal, I spent some time discussing its design with people, and eventually I came to the depressing conclusion that there’s no way to get a group of C experts — even if they are knowledgable, intelligent, and otherwise reasonable — to agree on the Friendly C dialect. There are just too many variations, each with its own set of performance tradeoffs, for consensus to be possible. To get a taste of this, notice that in the comments for the Friendly C post, several people disagree with what I would consider an extremely non-controversial design choice for Friendly C: memcpy() should have memmove() semantics. Another example is what should be done when a 32-bit integer is shifted by 32 places (this is undefined behavior in C and C++). Stephen Canon pointed out on twitter that there are many programs typically compiled for ARM that would fail if this produced something besides 0, and there are also many programs typically compiled for x86 that would fail when this evaluates to something other than the original value. So what is the friendly thing to do here? I’m not even sure– perhaps the baseline should be “do what gcc 3 would do on that target platform.” The situation gets worse when we start talking about a friendly semantics for races and OOB array accesses.

I don’t even want to entertain the idea of a big family of Friendly C dialects, each making some niche audience happy– that is not really an improvement over our current situation.

Luckily there’s an easy away forward, which is to skip the step where we try to get consensus. Rather, an influential group such as the Android team could create a friendly C dialect and use it to build the C code (or at least the security-sensitive C code) in their project. My guess is that if they did a good job choosing the dialect, others would start to use it, and at some point it becomes important enough that the broader compiler community can start to help figure out how to better optimize Friendly C without breaking its guarantees, and maybe eventually the thing even gets standardized. There’s precedent for organizations providing friendly semantics; Microsoft, for example, provides stronger-than-specified semantics for volatile variables by default on platforms other than ARM.

Since we published the Friendly C proposal, people have been asking me how it’s going. This post is a long-winded way of saying that I lost faith in my ability to push the work forward. However, I still think it’s a great idea and that there are people besides me who can make it happen.

Reducers are Fuzzers

A test case isn’t just a test case: it lives in the (generally extremely large) space of inputs for the software system you are testing. If we have a test case that triggers a bug, here’s one way we can look at it:

The set of test cases triggering a bug is a useful notion since we can search it. For example, a test case reducer is a program that searches for the smallest test case triggering a bug. It requires a way to transform a test case into a smaller one, for example by deleting part of it. The new variant of the test case may or may not trigger the bug. The process goes like this:

I’ve spent a lot of time watching reducers run, and one thing I’ve noticed is that the reduction process often triggers bugs unrelated to the bug that is the subject of the reduction:

Sometimes this is undesirable, such as when a lax interestingness test permits the reduction of one bug to get hijacked by a different bug. This happens all the time when reducing segfaults, which are hard to tell apart. But on the other hand, if we’re looking for bugs then this phenomenon is a useful one.

It seems a bit counterintuitive that test case reduction would lead to the discovery of new bugs since we might expect that the space of inputs to a well-tested software system is mostly non-bug-triggering with a few isolated pockets of bug-triggering inputs scattered here and there. I am afraid that that view might not be realistic. Rather, all of the inputs we usually see occupy a tiny portion of the space of inputs, and it is surrounded by huge overlapping clouds of bug-triggering inputs. Fuzzers can push the boundaries of the space of inputs that we can test, but not by as much as people generally think. Proofs remain the only way to actually show that a piece of software does the right thing any significant chunk of its input space. But I digress. The important fact is that reducers are decently effective mutation-based fuzzers.

In the rest of this post I’d like to push that idea a bit farther by doing a reduction that doesn’t correspond to a bug and seeing if we can find some bugs along the way. We’ll start with this exciting C++ program

#include <iostream>

int main() {
  std::cout << "Hello World!++" << std::endl;
}

and we’ll reduce it under the criterion that it remains a viable Hello World implementation. First we’ll preprocess and then use delta’s topformflat to smoosh everything together nicely:

g++ -std=c++11 -E -P -O3 hello-orig.cpp | topformflat > hello.cpp

You might be saying to yourself something like “it sure is stupid that John is passing an optimization flag to the preprocessor,” but trust me that it actually does change the emitted code. I didn’t check if it makes a difference here but I’ve learned to avoid trouble by just passing the same flags to the preprocessor as to the compiler.

Anyhow, the result is 550 KB of goodness:

Here’s the code that checks if a variant is interesting — that is, if it acts like a Hello World implementation:

g++ -O3 -std=c++11 -w hello.cpp >/dev/null 2>&1 &&
ulimit -t 1 &&
./a.out | grep Hello

The ulimit is necessary because infinite loops sometimes get into the program that is being reduced.

To find compiler crashes we’ll need a bit more elaborate of a test:

if
  g++ -O3 -std=c++11 -w hello.cpp >compiler.out 2>&1
then
  ulimit -t 1 &&
  ./a.out | grep Hello
else
  if
    grep 'internal compiler error' compiler.out
  then
    exit 101
  else
    exit 1
  fi
fi

When the compiler fails we look at its output and, if it contains evidence of a compiler bug, exit with code 101, which will tell C-Reduce that it should save a copy of the input files that made this happen.

The compiler we’ll use is g++ r231221, the development head from December 3 2015. Let’s get things going:

creduce --nokill --also-interesting 101 --no-default-passes \
  --add-pass pass_clex rm-tok-pattern-4 10 ../test.sh hello.cpp

The -also-interesting 101 option indicates that the interestingness test will use process exit code 101 to tell C-Reduce to make a snapshot of the directory containing the files being reduced, so we can look at it later. --no-default-passes clears C-Reduce’s pass schedule and -add-pass pass_clex rm-tok-pattern-4 10 add a single pass that uses a small sliding window to remove tokens from the test case. The issue here is that not all of C-Reduce’s passes are equally effective at finding bugs. Some passes, such as the one that removes dead variables and the one that removes dead functions, will probably never trigger a compiler bug. Other passes, such as the one that removes chunks of lines from a test case, eliminate text from the test case so rapidly that effective fuzzing doesn’t happen. There are various ways to deal with this problem, such as probabilistically rejecting improvements or rejecting improvements that are too large, but for this post I’ve chosen the simple expedient of running just one pass that (1) makes progress very slowly and (2) seems to be a good fuzzer.

The dynamics of this sort of run are interesting: as the test case walks around the space of programs, you can actually see it brush up against a compiler bug and then wander off into the weeds again:

The results are decent: after about 24 hours, C-Reduce caused many segfaults in g++ and triggered six different internal compiler errors: 1, 2, 3, 4, 5, 6. One of these was already reported, another looks probably like a duplicate, and four appear to be new.

I did a similar run against Clang++ r254595, again the development head from December 3. This produced segfaults and also triggered 25 different assertions:

llvm/include/llvm/Support/Casting.h:230: typename cast_retty::ret_type llvm::cast(Y &) [X = clang::FunctionProtoType, Y = clang::QualType]: Assertion `isa(Val) && "cast() argument of incompatible type!"' failed.
llvm/include/llvm/Support/Casting.h:95: static bool llvm::isa_impl_cl::doit(const From *) [To = clang::FunctionTemplateDecl, From = const clang::Decl *]: Assertion `Val && "isa<> used on a null pointer"' failed.
llvm/tools/clang/lib/AST/Decl.cpp:2134: clang::APValue *clang::VarDecl::evaluateValue(SmallVectorImpl &) const: Assertion `!Init->isValueDependent()' failed.
llvm/tools/clang/lib/AST/Decl.cpp:2181: bool clang::VarDecl::checkInitIsICE() const: Assertion `!Init->isValueDependent()' failed.
llvm/tools/clang/lib/AST/ExprCXX.cpp:451: static clang::DependentScopeDeclRefExpr *clang::DependentScopeDeclRefExpr::Create(const clang::ASTContext &, clang::NestedNameSpecifierLoc, clang::SourceLocation, const clang::DeclarationNameInfo &, const clang::TemplateArgumentListInfo *): Assertion `QualifierLoc && "should be created for dependent qualifiers"' failed.
llvm/tools/clang/lib/AST/../../include/clang/AST/TypeNodes.def:98: clang::TypeInfo clang::ASTContext::getTypeInfoImpl(const clang::Type *) const: Assertion `!T->isDependentType() && "should not see dependent types here"' failed.
llvm/tools/clang/lib/CodeGen/CodeGenModule.cpp:623: llvm::StringRef clang::CodeGen::CodeGenModule::getMangledName(clang::GlobalDecl): Assertion `II && "Attempt to mangle unnamed decl."' failed.
llvm/tools/clang/lib/CodeGen/../../include/clang/AST/Expr.h:134: void clang::Expr::setType(clang::QualType): Assertion `(t.isNull() || !t->isReferenceType()) && "Expressions can't have reference type"' failed.
llvm/tools/clang/lib/Lex/PPCaching.cpp:101: void clang::Preprocessor::AnnotatePreviousCachedTokens(const clang::Token &): Assertion `CachedTokens[CachedLexPos-1].getLastLoc() == Tok.getAnnotationEndLoc() && "The annotation should be until the most recent cached token"' failed.
llvm/tools/clang/lib/Parse/../../include/clang/Parse/Parser.h:2256: void clang::Parser::DeclaratorScopeObj::EnterDeclaratorScope(): Assertion `!EnteredScope && "Already entered the scope!"' failed.
llvm/tools/clang/lib/Sema/../../include/clang/AST/DeclTemplate.h:1707: void clang::ClassTemplateSpecializationDecl::setInstantiationOf(clang::ClassTemplatePartialSpecializationDecl *, const clang::TemplateArgumentList *): Assertion `!SpecializedTemplate.is() && "Already set to a class template partial specialization!"' failed.
llvm/tools/clang/lib/Sema/../../include/clang/Sema/Lookup.h:460: clang::NamedDecl *clang::LookupResult::getFoundDecl() const: Assertion `getResultKind() == Found && "getFoundDecl called on non-unique result"' failed.
llvm/tools/clang/lib/Sema/SemaDecl.cpp:10455: clang::Decl *clang::Sema::ActOnParamDeclarator(clang::Scope *, clang::Declarator &): Assertion `S->isFunctionPrototypeScope()' failed.
llvm/tools/clang/lib/Sema/SemaDeclCXX.cpp:11373: ExprResult clang::Sema::BuildCXXDefaultInitExpr(clang::SourceLocation, clang::FieldDecl *): Assertion `Lookup.size() == 1' failed.
llvm/tools/clang/lib/Sema/SemaExpr.cpp:2274: ExprResult clang::Sema::ActOnIdExpression(clang::Scope *, clang::CXXScopeSpec &, clang::SourceLocation, clang::UnqualifiedId &, bool, bool, std::unique_ptr, bool, clang::Token *): Assertion `R.getAsSingle() && "There should only be one declaration found."' failed.
llvm/tools/clang/lib/Sema/SemaExprCXX.cpp:2272: clang::FunctionDecl *clang::Sema::FindUsualDeallocationFunction(clang::SourceLocation, bool, clang::DeclarationName): Assertion `Matches.size() == 1 && "unexpectedly have multiple usual deallocation functions"' failed.
llvm/tools/clang/lib/Sema/SemaExprCXX.cpp:6663: ExprResult clang::Sema::CorrectDelayedTyposInExpr(clang::Expr *, clang::VarDecl *, llvm::function_ref): Assertion `TyposInContext < ~0U && "Recursive call of CorrectDelayedTyposInExpr"' failed.
llvm/tools/clang/lib/Sema/SemaExprMember.cpp:91: IMAKind ClassifyImplicitMemberAccess(clang::Sema &, const clang::LookupResult &): Assertion `!R.empty() && (*R.begin())->isCXXClassMember()' failed.
llvm/tools/clang/lib/Sema/SemaLookup.cpp:1904: bool clang::Sema::LookupQualifiedName(clang::LookupResult &, clang::DeclContext *, bool): Assertion `(!isa(LookupCtx) || LookupCtx->isDependentContext() || cast(LookupCtx)->isCompleteDefinition() || cast(LookupCtx)->isBeingDefined()) && "Declaration context must already be complete!"' failed.
llvm/tools/clang/lib/Sema/SemaLookup.cpp:2729: Sema::SpecialMemberOverloadResult *clang::Sema::LookupSpecialMember(clang::CXXRecordDecl *, clang::Sema::CXXSpecialMember, bool, bool, bool, bool, bool): Assertion `CanDeclareSpecialMemberFunction(RD) && "doing special member lookup into record that isn't fully complete"' failed.
llvm/tools/clang/lib/Sema/SemaOverload.cpp:11671: ExprResult clang::Sema::CreateOverloadedBinOp(clang::SourceLocation, unsigned int, const clang::UnresolvedSetImpl &, clang::Expr *, clang::Expr *): Assertion `Result.isInvalid() && "C++ binary operator overloading is missing candidates!"' failed.
llvm/tools/clang/lib/Sema/SemaTemplate.cpp:2906: ExprResult clang::Sema::BuildTemplateIdExpr(const clang::CXXScopeSpec &, clang::SourceLocation, clang::LookupResult &, bool, const clang::TemplateArgumentListInfo *): Assertion `!R.empty() && "empty lookup results when building templateid"' failed.
llvm/tools/clang/lib/Sema/SemaTemplateDeduction.cpp:609: (anonymous namespace)::PackDeductionScope::PackDeductionScope(clang::Sema &, clang::TemplateParameterList *, SmallVectorImpl &, clang::sema::TemplateDeductionInfo &, clang::TemplateArgument): Assertion `!Packs.empty() && "Pack expansion without unexpanded packs?"' failed.
llvm/tools/clang/lib/Sema/SemaTemplateInstantiate.cpp:2781: llvm::PointerUnion *clang::LocalInstantiationScope::findInstantiationOf(const clang::Decl *): Assertion `isa(D) && "declaration not instantiated in this scope"' failed.
llvm/tools/clang/lib/Sema/SemaTemplateVariadic.cpp:290: bool clang::Sema::DiagnoseUnexpandedParameterPack(clang::Expr *, clang::Sema::UnexpandedParameterPackContext): Assertion `!Unexpanded.empty() && "Unable to find unexpanded parameter packs"' failed.

I have to admit that I felt a bit overwhelmed by 25 potential bug reports, and I haven’t reported any of these yet. My guess is that a number of them are already in the bugzilla since people have been fuzzing Clang lately. Anyway, I’ll try to get around to reducing and reporting these. Really, this all needs to be automated so that when subsequent reductions find still more bugs, these just get added to the queue of reductions to run.

If you were interested in reproducing these results, or in trying something similar, you would want to use C-Reduce’s master branch. I ran everything on an Ubuntu 14.04 box. While preparing this post I found that different C-Reduce command line options produced widely varying numbers and kinds of crashes.

Regarding previous work, I believe — but couldn’t find written down — that the CERT BFF watches out for new crashes when running its reducer. In a couple of papers written by people like Alex Groce and me, we discussed the fact that reducers often slip from triggering one bug to another.

The new thing in this post is to show that triggering new bugs while reducing isn’t just some side effect. Rather, we can go looking for trouble, and we can do it without being given a bug to reduce in the first place. A key enabler for easy bug-finding with C-Reduce was finding a simple communication mechanism by which the interestingness test can give C-Reduce a bit of out-of-band information that a variant should be saved for subsequent inspection. I’m not trying to claim that reducers are awesome fuzzers, but on the other hand, it might be a little bit awesome that mutating Hello World resulted in triggering 25 different assertion violations in a mature and high-quality compiler. I bet we’d have done even better by starting with a nice big fat Boost application.

Multi-Version Execution Defeats a Compiler-Bug-Based Backdoor

[This piece is jointly authored by Cristian Cadar, Luís Pina, and John Regehr]

What should you do if you’re worried that someone might have exploited a compiler bug to introduce a backdoor into code that you are running? One option is to find a bug-free compiler. Another is to run versions of the code produced by multiple compilers and to compare the results (of course, under the additional assumption that the same bug does not affect all the compilers). For some programs, such as those whose only effect is to produce a text file, comparing the output is easy. For others, such as servers, this is more difficult and specialized system support is required.

Today we’ll look at using Varan the Unbelievable to defeat the sudo backdoor from the PoC||GTFO article. Varan is a multi-version execution system that exploits the fact that if you have some unused cores, running additional copies of a program can be cheap. Varan designates a leader process whose system call activity is recorded in a shared ring buffer, and one or more follower processes that read results out of the ring buffer instead of actually issuing system calls.

Compilers have a lot of freedom while generating code, but the sequence of system calls executed by a program represents its external behaviour and in most cases the compiler is not free to change it at all. There might be slight variations e.g., due to different compilers using different libraries, but these can be easily handled by Varan. Since all correctly compiled variants of a program should have the same external behaviour, any divergence in the sequence of system calls across versions flags a potential security attack, in which case Varan stops the program before any harm is done.

Typically, Varan runs the leader process at full speed while also recording the results of its system calls into the ring buffer. However, when used in a security-sensitive setting, Varan can designate some system calls as blocking, meaning that the leader cannot execute those syscalls until all followers have reached that same program point without diverging. For sudo, we designate execve as blocking, since that is a point at which sudo might perform an irrevocably bad action.

So here’s the setup:

  1. We have a patched version of sudo 1.8.13 from the PoC||GTFO article. It runs correctly and securely when compiled by a correct C compiler, but improperly gives away root privileges when compiled by Clang 3.3 because the patch was designed to trigger a wrong-code bug in that compiler.
  2. We are going to pretend that we don’t know about the Clang bug and the backdoor. We compile two versions of the patched sudo: one with Clang 3.3, the other with the default system compiler, GCC 4.8.4.
  3. We run these executables under Varan. Since the critical system call execve is blocking, it doesn’t much matter which version is the leader and which is the follower.

Now let’s visit an Ubuntu 14.04 VM where both versions of sudo (setuid root, of course) and Varan are installed. We’re using a user account that is not in the sudoers file — it should not be allowed to get root privileges under any circumstances. First let’s make sure that a sudo that was properly compiled (using GCC) works as expected:

$ /home/varan/sudo-1.8.13/install/bin/sudo-gcc cat /etc/shadow
Password: 
test is not in the sudoers file.  This incident will be reported.

Next, we make sure that the backdoor is functioning as intended:

$ /home/varan/sudo-1.8.13/install/bin/sudo-clang cat /etc/shadow
Password: 
root:!:16693:0:99999:7:::
...
test:$6$nL9rosCD$A.prR28WWZYgDHy.m6ovWYaRhP/OhOX7V9nDDirm9ZffjTWhLdJ7C171wTfLBpQNC2EoeEKAJpcEz9gV9ravL/:16693:0:99999:7:::
messagebus:*:16757:0:99999:7:::
colord:*:16757:0:99999:7:::

So far so good. Next let’s try the gcc-compiled sudo as the leader with the backdoored sudo as the follower:

$ vx-suid /home/varan/sudo-1.8.13/install/bin/sudo-gcc \
          /home/varan/sudo-1.8.13/install/bin/sudo-clang -- cat /etc/shadow
Password: 
$
test is not in the sudoers file.  This incident will be reported.

What happened here is that the gcc-compiled leader runs as before, since it doesn’t ever try to execute an execve call. When the backdoored follower tries to execute the malicious execve call, Varan detects the divergence and terminates both processes safely.

Now let’s try switching around the leader and follower, i.e., run the backdoored sudo as the leader with the gcc-compiled sudo as the follower:

$ vx-suid /home/varan/sudo-1.8.13/install/bin/sudo-clang \
          /home/varan/sudo-1.8.13/install/bin/sudo-gcc -- cat /etc/shadow
Password: 
$

This time the leader tries to execute the malicious execve call, and Varan blocks its execution until the follower reaches the same system call or diverges. In this case, the follower tries to execute a write system call (to print “test is not in the sudoers file...”) and thus Varan detects divergence and again terminates execution safely.

In this example, we only ran two versions in parallel, but Varan can run more than two versions. In terms of performance and resource utilization, security applications like sudo are a great match for multi-version execution: they are not CPU-bound, so any performance degradation is imperceptible to the user, and the extra cores are needed only briefly, during the critical security validation checks. We are looking into applying this approach to other critical security applications (e.g. ssh-agent and password managers), and are investigating a way of hardening executables by generating a single binary with Varan and a bunch of versions, each version generated by a different compiler. We can then deploy this hardened executable instead of the original program.

Of course, Varan can detect misbehavior other than compiler-bug-based backdoors. Divergence could be caused by a memory or CPU glitch, by a plain old compiler bug that is triggered unintentionally instead of being triggered by an adversarial patch, or by a situation where an application-level undefined behavior bug has been exploited by only one of the compilers, or even where both compilers exploited the bug but not in precisely the same way. A nice thing about N-version programming at the system call level is that it won’t bother us about transient divergences that do not manifest as externally visible behaviour through a system call.

We’ll end by pointing out a piece of previous work along these lines: the Boeing 777 uses compiler-based and also hardware-based N-version diversity: there is a single version of the Ada avionics software that is compiled by three different compilers and then it runs on three different processors: a 486, a 68040, and an AMD 29050.

Testcase Reduction for Non-Preprocessed C and C++

C-Reduce takes a C or C++ file that triggers a bug in a compiler (or other tool that processes source code) and turns it into the smallest possible test case that still triggers the bug. Most often, we try to reduce code that has already been preprocessed. This post is about how to reduce non-preprocessed code, which is sometimes necessary when — due to use of an integrated preprocessor — the compiler bug goes away when a separate preprocessing step is used.

The first thing we need to do is get all of the necessary header files into one place. This is somewhat painful due to things like computed includes and #include_next. I wrote a script that follows the transitive includes, renaming files and copying them over; it works fine on Linux but sometimes fails on OS X, I haven’t figured out why yet. Trust me that you do not want to look too closely at the Boost headers.

Second, we need to reduce multiple files at once, since they have not yet been glommed together by the preprocessor. C-Reduce, which is a collection of passes that iterate to fixpoint, does this by running each pass over each file before proceeding to the next pass. The seems to work well. A side benefit of implementing multi-file reduction is that it has other uses such as reducing programs that trigger link-time optimization bugs, which inherently requires multiple files.

Non-preprocessed code contains comments, so C-Reduce has a special pass for stripping those out. We don’t want to do this before running C-Reduce because removing comments might make the bug we’re chasing go away. Another pass specifically removes #include directives which tend to be deeply nested in some C++ libraries.

#ifdef … #endif pairs are hard to eliminate from first principles because they are often not located near to each other in the file being reduced, but you still need to eliminate both at once. At first this sounded like a hard problem to solve but then I found Tony Finch’s excellent unifdef tool and wrote a C-Reduce pass that simply calls it for every relevant preprocessor symbol.

Finally, it is often the case that a collection of reduced header files contains long chains of trivial #includes. C-Reduce fixes these with a pass that replaces an #include with the included text when the included file is very small.

What’s left to do? The only obvious thing on my list is selectively evaluating the substitutions suggested by #define directives. I will probably only do this by shelling out to an external tool, should someone happen to write it.

In summary, reducing non-preprocessed code is not too hard, but some specific support is required in order to do a good job of it. If you have a testcase reduction problem that requires multi-file reduction or needs to run on non-preprocessed code, please try out C-Reduce and let us know — at the creduce-dev mailing list — how it worked. The features described in this post aren’t yet in a release, just build and install our master branch.

As a bonus, since you’re dying to know, I’ll show you what C-Reduce thinks is the minimal hello world program in C++11. From 127 header files + the original source file, it creates 126 empty files plus this hello.cpp:

#include "ostream"
namespace std {
basic_ostream cout;
ios_base::Init x0;
}
int main() { std::cout << "Hello"; }

And this ostream:

namespace
{
typedef __SIZE_TYPE__ x0;
typedef __PTRDIFF_TYPE__ x1;
}
namespace std
{
template < class > struct char_traits;
}
typedef x1 x2;
namespace std
{
template < typename x3, typename = char_traits < x3 > >struct basic_ostream;
template <> struct char_traits 
{
    typedef char x4;
    static x0 x5 ( const x4 * x6 )
    {
        return __builtin_strlen ( x6 );
    }
};
template < typename x3, typename x7 > basic_ostream < x3,
         x7 > &__ostream_insert ( basic_ostream < x3, x7 > &, const x3 *, x2 );
struct ios_base
{
    struct Init
    {
        Init (  );
    };
};
template < typename, typename > struct basic_ostream
{
};
template < typename x3,
         typename x7 > basic_ostream < x3 > operator<< ( basic_ostream < x3,
                 x7 > &x8, const x3 * x6 )
{
    __ostream_insert ( x8, x6, x7::x5 ( x6 ) );
    return x8;
}
}

It's obviously not minimal but believe me we have already put a lot of work into domain-specific C++ reduction tricks. If you see something here that can be gotten rid of and want to take a crack at teaching our clang_delta tool to do the transformation, we'll take a pull request.

A Few Synthesizing Superoptimizer Results

For this post, I crippled Souper by disabling its path conditions and limiting the depth of harvested expressions to two LLVM instructions. The first goal was to create a nice easy burn-in test for Souper’s instruction synthesizer, which uses a variant of this method; the second goal was to see if depth-limited, path-condition-free expressions would result in optimizations that are easier to understand and implement in LLVM.

We harvested LLVM expressions from SPEC CINT 2006, leaving out omnet, which caused a compiler crash. For each of the ~30,000 unique harvested expressions, we ran Souper’s synthesizer, which attempts to create a functionally equivalent but cheaper version of the expression. The results are ranked by the average of the optimization’s rank in terms of static and dynamic profile counts.

Here are the results

One thing you’ll notice is a lot of this sort of thing:

%0:i8 = var
%1:i8 = and 1:i8, %0
%2:i1 = ne 0:i8, %1
infer %2
%3:i1 = trunc %0
result %3

The backends don’t usually care which of these they get, but even so this seems like a nice easy simplification to implement at the IR level.

My favorite optimizations from this batch are where Souper uses bit-level tricks to remove select instructions; here’s an example where the optimized version seems to lead to better code on the architectures that I’ve tried:

%0:i1 = var
%1:i64 = var
%2:i64 = select %0, %1, 1:i64
%3:i1 = eq 0:i64, %2
infer %3
%4:i64 = zext %0
%5:i1 = ult %1, %4
result %5

On the other hand, it is clear that there are many select-removal “optimizations” in this batch that are just not a good idea. We are not finding it easy to automatically choose which of two LLVM expressions are preferable. For now we just say that most instructions cost 1, select and div/rem cost 3, and constants and phis are free. Obviously this is just a first hack. We’d appreciate feedback on this aspect of the work (and others, of course). And obviously let us know if you find bugs in the results.

We might ask what can cause Souper to not find an optimization. First, Souper’s grab-bag of components for synthesis contains one of each of the following: shl, lshr, ashr, and, or, xor, add, sub, slt, ult, sle, ule, eq, ne, and constant. If an optimization requires, for example, two xors, we won’t find it. Of course we could give the synthesizer more components to work with but that slows it down. Sext/zext/trunc aren’t considered basic components by Souper, but are added as necessary to make bitwidths match up. This leads us to the second cause of missed optimizations: Souper won’t make up bitwidths not found in the original code. So, if there’s a trick where the arguments need to be extended from 32 bits to 64 and then truncated later, we won’t find that. Again this isn’t hard to fix, but it does appear to be hard to fix without slowing down synthesis a lot. Finally, if the solver (Z3 here) uses too much time or memory, or if some limits on loops inside the synthesizer are exceeded, we’ll miss out on whatever we might have found if we were willing to use more resources.

I’d like to distill the synthesis problems that come from optimizing LLVM code into some sort of synthesis benchmark that researchers can use to evaluate their work. We would say that your synthesis engine is better than ours (on this benchmark) if it can save more total static or dynamic profile weight in a given amount of time, or if it can produce equivalent results in less time. Of course we’ll have to agree on a cost model. I just attended the 2015 SMT Workshop the other day (where I talked about Souper!) and it sounds like there is plenty of interest in the solver community in providing better support for synthesis, which is great!

We have a medium-term goal of taking an optimization discovered by Souper and generalizing it into something that is very close to what an LLVM developer would implement. Since Souper has no capacity for abstraction, we’ll turn its optimizations into Alive patterns and then derive not only any preconditions that need to hold before the optimization can fire but also code for converting constant values on the left hand side into constants on the right hand side. For example let’s say that Souper finds this optimization:

%1 = lshr i32 %in, 3
%out = shl nuw i32 %1, 3
  =>
%out = and i32 4294967288, %in

The general form that we’d like to derive is:
just

%1 = lshr %in, C
%out = shl %1, C
  =>
%out = and -(1<<C), %in

Can we derive -(1<<C) from scratch? Time will tell. If this works, it’ll be useful for two reasons. First, it’ll effectively deduplicate Souper’s output: many specific optimizations will all boil down to the same general one. Second, the optimization will be easier to implement in LLVM if a human doesn’t have to do the generalization work. Some work by Nuno Lopes contains the basic ideas that we’ll use.

Something else I’ve been thinking about is the fact that all compilers have weaknesses of the kind that we point out in this post. The OPTGEN paper has some concrete data for GCC, LLVM, and Intel CC. I’ve heard that the Microsoft C/C++ compiler could stand to be improved in this respect too. CompCert needs a ton of work, the last time I checked. Wouldn’t it be cool if the work we’re doing for Souper and Alive could apply more broadly than just to LLVM? One observation is that if we deal only with integer values, the IR semantics in each of these compilers is probably broadly the same (leaving out some tricky issues like IR-level undefined behaviors). There’s probably no fundamental reason we can’t just reuse integer optimizations across multiple compilers.

Here are a few offers I’d like to extend:

  1. If you are an LLVM frontend developer, I’d like to run Souper on some representative code from your frontend. It would be easiest if you harvested optimized LLVM code into Souper yourself (I’ll have to write up instructions) but if not, you can send me some bitcode files — in this case we won’t get dynamic profile information. A few GB would not be too much. The effect will be that we can help the LLVM middle-end optimizers do a better job on code that you care about.
  2. If you develop a compiler other than LLVM, even a closed-source one, let’s figure out a way to translate an integer subset of your IR into Souper, so we can look for missed optimizations. My guess is that if you target a subset of Souper that doesn’t have phis or path conditions, this isn’t very difficult. You will basically end up doing the same kind of depth-bounded expression harvesting that I did.
  3. If you are doing research on program synthesis, and if you might be interested in working towards some benchmarks about optimizing LLVM code, let me know and let’s figure out how to make that happen.

The next set of results that I post will be similar to these, but they’ll show off Souper’s ability to exploit the results of one or two of LLVM’s dataflow analyses. Finally I want to add that although I’ve been writing up these posts, Souper is a group effort.