Skip to content

Multi-Version Execution Defeats a Compiler-Bug-Based Backdoor

[This piece is jointly authored by Cristian Cadar, Luís Pina, and John Regehr]

What should you do if you’re worried that someone might have exploited a compiler bug to introduce a backdoor into code that you are running? One option is to find a bug-free compiler. Another is to run versions of the code produced by multiple compilers and to compare the results (of course, under the additional assumption that the same bug does not affect all the compilers). For some programs, such as those whose only effect is to produce a text file, comparing the output is easy. For others, such as servers, this is more difficult and specialized system support is required.

Today we’ll look at using Varan the Unbelievable to defeat the sudo backdoor from the PoC||GTFO article. Varan is a multi-version execution system that exploits the fact that if you have some unused cores, running additional copies of a program can be cheap. Varan designates a leader process whose system call activity is recorded in a shared ring buffer, and one or more follower processes that read results out of the ring buffer instead of actually issuing system calls.

Compilers have a lot of freedom while generating code, but the sequence of system calls executed by a program represents its external behaviour and in most cases the compiler is not free to change it at all. There might be slight variations e.g., due to different compilers using different libraries, but these can be easily handled by Varan. Since all correctly compiled variants of a program should have the same external behaviour, any divergence in the sequence of system calls across versions flags a potential security attack, in which case Varan stops the program before any harm is done.

Typically, Varan runs the leader process at full speed while also recording the results of its system calls into the ring buffer. However, when used in a security-sensitive setting, Varan can designate some system calls as blocking, meaning that the leader cannot execute those syscalls until all followers have reached that same program point without diverging. For sudo, we designate execve as blocking, since that is a point at which sudo might perform an irrevocably bad action.

So here’s the setup:

  1. We have a patched version of sudo 1.8.13 from the PoC||GTFO article. It runs correctly and securely when compiled by a correct C compiler, but improperly gives away root privileges when compiled by Clang 3.3 because the patch was designed to trigger a wrong-code bug in that compiler.
  2. We are going to pretend that we don’t know about the Clang bug and the backdoor. We compile two versions of the patched sudo: one with Clang 3.3, the other with the default system compiler, GCC 4.8.4.
  3. We run these executables under Varan. Since the critical system call execve is blocking, it doesn’t much matter which version is the leader and which is the follower.

Now let’s visit an Ubuntu 14.04 VM where both versions of sudo (setuid root, of course) and Varan are installed. We’re using a user account that is not in the sudoers file — it should not be allowed to get root privileges under any circumstances. First let’s make sure that a sudo that was properly compiled (using GCC) works as expected:

$ /home/varan/sudo-1.8.13/install/bin/sudo-gcc cat /etc/shadow
test is not in the sudoers file.  This incident will be reported.

Next, we make sure that the backdoor is functioning as intended:

$ /home/varan/sudo-1.8.13/install/bin/sudo-clang cat /etc/shadow

So far so good. Next let’s try the gcc-compiled sudo as the leader with the backdoored sudo as the follower:

$ vx-suid /home/varan/sudo-1.8.13/install/bin/sudo-gcc \
          /home/varan/sudo-1.8.13/install/bin/sudo-clang -- cat /etc/shadow
test is not in the sudoers file.  This incident will be reported.

What happened here is that the gcc-compiled leader runs as before, since it doesn’t ever try to execute an execve call. When the backdoored follower tries to execute the malicious execve call, Varan detects the divergence and terminates both processes safely.

Now let’s try switching around the leader and follower, i.e., run the backdoored sudo as the leader with the gcc-compiled sudo as the follower:

$ vx-suid /home/varan/sudo-1.8.13/install/bin/sudo-clang \
          /home/varan/sudo-1.8.13/install/bin/sudo-gcc -- cat /etc/shadow

This time the leader tries to execute the malicious execve call, and Varan blocks its execution until the follower reaches the same system call or diverges. In this case, the follower tries to execute a write system call (to print “test is not in the sudoers file...”) and thus Varan detects divergence and again terminates execution safely.

In this example, we only ran two versions in parallel, but Varan can run more than two versions. In terms of performance and resource utilization, security applications like sudo are a great match for multi-version execution: they are not CPU-bound, so any performance degradation is imperceptible to the user, and the extra cores are needed only briefly, during the critical security validation checks. We are looking into applying this approach to other critical security applications (e.g. ssh-agent and password managers), and are investigating a way of hardening executables by generating a single binary with Varan and a bunch of versions, each version generated by a different compiler. We can then deploy this hardened executable instead of the original program.

Of course, Varan can detect misbehavior other than compiler-bug-based backdoors. Divergence could be caused by a memory or CPU glitch, by a plain old compiler bug that is triggered unintentionally instead of being triggered by an adversarial patch, or by a situation where an application-level undefined behavior bug has been exploited by only one of the compilers, or even where both compilers exploited the bug but not in precisely the same way. A nice thing about N-version programming at the system call level is that it won’t bother us about transient divergences that do not manifest as externally visible behaviour through a system call.

We’ll end by pointing out a piece of previous work along these lines: the Boeing 777 uses compiler-based and also hardware-based N-version diversity: there is a single version of the Ada avionics software that is compiled by three different compilers and then it runs on three different processors: a 486, a 68040, and an AMD 29050.

Testcase Reduction for Non-Preprocessed C and C++

C-Reduce takes a C or C++ file that triggers a bug in a compiler (or other tool that processes source code) and turns it into the smallest possible test case that still triggers the bug. Most often, we try to reduce code that has already been preprocessed. This post is about how to reduce non-preprocessed code, which is sometimes necessary when — due to use of an integrated preprocessor — the compiler bug goes away when a separate preprocessing step is used.

The first thing we need to do is get all of the necessary header files into one place. This is somewhat painful due to things like computed includes and #include_next. I wrote a script that follows the transitive includes, renaming files and copying them over; it works fine on Linux but sometimes fails on OS X, I haven’t figured out why yet. Trust me that you do not want to look too closely at the Boost headers.

Second, we need to reduce multiple files at once, since they have not yet been glommed together by the preprocessor. C-Reduce, which is a collection of passes that iterate to fixpoint, does this by running each pass over each file before proceeding to the next pass. The seems to work well. A side benefit of implementing multi-file reduction is that it has other uses such as reducing programs that trigger link-time optimization bugs, which inherently requires multiple files.

Non-preprocessed code contains comments, so C-Reduce has a special pass for stripping those out. We don’t want to do this before running C-Reduce because removing comments might make the bug we’re chasing go away. Another pass specifically removes #include directives which tend to be deeply nested in some C++ libraries.

#ifdef … #endif pairs are hard to eliminate from first principles because they are often not located near to each other in the file being reduced, but you still need to eliminate both at once. At first this sounded like a hard problem to solve but then I found Tony Finch’s excellent unifdef tool and wrote a C-Reduce pass that simply calls it for every relevant preprocessor symbol.

Finally, it is often the case that a collection of reduced header files contains long chains of trivial #includes. C-Reduce fixes these with a pass that replaces an #include with the included text when the included file is very small.

What’s left to do? The only obvious thing on my list is selectively evaluating the substitutions suggested by #define directives. I will probably only do this by shelling out to an external tool, should someone happen to write it.

In summary, reducing non-preprocessed code is not too hard, but some specific support is required in order to do a good job of it. If you have a testcase reduction problem that requires multi-file reduction or needs to run on non-preprocessed code, please try out C-Reduce and let us know — at the creduce-dev mailing list — how it worked. The features described in this post aren’t yet in a release, just build and install our master branch.

As a bonus, since you’re dying to know, I’ll show you what C-Reduce thinks is the minimal hello world program in C++11. From 127 header files + the original source file, it creates 126 empty files plus this hello.cpp:

#include "ostream"
namespace std {
basic_ostream cout;
ios_base::Init x0;
int main() { std::cout << "Hello"; }

And this ostream:

typedef __SIZE_TYPE__ x0;
typedef __PTRDIFF_TYPE__ x1;
namespace std
template < class > struct char_traits;
typedef x1 x2;
namespace std
template < typename x3, typename = char_traits < x3 > >struct basic_ostream;
template <> struct char_traits 
    typedef char x4;
    static x0 x5 ( const x4 * x6 )
        return __builtin_strlen ( x6 );
template < typename x3, typename x7 > basic_ostream < x3,
         x7 > &__ostream_insert ( basic_ostream < x3, x7 > &, const x3 *, x2 );
struct ios_base
    struct Init
        Init (  );
template < typename, typename > struct basic_ostream
template < typename x3,
         typename x7 > basic_ostream < x3 > operator<< ( basic_ostream < x3,
                 x7 > &x8, const x3 * x6 )
    __ostream_insert ( x8, x6, x7::x5 ( x6 ) );
    return x8;

It's obviously not minimal but believe me we have already put a lot of work into domain-specific C++ reduction tricks. If you see something here that can be gotten rid of and want to take a crack at teaching our clang_delta tool to do the transformation, we'll take a pull request.

A Few Pictures

Paris was very quiet on Saturday and people on the street looked tired, having (like us) stayed up most of the night watching the news, listening to sirens, and worrying about things. Today was sunny and warm and things seemed more normal; plenty of folks out jogging, sitting in parks, other usual weekend activities.

Classic Bug Reports

A bug report is sometimes entertaining either because of the personalities involved or because of the bug itself. Here are a collection of links into public bug trackers; I learned about most of these in a recent Twitter thread.

I hope you enjoy these as much as I do. Thanks to everyone who contributed links.

Updates from comments and Reddit:

API Fuzzing vs. File Fuzzing: A Cautionary Tale

Libraries that provide APIs should be rock solid, and so should file parsers. Although we can use fuzzing to ensure the solidity of both kinds of software, there are some big differences in how we do that.

A file parser should be fully robust: it isn’t allowed to crash even if presented with a corrupted file or the wrong kind of file. Although a library that implements an API usually does some validation of arguments, this is generally a debugging aid rather than an attempt to withstand hostility.
A file usually originates outside of a trust boundary. There generally is not a trust boundary between a library and the program that uses the library.
A file parser can perform arbitrary checks over its inputs. A C/C++ library providing an API cannot check some properties of its input. For example, it cannot check that a pointer refers to valid storage. Moreover, APIs are often so much in the critical path that full error checking would not be desirable even if it were possible.

For fuzzing campaigns the point is this:

The sole concern of a file fuzzer is to generate interesting files that expose bugs in the file parser or in subsequent logic. An API fuzzer must balance two goals. Like the file fuzzer, it needs to expose bugs by using the API in interesting ways. However, at the same time, it must stay within the usage prescribed by the API — careful reading of the documentation is required, as well as a bit of finesse and good taste.

As a concrete example, let’s look at libpng, which provides both a file parser and an API. To fuzz the file parsing side, we generate evil pngish files using something like afl-fuzz and we wait for the library to do something wrong. But what about fuzzing the API? We’ll need to look at the documentation first, where we find text like this:

To write a PNG file using the simplified API:

  1) Declare a 'png_image' structure on the stack and memset()
     it to all zero.

  2) Initialize the members of the structure that describe the
     image, setting the 'format' member to the format of the
     image in memory.

  3) Call the appropriate png_image_write... function with a
     pointer to the image to write the PNG data.

Hopefully it is immediately obvious that calling random API functions and passing random crap to them is not going to work. This will, of course, cause libpng to crash, but the crashes will not be interesting because they will not come from valid usage of the library.

You might ask why anyone would care to fuzz an API. Aren’t trust boundaries the things that matter? The answer is easy: If you are looking to find or prevent exploitable vulnerabilities, you should always focus on fuzzing at trust boundaries. On the other hand, if you are more generally interested in reliable software, you should fuzz APIs as well. We might use API fuzzing to ensure that printf prints floats correctly and that the red-black tree you coded up late last night doesn’t spuriously drop elements.

Now let’s look at a blog post by GDS from last week about fuzzing mbed TLS. There’s something kind of interesting going on here: they are doing API fuzzing (of the mbed TLS library) but they are doing it using afl-fuzz: a file fuzzer. This works well because afl-fuzz provides the data that is “sent” between the client and server (sent is in quotes because this fuzzing effort gains speed and determinism by putting the client and server in the same process).

At the bottom of the post we read this:

Our fuzzing scans discovered client-side NULL pointer dereference vulnerabilities in mbed TLS which affect all maintained versions of mbed TLS. ARM fixed the issues in versions 2.1.1, 1.3.13 and PolarSSL version 1.2.16. See the release notes for details.

However, the corresponding text in the release notes is this:

Fabian Foerg of Gotham Digital Science found a possible client-side NULL pointer dereference, using the AFL Fuzzer. This dereference can only occur when misusing the API, although a fix has still been implemented.

In other words, the blog post author disagrees with the mbed TLS maintainers about whether a bug has been discovered or not. It is a matter of debate whether or not this kind of crash represents a bug to be fixed. I would tend to agree with the mbed TLS maintainers’ decision to err on the side of defensive programming here. But was there a vulnerability? That seems like a stretch.

In summary, API fuzzing is different from file fuzzing. Good API fuzzing requires exploration of every legal corner of the API and no illegal corners. Ambiguous situations will come up, necessitating judgement calls and perhaps even discussions with the providers of the API.

Comments on a Formal Verification of PolarSSL

The C language has given the world many enduring gifts such as buffer overflows, uninitialized variables, and use-after-free errors. Since rewriting a code base in some other language is not easy, we’re often stuck trying to eliminate bugs in legacy C before they bite us, and of course bugs in network-facing code sometimes bite us very hard. In this post I’ll be discussing TrustInSoft’s Verification Kit for PolarSSL 1.1.8, which I’ll call “the report” throughout this post. Just to be clear, I’m not unbiased — I’m working for these folks this year — so I’ll try to keep the cheerleading to a minimum.

The crux is:

This report states the total immunity of the SSL server implemented by the PolarSSL 1.1.8 library to [CWE-119, CWE-120, CWE-121, CWE-122, CWE-123, CWE-124, CWE-125, CWE-126, CWE-127, CWE-369, CWE-415, CWE-416, CWE-457, CWE-476, CWE-562, and CWE-690] if it is deployed according to the Secure Deployment Guide detailed in section 4.4.

In other words, a collection of nasty undefined behaviors that are so hard to get rid of, have been gotten rid of — if you are careful to follow the instructions.

The thing that struck me while reading the report is that I’ve never seen a document quite like it. I’ll next try to nail down that idea by explaining what the report is not.

It’s not a piece of formal methods research, which would need to demonstrate the novelty and utility of a verification technique. Rather, the verification techniques are completely deemphasized in the report in order to concentrate on the code that is being verified. In fact, one of the things that bothers me about formal verification research is that so much of it is concerned with new notations and clever algorithms instead of focusing on the thing I’m interested in: the interplay between the verification techniques and the code being verified.

It’s neither a formal proof nor a code review. Rather, nearly all of the reasoning about code has been delegated to the analysis tool, which must be trusted to come to the right conclusions. Why would we trust a formal verification tool? For one thing, the analyzer has been validated on many other codes including programs generated by Csmith (we wrote a short paper about that work). For another, the analyzer identified some legitimate shortcomings in PolarSSL that needed to be fixed before the verification would go through (these are in Section 4.4.3). Additionally, a much more serious PolarSSL bug was found earlier in the verification process. It is not mentioned in this report since it had been fixed before version 1.1.8 was released.

Because it makes a guarantee, it is fundamentally different from a report based on unsound static analysis. By early 2014, OpenSSL had been examined by various unsound analyzers and yet none of them found Heartbleed. In contrast, if someone had used sound static analysis to verify OpenSSL 1.0.1f, it is a certainty that Heartbleed would have been among the defects that were discovered. (We have to be a little bit careful when making this kind of statement. For example, the bug won’t be discovered if the OpenSSL being verified was compiled using -DOPENSSL_NO_HEARTBEATS. The buggy code must lie within the verification perimeter that I’m about to discuss.)

So what is this report? It is, mainly, three things:

The definition of a perimeter within which its claims apply. This perimeter has three aspects. First, it excludes some parts of PolarSSL, such as X.509, that were not verified. Second, it excludes some ways in which the code could be used by defining drivers that exercise API calls in certain ways. For example, the server driver that spans pages 16 and 17 initializes the library, makes an arbitrary number of read and write calls, and then shuts it down. The guarantees in the report would not apply to a different usage, such as one that started writing data without properly initializing the library. Nor would they apply to usage of the library by a buggy program that corrupted PolarSSL’s state using a stray indirect write. Third, some configurations of the code are placed outside of the perimeter. The issue is that C code is implicitly parameterized by a collection of implementation defined behaviors, such as the representation of integers, and it is also typically explicitly parameterized by preprocessor directives. A real C program is actually a shorthand for a huge family of different programs; any formal verification effort either has to pin down which member of the family is being verified or else reason about the entire family. Reasoning about the entire family in a sound and precise way is an open problem.

A sketch of the verification strategy, which is divide and conquer. Each of chapters 5-10 describes the verification of a component. The interfaces between components are specified in ACSL, which is designed in such a way that when two components are verified against opposite sides of an interface, their overall verification can be taken for granted.

A justification of alarms. The analysis tool signals an alarm any time it fails to prove that an operation (for example, incrementing a signed integer) is safe. Each alarm must be investigated to determine if it is a true alarm (exposing a defect in the code) or a false alarm (exposing an imprecision in the analysis). Appendix A of the report contains this analysis. It turns out to be easy, in theory, to avoid the imprecisions that caused these alarms: the necessary techniques have been known for decades. The issue is that there are engineering reasons for avoiding these static analysis techniques: their results can be very difficult for people to understand and also they can be computationally expensive. An entertaining exercise would be to pick a few parts of Appendix A, download PolarSSL 1.1.8, and make sure you agree with the reasoning from the report.

That’s all there is to it, though if you read the report you’ll see that the reasoning sometimes becomes quite detailed and technical.

Finally I’ll try to anticipate a few questions the you might have at this point.

How much effort would be required to track the upstream PolarSSL (now mbed TLS)? A majority of local changes will just go through the verification automatically. Larger-scale changes might raise one or more new alarms that need to be investigated.

Under what conditions does the claim of “total immunity to UBs” apply? In addition to the conditions described in the report, the intellectual analyses in Appendix A must be correct, the verification tool must not have made a mistake, and the C compiler and the platform (OS + standard libs + hardware) must not make mistakes.

If we don’t want UBs, why not just use a TLS written in Java or C# or some other safe language? Of course that’s a fine idea! But we’ll have to somehow shoehorn a Java runtime onto those embedded processors with ~64 KB of RAM where PolarSSL / mbed TLS run. Also, I’m not sure that it is clear how to implement crypto in higher level languages in such a way that it resists timing attacks (this is hard enough in C). Finally, safe languages are only dynamically safe: a buffer overrun bug in a Java program will lead to an exception, and perhaps to a DoS attack. The report makes a guarantee of static safety, meaning that buffer overrun bugs do not exist.

Where is the corresponding document for OpenSSL? It doesn’t exist. Based on lines of code, it would probably be about four times longer than the PolarSSL report, and would require at least four times the effort to create.

What is the future of this kind of report? It’s all about economics. Given a piece of C code, we can choose to inspect it, to test it by hand, to fuzz it, to subject it to unsound static analysis, and to subject it to sound static analysis. Obviously, all of these validation techniques have merit — the question is how many resources to allocate to each of them. Larger, more rapidly evolving code bases tend to favor cheaper validation methods whereas smaller, security-critical codes are amenable to formal verification. The reach of formal verification techniques is increasing as techniques improve, but also software keeps getting bloatier.

How should I think or act differently based on the information in this report (and others like it)? Well, this is the central question. Formal verification has made major progress in the last 10-15 years and we now have extremely impressive technical results including CompCert and seL4. However, I think it’s fair to say that almost everyone is having trouble figuring out how to fit these results into the big picture. From the client’s point of view, formal verification tends to be somewhat mysterious, as if a mathematically adept wizard had sprinkled some magic verification dust over a source code repository, which now glows with correctness. Let’s take a quick look at this article about seL4 that appeared in New Scientist a few days ago. On Twitter, Gerwin Klein — the leader of the L4.verified project — said:

The sad truth is that this article is not that bad compared to others. I’ve seen way worse. Nuance is hard..

What he says is true: the article, while being sort of decent, is just extremely far from conveying accurate technical information. Here’s a more nuanced version.

The PolarSSL report can be best understood as providing details that are necessary for the next level of people to crack open the black box surrounding this particular formal verification effort. We didn’t need another article about abstract interpretation, we needed specific technical claims about a particular piece of source code, followed by details that back up the claim. Much more work on communicating verification results is needed, but this is a good piece of progress.

Secret Coders

coders Although I’m not sure that I’ve mentioned it here before, I’m a pretty big comic book nerd. So I was psyched when, late last year, Gene Luen Yang mailed me asking if I’d like a review copy of his upcoming graphic novel. I love Gene’s Avatar comics, which I had been reading with my kids, as well as his earlier American Born Chinese. What I hadn’t known is that Gene is a former computer engineer and high school computer science teacher.

Gene’s mail described Secret Coders — which comes out at the end of September 2015 — as sort of a Harry Potter but with programming instead of magic. It stars plucky Hopper Gracie who has to adjust to life at her strange new middle school while also, with her new buddy Eni, unraveling its many mysteries, some of which might involve her missing father. Along the way, the reader gets a clever introduction to binary numbers and Logo programming, the latter via a slightly mysterious robotic turtle character that obeys verbal commands. You can read an excerpt here. The book’s web site also has a 5-minute video introducing Logo, which makes it feel like this series might end up being as much of a MOOC as a graphic novel.

My 10 year old read Secret Coders almost as soon as it arrived. I should mention that he has read more of Gene’s books than I have (perhaps using the school library? I’m not totally sure) and has loved all of them. The kid is also a decent Scratch programmer and something of a novice at Python. He gave the book high ratings both for plot and for CS content. My eight year old hasn’t read the book yet; if he does, I’ll update this post with additional findings. Anyway, at $5.50 (Amazon price) this book is a great deal and I’d recommend it to folks who are hoping to get a young person interested in programming.

Update from Sep 17: The 8 year old is reading Secret Coders now and loving it. Too early to tell if he’ll pick up the programming content.

Sabbatical at TrustInSoft

At the beginning of September I started at TrustInSoft, a Paris-based startup where I’ll be working for the next 10 months. I’ll post later about what I’m doing here, for now a bit about the company. TrustInSoft was founded by Pascal Cuoq, Fabrice Derepas, and Benjamin Monate: computer science researchers who (among others) created the Frama-C static analyzer for C code while working at CEA, the Atomic Energy Commission in France. They spun off a company whose business is guaranteeing the absence of undefined behavior bugs in C code, often deeply embedded software that needs to just work. Of course this is a mission that I believe in, and also I already know Pascal quite well and had worked with him.

The logistics of moving my family overseas for a year turned out to be more painful than I’d anticipated, but since we arrived it has been great. I’m super happy to be without a car, for example. Paris is amazing due to the density of bakeries and other shops, and a street right outside our apartment has a big open-air market three times a week. I’ve long enjoyed spending time in France and it has been fun to start brushing the dust off of my bad-but-sometimes-passable French, which was actually pretty good when I lived in Morocco in the 1980s. One of the funny things I’ve noticed is that even now, I have a much easier time understanding someone from North Africa than I do a random French person.

We spent August getting settled and being tourists, a few pics follow.

I know this bridge.

Hemingway talks about kids sailing these little boats in the Jardin du Luxembourg in the 1920s, I wonder how long it has been going on?

Thinnest building in Paris?

Dinner at one of the Ottolenghi restaurants in London with friends Edye and Alastair. I’ve been dreaming about eating at one of these for years!

Bletchley Park was super great to visit.

Small World and scotch is a good combination.

Back in Paris, one of the coolest art installations I’ve seen out at the Parc de la Villette.

I’m sort of obsessed with these spray-painted spiders that are all over the city.

A Few Synthesizing Superoptimizer Results

For this post, I crippled Souper by disabling its path conditions and limiting the depth of harvested expressions to two LLVM instructions. The first goal was to create a nice easy burn-in test for Souper’s instruction synthesizer, which uses a variant of this method; the second goal was to see if depth-limited, path-condition-free expressions would result in optimizations that are easier to understand and implement in LLVM.

We harvested LLVM expressions from SPEC CINT 2006, leaving out omnet, which caused a compiler crash. For each of the ~30,000 unique harvested expressions, we ran Souper’s synthesizer, which attempts to create a functionally equivalent but cheaper version of the expression. The results are ranked by the average of the optimization’s rank in terms of static and dynamic profile counts.

Here are the results

One thing you’ll notice is a lot of this sort of thing:

%0:i8 = var
%1:i8 = and 1:i8, %0
%2:i1 = ne 0:i8, %1
infer %2
%3:i1 = trunc %0
result %3

The backends don’t usually care which of these they get, but even so this seems like a nice easy simplification to implement at the IR level.

My favorite optimizations from this batch are where Souper uses bit-level tricks to remove select instructions; here’s an example where the optimized version seems to lead to better code on the architectures that I’ve tried:

%0:i1 = var
%1:i64 = var
%2:i64 = select %0, %1, 1:i64
%3:i1 = eq 0:i64, %2
infer %3
%4:i64 = zext %0
%5:i1 = ult %1, %4
result %5

On the other hand, it is clear that there are many select-removal “optimizations” in this batch that are just not a good idea. We are not finding it easy to automatically choose which of two LLVM expressions are preferable. For now we just say that most instructions cost 1, select and div/rem cost 3, and constants and phis are free. Obviously this is just a first hack. We’d appreciate feedback on this aspect of the work (and others, of course). And obviously let us know if you find bugs in the results.

We might ask what can cause Souper to not find an optimization. First, Souper’s grab-bag of components for synthesis contains one of each of the following: shl, lshr, ashr, and, or, xor, add, sub, slt, ult, sle, ule, eq, ne, and constant. If an optimization requires, for example, two xors, we won’t find it. Of course we could give the synthesizer more components to work with but that slows it down. Sext/zext/trunc aren’t considered basic components by Souper, but are added as necessary to make bitwidths match up. This leads us to the second cause of missed optimizations: Souper won’t make up bitwidths not found in the original code. So, if there’s a trick where the arguments need to be extended from 32 bits to 64 and then truncated later, we won’t find that. Again this isn’t hard to fix, but it does appear to be hard to fix without slowing down synthesis a lot. Finally, if the solver (Z3 here) uses too much time or memory, or if some limits on loops inside the synthesizer are exceeded, we’ll miss out on whatever we might have found if we were willing to use more resources.

I’d like to distill the synthesis problems that come from optimizing LLVM code into some sort of synthesis benchmark that researchers can use to evaluate their work. We would say that your synthesis engine is better than ours (on this benchmark) if it can save more total static or dynamic profile weight in a given amount of time, or if it can produce equivalent results in less time. Of course we’ll have to agree on a cost model. I just attended the 2015 SMT Workshop the other day (where I talked about Souper!) and it sounds like there is plenty of interest in the solver community in providing better support for synthesis, which is great!

We have a medium-term goal of taking an optimization discovered by Souper and generalizing it into something that is very close to what an LLVM developer would implement. Since Souper has no capacity for abstraction, we’ll turn its optimizations into Alive patterns and then derive not only any preconditions that need to hold before the optimization can fire but also code for converting constant values on the left hand side into constants on the right hand side. For example let’s say that Souper finds this optimization:

%1 = lshr i32 %in, 3
%out = shl nuw i32 %1, 3
%out = and i32 4294967288, %in

The general form that we’d like to derive is:

%1 = lshr %in, C
%out = shl %1, C
%out = and -(1<<C), %in

Can we derive -(1<<C) from scratch? Time will tell. If this works, it’ll be useful for two reasons. First, it’ll effectively deduplicate Souper’s output: many specific optimizations will all boil down to the same general one. Second, the optimization will be easier to implement in LLVM if a human doesn’t have to do the generalization work. Some work by Nuno Lopes contains the basic ideas that we’ll use.

Something else I’ve been thinking about is the fact that all compilers have weaknesses of the kind that we point out in this post. The OPTGEN paper has some concrete data for GCC, LLVM, and Intel CC. I’ve heard that the Microsoft C/C++ compiler could stand to be improved in this respect too. CompCert needs a ton of work, the last time I checked. Wouldn’t it be cool if the work we’re doing for Souper and Alive could apply more broadly than just to LLVM? One observation is that if we deal only with integer values, the IR semantics in each of these compilers is probably broadly the same (leaving out some tricky issues like IR-level undefined behaviors). There’s probably no fundamental reason we can’t just reuse integer optimizations across multiple compilers.

Here are a few offers I’d like to extend:

  1. If you are an LLVM frontend developer, I’d like to run Souper on some representative code from your frontend. It would be easiest if you harvested optimized LLVM code into Souper yourself (I’ll have to write up instructions) but if not, you can send me some bitcode files — in this case we won’t get dynamic profile information. A few GB would not be too much. The effect will be that we can help the LLVM middle-end optimizers do a better job on code that you care about.
  2. If you develop a compiler other than LLVM, even a closed-source one, let’s figure out a way to translate an integer subset of your IR into Souper, so we can look for missed optimizations. My guess is that if you target a subset of Souper that doesn’t have phis or path conditions, this isn’t very difficult. You will basically end up doing the same kind of depth-bounded expression harvesting that I did.
  3. If you are doing research on program synthesis, and if you might be interested in working towards some benchmarks about optimizing LLVM code, let me know and let’s figure out how to make that happen.

The next set of results that I post will be similar to these, but they’ll show off Souper’s ability to exploit the results of one or two of LLVM’s dataflow analyses. Finally I want to add that although I’ve been writing up these posts, Souper is a group effort.

Nibble Sort Denouement

Back in January my nibble sort contest resulted in entries that dramatically exceeded my expectations. Since then I’ve been trying to write up a post explaining the various strategies that people used and since you don’t care about my excuses I won’t tell you them, but I never got it written. However! I want to point everyone to three great blog entries:

The thing that I liked most about this contest is that the fastest entries ended up being way faster than I predicted — mostly because I didn’t understand how excellent the Haswell vector unit is. I’ve sent my standard programming contest prize (Amazon gift certificate covering the cost of a paper copy of Hacker’s Delight) to Alexander and Jeff (Jerome’s non-SIMD entry, alas, arrived a bit after the contest expired). Thanks again everyone!

Final note: as a compiler researcher I think these contest entries could be very useful as a case study in how variance among algorithms and among implementations of algorithms can lead to extreme performance differences. In the long run we need compilers that understand how to automatically switch between different versions such as these. Regular compiler technology won’t cut it, nor will (at present) program synthesis. But it’s a good goal.